## PDF Extract Automation

Below are the steps taken by the automation to extract information from PDFs in a given folder. All data for these PDFs was generated by the Faker library.

### 1.0 Import Packages
The main package used for PDF extraction is pypdf although pypdf2 can also be used for these purposes.

In [1]:
import time
t0 = time.time()

import os
import re
import faker
import random
import warnings
import pandas as pd
from datetime import date
from pypdf import PdfReader

warnings.filterwarnings("ignore", category=FutureWarning) 

### 2.0 Extract data from PDFs in a given folder
Data from all PDFs in the folder below will be extracted and stored in the dictionary extract_dict. Columnlist includes the names of all PDF fields pulled ahead of time.

In [2]:
folder = 'SamplePDFs'
os_folder = '\\' + folder

columnlist = ['Year', 'Employee Name', 'Employee SSN', 'Agency Name', 'Agency code', 'Agency Contact', 'Agency Phone', 'DOE DP amount', 'DOE 67 amount']
file_names = []

extract_dict = dict.fromkeys(columnlist)
for i in columnlist:
    extract_dict[i] = []
    

file_list = os.listdir(os.getcwd()+os_folder)

for i in file_list:
    file = folder + '/' + i
    pdf_object = open(file, 'rb')
    pdf_file = PdfReader(pdf_object)
    pdf_dict = pdf_file.get_form_text_fields()
    pdf_object.close()
    file_names.append(file)
    for y in columnlist:
        extract_dict[y] = extract_dict[y] + [pdf_dict[y]]

    

### 3.0 Store in .csv output file
The extracted data will be stored in the CSV file: 'AutomationOutput.csv'

In [3]:
df = pd.DataFrame.from_dict(data = extract_dict)
df = df.reset_index(drop = True)
df['File'] = ""
df['File'] = file_names
df['Agency Phone'] = df['Agency Phone'].astype(str)
df['Agency Phone'] = df['Agency Phone'].str.replace(r"(\d{3})(\d{3})(\d{4})", r"(\1) \2-\3")

df['Employee SSN'] = df['Employee SSN'].astype(str)
df['Employee SSN'] = df['Employee SSN'].str.replace(r"(\d{3})(\d{2})(\d{4})", r"\1-\2-\3")

df.to_csv('Bono_Class.csv', index=False)
t1 = time.time()
total_time = t1 - t0
total_time = round(total_time,2)

today = date.today()
datelist = ['%m-%d-%Y', '%B %d, %Y']
todaylist = [today.strftime(i) for i in datelist]

print('This automation ran on: ' + todaylist[-1])

print('This automation ran in: ' + str(total_time) + ' seconds')



This automation ran on: April 17, 2023
This automation ran in: 2.04 seconds
