# Data Pipeline Preparation

This Jupyter Notebook demonstrates how data pipeline is prepared.
* text will be extracted from PDF files and saved into SQLite database

## Import Libraries

In [1]:
import numpy as np
import pandas as pd

from glob import glob

# Python package to extract text from pdf
from pdfminer.high_level import extract_text

Save file path in numpy array

In [2]:
# load filenames for pdf files
pdf_files = np.array(glob("pdf/*"))

In [3]:
# number of files in the 'PDF' folder
len(pdf_files)

5

## Extract text from PDF files

In [4]:
# create empty dictionary called text_dict
text_dict = {}

# loop through the filenames, extract text from PDF files and save to the dictionary
for file in pdf_files:
    text = extract_text(file)
    text_dict[file] = text

In [5]:
# convert dictionary to pandas DataFrame
df = pd.DataFrame(list(text_dict.items()), columns = ['file_path', "raw_text"])

File paths and raw text from PDF files in PDF folder have been saved to a dataframe

In [6]:
df

Unnamed: 0,file_path,raw_text
0,pdf/Maple Knoll Communities success story.pdf,"Residents \nfirst, technology \nsecond\n\nSUC..."
1,pdf/circle-of-life-hospice.pdf,Committed \nto providing \ncompassionate \nc...
2,pdf/Concord Regional VNA Systems Success Story...,EHR software \ndelivers increased \nproductivi...
3,pdf/willow-health.pdf,Overcoming \nEHR adoption \nhurdles\n\nSUCCES...
4,pdf/first-choice-home-health-and-hospice.pdf,Eliminating paper \nand improving \norganizati...


## Save Data

Save dataset into a SQLite database

In [7]:
# import SQLAlchemy library
from sqlalchemy import create_engine

engine = create_engine('sqlite:///Text.db')
df.to_sql('Text_table', engine, if_exists = 'replace', index=False)