Title:  pdfScraping
Auteur:  MCKeuken
Datum:  Nov. 2018

Uitdaging: Bij het bereken van zorgkosten van instellingen krijgen we een aantal bestanden aangeleverd. Het kan soms 
        voorkomen dat een bepaald getal mist. Dit is dan best onhandig omdat je dan handmatig door een jaarrekening 
        moet om dat bepaalde bedrag te vinden. Het voordeel is dat de layout van jaarrekeningen een bepaalde structuur
        aanhouden waar de beschijving voorafgaande aan het bedrag constant kan zijn. 
        
Doel:   Doelstelling is om binnenaPDF bestanden opzoek te gaan naar een bepaalde text waarna een bedrag staat. 
        Wat de code doet is de pdf bestanden om te zetten naar een txt bestand, in het txt bestand opzoek 
        gaan naar een reguliere expressie, en de text na de reguliere expressie wegschrijft naar een dataframe.
        De reden waarom de PDF's omgezet worden naar txt is dat pdf's niet echt heel makelijk zijn om door te 
        zoeken met regulieren expressies. 

Layout:  
        1) Algemene modules importeren
        2) PDF-TXT omzet functies defineren
        3) Daadwerkelijk PDF naar TXT uitvoeren
        4) Op zoek naar de reguliere expressie 
        5) Controle
    
Requirements:  twee folders aanmaken:  - pdfDir (waarin je de pdf's inzet)
                                       - txtDir (dit is de output folder)
 
 
Title: pdfScraping
Author: MCKeuken
Date: Nov. 2018

Challange: When calculating the costs of healthcare of institutes we get a number of data files. However it can occure 
        that these numbers are not complete. In that case it is a bit of a pain because you would then have to go 
        through these financial reports manually to find that number. The nice thing of these financial reports is 
        that their layout is fairly constant which means that a given number is usually preceded by a regular       
        description. 
        
Goal:   The goal of this script is to search for certain strings within a PDF document which is followed by a number. 
        The code first converts PDF files to a txt file, then within the txt file searches for a regular expression. 
        The reason why I convert the PDFs to text is because PDFs are not necessarily very easy to search through via 
        regular expresions. 

Layout:
        1) Importing standard modules
        2) Definine PDF-TXT convert functions
        3) Convert the PDF to TXT
        4) Search for the regular expression  
        5) Control
    
Requirements:  two folders:  - pdfDir (this contains your pdf files)
                             - txtDir (this is the output folder)
 

In [1]:
# 1) Import Modules
import os, sys, getopt, re
from io import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import pandas as pd

In [2]:
#  2) Defining the functions

# Function 1) Get string of text content of pdf: 
def convert(fname, pages=None):
    if not pages:
        pagenums = set()
    else:
        pagenums = set(pages)
    output = StringIO()
    manager = PDFResourceManager()
    converter = TextConverter(manager, output, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)
    infile = open(fname, 'rb')
    for page in PDFPage.get_pages(infile, pagenums):
        interpreter.process_page(page)
    infile.close()
    converter.close()
    text = output.getvalue()
    output.close
    return text 

# Function 2) Converting multiple PDF files:
def convertMultiple(pdfDir, txtDir):
    if pdfDir == "": pdfDir = os.getcwd() + "\\" 
    for pdf in os.listdir(pdfDir): 
        fileExtension = pdf.split(".")[-1]
        if fileExtension == "pdf":
            pdfFilename = pdfDir + pdf
            print(pdfFilename)
            text = convert(pdfFilename) 
            textFilename = txtDir+ pdf + ".txt"
            textFile = open(textFilename, "w") 
            textFile.write(text) 
            textFile.close()

In [3]:
# 3) Convert the PDF to TXT
# This is where you actually convert the PDFs to TXT

# Modify the following path to your own oaths
pdfDir = "/Users/mckitchen/Documents/Werk/Projecten/pdfScraping/pdfDir/"
txtDir = "/Users/mckitchen/Documents/Werk/Projecten/pdfScraping/txtDir/"

# Call function 2 (which will call function 1):
convertMultiple(pdfDir, txtDir)

# Congrats, you have just converted the PDFs to TXT!

/Users/mckitchen/Documents/Werk/Projecten/pdfScraping/pdfDir/keuken_test.pdf


In [4]:
# 4) Search for the regular expression 
# The next step is to actually search the TXT file for a given regular expersion.
searchTerm = "Keuken"

# Start the forloop:
output=[]
# for alle files in the txtDir folder:
for txt in os.listdir(txtDir): 
    # check if there are files that end with .txt:
    if txt.endswith(".txt"):
        # If so, open the text file and read the text:
        path = os.path.join(txtDir + txt)
        file = open(path,'r')
        fileText = file.read()
        # Now that we have the text in working memory we can searh it.
        # First things first, does the searchTerm even exist in the text?
        matches = re.findall(searchTerm, fileText, re.I) 
        # Then we check how often the searchTerm occures:
        freq = len(matches) 
        # But here is where the actual crux of the code is: 
        # Nice that we found the searchTerm but we actually only care about 
        #   the combination of the searchTerm AND the following numbers.
        # This is what you do in the following line:
        numberAfterMatch = re.findall( searchTerm + "\s*(\d+)" , fileText) 
        # So it might happen that there are multiple valid combinations. So lets make 
        #   sure we save the different hits: 
        for hit in numberAfterMatch:
            output.append((txt, searchTerm, freq, hit))
# To have a somewhat nicer output we'll add some colum names:      
cols = ['file_name', 'Search term', 'frequenty','numberAfterMatch']    
# Personally I like dataframes so there you have it:
output = pd.DataFrame(output, columns = cols)

In [5]:
output


Unnamed: 0,file_naam,zoekterm,frequentie,getalNaZoekterm
