# Assignment 1.3

1. Download the folders containing PDF files provided via Moodle (1-3-pdf-files.zip). Since the PDF format cannot be directly used to process text, you first have to convert the file contents to plain text. Find two different methods to convert PDFs to text, and compare their performance. You should provide a quantitative and qualitative analysis. For comparison of two generated files, use Python’s SequenceMatcher.ratio(). With this analysis as a basis, choose one of the methods, and provide the processed raw text files in the submission folder. Justify your decision!


### First method: using PyPDF2
PyPDF2 is a python package designed for conversion of PDF files to text files.

In [None]:
import os
import PyPDF2
 
pdfFolders = ["1-3-pdf-files/flyers", "1-3-pdf-files/iban", "1-3-pdf-files/scans"]
txtFolders = ["1-3-txt-files-pypdf2/flyers", "1-3-txt-files-pypdf2/iban", "1-3-txt-files-pypdf2/scans"]

for i in range(len(pdfFolders)):
    readFolder = pdfFolders[i]
    writeFolder = txtFolders[i]
    for filename in os.listdir(readFolder):
        inputFile = readFolder + "/" + filename
        outputFile = writeFolder + "/" + filename.replace('.pdf', '.txt')

        # create file object variable
        # opening method will be rb
        pdfFileObj = open(inputFile,'rb')
 
        #create reader variable that will read the pdfFileObj
        pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
 
        #This will store the number of pages of this pdf file
        numPages = pdfReader.getNumPages()

        with open(outputFile, 'w') as f:
            for page in range(numPages):

                #create a variable that will select the selected number of pages
                pdfObject = pdfReader.pages[page]
    
                #create text variable which will store all text datafrom pdf file
                text = pdfObject.extract_text()

                #write page to file
                f.write(text)

 

## Second method: using pdftotext

'pdftotext' is an online PDF to text converter which can be accessed at [link](https://pdftotext.com/).

In [None]:
from difflib import SequenceMatcher

txtFolders2 = ["1-3-txt-files-pdftotext/flyers", "1-3-txt-files-pdftotext/iban", "1-3-txt-files-pdftotext/scans"]

for i in range(len(txtFolders)):
    readFolder = txtFolders[i]
    for filename in os.listdir(readFolder):
        if filename.startswith('.'):
            continue
        inputFile1 = readFolder + "/" + filename
        inputFile2 = txtFolders2[i] + '/' + filename

        with open(inputFile1, "r", encoding = 'utf-8-sig', errors = 'ignore') as f:
            txt1 = ""
            for line in f.readlines():
                txt1 += line
            txt1 = os.linesep.join([s for s in txt1.splitlines() if s])
        
        with open(inputFile2, "r", encoding = 'utf-8-sig', errors = 'ignore') as f:
            txt2 = ""
            for line in f.readlines():
                txt2 += line
            txt2 = os.linesep.join([s for s in txt2.splitlines() if s])
        
        seqMatch = SequenceMatcher(a = txt1, b = txt2).ratio()

        print(f"Original file: {filename.replace('.txt', '.pdf')}")
        print(f"Similarity of PyPDF2 text file and pdftotext text file: {seqMatch}\n")

### Observations:

<ul>
    <li>There is a considerable discrepancy between the 'double_ocr.txt' file produced by PyPDF2 and the one produced by pdftotext, since the similarity of the two files is ca. 0.13. <br> 
    After comparing each of the files with the original pdf, we noticed that the PyPDF2 text file was produced by traversing each column in the pdf, top to bottom, left to right. So the logic of the text was preserved. <br>
    However, the pdftotext file was generated by traversing the pdf top to bottom without respecting the columns, which is why the information from different columns was mixed.</li>
    <br>
    <li>'bahnstadt.txt' was another file where significant differences appeared. Once again, pdftotext traversed pages from left to right, top to bottom without preserving the logic of the original data. For example, on the page "Inhalt" sections, numbers and subsections were mixed. PyPDF2 on the other hand preserved this logic.<br> 
    Furthermore, PyPDF2 took care of the line break from page 4 to 5, whereas pdftotext inserted in the middle of the sentence starting on page 4 the data:<br>
    4<br>
    | BAHNSTADT - BERGHEIM - WESTSTADT<br>
    Ursula Gross, Redaktion
    </li>
    <br>
    <li>Unfortunately, none of the two methods performed well on the file 'single_ocr.txt'. The two text files contain confusing information due to the fact that the pdfs were traversed from left to right without preserving the column boundaries.
    </li>
    <br>
</ul>

As a consequence of the above observations, we chose to further proceed with the text files generated with PyPDF2.

2. Why is a high quality conversion from PDF to plain text hard?Your answer does not need to be exhaustive but should outline some of the most important reasons.

# Reasons why PDF to plain text conversion is hard

A PDF file consists of a set of instructions where each of them has one of the following roles: setting the font to use by the following instructions, setting the text position and direction to use by the upcoming instructions or drawing text by given string arguments.

Upon converting a pdf to plain text, the following problems might occur:
<ol>
    <li>The string drawing instructions can occur in different orders, so a converter cannot simply take the string arguments in the order they appear and concatenate them. Text positioning instructions also need to be taken into account.</li>
    <br>
    <li>The text in a pdf might be organized over multiple columns. Hence, the question arises how to read the data from such a file: line by line without respecting column boundaries or column by column, starting left and ending right. Furthermore, the text might be multi-columnar without the pdf containing any hints in this respect.</li>
    <br>
    <li>Spaces are sometimes created by text positioning instructions rather than by drawing a space glyph. A text extractor that does not take this possibility into account might return a result without spaces.</li>
    <br>
    <li>Usually for bold text one uses a different, bold font program; if that is not at hand, people emulate bold by printing the same text twice with a minute offset; with a slightly larger offset and a different color a shadow effect can be emulated; if the text extractor does not try to recognize this, once can end up having some duplicate characters in the output.</li>
    <br>
    <li>Text may be drawn using the same color as the background; to recognize this, a text extractor has to take into account anything drawn beforehand in the location of the text.</li>
    <br>
    <li>The OCR sometimes misses hyperlinks, more frequently when natural anchor text is used instead of the actual URL.</li>
</ol>

3. i) From the PDF files in the folder flyers/ extract as many valid phone numbers as possible.

In [1]:
import re

In [8]:
flyersFolder = "1-3-txt-files-pypdf2/flyers"
telephones = set()

for filename in os.listdir(flyersFolder):
    readFile = flyersFolder + "/" + filename

    with open(readFile, "r", encoding = 'utf-8-sig', errors = 'ignore') as f:
        for line in f.readlines():
            tel = re.findall("\d[0-9\s/\(\)\[\]-]*\d", line)
            for new_tel in tel:
                telephone = re.sub('[\s/\(\)\[\]-]', '', new_tel)
                if len(telephone) == 11:
                    telephones.add(telephone)


In [9]:
phoneNrFile = "results/phone_numbers_flyers.txt"
with open(phoneNrFile, "w") as f:
    for telephone in telephones:
        f.write(telephone)
        f.write("\n")

3. ii) Extract valid URLs and Email addresses from the files in the folder flyers/.

We assume that valid URLs start with "http(s)" or "www".

In [56]:
urls = set()
email_addresses = set()

for filename in os.listdir(flyersFolder):
    readFile = flyersFolder + "/" + filename

    with open(readFile, "r", encoding = 'utf-8-sig', errors = 'ignore') as f:
        for line in f.readlines():
            email_adr = re.findall("[\w\.-]+@[\w\.-]*[a-zA-Z]+", line)
            url_http = re.findall("https?:\\/\\/[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}", line)
            url_www = re.findall("www\\.[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}", line)
            for e in email_adr:
                email_addresses.add(e)
            for u in url_http + url_www:
                urls.add(u)

In [48]:
emailsFile = "results/email-addresses.txt"
with open(emailsFile, "w") as f:
    for email in email_addresses:
        f.write(email)
        f.write("\n")

In [57]:
urlsFile = "results/urls.txt"
with open(urlsFile, "w") as f:
    for url in urls:
        f.write(url)
        f.write("\n")

3. iii) From the PDF file in the folder iban/ extract all IBANs.

In [18]:
iban_file = "1-3-txt-files-pypdf2/iban/liste1.txt"
iban_list = []

with open(iban_file, "r") as f:
    for line in f:
        ibans = re.findall("[A-Z]{2,2}[0-9A-Z\s]+", line)
        for element in ibans:
            e = re.sub('[\s]', '', element)
            if 24 <= len(e) <= 34:
                iban_list.append(e)

In [19]:
ibanFile = "results/ibans.txt"
with open(ibanFile, "w") as f:
    for iban in iban_list:
        f.write(iban)
        f.write("\n")

4. Apply your solution from task 3-(i) to the files in the folder scans/ which consists of pages scanned from a phone book. Analyse how well your solution performs by giving examples.

In [16]:
scansFolder = "1-3-txt-files-pypdf2/scans"
telephones = set()

for filename in os.listdir(scansFolder):
    readFile = scansFolder + "/" + filename

    with open(readFile, "r", encoding = 'utf-8-sig', errors = 'ignore') as f:
        for line in f.readlines():
            tel = re.findall("\d[0-9\s/\(\)\[\]-]*\d", line)
            for new_tel in tel:
                telephone = re.sub('[\s/\(\)\[\]-]', '', new_tel)
                if len(telephone) == 11:
                    telephones.add(telephone)


In [17]:
phoneNrFile = "results/phone_numbers_scans.txt"
with open(phoneNrFile, "w") as f:
    for telephone in telephones:
        f.write(telephone)
        f.write("\n")

The code from 3.3 i) does not perform too well on the text files from the scans directory mainly because the code validates only telephone numbers of length 11. However, most of the telephone numbers in file "double_ocr.txt" have less than 11 digits since they all have the same prefix, namely 06221.<br>
So numbers like "33836 70 " or "4339009 " (from "double_ocr.txt") are not recognized as valid telephone numbers, even though they are.
Similarly does this happen with phone numbers in the "single_ocr.txt file", where most telephone numbers have length 5 or 6.


Moreover, house and telephone numbers would be combined in certain cases because they appear immediately after one another (e.g. in "double_ocr.txt": Kirschgartenstr. 19 33836 70, where 19 is the house number and 33836 is a fragment of a telephone number). If the two happen to sum up to 11 digits, then they would be validated by the code.

Even though the two scans contain mainly telephone numbers, the code can only recognize 12. If we would allow numbers with less than 11 digits, we might also get numbers which are actually not valid like fax numbers or combinations of house and telephone numbers.