### OCR-notebook: preparing data for subsequent NLP-analysis with SpaCy (template = one notebook per book)

1. **pre-processing (i):** cut irrelevant material (entire pages: front page, toc, intro, index etc.) using PyPDF2
2. **pre-processing (ii):** turn pdf into jpegs using pdf2image
3. **OCR:** run tesseract on all jpegs in one folder (one folder = one book)
4. **post-processing (i):** get rid of faulty OCR-output by eliminating sentences that contain words with low accuracy 
5. **post-processing (ii):** turn OCR-output into one coherent txt-file

**1. pre-processing (i):** cut irrelevant material (entire pages: front page, toc, intro, index etc.) using PyPDF2

*required adjustemts for book-specific notebooks: 1. path input_pdf, 2. path output_pdf, 3. range* 

In [5]:
from PyPDF2 import PdfFileReader, PdfFileWriter
from pathlib import Path

home = str(Path.home())
input_pdf  = home + ("/data/input/raw/Descartes/Meditations/meditations.pdf") 
output_pdf = home + ("/data/input/cut/Descartes/Meditations/meditation_one.pdf")

pdf_reader = PdfFileReader(input_pdf)
pdf_writer = PdfFileWriter()

for n in range(15, 19):
    page = pdf_reader.getPage(n)
    pdf_writer.addPage(page)

with open(output_pdf, 'wb') as out:
    pdf_writer.write(out)

**2. pre-processing (ii):** turn pdf into jpegs using pdf2image

*required adjustemts for book-specific notebooks: 1. path output_folder_imgs*

In [6]:
from pdf2image import convert_from_path

output_folder_imgs = home + ("/data/output/jpegs/Descartes/Meditations/meditation_one")


images = convert_from_path(
    fmt='jpeg',
    dpi=300, 
    pdf_path=output_pdf,
    output_folder=output_folder_imgs 
)


**3. OCR:** run tesseract on all jpegs in one folder (one folder = one book)

*required adjustemts for book-specific notebooks: 1. path glob* 

In [49]:
from pathlib import Path
import os
from glob import glob

home = str(Path.home())
path = home + "/data/output/jpegs/Descartes/Meditations/meditation_one/*.jpg"
sorted_pages = sorted(glob(path), key=os.path.getmtime)

print(sorted_pages)

['/home/jovyan/data/output/jpegs/Descartes/Meditations/meditation_one/064d004a-f229-4ad0-9d5a-8203cad8839e-1.jpg', '/home/jovyan/data/output/jpegs/Descartes/Meditations/meditation_one/064d004a-f229-4ad0-9d5a-8203cad8839e-2.jpg', '/home/jovyan/data/output/jpegs/Descartes/Meditations/meditation_one/064d004a-f229-4ad0-9d5a-8203cad8839e-3.jpg', '/home/jovyan/data/output/jpegs/Descartes/Meditations/meditation_one/064d004a-f229-4ad0-9d5a-8203cad8839e-4.jpg']


*required adjustemts for book-specific notebooks: none*

In [65]:
%%time

from tesserocr import PyTessBaseAPI

LIMIT = 65
REMOVE_WORD = "++PLEASE_REMOVE_ME++"
PROCESSED_TEXT = " "

with PyTessBaseAPI(path=home + "/tessdata_best/.", psm=1, lang='eng') as api:
    for img in sorted_pages:
        api.SetImageFile(img)
        
        # print(api.GetUTF8Text())
        
        print("~~~~~~~")
        print("kind computer currently doing:", img)
        api.Recognize()
        words = api.AllWords()
        confi = api.AllWordConfidences()
                     
        for x in range(len(confi)):
          if confi[x] < LIMIT:
            print("we removed word index:", x, "value:", words[x], "confidence:", confi[x])
            words[x] = REMOVE_WORD
    
        output = ' '.join(words)
           
        print("processing done, page begins with:", output[0:60])
        PROCESSED_TEXT += output


~~~~~~~
kind computer currently doing: /home/jovyan/data/output/jpegs/Descartes/Meditations/meditation_one/064d004a-f229-4ad0-9d5a-8203cad8839e-1.jpg
we removed word index: 202 value: free confidence: 64
we removed word index: 203 value: time. confidence: 64
we removed word index: 367 value: re-+ed. confidence: 31
we removed word index: 441 value: 12 confidence: 60
processing done, page begins with: 17 18 MEDITATIONS ON FIRST PHILOSOPHY in which are demonstra
~~~~~~~
kind computer currently doing: /home/jovyan/data/output/jpegs/Descartes/Meditations/meditation_one/064d004a-f229-4ad0-9d5a-8203cad8839e-2.jpg
we removed word index: 20 value: | confidence: 44
we removed word index: 59 value: I confidence: 52
we removed word index: 161 value: experiences’ confidence: 26
we removed word index: 205 value: | confidence: 59
we removed word index: 254 value: [ confidence: 64
we removed word index: 342 value: | confidence: 50
we removed word index: 356 value: I confidence: 49
we removed word inde

**4. post-processing (i)**: get rid of faulty OCR-output by eliminating sentences containing words with low accuracy (i.e. words that have been replaced by "PLEASE_REMOVE_ME" in step 3)

*required adjustemts for book-specific notebooks: none*

In [66]:
import pandas as pd
import re
import spacy

from spacy.lang.en import English

nlp = English()

nlp.add_pipe(nlp.create_pipe('sentencizer'))

doc = nlp(PROCESSED_TEXT)

sentences = [sent.string.strip() for sent in doc.sents]

In [76]:
for x in range(len(sentences)):
    if "++PLEASE_REMOVE_ME++" in sentences[x]:
#       print("+++++ we removed sentence:", sentences[x])
        sentences[x] = ''

    
sentences_string = ' '.join(sentences)

print(sentences_string)

17 18 MEDITATIONS ON FIRST PHILOSOPHY in which are demonstrated the existence of God and the distinction between the human soul and the body FIRST MEDITATION What can be called into doubt Some years ago I was struck by the large number of falsehoods that I had accepted as true in my childhood, and by the highly doubtful nature of the whole edifice that I had subsequently based on them. I realized that it was necessary, once in the course of my life, to demolish everything completely and start again right from the foundations if I wanted to establish anything at all in the sciences that was stable and likely to last. But the task looked an enormous one, and I began to wait until I should reach a mature enough age to ensure that no subsequent time of life would be more suitable for tackling such inquiries. This led me to put the project off for so long that I would now be to blame if by pondering over it any further I wasted the time still left for carrying it out.  But to accomplish thi

**5. post-processing (ii):** turn OCR-output into one coherent txt-file

*required adjustemts for book-specific notebooks: 1. path file_output_processed*

In [77]:
file_output_processed = open(home + "/data/output/txt/Descartes/Meditations/meditation_one.txt", "w")

file_output_processed.write(sentences_string)

file_output_processed.close()