This Jupyter notebook describes and executes our processing of the World Bank Loans from PDFs to text files.

Most of the loan documents are native PDFs with the loan information embedded as text; we use pdfminer.six to preprocess these.
The documents on which this fails are largely scanned image files, for which we use tesseract.

Preconditions:
1. pdfminer.six, pytesseract, pd2image, and pandas are present on your system
2. The [Tesseract software](https://github.com/tesseract-ocr/tesseract) is also installed (dependency for pytesseract)
3. Poppler is also installed, c.f. [here for Windows installation](https://stackoverflow.com/questions/18381713/how-to-install-poppler-on-windows) (dependency for pdf2image)
4. The World Bank Loans are stored in a single directory, 'world_bank_loans', in the data directory.

This script will save the resulting output to text documents in two subdirectories of the data directory.

In [24]:
import pdfminer.high_level
import os
import os.path
import pandas as pd

import pytesseract
from pdf2image import convert_from_path

In [17]:
DATA_DIR = os.path.join('..','data')
WORLD_BANK_LOANS_DIR = os.path.join(DATA_DIR,'world_bank_loans')

files = os.listdir(path='world_bank_loans')
files.sort()
print(files)
print(len(files))

['1990_april_24_587321468019152780_conformed-copy--l3186--forestry-sector-project--loan-agreement.pdf', '1990_april_24_668811468165272290_conformed-copy--c2120--water-supply-project--loan-agreement.pdf', '1990_april_25_904191468298750561_conformed-copy--l3190--environment-management-project--loan-agreement.pdf', '1990_april_30_410811468040573756_conformed-copy--l3180--rural-electrification-project--loan-agreement.pdf', '1990_april_30_725911468042268845_conformed-copy--l3182--third-telecommunications-project--loan-agreement.pdf', '1990_april_5_790191468211457471_conformed-copy--l3076--west-mitidja-irrigation-project--loan-agreement.pdf', '1990_august_10_460651468271855106_conformed-copy--l3203--universities-science-and-technology-research-project--loan-agreement.pdf', '1990_august_10_885551468047414878_conformed-copy--l3202--second-technology-advancement-project--loan-agreement.pdf', '1990_august_10_939121468273631507_conformed-copy--l3178--juam-regional-water-supply-project--loan-agree

In [32]:
first_parse_list=[]

## WARNING

This next block is **slow**.
It reads every file in world_bank_loans/, and uses pdfminer's extract_text() function to extract pdf text.
This may take up to an hour.

In [33]:
for file in files:
    raw_parse = pdfminer.high_level.extract_text(os.path.join('world_bank_loans',file))
    first_parse_list.append({'filename':file,'raw_parse':raw_parse})

In [34]:
first_parse_list[0:5]

[{'filename': '1990_april_24_587321468019152780_conformed-copy--l3186--forestry-sector-project--loan-agreement.pdf',
  'raw_parse': 'd\ne\nz\ni\nr\no\nh\nt\nu\nA\n \ne\nr\nu\ns\no\nc\ns\nD\n \nc\n\nl\n\ni\n\ni\nl\n\nb\nu\nP\n\nd\ne\nz\ni\nr\no\nh\nt\nu\nA\n \ne\nr\nu\ns\no\nc\ns\nD\n \nc\n\ni\n\nl\n\ni\nl\n\nb\nu\nP\n\nd\ne\nz\ni\nr\no\nh\n\nt\n\n \n\nu\nA\ne\nr\nu\ns\no\nc\ns\nD\n \nc\n\nl\n\ni\n\ni\nl\n\nb\nu\nP\n\nd\ne\nz\ni\nr\no\nh\nt\nu\nA\n \ne\nr\nu\ns\no\nc\ns\nD\n \nc\n\ni\n\nl\n\ni\nl\n\nb\nu\nP\n\nCONFORMED COPY\n\nLOAN NUMBER 3186 IVC\n\nLoan Agreement\n\n(Forestry Sector Project)\n\nREPUBLIC OF COTE D’IVOIRE\n\nbetween\n\nand\n\nINTERNATIONAL BANK FOR RECONSTRUCTION\nAND DEVELOPMENT\n\nDated April 24, 1990\n\nLOAN AGREEMENT\n\nLOAN NUMBER 3186 IVC\n\nAGREEMENT, dated April 24, 1990, between REPUBLIC OF COTE D’IVOIRE (the \nBorrower) and INTERNATIONAL BANK FOR RECONSTRUCTION AND DEVELOPMENT (the Bank).\n\nWHEREAS (A) the Borrower, having satisfied itself as to the feasibil

This gets stored in a DataFrame to make the rest of the processing more transparent.

In [37]:
first_parse_df = pd.DataFrame(first_parse_list)

Most of the extracted texts include the following initial string that we strip away:

In [41]:
first_parse_df['de_headed'] = first_parse_df['raw_parse'].apply(lambda x: x.strip('d\ne\nz\ni\nr\no\nh\nt\nu\nA\n \ne\nr\nu\ns\no\nc\ns\nD\n \nc\n\nl\n\ni\n\ni\nl\n\nb\nu\nP\n\nd\ne\nz\ni\nr\no\nh\nt\nu\nA\n \ne\nr\nu\ns\no\nc\ns\nD\n \nc\n\ni\n\nl\n\ni\nl\n\nb\nu\nP\n\nd\ne\nz\ni\nr\no\nh\n\nt\n\n \n\nu\nA\ne\nr\nu\ns\no\nc\ns\nD\n \nc\n\nl\n\ni\n\ni\nl\n\nb\nu\nP\n\nd\ne\nz\ni\nr\no\nh\nt\nu\nA\n \ne\nr\nu\ns\no\nc\ns\nD\n \nc\n\ni\n\nl\n\ni\nl\n\nb\nu\nP\n\n'))

After stripping away this junk header, we can use the length of pdfminer's parsed text to determine whether pdfminer successfully parsed a given document.

A prior search revealed that a length of 1000 is a good cutoff; up to and including a text length of 1000, we may assume pdfminer failed to parse successfully.

In [44]:
first_parse_df['de_headed_length'] = first_parse_df['de_headed'].apply(len)

In [48]:
first_parse_df['de_headed_length'].value_counts

<bound method IndexOpsMixin.value_counts of 0       40838
1       16557
2       32534
3       40263
4       32938
        ...  
3200    17860
3201       17
3202       19
3203       22
3204       13
Name: de_headed_length, Length: 3205, dtype: int64>

In [62]:
unparseables_1000 = first_parse_df[first_parse_df['de_headed_length']<1001]
print(len(unparseables_1000))
unparseables_1000

400


Unnamed: 0,filename,raw_parse,de_headed,de_headed_length
24,1990_december_4_833971468016863921_china--rura...,d\ne\nz\ni\nr\no\nh\nt\nu\nA\n \ne\nr\nu\ns\no...,,13
64,1990_june_12_146771468325288731_pakistan--agri...,d\ne\nz\ni\nr\no\nh\nt\nu\nA\n \ne\nr\nu\ns\no...,,28
84,1990_march_27_681121468040453518_india--hydera...,d\ne\nz\ni\nr\no\nh\nt\nu\nA\n \ne\nr\nu\ns\no...,,11
89,1990_may_15_598881468034838695_india--integrat...,d\ne\nz\ni\nr\no\nh\nt\nu\nA\n \ne\nr\nu\ns\no...,,11
90,1990_may_15_955111468244536485_india--integrat...,d\ne\nz\ni\nr\no\nh\nt\nu\nA\n \ne\nr\nu\ns\no...,,11
91,1990_may_17_749401468251094028_india--seventh-...,d\ne\nz\ni\nr\no\nh\nt\nu\nA\n \ne\nr\nu\ns\no...,,11
92,1990_may_17_788441468273707898_india--seventh-...,d\ne\nz\ni\nr\no\nh\nt\nu\nA\n \ne\nr\nu\ns\no...,,11
95,1990_may_1_670561468035072614_india--technical...,d\ne\nz\ni\nr\no\nh\nt\nu\nA\n \ne\nr\nu\ns\no...,,11
97,1990_may_22_349231468225306873_caribbean-regio...,,,22
125,1990_october_4_185891468252289466_india--andhr...,d\ne\nz\ni\nr\no\nh\nt\nu\nA\n \ne\nr\nu\ns\no...,,12


So for the files in unparseables_1000, it makes sense to attempt reading with pytesseract.

We attempt to do this just once to validate that this is worth doing:

In [79]:
reparse_attempt_list = []
for file in unparseables_1000['filename'].sample(n=1,random_state=8001):
    pages = convert_from_path(os.path.join('world_bank_loans',file))
    text = ''
    for page in pages:
        text += (str(pytesseract.image_to_string(page)))
    reparse_attempt_list.append(text)
reparse_df = pd.DataFrame({'filename':unparseables_1000['filename'].sample(n=1,random_state=8001),'reparse':reparse_attempt_list})

In [85]:
reparse_df['reparse'].values

array(['1\n\nPubli¢ Disclosure Authorized\n\nPublic Disclosure Authorized\n\nOFFICIAL |\n| DOCUMENTS |\n\n. —\n\n \n\né\n\nLOAN NUMBER 8339-PE\n\nLoan Agreement\n\n(Cusco Transport Improvement Project —\nMejoramiento del Transporte en la Ciudad del Cusco)\n\nbetween\nREPUBLIC OF PERU\nand\n\nINTERNATIONAL BANK FOR RECONSTRUCTION\nAND DEVELOPMENT\n\nDated 5@3 QY L2014\n\n \n\n \n\x0cLOAN AGREEMENT\n\nAgreement dated 3!% 24 , 2014, between the REPUBLIC\nOF PERU (“Borrower”) and INTE ATIONAL BANK FOR RECONSTRUCTION\n\nAND DEVELOPMENT (“Bank”). The Borrower and the Bank hereby agree as follows:\n\n1.01.\n\n2.01.\n\n2.02.\n\n2.03.\n\nARTICLE I — GENERAL CONDITIONS; DEFINITIONS\n\nThe General Conditions (as defined in the Appendix to this Agreement)\nconstitute an integral part of this Agreement.\n\nUnless the context requires otherwise, the capitalized terms used in this\nAgreement have the meanings ascribed to them in the General Conditions or in\nthe Appendix to this Agreement.\n\nARTICLE

Looks good! This will take a while; let's proceed. 

## Warning
This next code block is ***extremely slow***. It took an entire overnight to run, as it will process 400 PDFs with OCR. This could potentially be parallelized.

In [86]:
reparse_attempt_list = []
for file in unparseables_1000['filename']:
    pages = convert_from_path(os.path.join('world_bank_loans',file))
    text = ''
    for page in pages:
        text += (str(pytesseract.image_to_string(page)))
    reparse_attempt_list.append(text)
reparse_df = pd.DataFrame({'filename':unparseables_1000['filename'],'reparse':reparse_attempt_list})

In [87]:
reparse_df

Unnamed: 0,filename,reparse
24,1990_december_4_833971468016863921_china--rura...,Public Disclosure Authorizea\n\nLOAN NUMBER 32...
64,1990_june_12_146771468325288731_pakistan--agri...,Public Disclosure Authorized\n\nPublic Disclos...
84,1990_march_27_681121468040453518_india--hydera...,Public Disclosure Authorized\n\n‘.nnmx:.-‘mmmm...
89,1990_may_15_598881468034838695_india--integrat...,P ] )\n\nORI\n\n_—\n\n- Ay 2T e\n- pat v R AT\...
90,1990_may_15_955111468244536485_india--integrat...,P ] )\n\nORI\n\n_—\n\n- Ay 2T e\n- pat v R AT\...
91,1990_may_17_749401468251094028_india--seventh-...,Public Disclosure Authorized\n\n[ OFFICIAL |\n...
92,1990_may_17_788441468273707898_india--seventh-...,Public Disclosure Authorized\n\n[ OFFICIAL |\n...
95,1990_may_1_670561468035072614_india--technical...,Public Disclosure Authorized\n\nPublic Disclos...
97,1990_may_22_349231468225306873_caribbean-regio...,LOAN NUMBER 3200 CRG\n\nLoan Agreement\n\n(Fif...
125,1990_october_4_185891468252289466_india--andhr...,Public Disclosure Authérized\n\nPublic Disclos...


We are now ready to write our processed text files, and we do so over the following two blocks.

In the first block, we will write the successful pdfminer extractions.

In [22]:
PDFMINER_TEXT_DIR = os.path.join(DATA_DIR,'pdfminer_texts')

try:
    os.mkdir(PDFMINER_TEXT_DIR)
except FileExistsError as e:
    pass

for _,pdfminer_result in Files_pdfminer[Files_pdfminer['de_headed_length']>=1001].iterrows():
   with open(os.path.join(PDFMINER_TEXT_DIR,pdfminer_result['filename'].replace('.pdf','.txt')),'w+') as f:
        f.write(pdfminer_result['de_headed'])

And now we will write the pytesseract results to text files as well.

In [24]:
TESSERACT_TEXT_DIR = os.path.join(DATA_DIR,'pytesseract_texts')

try:
    os.mkdir(TESSERACT_TEXT_DIR)
except FileExistsError as e:
    pass

for _,pytesseract_result in Files_tesseract.iterrows():
   with open(os.path.join(TESSERACT_TEXT_DIR,pytesseract_result['filename'].replace('.pdf','.txt')),'w+') as f:
        f.write(pytesseract_result['reparse'])