# Automated PDF Text Search and Analyzer (Beta Version)
Prepared by Hiro Yokoi, July 10, 2019<br><br>
This is the temporary PDF analyzer to ease **text search** for **Portfolio Review and Analysis (PRA)** for managing urban spatial growth.

**Limitations of this Script**
- This PDF analyzer can only analyze ONE PDF file at a time at this moment. In the future, all the PDF files in a folder will be analyzed all at once.
- If the file is OCR-read PDF, this PDF analyzer does not accurately read the text (particularly multiple phrases).

**What you have to do**
- All you have to do is to change `your_folder_path` and `your_pdf_file_name`. Then, the system will automatically anlyze the texts in the PDF.
- If you want to change the search text, you can change the String part like `String = ['aaa', 'bbb', 'ccc', 'ddd']`. Be sure to type the **lower case** character. Text search is case sensitive.

### Import packages


In [11]:
import pandas as pd
import numpy as np
import PyPDF2
import textract
import re
import os
import glob
#from textblob import TextBlob
#from nltk.tokenize import word_tokenize
#from nltk.corpus import stopwords
#import nltk
#nltk.download('punkt')

### Input your folder path and file name here. You need to change the folder path and file name.

In [12]:
# Please change the folder path and file name.
# You shoud replace only after "r'". The path should be something like r'C:\Users\wbXXXXXX\......'.
your_folder_path = r'C:\Users\wb535782\Desktop\PRA\P050772'
your_pdf_file_name = r'P050772_ICR.pdf'

### Extract text from PDF file

In [13]:
# Establish complete file path
complete_path = os.path.join(your_folder_path, your_pdf_file_name)

# Open the pdf file
pdfFileObj = open(complete_path,'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

# Discern the number of pages, so that search all the pages.
num_pages = pdfReader.numPages

# Extract text in each page.
count = 0
text = ""
while count < num_pages:
    pageObj = pdfReader.getPage(count)
    count +=1
    text += pageObj.extractText()

# the condictional statement provides either the computer written document if its yes or the OCR scanned document if its no.
if text != "":
   text = text
else:
   text = textract.process(your_pdf_file_name, method='tesseract', language='eng')

# Clean text by removing new page, tab, double white space.
text = text.replace('\n','').replace('\t', '').lower()
text = " ".join(text.split())

In [14]:
# Show first 200 words.
text[:200]

'document of the world bank report no: icr00001050 implementation completion and results report (ibrd-70370) on a loan in the amount of eur 218.2 million (us$ 202.1 million equivalent) to the federativ'

### Define search phrases

In [15]:
search_phrases = [
    'raise awareness',
    'regulatory reform',
    'institutional capacity',
    'policy reform',
    'informal settlement',
    'urban',
    'peri-urban',
    'gender',
    'poverty map',
    'land governance assessment framework country diagnostics',
    'lgaf',
    'annual land and poverty conference',
    'land market assessment course',
    'urbanization review',
    'city development strategies',
    'city development strategy',
    'urbanization review',
    'urban research symposium',
    'land use planning course',
    'land market assessment toolkits',
    'tod implementation resources',
    'transforming transportation conference',
    'tokyo distance learning center',
    'tdlc',
    'leaders in urban transport planning course',
    'land readjustment course',
    'approaches to urban slums',
    'street addressing',
    'street addressing and the management of cities course',
    'upgrading urban informal settlements course',
    'cadastre law',
    'cadaster law',
    'cadastre modernization',
    'cadaster modernization',
    'property rights',
    'titling',
    'land use',
    'land assembly regulation',
    'property tax',
    'public land management',
    'expropriation mechanism',
    'land readjustment regulatory framework',
    'public-private investment',
    'ppp',
    'separation and clarity of institutional mandates',
    'participatory practice',
    'metropolitan',
    'peri-urban',
    'multi-use cadaster',
    'multi-use cadastre',
    'multi use cadaster',
    'multi use cadastre',
    'integrated cadaster',
    'integrated cadastre',
    'hardware',
    'equipment',
    'software',
    'database',
    'management information system',
    'geospatial data'
    'geographic information system',
    'gis',
    'innovation',
    'innovative technology',
    'land allocated for public infrastructure',
    'delineation',
    'land use',
    'regulated land use',
    'building code',
    'monitoring land use',
    'land use monitoring',
    'map',
    'land use planning',
    'ngo',
    'cso',
    'planning professionals',
    'universities',
    'university',
    'academia',
    'spatial planning',
    'participatory urban and territorial planning',
    'mapping sysytem',
    'national planning agencies',
    'national planning agency',
    'urban plan',
    'territorial plan',
    'institutional arrangement',
    'transit oriented development',
    'transport-led land use',
    'slum upgrading',
    'land readjustment',
    'land development',
    'upgrade',
    'upgrading',
    'rehabilitation',
    'modernization',
    'consolidating',
    'consolidate',
    'land value',]

### Analyze the frequency of search phrases in the PDF file

In [16]:
#phrases = ['urban', 'corresponding to']

conter = 0
dicts = {}

for phrase in search_phrases:
    if phrase in text:
        counter = text.count(phrase)
        dicts[phrase] = counter
        df = pd.DataFrame(dicts, index = [your_pdf_file_name], columns = search_phrases)

### Show the result including null values.

In [17]:
df

Unnamed: 0,raise awareness,regulatory reform,institutional capacity,policy reform,informal settlement,urban,peri-urban,gender,poverty map,land governance assessment framework country diagnostics,...,slum upgrading,land readjustment,land development,upgrade,upgrading,rehabilitation,modernization,consolidating,consolidate,land value
P050772_ICR.pdf,,,,,,8,4,2,,,...,,,,,1,1,,1,3,


### Remove null values and show the results.

In [18]:
df_na_dropped = df.dropna(axis=1, how='all')
df_na_dropped

Unnamed: 0,urban,peri-urban,gender,peri-urban.1,equipment,database,management information system,gis,innovation,map,ngo,university,institutional arrangement,upgrading,rehabilitation,consolidating,consolidate
P050772_ICR.pdf,8,4,2,4,7,1,5,11,1,2,9,12,3,1,1,1,3


### Export the result with null values

In [19]:
df.to_csv(your_pdf_file_name + '.csv')

### Export the result without null values

In [20]:
df_na_dropped.to_csv(your_pdf_file_name + '_nonnull.csv')

<br><br><br><br>
## (Under Construction) Analyzing Multiple PDF files all at once

In [None]:
pdf_dir = r"C:\Users\wb535782\Desktop\PRA\P050772"
pdf_files = glob.glob("%s/*.pdf" % pdf_dir)
pdf_files

In [None]:
for pdf_file in pdf_files:
    
    # Open the pdf file
    pdfFileObj = open(pdf_file,'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

    # Discern the number of pages, so that search all the pages.
    num_pages = pdfReader.numPages

    # Extract text in each page.
    count = 0
    text = ""
    
    while count < num_pages:
        pageObj = pdfReader.getPage(count)
        count +=1
        text += pageObj.extractText()

    # the condictional statement provides either the computer written document if its yes or the OCR scanned document if its no.
    if text != "":
        text = text
    else:
        text = textract.process(pdf_file, method='tesseract', language='eng')

    # Clean text by removing new page, tab, double white space.
    text = text.replace('\n','').replace('\t', '').lower()
    text = " ".join(text.split())

    dicti = {pdf_file: text}

    #lst = []
    #pdf_dicts = {}
    
    #for file_text in text:
        #pdf_dicts[pdf_file] = file_text

In [None]:
df2 = pd.DataFrame.from_dict(dicti,  orient = 'index', columns = ['text'])
df2

In [None]:
df2.to_csv('test.csv')