# OCR Lightweight Pipeline Proof of Concept

This section will explore the usage of OCR libraries to read in policy paper PDF's in relation to forestry. This notebook specifically is to demonstrate the bare bones of how the pipeline will work. 

Need to install libraries `pytesseract`, `opencv-python`, `pillow`, `wand`, and relevant dependencies. Use: 

`pip install pillow`

`brew install tesseract`

`pip install pytesseract`

`pip install opencv-python`

`pip install wand`

`brew install freetype imagemagick`

In [2]:
# from PIL import Image as
import PIL
from wand.image import Image
import pytesseract
import cv2
import os
import numpy as np
import io

# import PyPDF2 
# import textract
# from nltk.tokenize import word_tokenize
# from nltk.corpus import stopwords

# import wand

## Text Extraction

Inspiration from https://pythontips.com/2016/02/25/ocr-on-pdf-files-using-python/

In [7]:
# This should be the absolute path to your pytesseract 
# pytesseract.pytesseract.tesseract_cmd = '/Users/apple/anaconda3/lib/python3.6/site-packages/pytesseract/pytesseract.py'
os.listdir("sample data/Kenya")[0]

'.DS_Store'

In [4]:
# Convert a PDF into a form that we can utilize the wand library
image_pdf = Image(filename="sample data/Kenya/AgricultureFisheriesandFoodAuthorityNo13of2013.PDF", resolution=300)

In [4]:
# Convert this to JPEG so that we can extract information that we can feed into pytesseract
image_jpeg = image_pdf.convert('jpeg')

In [5]:
toy_image = []
final_text = []

In [6]:
# Convert the images on each page to blobs (binary strings)
for img in image_jpeg.sequence:
    img_page = Image(image=img)
    toy_image.append(img_page.make_blob('jpeg'))


In [7]:
type(toy_image[0])

bytes

Now that we have these image representations, we can loop OCR over the images in toy_image.

In [8]:
len(toy_image)

32

This makes sense, as the corresponding PDF has 32 pages. Therefore, each element of the toy_image list represents a page in the original PDF. 

In [9]:
for img in toy_image:
    text = pytesseract.image_to_string(PIL.Image.open(io.BytesIO(img)))
    final_text.append(text)

In [13]:
# Page 1
final_text[0]

'SPECIAL ISSUE\n\nKenya Gazette Supplement No. 25 (Acts No. 13)\n\n \n\n \n\nREPUBLIC OF KENYA\n\nKENYA GAZETTE SUPPLEMENT\n\nACTS, 2013\n\nNAIROBI, 25th January, 2013\n\n \n\nCONTENT\n\nAct—\n\nThe Agriculture, Fisheries and Food Authority Act, 2013 ...... ccc 183\n\nearner\n\nate\n\nNATIONAL COUNCIL FOR LAW REPORTING\n{\n| RECEIVED\n\nbods\nbie\n\n \n  \n \n \n\n©, Box 10444-00700\nNAIROBI, KE\n\nFELL? 19231 FAK er tab\neee nnn\n\nLe\n\nssamcns anes ten\n\nPRINTED AND PUBLISHED BY THE GOVERNMENT PRINTER, NAIROBI'

In [14]:
# Page 2 
final_text[1]

'THE AGRICULTURE, FISHERIES AND FOOD AUTHORITY\nACT\n\nNo. 13 of 2013\nDate of Assent: 14th January, 2013\nDate of Commencement: 25th January, 2013\nARRANGEMENT OF SECTIONS\n\nSection\nPART I—PRELIMINARY\n\n1—-Short title and commencement.\n\n2—-Interpretation.\n\nPART II—ESTABLISHMENT, FUNCTIONS AND POWERS OF\nTHE AUTHORITY\n\n3—Establishment of the Authority.\n4—Functions of the Authority.\n5—Board of the Authority.\n6— Powers of the Authority.\n7—Conduct of business and affairs of the Authority.\n8—Delegation by the Authority.\n9—Remuneration of members of the Board.\n10-—— The Director General.\n1*1—Organization of the Secretariat of the Authority.\n12—Staff.\n13~-The common seal of the Authority.\n14—Protection from personal liability.\n15—Liability for damages.\nPART ITI—FINANCES OF THE AUTHORITY\n16—Funds of the Authority.\n17—Financial year.\n18—Annual estimates.\n19—Accounts and audit.\n\n20—Investment of funds.'

In [15]:
# Page 3
final_text[2]

'184\nNo. 13 Agriculture, Fisheries and Food Authority 2013\n\nPART IV—POLICY GUIDELINES ON DEVELOPMENT,\nPRESERVATION AND UTILIZATION OF AGRICULTURAL\nLAND\n\n21—Land development guidclincs.\n\n22—Rules on preservation, utilization and development of agricultural\nland.\n\n23—Land preservation guidelines.\nPART V—PROVISIONS ON NOXIOUS OR INVASIVE WEEDS\n24—Power to declare plant a noxious or invasive weed.\n25—Duty to report.\n26—Power of county government officer to enter land.\n27—Order by county government to clear land.\n28—Eradication of weed by county government.\nPART VI—RESPONSIBILITY OF COUNTY GOVERNMENTS\n29—Respective roles of national and county governments.\n30— Penalty for non-comphance with order.\n31— Register of land development orders.\n32—- Land preservation orders.\n33— Appeal against a land preservation order.\n34— Cancellation and amendments of orders.\n35— Register of orders.\n36— Failure to comply with an order.\n37— Penalty for failure to comply.\n\n38— Right 

## Data Cleaning and Format Conversion

### Idea for conversion

For purposes of topical analysis, it might be valuable to store the documents as dictionaries. The keys would just be corresponding paragraph position, while the values would just be the paragraphs themselves. 

For example, let's assume our text is: 

"Aristotle was born in 384 B.C. in Stagira in northern Greece. Both of his parents were members of traditional medical families, and his father, Nicomachus, served as court physician to King Amyntus III of Macedonia. His parents died while he was young, and he was likely raised at his family’s home in Stagira. At age 17 he was sent to Athens to enroll in Plato’s Academy. He spent 20 years as a student and teacher at the school, emerging with both a great respect and a good deal of criticism for his teacher’s theories. Plato’s own later writings, in which he softened some earlier positions, likely bear the mark of repeated discussions with his most gifted student.

When Plato died in 347, control of the Academy passed to his nephew Speusippus. Aristotle left Athens soon after, though it is not clear whether frustrations at the Academy or political difficulties due to his family’s Macedonian connections hastened his exit. He spent five years on the coast of Asia Minor as a guest of former students at Assos and Lesbos. It was here that he undertook his pioneering research into marine biology and married his wife Pythias, with whom he had his only daughter, also named Pythias.

In 342 Aristotle was summoned to Macedonia by King Philip II to tutor his son, the future Alexander the Great—a meeting of great historical figures that, in the words of one modern commentator, “made remarkably little impact on either of them.”

Then, if we store this in `dictionary`, then:

`dictionary[0]` =  "Aristotle was born in 384 B.C. in Stagira in northern Greece. Both of his parents were members of traditional medical families, and his father, Nicomachus, served as court physician to King Amyntus III of Macedonia. His parents died while he was young, and he was likely raised at his family’s home in Stagira. At age 17 he was sent to Athens to enroll in Plato’s Academy. He spent 20 years as a student and teacher at the school, emerging with both a great respect and a good deal of criticism for his teacher’s theories. Plato’s own later writings, in which he softened some earlier positions, likely bear the mark of repeated discussions with his most gifted student."

`dictionary[1]` = "When Plato died in 347, control of the Academy passed to his nephew Speusippus. Aristotle left Athens soon after, though it is not clear whether frustrations at the Academy or political difficulties due to his family’s Macedonian connections hastened his exit. He spent five years on the coast of Asia Minor as a guest of former students at Assos and Lesbos. It was here that he undertook his pioneering research into marine biology and married his wife Pythias, with whom he had his only daughter, also named Pythias."

`dictionary[2]` = "In 342 Aristotle was summoned to Macedonia by King Philip II to tutor his son, the future Alexander the Great—a meeting of great historical figures that, in the words of one modern commentator, “made remarkably little impact on either of them."

With regards to subsections within documents, then we can expand this final conversion into a multidimensional dictionary, where the outer layer of the dictionary are the subsections, while the inner dictionaries are the corresponding paragraph positions and paragraphs. 

### Data Cleaning

### Format Conversion