[PyPDF2 Tutorial](https://roytuts.com/extract-text-from-pdf-file-using-python/) <br>
[Tabula and Camelot Tutorial](https://www.geeksforgeeks.org/how-to-extract-pdf-tables-in-python/)

Following code was pulled from [this](https://stackoverflow.com/questions/16694907/download-large-file-in-python-with-requests) StackExchange post so an online pdf can be pulled without needing for a user to download a pdf and change the code. Only the url of the pdf they want to work with must be changed

In [2]:
import requests
import shutil

def download_file(url):
    local_filename = url.split('/')[-1]
    with requests.get(url, stream=True) as r:
        with open(local_filename, 'wb') as f:
            shutil.copyfileobj(r.raw, f)

    return local_filename

## Using PyPDF2

Documentation available [here](https://pypi.org/project/PyPDF2/) <br>
We can pull a PDF from the web and scrape it as an exercise in downloading files from the web with Python, but the file is included if you'd like to change the code to only open the existing file

In [3]:
import PyPDF2 as pydf

#Obtain file name
file_name = download_file("http://www.africau.edu/images/default/sample.pdf")
#Open file
pdf_file = open(file_name, 'rb')

pdf_reader = pydf.PdfFileReader(pdf_file)
page_count = pdf_reader.numPages
text = []

#Extract text from every page
for page in range(page_count):
    #Try except block in case of corrupted data
    try:
        pdf_page = pdf_reader.getPage(page)
        text.append(pdf_page.extractText())
    except:
        pass

full_text = "\n".join(text)
print("Full text: " + full_text)

Full text:  A Simple PDF File  This is a small demonstration .pdf file -  just for use in the Virtual Mechanics tutorials. More text. And more  text. And more text. And more text. And more text.  And more text. And more text. And more text. And more text. And more  text. And more text. Boring, zzzzz. And more text. And more text. And  more text. And more text. And more text. And more text. And more text.  And more text. And more text.  And more text. And more text. And more text. And more text. And more  text. And more text. And more text. Even more. Continued on page 2 ...
 Simple PDF File 2  ...continued from page 1. Yet more text. And more text. And more text.  And more text. And more text. And more text. And more text. And more  text. Oh, how boring typing this stuff. But not as boring as watching  paint dry. And more text. And more text. And more text. And more text.  Boring.  More, a little more text. The end, and just as well. 


Note that since we're storing the text per page into a list collection you can easily search for text by page number accounting for the fact that list index starts at 0, where as PDF page counts will start at 1.

In [37]:
#This sample pdf contains the word "text" many times
#Let's scrub the full string output to see exactly how many times.
str_to_find = "text"
print(f"String '{str_to_find}' was found {full_text.count(str_to_find)} times in the PDF")

String 'text' was found 40 times in the PDF


## Using Tabula and Camelot

While we were able to learn to use PyPDF2 to extract plain text from PDF, some data may only be available in PDF format, in which case plain-text parsing will not suffice to retain formatting of the information. Tabula and Camelot provide means of solving this issue.

In [5]:
import tabula
from tabula import read_pdf
from tabulate import tabulate
import pandas as pd

path = r'assets/Test.pdf'
pdf_file = open(path, 'rb')
#reads table from pdf file
df = read_pdf(pdf_file, pages="all",
              output_format="dataframe") #address of pdf file
#Tabula read PDF returns a LIST of dataframes, access by index to use as data for DF
df = pd.DataFrame(data=df[0])
df


Unnamed: 0,Pos,Player,Team,Span,Innings,Runs,Highest,Average,Striking
0,,,,,,,Score,,Rate
1,1.0,Sachin Tendular,India,1989-2012,452.0,18426.0,200,44.83,86.23
2,2.0,Kumar Sangakkara,Sri Lanka,2000-2015,380.0,14234.0,169,41.98,78.86
3,3.0,Ricky Ponting,Australia,1995-2012,365.0,13704.0,164,42.03,80.39
4,4.0,Sanath Jayasuriya,Sri Lanka,1989-2011,433.0,13430.0,189,32.36,91.2
5,5.0,Mahela Jayawardene,Sri Lanka,1998-2015,418.0,12650.0,144,33.37,78.96
6,6.0,Virat Kohli,India,2008-2020,236.0,11867.0,183,59.85,93.39
7,7.0,Inzamam-ul-Haq,Pakistan,1991-200,350.0,11739.0,137,39.52,74.24
8,8.0,Jacques Kalis,South Africa,1996-2014,314.0,11579.0,139,44.36,72.89
9,9.0,Saurav Ganguly,India,1992-2007,300.0,11363.0,183,41.02,73.7


Tabula tends to not play nice with messier PDFs

In [6]:
#One of the PDFs provided in a data set for Matthew's replication
path = r"assets/t01_01.pdf"
df = read_pdf(path, pages="all",
              output_format="dataframe")
df = pd.DataFrame(data=df[0])
df

Unnamed: 0,10,9,8,7,6,5,4,3,2,1,Total


Some PDFs are much harder for Tabula to read than others. One approach might be outputting with tabula to a CSV from a PDF and manually cleaning the data in Excel or some human-readable format. The rest of the cleaning can be handled once it's importable to Pandas, but that starting line of importing to a pandas DataFrame might need human effort.

Camelot tends to be more flexible when it comes reading harder PDFs, but dependencies are quite annoying to handle with this library.

Required packages that are not dependencies: opencv-python, GhostScript local machine installation.
`pip install camelot-py` with `pip install opencv-python` should handle the cv2 dependency that tends to raise an error, and `pip install camelot-py[cv]` should work as well.

For GhostScript, visit their website since a local machine installation is necessary. A PATH environment variable to path `C:\Program Files\gs\gs"x.xx.x"\bin` is necessary as well, but after camelot should work!

In [7]:
import camelot
# extract all the tables in the PDF file
df = camelot.read_pdf(path)
df[0].df

Unnamed: 0,0,1,2
0,Deciles\nםינורישע,לכה ךס\nTotal,
1,10\n9\n8\n7\n6\n5\n4\n3\n2\n1,,
2,84.0\n83.7\n81.0\n81.1\n79.9\n77.4\n71.7\n63.6...,70.2\n70.0\n71.1\n70.8\n71.2\n69.6\n70.4\n70.6...,1997\n1998\n1999\n2000\n2001\n2002\n2003\n2004...


It's not in a super readable format, but it's definitely a start over Tabula! Additionally, settings can be tweaked to fine-tune camelot to better read the table region. Inexplicably, Camelot could not read the sample table provided from GeeksForGeeks using their provided code, even with additional tweaking... 