# PDF Data Extraction with Python (almost complete)
> How to extract data from a PDF using python

- toc: true
- badges: false
- comments: false
- categories: [Python, Tutorial, Data]
- image: images/chart-preview.png

### Extracting data from PDFs using Tabula
Data extraction from a PDF document can be an incredibly arduous task. It may not be too bad when it is required as a once-off task, but when the process needs to be repeated many times over, it can be truly crushing. Working with data in a PDF data is difficult because the data is formatted differently to how it is in a spreadsheet. This means that before we can work with and manipulate the data, we must extract it from the PDF, correct any misalignments and format interpretation errors, and then store it in a more data-friendly format like a *csv* or *xlsx* spreadsheet. One advantage when working with formal reports that have been published in PDF format is that they are typically consistently structured, with only the content changing. Thus we are often able to construct an automated process for extracting data from PDF tables, which can really help when extracting data from many files. In this tutorial, we will demonstrate how to use a python module called [Tabula](https://pypi.org/project/tabula-py/). Tabula allows you to pull data from a PDF and load it into a [Pandas](https://pypi.org/project/pandas/) dataframe. It is important to note that this is only the first part of the PDF data extraction process. Once the data is in a dataframe, it then needs to be cleaned and arranged in a manner that is consistent across all datasets, before being stored. In this tutorial, we will only focus on the initial extraction process, using Tabula. 

### Getting Started
In order to follow along with this tutorial, a basic understanding of the Python programming language and Python Environments is considered to be read. You will need to have Python installed, along with the Tabula module which can be downloaded using pip or [Anaconda](https://anaconda.org/conda-forge/tabula-py). 
For this tutorial, we have used this [PDF](https://ckan.africadatahub.org/dataset/south-africa-inflation-data/resource/1d22865e-8ace-46e1-8222-ec5352334889) file. 

In [1]:
import tabula
import pandas as pd
import glob

Say we wish to extract "Table 1 - Consumer price indices for the total country", as seen below.
![](images/pdf_extraction/raw_pdf.png)

In [2]:
#hide_output
tables = tabula.read_pdf('./data/pdf_extraction/p0141june2022_tables.pdf', pages="all")

Got stderr: Oct 04, 2022 2:38:34 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Oct 04, 2022 2:38:34 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Oct 04, 2022 2:38:34 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Oct 04, 2022 2:38:34 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Oct 04, 2022 2:38:34 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Oct 04, 2022 2:38:34 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Oct 04, 2022 2:38:35 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Oct 04, 2022 2:38:35 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Oct 04, 2022 2:38:35 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Oct 04, 2022 2:38:35 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Oct 04, 2022 2:38:35 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Oct 04, 2022 2:38:35 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Oct 04, 2022 2:38:35 PM org.apache.pdfbox.pdmodel.font.PDTrueTyp

This will cause Tabula to parse the entire PDF document and extract all of the tables that it identifies, returning them in a list that we have called `tables`. We can reduce runtime by specifying which pages of the document to look for. In this case, Table 1 can be found on pages 3 and 4.

In [3]:
#hide_output
tables = tabula.read_pdf('./data/pdf_extraction/p0141june2022_tables.pdf', pages=(3,4))

Got stderr: Oct 04, 2022 2:38:40 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Oct 04, 2022 2:38:40 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Oct 04, 2022 2:38:41 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Oct 04, 2022 2:38:41 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Oct 04, 2022 2:38:41 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Oct 04, 2022 2:38:41 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



We can also specify whether Tabula should identify rows and columns with white space or lines, by specifying `stream` or `lattice` = `True`. In my experience thus far, `stream` typically provides better results.

In [4]:
#hide_output
tables = tabula.read_pdf('./data/pdf_extraction/p0141june2022_tables.pdf', pages=(3,4), stream=True)

Got stderr: Oct 04, 2022 2:38:42 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Oct 04, 2022 2:38:42 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Oct 04, 2022 2:38:43 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Oct 04, 2022 2:38:43 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Oct 04, 2022 2:38:43 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Oct 04, 2022 2:38:43 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Once we have our list of tables, we can access the appropriate table from the list. Note, that the table we're interested in, Table 1, is split across two pages. Tabula will treat this as two separate tables, so in order to access Table 1, we will need to pull the first two tables from the list.

In [5]:
df_1 = tables[0]
df_2 = tables[1]

We have now extracted the data for Table 1 from the PDF and have it in a Pandas dataframe. From here the data can be cleaned and processed as required. 

In some cases, Tabula is not automatically able to identify the table that you wish to extract. In this case, it is necessary to specify the area of the page that you are interested in. The easiest way to locate the appropriate page coordinates is to open Tabula in your browser, upload the pdf document and then manaually select the area of the table of interest via click and drag with your mouse. You have to download and install the Tabula tool from [here](https://tabula.technology/) and **note**, this is separate from the Tabula module that you have installed with Python. Once you have selected your table area, you can export it as a script. Open the script in any text editor and you will see something like this:

`java -jar tabula-java.jar  -a 143.249,62.261,670.851,546.702 -p 3 "$1"  `

Copy the numbers: `143.249,62.261,670.851,546.702` and include them in your python command as follows: 

In [6]:
#hide_output
tables = tabula.read_pdf('./data/pdf_extraction/p0141june2022_tables.pdf', pages=(3,4), stream=True, area=(143.249,62.261,670.851,546.702))

Got stderr: Oct 04, 2022 2:38:44 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Oct 04, 2022 2:38:44 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



### Conclusion
In this tutorial, we have shown you how to extract data from a PDF document using a Python module called Tabula. We have used this method very successfully in extracting data from many African country Consumer Price Index reports in order to produce our Inflation database, which is used to service the ADH African Inflation observer. 