# PDF Data Extraction with Python (almost complete)
> How to extract data from a PDF using python

- toc: true
- badges: false
- comments: false
- categories: [Python, Tutorial, Data]
- image: images/chart-preview.png

### Extracting data from PDFs using Tabula
Data extraction from a PDF document can be an incredibly arduous task. It may not be too bad when it is required as a once-off task, but when the process needs to be repeated many times over, it can be truly crushing. Working with data in a PDF data is difficult because the data is formatted differently to how it is in a spreadsheet. This means that before we can work with and manipulate the data, we must extract it from the PDF, correct any misalignments and format interpretation errors, and then store it in a more data-friendly format like a *csv* or *xlsx* spreadsheet. One advantage when working with formal reports that have been published in PDF format is that they are typically consistently structured, with only the content changing. Thus we are often able to construct an automated process for extracting data from PDF tables, which can really help when extracting data from many files. In this tutorial, we will demonstrate how to use a python module called [Tabula](https://pypi.org/project/tabula-py/). Tabula allows you to pull data from a PDF and load it into a [Pandas](https://pypi.org/project/pandas/) dataframe. It is important to note that this is only the first part of the PDF data extraction process. Once the data is in a dataframe, it then needs to be cleaned and arranged in a manner that is consistent across all datasets, before being stored. In this tutorial, we will only focus on the initial extraction process, using Tabula. 

### Getting Started
In order to follow along with this tutorial, a basic understanding of the Python programming language and Python Environments is considered to be read. You will need to have Python installed, along with the Tabula module which can be downloaded using pip or [Anaconda](https://anaconda.org/conda-forge/tabula-py). 
For this tutorial, we have used this [PDF](https://ckan.africadatahub.org/dataset/south-africa-inflation-data/resource/1d22865e-8ace-46e1-8222-ec5352334889) file. 

In [1]:
import tabula
import pandas as pd
import glob

Say we wish to extract "Table 1 - Consumer price indices for the total country", as seen below.
![](images/pdf_extraction/raw_pdf.png)