# Step-by-step guide to extracting data from PDF files with Python

___

* There are several great Python packages that help working with PDF files (e.g., PDFMiner, pyPDF, etc.)
* This tutorial introduces 'tabula' package step by step.

---

### 1. Install and import **tabula**.

* But first install [Java](https://www.java.com/en/download/)

In [69]:
pip install tabula-py




In [70]:
import tabula

___

### 2. Check if everything is ready.

In [63]:
tabula.environment_info()

Python version:
    3.8.3 (default, Jul  2 2020, 17:30:36) [MSC v.1916 64 bit (AMD64)]
Java version:
    java version "1.8.0_261"
Java(TM) SE Runtime Environment (build 1.8.0_261-b12)
Java HotSpot(TM) Client VM (build 25.261-b12, mixed mode)
tabula-py version: 2.1.1
platform: Windows-10-10.0.18362-SP0
uname:
    uname_result(system='Windows', node='DESKTOP-J05OL2A', release='10', version='10.0.18362', machine='AMD64', processor='Intel64 Family 6 Model 158 Stepping 9, GenuineIntel')
linux_distribution: ('', '', '')
mac_ver: ('', ('', '', ''), '')
    


___

### 3. Read your PDF file.

In [None]:
file = r"Your_path/file_name.pdf"
 
tables = tabula.read_pdf(file, pages = "all", multiple_tables = True)

___

### 4. Specify coordinates to micro-extract data from PDF file.

* If the tables in the PDF file are well organized, then you can simply use the command above or its variation. To learn more about the syntax, visit https://tabula-py.readthedocs.io/en/latest/
* However, whenever possible, it is better to specifiy the exact location of the data in the PDF file, i.e., coordinates.
* Here is one way to find out coordinates of a PDF file:
    1. Go to https://tabula.technology/ and install **Tabula**.
    2. Upzip the file and execute Tabula.exe.
    3. A new window will open (or you can type in "http://127.0.0.1:8080/" into your browser).
    4. Upload your PDF file and click **Extract Data**.
    5. Your PDF file will open. Select the area from which you want to extract data.
    6. Click **Preview & Export Extracted Data**.
    7. Select **Script** from top-down menu at the top of the page. Click **Export**.
    8. This will download a script file. Open with your text editor.
    9. It will look something like below. Copy the four coordinates in the middle (representing top, left, bottom, and right coordinates).
    
        java -jar tabula-java.jar  -a 96.773,50.108,177.098,208.463 -p 1 "$1" 
        
       
    10. Include area = (96.773,50.108,177.098,208.463) into your **tabula** command.

In [None]:
data = tabula.read_pdf(r'your_path\file_name.pdf', area =(96.773,50.108,177.098,208.463), pages = "all")

---

### 5. Export the extracted data to csv.

* The exporting method will depend on the structure of your extracted data.

In [75]:
import csv

file = open('test.csv','w+',newline='')

with file:
    write = csv.writer(file)
    write.writerows(data)

---

#### Endnote:
1. For greater details about **tabula**, visit https://tabula-py.readthedocs.io/en/latest/
2. For how to use Markdown syntax, visit John Gruber's [website](https://daringfireball.net/projects/markdown/syntax), its creator.
3. This Jupyter Notebook is uploaded into my Github repository at https://github.com/open-data-society/jwchung/blob/master, also rendered at https://nbviewer.jupyter.org/
4. The link to this Notebook is https://nbviewer.jupyter.org/github/open-data-society/jwchung/blob/gh-pages/ReadPDFfiles.ipynb