## PDF File Table Data Extractor
This Notebook is a helper to extract tables from a PDF file. 
The PDF contains data for planting different types of vegetables. The PDF is available by the Brazilian State Owned Company EMBRAPA.

## Install Dependencies

For general data extraction from a pdf file, I'm using the [tabula-py](https://tabula-py.readthedocs.io/en/latest/getting_started.html#requirements) package. Take note that for this package to run correctly you should have installed in your machine the Java 8+.

In [None]:
#Install Dependencies - tabula-py
%pip install tabula-py

## Import tabula-py package and select the file

Make sure to set the correct file path for the PDF.

In [1]:
import tabula
import pandas as pd
import os

# --> Load PDF file
# Specify pdf file name (inside pdf folder)
pdf_name = "tabela_embrapa_clean.pdf"
pdf_file = os.path.join("pdf", pdf_name)

# --> Output CSV file
csv_file = os.path.join("output", pdf_name + ".csv")

columns = ["Cultura", "Sul", "Sudeste", "Centro-Oeste", "Norte","Entre Linhas","Entre Plantas", "Tipo de Plantio", "Colheita", "Produção m2"]


## Read the table

If you want to simply import the table to the notebook, just use the read command.

Here you can specify if you want to read all the pages (With the pages="all" argument) or specify the page number (use pages=2 or other page number). It is possible to specify if the document have multiple tables, that will be divided into different DataFrames.

In [None]:
table = tabula.read_pdf(pdf_file, pages="all", multiple_tables=False)

## Export table to CSV

To export the table to a CSV file, run the following command.

Here you can also specify if you want to read all the pages or just a specific page. The last configuration is important to select the correct encoding to the file, so your accents will not be all messed up in the final document.

In [None]:
tabula.convert_into(pdf_file, csv_file, output_format="csv", pages='all', java_options="-Dfile.encoding=UTF8")