# PDF File Data Extractor
This Notebook is a helper to extract data from a PDF file. 
The PDF contains data for planting different types of vegetables. The PDF is available by the Brazilian State Owned Company EMBRAPA.

## Install Dependencies

For general data extraction from a pdf file, I'm using the [PyMuPDF](https://github.com/pymupdf/PyMuPDF-Utilities) package.

In [4]:
#Install Dependencies - PymuPDF
%pip install pymupdf

Note: you may need to restart the kernel to use updated packages.


## Import PyMuPDF package and configurations

PyMuPDF is loaded into python via the fitz package.

Here I'm also setting the initial and final pages of the PDF that I want to read.


In [1]:
import fitz
import os

# --> Load PDF file
# Specify pdf file name (inside pdf folder)
pdf_name = "tabela_embrapa_clean.pdf"
pdf_file = os.path.join("pdf", pdf_name)

# --> Set Initial and Final Pages of the Document to search
initial_page = 8
final_page = 57

# Load PDF
pdf_document = fitz.open(pdf_file)

## Data Extraction

The data could be extract in form o words or blocks. The usage may provided more or less relevant results according to the pdf file being read. For this case, I found that reading the blocks of content from the page provided better results.

The get method with the blocks option will return for each page a list with all the elements in the page such as text blocks and images, each of those items are in the form of another list with the position of the block in the document, the data (if it is an image or a image mask) or the text, that is what we are interested!

In [3]:
pages = []

for page in range(initial_page-1, final_page-1):
    pdf_page = pdf_document[page]
    pages.append(pdf_page.get_text("blocks"))

## Data Cleaning

If you take a look at the results of the extraction process above, you will see that our data is not in a very nice way. I'm interested in getting the results in form of a list and a dict so I can choose later which will suit best for my application.

On a first step, I will need to find among all of the other blocks from the page, just the ones that I'm interested, being the ones that contains the Name, Scientific Name, Description, Epoch and Crop region and Recommendations. To search for those I'm using the find() method.

Together with the finding the relevant data, I'm also cleaning the data and splitting into a name:data pair (thats why I will also generate a dict, that could be used as a JSON latter). To do that I'm using the replace("search", "replace", number of times) method that search a string and replace it with the second argument, the third argument is a breakpoint, to tell how many times the method should replace the item, here I just want it to execute one time.

To clean things up, I'm also replacing the break lines and then I will split the data, here, as I did with the replace() method, I just want it to run once.

In [16]:
#Clean Data
#Search for first item
clean_page = []
clean_dict = []
clean_list = []
for page in pages:
    for item in page:
        if item[4].find("Nome popular") != -1:
            temp = item[4].replace("-","–", 1).replace('\n', ' ').split("–", 1)
            temp[0] = temp[0].rstrip()
            temp[1] = temp[1].lstrip()
            clean_list.append(temp)
        elif item[4].find("Nome científico") != -1:
            temp = item[4].replace("-","–", 1).replace('\n', ' ').split("–", 1)
            temp[0] = temp[0].rstrip()
            temp[1] = temp[1].lstrip()
            clean_list.append(temp)
        elif item[4].find("Descrição") != -1:
            temp = item[4].replace("-","–", 1).replace('\n', ' ').split("–", 1)
            temp[0] = temp[0].rstrip()
            temp[1] = temp[1].lstrip()
            clean_list.append(temp)
        elif item[4].find("Época e regiões para plantio") != -1:
            temp = item[4].replace("-","–", 1).replace('\n', ' ').split("–", 1)
            temp[0] = temp[0].rstrip()
            temp[1] = temp[1].lstrip()
            clean_list.append(temp)           
        elif item[4].find("Recomendações para aproveitamento") != -1:
            temp = item[4].replace("-","–", 1).replace('\n', ' ').split("–", 1)
            temp[0] = temp[0].rstrip()
            temp[1] = temp[1].lstrip()
            clean_list.append(temp)
    clean_page.append(clean_list)
    clean_dict.append(dict(clean_list))
    clean_list = []



## Finally

And thats it! We have now a clean and ready data to use in different scenarios. For these data, I'm planning to build a kitchen garden management that will provide useful data about the main crops. The next step will be to extract data from a PDF table. That could be a more challenging subject, as the pdf don't have a structure for the table, is just a bunch of blocks of text and graphical data (that represents the lines). If we try to read the blocks we just get a random text mess that will be hard to visualize.

But for our lucky, that are many other tools to help with that!

Until next time!