### License agreement, Illumination Works. LLC.

# Table extraction from native PDF
This tutorial contains complete code to ....

In this notebook, you will:

* Learn how to read a pdf using python libraries
* Format the content into a table using a series of steps that use outlier detection, clustering, ngrams, and pandas grouping

### Required libraries
* Pandas
* Numpy
* Fitz. Install with pip using the command *pip install pymupdf*
* NLTK. Install with pip using the command *pip install nltk*
* scikit learn. Install with pip using the command *pip install scikit-learn*


### Custom library - pdf_tables
This jupyter notebook is accompanied by a python script, ***pdf_library.py*** that contains the functions to process the contents of the PDF document. 


## Setup

In [None]:
import itertools
import numpy as np
import os
import pandas as pd

import fitz

import pdf_tables as pdf

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

pd.options.display.float_format = '{:.2f}'.format
pd.options.display.width = 600
pd.options.display.max_colwidth = 300
pd.options.display.max_columns = 10

## PDF location

In [None]:
data_dir = os.path.join('..', 'data')
pdf_file_location = os.path.join(data_dir,  'Aircraft database sample.pdf')

# Step 1: Get text and positional INFORMATION

In [None]:
pdf_handle = fitz.open(pdf_file_location)

page_elements = pdf.get_page_elements(pdf_handle, page_num=0)
pd.concat([page_elements.head(5), page_elements.tail(5)])
pdf_handle.close()

## Step 2: REMOVE unrelated text


## Step 3: Build table by GROUPING on X and Y axis

In [None]:
outlier_props = {'outlier_range': 1.6}
cluster_props = ({'eps': 11, 'min_samples': 1, 'metric': 'manhattan'}
                 , {'eps': 0.5, 'min_samples': 1, 'metric': 'manhattan'}
                )

clustered_table, cluster_details = pdf.get_table_via_clustering(data=page_elements
                                                                , remove_outliers=True
                                                                , outlierprops=outlier_props
                                                                , cluster_data_columns=[['x_avg'], ['y_avg']]
                                                                , clusterprops=cluster_props
                                                               )
pd.concat([clustered_table.head(5), clustered_table.tail(5)])

## Step 4: Find table HEADERS

In [None]:
table_with_headers = pdf.get_page_headers(data=clustered_table, header_row_detector='Model Full Name')
table_with_headers.head(5)

## Step 5: CONSOLIDATE rows

In [None]:
clean_complete_table = pdf.group_rows(data=table_with_headers, row_grouper_columns=['ID'])

## Finally: Review output

In [None]:
clean_complete_table

*****************************************************************************************************