Goal: Automated table extraction for knowledge graph development. [Camelot](https://camelot-py.readthedocs.io/en/master/) provides tools to extract tables from pdf's to pandas data frames. The pdf is "Appendix C" that describes the data format for the form “OPNAV 4790/2K”. A useful [medium article](https://medium.com/@luchensf/retrieve-table-contents-from-pdf-df514b779d07) on extracting tables using Camelot.

## Install camelot from conda forge
```bash
mamba install -c conda-forge camelot-py
```

### Note:
Installation using mamba from the condaforge channel installs the ghostscript gs excutable dependency in the environments bin directory which probably won't be in your path. To fix this:

```bash
export PATH=/Users/cvardema/mambaforge/envs/pdfmunge/bin:$PATH
```

Additionally and annoyingly, conda installs the python [ghostscript](https://pypi.org/project/ghostscript/) package in the user site packages directory (.local) which may not be in the python path. Here I used sys.path.insert to insert the .local/lib/python3.10/site-packages into the correct python path.

Both of these paths (executable and module) must be set correctly or camelot will fail with irritating module not found error messages.


In [9]:
!export PATH=/Users/ccunnin8/mambaforge/envs/pdfmunge/bin:$PATH
import sys
sys.path.insert(0, "/Users/ccunnin8/.local/lib/python3.10/site-packages")
print(sys.path)

['/Users/ccunnin8/.local/lib/python3.10/site-packages', '/Users/ccunnin8/.local/lib/python3.10/site-packages', '/home/ccunnin8/git/decoder-ring', '/home/ccunnin8/mambaforge/envs/pdfmunge/lib/python310.zip', '/home/ccunnin8/mambaforge/envs/pdfmunge/lib/python3.10', '/home/ccunnin8/mambaforge/envs/pdfmunge/lib/python3.10/lib-dynload', '', '/home/ccunnin8/.local/lib/python3.10/site-packages', '/home/ccunnin8/mambaforge/envs/pdfmunge/lib/python3.10/site-packages']


In [10]:
import camelot
from camelot import utils
from pathlib import Path

In [11]:
import ghostscript

In [14]:
datapath = Path('./data')
pdf_table_reader = camelot.read_pdf('./data/JFMM-VI-19-2K.pdf',pages='all')

In [15]:
print("Number of Tables detected: ", pdf_table_reader.n)
print(pdf_table_reader[0].parsing_report)

Number of Tables detected:  47
{'accuracy': 100.0, 'whitespace': 0.0, 'order': 1, 'page': 3}


- Pandas df index 0: When Discovered Codes
- Pandas df index 1: Status Codes
- Pandas df index 2: Cause Codes
- Pandas df index 3-4: Deferral Codes
- Pandas df index 5: Safety Hazard Codes
- Pandas df index 6: Alteration Type Codes
- Pandas df index 7: Rank or Rate Codes
- Pandas df index 8: Priority Codes
- Pandas df index 9-10: Type Availiability Codes/Usage of Type Availiability Codes
- Pandas df index 11-12: Action To be Taken Codes
- Pandas df index 13-18: Action Taken Codes Part 1 (Verify on HTTPS://OARS.NSLC.NAVY.MIL/OARS/DOCS/REF/INDEX.HTML)
- Pandas df index 19: When Discovered Codes
- Pandas df index 20: Status Codes
- Pandas df index 21: Cause Codes
- Pandas df index 22: Safety Hazard Codes
- Pandas df index 23-26: Action Taken Codes Part 2

- Pandas df index 27-47: Junk





In [44]:
pdf_table_reader [29].df

Unnamed: 0,0
0,
1,
2,
3,
4,
5,
6,
7,


In [7]:
import pandas as pd
table_c1 = pdf_table_reader[0].df
for index in range(1,14):
    table_c1.merge(pdf_table_reader[index].df)
table_c1.to_csv('junk.csv')

In [8]:

for index in range(0,14):
    file_name = "./data/2k_datastruct_" + str(index) + ".csv"
    pdf_table_reader[index].df.to_csv(file_name)