Goal: Automated table extraction for knowledge graph development. [Camelot](https://camelot-py.readthedocs.io/en/master/) provides tools to extract tables from pdf's to pandas data frames. The pdf is "Appendix C" that describes the data format for the form “OPNAV 4790/2K”. A useful [medium article](https://medium.com/@luchensf/retrieve-table-contents-from-pdf-df514b779d07) on extracting tables using Camelot.

## Install camelot from conda forge
```bash
mamba install -c conda-forge camelot-py
```

### Note:
Installation using mamba from the condaforge channel installs the ghostscript gs excutable dependency in the environments bin directory which probably won't be in your path. To fix this:

```bash
export PATH=/Users/cvardema/mambaforge/envs/pdfmunge/bin:$PATH
```

Additionally and annoyingly, conda installs the python [ghostscript](https://pypi.org/project/ghostscript/) package in the user site packages directory (.local) which may not be in the python path. Here I used sys.path.insert to insert the .local/lib/python3.10/site-packages into the correct python path.

Both of these paths (executable and module) must be set correctly or camelot will fail with irritating module not found error messages.


In [1]:
!export PATH=/Users/ccunnin8/mambaforge/envs/pdfmunge/bin:$PATH
import sys
sys.path.insert(0, "/Users/ccunnin8/.local/lib/python3.10/site-packages")
print(sys.path)

['/Users/ccunnin8/.local/lib/python3.10/site-packages', '/home/ccunnin8/git/decoder-ring', '/home/ccunnin8/mambaforge/envs/pdfmunge/lib/python310.zip', '/home/ccunnin8/mambaforge/envs/pdfmunge/lib/python3.10', '/home/ccunnin8/mambaforge/envs/pdfmunge/lib/python3.10/lib-dynload', '', '/home/ccunnin8/.local/lib/python3.10/site-packages', '/home/ccunnin8/mambaforge/envs/pdfmunge/lib/python3.10/site-packages']


In [2]:
import camelot
from camelot import utils
from pathlib import Path

In [3]:
import ghostscript

In [4]:
datapath = Path('./data')
pdf_table_reader = camelot.read_pdf('./data/data_elements.pdf',pages='all')

In [5]:
print("Number of Tables detected: ", pdf_table_reader.n)
print(pdf_table_reader[0].parsing_report)

Number of Tables detected:  51
{'accuracy': 100.0, 'whitespace': 0.0, 'order': 1, 'page': 2}


- Pandas df index 0-14 are the "Data elements specifications" Table C-1 from Appendix C pp C-2 to C-15
- Pandas df index 15-16 is Maintenance Action Codes Table C-2 pp C-16-C17
- Pandas df index 17 is Configuration Changes Maintenance Action Codes Table C-3 pp C18
- Pandas df index 18-19 is Other Alterations Table C-4 pp C-20 to C21
- Pandas df index 20-21 Availability Codes Table C-5 pp C-21 to C22
- Pandas df index 22-23 Cause Codes Table C-6 Table C-6 pp C-23 to C-24
- Pandas df index 24-25 Deferral Reason Codes Table C-7 pp C-26 to C-27
- Pandas df index 26 Final Action Codes Table C-8 pp C-29
- Pandas df index 27-29 Funding Activity Codes Table C-9 pp C-29 to C-31
- Pandas df index 30-36 IMA Repair Work Center Codes Table C-10 pp C-32 to C-38
- Pandas df index 37 In-Progress codes Table C-11 pp C-39
- Pandas df index 38 Key Operation Codes Table C-12 pp C-41
- Pandas df index 39 Department Key Table C-13 pp C-42
- Pandas df index 40 Deferred Maintenance Action Priority Table C-14 pp C-45 Note: Row 1 missing CODE 1 Mandatory from page 44.
- Pandas df index 41 Rate Data Elements Table C-15 pp C-47
- Pandas df index 42 Risk Assessment Codes Table C-16 pp C-50
- Pandas df index 43-44 Screening (TYCOM) Table C-17 pp C-51 to C-52
- Pandas df index 45-46 Special Purpose Table C-18 pp C-53 to C-54
- Pandas df index 47 Effect of Failure Table C-19 pp C-55
- Pandas df index 48 Availability Type Table C-20 pp C-56
- Pandas df index 49-50 When Discovered Codes Table C-21 pp C-56


In [77]:
pdf_table_reader[50].df

Unnamed: 0,0,1
0,CODE,DESCRIPTION
1,6,During PMS
2,7,Securing
3,8,During AEC (Assessment of Equipment) Program
4,9,"No Failure, PMS Accomplishment Only"
5,0,Not Applicable (use when reporting printing se...


In [76]:
import pandas as pd
table_c1 = pdf_table_reader[0].df
for index in range(1,14):
    table_c1.merge(pdf_table_reader[index].df)
table_c1.to_csv('junk.csv')

In [15]:

for index in range(0,14):
    file_name = "./data/2k_datastruct_" + str(index) + ".csv"
    pdf_table_reader[index].df.to_csv(file_name)