Goal: Automated table extraction for knowledge graph development. [Camelot](https://camelot-py.readthedocs.io/en/master/) provides tools to extract tables from pdf's to pandas data frames. The pdf is "Appendix C" that describes the data format for the form “OPNAV 4790/2K”. A useful [medium article](https://medium.com/@luchensf/retrieve-table-contents-from-pdf-df514b779d07) on extracting tables using Camelot.

## Install camelot from conda forge
```bash
mamba install -c conda-forge camelot-py
```

### Note:
Installation using mamba from the condaforge channel installs the ghostscript gs excutable dependency in the environments bin directory which probably won't be in your path. To fix this:

```bash
export PATH=/Users/cvardema/mambaforge/envs/pdfmunge/bin:$PATH
```

Additionally and annoyingly, conda installs the python [ghostscript](https://pypi.org/project/ghostscript/) package in the user site packages directory (.local) which may not be in the python path. Here I used sys.path.insert to insert the .local/lib/python3.10/site-packages into the correct python path.

Both of these paths (executable and module) must be set correctly or camelot will fail with irritating module not found error messages.


In [2]:
!export PATH=/Users/ccunnin8/mambaforge/envs/pdfmunge/bin:$PATH
import sys
sys.path.insert(0, "/Users/ccunnin8/.local/lib/python3.10/site-packages")
print(sys.path)

['/Users/ccunnin8/.local/lib/python3.10/site-packages', '/home/ccunnin8/git/decoder-ring', '/home/ccunnin8/mambaforge/envs/pdfmunge/lib/python310.zip', '/home/ccunnin8/mambaforge/envs/pdfmunge/lib/python3.10', '/home/ccunnin8/mambaforge/envs/pdfmunge/lib/python3.10/lib-dynload', '', '/home/ccunnin8/.local/lib/python3.10/site-packages', '/home/ccunnin8/mambaforge/envs/pdfmunge/lib/python3.10/site-packages']


In [3]:
import camelot
from camelot import utils
from pathlib import Path

In [4]:
import ghostscript

In [5]:
datapath = Path('./data')
pdf_table_reader = camelot.read_pdf('./data/JFMM-VI-19-2K.pdf',pages='all')

In [6]:
print("Number of Tables detected: ", pdf_table_reader.n)
print(pdf_table_reader[0].parsing_report)

Number of Tables detected:  47
{'accuracy': 100.0, 'whitespace': 0.0, 'order': 1, 'page': 3}


- Pandas df index 0: When Discovered Codes
- Pandas df index 1: Status Codes
- Pandas df index 2: Cause Codes
- Pandas df index 3-4: Deferral Codes
- Pandas df index 5: Safety Hazard Codes
- Pandas df index 6: Alteration Type Codes
- Pandas df index 7: Rank or Rate Codes
- Pandas df index 8: Priority Codes
- Pandas df index 9-10: Type Availiability Codes/Usage of Type Availiability Codes
- Pandas df index 11-12: Action To be Taken Codes
- Pandas df index 13-18: Action Taken Codes Part 1 (Verify on HTTPS://OARS.NSLC.NAVY.MIL/OARS/DOCS/REF/INDEX.HTML)
- Pandas df index 19: When Discovered Codes
- Pandas df index 20: Status Codes
- Pandas df index 21: Cause Codes
- Pandas df index 22: Safety Hazard Codes
- Pandas df index 23-26: Action Taken Codes Part 2

- Pandas df index 27-47: Junk





In [12]:
pdf_table_reader [4].df

Unnamed: 0,0,1
0,Code,Deferral Reason
1,5,Inadequate School Practical Training
2,6,Lack of Facilities or Capabilities
3,7,Not Authorized for Ship’s Force or Unit Accomp...
4,8,For Ship’s Force or Unit Overhaul of Availabil...
5,9,Lack of Technical Documentation
6,0,Other - or Not Applicable (explain in block 35)


In [8]:
import pandas as pd
#when discovered
whenDisc = pdf_table_reader[0].df
whenDisc.to_csv('data/whenDiscoveredCode.csv')
#status
status = pdf_table_reader[1].df
status.to_csv('data/statusCode.csv')
#cause
cause = pdf_table_reader[2].df
cause.to_csv('data/causeCode.csv')
#deferral
table_c3 = pdf_table_reader[3].df
table_c41 = pdf_table_reader[4].df
table_c42 = table_c41.drop(table_c41.index[0])
df = pd.concat([table_c3,table_c42])
df.to_csv('data/deferralCode.csv')
#safety hazard
safety = pdf_table_reader[5].df
safety.to_csv('data/safetyHazardCode.csv')
#alteration type
alteration = pdf_table_reader[6].df
alteration.to_csv('data/alterationCode.csv')
#rank or rate
rate = pdf_table_reader[7].df
rate.to_csv('data/rateCode.csv')
#priority
priority = pdf_table_reader[8].df
priority.to_csv('data/priorityCode.csv')
#type availability
type = pdf_table_reader[9].df
type.to_csv('data/typeAvailabilityCode.csv')
#type availability scenarios
scenario = pdf_table_reader[10].df
scenario.to_csv('data/typeAvailabilityScenarioCode.csv')
#action to be taken
table_c11 = pdf_table_reader[11].df
table_c121 = pdf_table_reader[12].df
table_c122 = table_c121.drop(table_c121.index[0])
df = pd.concat([table_c11,table_c122])
df.to_csv('data/actionToBeTakenCode.csv')
#action taken 1
action1 = pdf_table_reader[13].df

action2 = pdf_table_reader[14].df
action21 = action2.iloc[:5:]
action22 = action2.iloc[5:,:]
action21.to_csv('data/actionTakenCodes/at3_2ndCharacter.csv')

action3 = pdf_table_reader[15].df
action31 = action3.iloc[:5:]
action32 = action3.iloc[5:,:]
action31.to_csv('data/actionTakenCodes/at7_2ndCharacter.csv')

action4 = pdf_table_reader[16].df
action41 = action4.iloc[:11:]
action41.loc[len(df.index)] = ['','A', 'FOTE, multimode heavy duty MQJs utilized']
action41.to_csv('data/actionTakenCodes/at9_2ndCharacter.csv')
action42 = action4.iloc[11:,:]

action5 = pdf_table_reader[17].df
action51 = action5.drop(action5.index[0])
action6 = pdf_table_reader[18].df

firstNumber = pd.concat([action1,action22,action32,action42,action51])
firstNumber.to_csv('data/actionTakenCodes/actionTakenPrimaryCode.csv')



In [9]:
import pandas as pd
table_c1 = pdf_table_reader[3].df
for index in range(3,4):
    table_c1.merge(pdf_table_reader[index].df)
table_c1.to_csv('merge1.csv')

In [10]:

for index in range(0,14):
    file_name = "./data/2k_datastruct_" + str(index) + ".csv"
    pdf_table_reader[index].df.to_csv(file_name)