## JETDS Information Pull
This Jupyter Notebook allows for automated table extraction for knowledge graph development and general data analytics usage. [Camelot](https://camelot-py.readthedocs.io/en/master/) provides tools to extract tables from PDFs to pandas data frames. The data frames are then cleaned up and saved as CSV files in a format usable for the code disambiguation subgraph built in codeTypeBuilder.ipynb.

The pdf extracted here is from a 2018 copy of the Department of Defense Standard Practice JETDS document, pertaining to designator codes that appear in the 2kilos form. Here is a useful [medium article](https://medium.com/@luchensf/retrieve-table-contents-from-pdf-df514b779d07) on extracting tables using Camelot.

### Installation: camelot from conda forge
```bash
mamba install -c conda-forge camelot-py
```

### Note:
Installation using mamba from the condaforge channel installs the ghostscript gs excutable dependency in the environments bin directory which probably won't be in your path. To fix this:

```bash
export PATH=/Users/cvardema/mambaforge/envs/pdfmunge/bin:$PATH
```

Additionally and annoyingly, conda installs the python [ghostscript](https://pypi.org/project/ghostscript/) package in the user site packages directory (.local) which may not be in the python path. Here I used sys.path.insert to insert the .local/lib/python3.10/site-packages into the correct python path.

Both of these paths (executable and module) must be set correctly or camelot will fail with irritating module not found error messages.

### Environment Files
Until we adopt the TAI Frameworks setup for package dependency resolution, I have attempted to clone out some environment.yml files for usage. Try utilizing these if necessary.


In [1]:
!export PATH=/Users/ccunnin8/mambaforge/envs/pdfmunge/bin:$PATH
import sys
sys.path.insert(0, "/Users/ccunnin8/.local/lib/python3.10/site-packages")
print(sys.path)

['/Users/ccunnin8/.local/lib/python3.10/site-packages', '/home/ccunnin8/git/JETDS', '/home/ccunnin8/mambaforge/envs/pdfmunge/lib/python310.zip', '/home/ccunnin8/mambaforge/envs/pdfmunge/lib/python3.10', '/home/ccunnin8/mambaforge/envs/pdfmunge/lib/python3.10/lib-dynload', '', '/home/ccunnin8/.local/lib/python3.10/site-packages', '/home/ccunnin8/mambaforge/envs/pdfmunge/lib/python3.10/site-packages']


In [2]:
import camelot
from camelot import utils
from pathlib import Path

In [3]:
import ghostscript

### PDF Table Reader
We start by feeding camelot the selected PDF snippet. Camelot saves all identified tables as dataframes which you can parse throught using pdf_table_reader[] . note that tables that are split by page break, formatted strangely, or otherwise often get split into separate dataframes. This will be fixed with some trimming and pandas merge functions.

In [5]:
datapath = Path('./data')
pdf_table_reader = camelot.read_pdf('./data/JETDS_2018.pdf',pages='all')



In [6]:
print("Number of Tables detected: ", pdf_table_reader.n)
print(pdf_table_reader[0].parsing_report)

Number of Tables detected:  36
{'accuracy': 100.0, 'whitespace': 0.0, 'order': 1, 'page': 11}


In [8]:
pdf_table_reader [3].df

Unnamed: 0,0,1
0,Group example of use indicators,Family name \n(Not to be construed as limiting...
1,OA Miscellaneous groups,Groups not otherwise listed. Do not use if a ...
2,OB Multiplexer and/or demultiplexer groups,All types
3,OD Indicator groups,All types
4,OE \nAntenna groups,All Types
5,OF \nAdapter groups,All types
6,OG Amplifier groups,All types
7,OH Simulator groups,All types
8,OI \nCryptographic groups,All types
9,OJ \nConsoles and console groups,All types


### Breakdown of Usable Tables
- Pandas df index 0: Item Levels (Main Hierarchy)
- Pandas df index 3: Group Indicators
- Pandas df index 4-8: Unit Indicators
- Pandas df index 3-4: Deferral Codes
- Pandas df index 5: Safety Hazard Codes
- Pandas df index 6: Alteration Type Codes
- Pandas df index 7: Rank or Rate Codes
- Pandas df index 8: Priority Codes
- Pandas df index 9-10: Type Availiability Codes/Usage of Type Availiability Codes
- Pandas df index 11-12: Action To be Taken Codes
- Pandas df index 13-18: Action Taken Codes Part 1 (Verify on HTTPS://OARS.NSLC.NAVY.MIL/OARS/DOCS/REF/INDEX.HTML)
- Pandas df index 19: When Discovered Codes
- Pandas df index 20: Status Codes
- Pandas df index 21: Cause Codes
- Pandas df index 22: Safety Hazard Codes
- Pandas df index 23-26: Action Taken Codes Part 2

- Pandas df index 27-47: Junk





### Datastruct Cleaning
We then iterate through the desired dataframes to clean up the output and save as codes under the ./tables section.
Note: will clean up and streamline code.

In [9]:
import pandas as pd
#item levels
itemLevels = pdf_table_reader[0].df
itemLevels.to_csv('tables/itemLevels.csv')
#group indicators
group = pdf_table_reader[3].df
group.to_csv('tables/group.csv')
#unit
unit1 = pdf_table_reader[4].df
unit2 = pdf_table_reader[5].df
unit3 = pdf_table_reader[6].df
unit4 = pdf_table_reader[7].df
unit5 = pdf_table_reader[8].df
unit_all = pd.concat([unit1,unit2,unit3,unit4,unit5])
unit_all.to_csv('tables/unit.csv')

In [11]:
#Table Cleanup for KG Building
from pathlib import PurePath, Path
#Group Indicators
p = PurePath('./tables/group.csv')
df = pd.read_csv(p)
pd.set_option('display.max_columns', None)
df.columns = ['0','Indicator','Group example of use indicators','Family name']
df = df.iloc[pd.RangeIndex(len(df)).drop(0)]
display(df)
df.to_csv('tables/group.csv')
#itemLevels
p = PurePath('./tables/itemLevels.csv')
df = pd.read_csv(p)
pd.set_option('display.max_columns', None)
df.columns = df.iloc[0]
df = df.iloc[pd.RangeIndex(len(df)).drop(0)]
display(df)
df.to_csv('tables/itemLevels.csv')
#unit
p = PurePath('./tables/unit.csv')
df = pd.read_csv(p)
pd.set_option('display.max_columns', None)
df.columns = df.iloc[0]
df = df.iloc[pd.RangeIndex(len(df)).drop(0)]
df.to_csv('tables/unit.csv')
display(df)

Unnamed: 0,0,Indicator,Group example of use indicators,Family name
1,1,OA,Miscellaneous groups,Groups not otherwise listed. Do not use if a ...
2,2,OB,Multiplexer and/or demultiplexer groups,All types
3,3,OD,Indicator groups,All types
4,4,OE,Antenna groups,All Types
5,5,OF,Adapter groups,All types
6,6,OG,Amplifier groups,All types
7,7,OH,Simulator groups,All types
8,8,OI,Cryptographic groups,All types
9,9,OJ,Consoles and console groups,All types
10,10,OK,Control groups,All types


Unnamed: 0,0,Item Level \nName,Description,Examples
1,1,Unit,An item that may be capable of independent ope...,"Radio, computer, digital \nPower Supply, Anten..."
2,2,Group,A collection of units or assemblies that are n...,Antenna group may be \n“used with” or “part \n...
3,3,Set,"A unit or units and necessary assemblies, suba...","Radio terminal set or \nsound measuring set, \..."
4,4,Subsystem,"A combination of sets, groups, etc., which per...",Intercept-Aerial Guided \nMissile Subsystem
5,5,System,"A combination of two or more sets, which may b...",Integrated Shipboard \nComputer System and a \...
6,6,Center,A collection of units and items in one locatio...,an Operations Center
7,7,Central,"A grouping of sets, units or combinations ther...","Operations Central, \nCentral, Communications"


Unnamed: 0,0,Unit indicators,Family name,Examples of use \n(Not to be construed as limiting the \napplication of the unit)
1,1,AB,Support for antennas,"Antenna mounts, mast bases, mast \nsections, t..."
2,2,AM,Amplifiers,"Power, audio, interphone, radio \nfrequency, v..."
3,3,AS,"Antenna, simple and complex","Arrays, parabolic type, masthead \nwhip or tel..."
4,4,BA,"Battery, primary type","Batteries, battery packs, etc."
5,5,BB,"Battery, secondary type","Batteries, battery packs, etc."
...,...,...,...,...
88,9,TT,Teletypewriter and facsimile \napparatus,"Teletype, tape, facsimile \nmiscellaneous equi..."
89,10,TU,Television,Special types
90,11,TW,Tape units,Preprogrammed with operational \ntest and chec...
91,12,V,Vehicles,"Carts, dollies, vans peculiar to \nelectronic ..."
