# Introduction

This script uses the [tabula-py](https://github.com/chezou/tabula-py) library to extract tables from ICE Significant Incident Summary (form G-324A rev. 2019) forms. While these forms are created and were previously distributed as actual text, in February 2021 under the Biden administration, ICE under the direction of DHS Secretary Mayorkas and ICE Director Tae Johnson, ICE inexplicably began distributing these forms as a series of flattened .jpg files bound as a pdf. It is not clear what procedures were used to convert perfectly good text based pdf files into the equivelant of scanned images. However, doing so makes the files more difficult to properly analyze and the additional manipulation to the files by ICE introduces error for the analyst as additional transformations are requied to convert the files back into text.

For this project, Adobe Acrobad DC was used to apply optical character recognition (OCR) to the ICE modified files. Once OCR was appled, the [Tabula](https://tabula.technology/) application GUI was used to create a table extraction template. This template, saved as a .json file was placed into the project's document directory while OCR'd pdfs were placed into a data directory. This script was then used to call the `tabula.read_pdf_with_template()` function to apply the template to each pdf contained in the data directory. Each pdf contains multiple tables, each of which is best broken down into smaller component parts to avoid regognition errors. In this case, there are six individual compoent tables several of which need to be merged manually by analysts. Python was instructed to create individual .csv files for each of the component tables defined in the tabula template file.

# Additional Notes

Initially, the code below was developed and run in RStudio as part of an R Markdown file as most of the project analysis takes place within R. However, debugging proved difficult as RStudio was not particuarly verbose with error reporting. Development of the script shifted to Jupyter, and forms this document. Some of the pdf files generated errors. The cause of this is still not identified, and it bears mention that these files also generate errors when the template is applied to the tabula GUI interface, but that creating extraction zones specific to the files seems to generate desired output. The problematic files will be processed individually by hand. Lastly, there is an R library for tabula called `tabulizer`. However, as of the development of this script it did not yet appear to incorporate the use of template files. Therefore, the decision was made to use tabula-py.

In [2]:
# Import libraries
import pandas as pd
import tabula
import os

In [32]:
# get list of files in data directory, removing the file extension
listdir =os.listdir("./data/")
filelist=[x.split('.')[0] for x in listdir]


# process the pdf files in a for loop and write csv files for each table
for file in filelist:

    output = tabula.read_pdf_with_template(
      input_path="./data/" + file + ".pdf",
      template_path="./docs/templates/2021-SIS-v2.tabula-template.json", 
      pandas_options= {"header":None})

    #output_df = output[0].append(output[1])
    output[0].to_csv("./data/" + file + "_A.csv")
    output[1].to_csv("./data/" + file + "_B.csv")
    output[2].to_csv("./data/" + file + "_C.csv")
    output[3].to_csv("./data/" + file + "_D.csv")
    output[4].to_csv("./data/" + file + "_E.csv")
    output[5].to_csv("./data/" + file + "_F.csv")