# Introduction to the Phil Analysis Project

The data for this project comes from a data extraction/engineering application I developed here: [PhilAuditSystem](https://github.com/jackvaughan09/PhilippinesAuditSystem)

I'll give a synopsis of what is going on over there to develop the context for the analysis conducted in the following notebooks.

Political and economic development researcher [Mike Denly](https://mikedenly.com/) has been studying codified corruption in the governments of the developing world for a number of years. His current work focuses on quantifying corrupt practices via a novel methodology of using internal government audit reports to detect underhand dealings and legal gymnastics. This practice [works exceptionally well in some places](https://mikedenly.com/research/audit-measurement), and is why I was recruited to accomplish the task given here!

In late 2022, we discovered a treasure trove of audit reports for the government of the Philippines [posted online](https://www.coa.gov.ph/reports/annual-audit-reports/aar-local-government-units/#167-428-leyte). Unfortunately, the site is guarded against scraping with cloudflare technology, so we've had to download the some 12,000 reports manually in the form of zip files. 

Each zip file contains some file structure that looks like this, for example:
```
├── MunicipalityYearAuditReport
│   ├── Auditors_reportYear.docx
│   ├── AuditCertificate.doc
│   ├── AuditStatus-report-year.pdf
│   ├── etc.
└── etc.
```
Inside of files like `Auditors_reportYear.docx` and `AuditStatus-report-year.pdf` are tables containing lists of observations.  


### Step by step, the audit extraction system:

**I.** Unzips all zip files input to the software

**II.** Filters through the slew of unwanted files / Detects only the audit reports

**III.** Converts all relevant files to a standard PDF format (all .doc/.docx files and non-standard PDFs) using Linux-based [LibreOffice](https://www.libreoffice.org/) command line tools and shell scripting.

**IV.** Scrapes the PDF:
   1. Opens the PDF with `PyPDF2` 
   2. Locates the relevant portion of the document containing tables with fuzzy logic. 
 

   3. Uses a computer vision library, [camelot](https://camelot-py.readthedocs.io/en/master/) to extract the tables containing the audit observations
   4. Implements an array of hand-crafted data cleaning tools to filter out bad data and consolidate good observations. For each document it:
   
      *i.* Establishes canonical headers for the document so that each table from the document can be joined in a single dataframe. 

         - some tables do not have headers (column names) and get missed for this reason

         - some tables have extra headers that become entrenched in the observation space

      *ii.* Filters out rows of entrenched headers

      *iii.* Attempts to coerce document headers into a standard set for the document corpus by using fuzzy logic to find a "good match" in the canon headers for each column name. We do this so that all tables from all documents passed to the system can be joined in a single file at the end (this requires standard column names)

            
            CANON_HEADERS = [
               "audit observation",
               "recommendations",
               "references",
               "status of implementation",
               "reasons for partial/non-implementation",
               "management action"
            ]            

      *iv.* Conducts overflow repair:

         - Some documents have table rows that span multiple pages, so some observations are cut in half. 
        
         - I use a number of learned rules from research on the corpus to locate these rows and concatenate them back with the originating row
        
      *v.* Tags each observation with the source document name
      
      *vi.* Compiles all scraped observations into a dataframe <br>
   <br>   

**V.** Compiles all document dataframes to a single .xlsx file

**VI.** Returns collated data, the converted pdfs, and a log of filtering/data loss from the extraction process


> The entire program runs inside of a Docker container that interacts with the host filesystem to load and return data. 

### System Diagram:

![System Diagram](../img/introduction/auditsystemdiagram.png)


## Without further ado...

> Let's get into the analysis!