# Caltrans Data Extraction

The goal of this project is to extract relevant data from text files, previously converted from PDF files. Since the text files are quite structured, the decision is to use regex to do this.

## Setup

Install the following packages if you don't have them yet:

In [None]:
try:
    import google.colab
    IS_COLAB = True
except ModuleNotFoundError:
    IS_COLAB = False

from pathlib import Path

def setup():
    # First get GitHub code:
    !wget https://github.com/nesaboz/caltrans_data_extraction/archive/refs/heads/main.zip
    # unzip it
    !unzip main.zip
    # copy all the files to root
    !mv ./caltrans_data_extraction-main/* .
    # delete the empty folder
    !rm -r caltrans_data_extraction-main

    # now get all the data (should take under 30 seconds):
    # modify as needed:
    print("Downloading data.zip ...")
    !gdown '1y-ufhK56J3h994I5HKiarcFzbCkB6h_h'  
    
    # IMPORTANT: unzip to a folder called data
    !unzip data.zip -d data

    # delete the zip files and main.ipynb since it's confusing to have it here:
    !rm main.zip
    !rm data.zip
    !rm main.ipynb

    # install libraries:
    !pip install pandas==1.5.4 numpy==1.26.4 tqdm==4.66.2 openpyxl==3.1.2 pytest==8.1.1 pyperclip==1.8.2

if IS_COLAB: 
    if not Path('data').exists():
        setup()
    else:
        response = input("Do you want to proceed? yes or [no]: ").lower().strip()
        if response == "yes":
            setup()

In [None]:
from experiment import *

import pyperclip

%reload_ext autoreload
%autoreload 2

# Note: to print DataFrame fully use:
# pd.set_option('display.max_rows', None)  # to set globally, or use: 
# with pd.option_context('display.max_rows', None, 
#                        'display.max_columns', None, 
#                        'display.width', None, 
#                        'display.max_colwidth', None):
#   display(df)

## Classify contracts into types

Some documents are better presented in lineprinter (i.e. type1), other in table (i.e. type2) format. In addition, some document contain multi contracts and we need to split those:

In [None]:
# # uncomment the following line to sort the contracts:
sort_contracts()

## Single Contract example

Let's look at the single contract:

In [None]:
c = Contract('t2_12752')

To copy file contents to clipboard so one can paste it elsewhere (like regex101.com):

In [None]:
pyperclip.copy(c.file_contents)

There are four attributes of the contract that gets extracted: `info`, `bids`, `subcontractors`, and `items`, for example:

In [None]:
c.extract()

In [None]:
c.info.df

In [None]:
c.bids.df

In [None]:
c.subcontractors.df

In [None]:
c.items.df

# Process single contract

In [None]:
ex = Experiment('t1_2652')
ex.run()

# Process several contract

We now run extraction on a small sample of contracts defined by `num_contracts`:

In [None]:
filepaths = get_contract_filepaths(contract_type=1, num_contracts=5)
ex = Experiment(filepaths)
ex.run()

# Process all contracts type 2

Let's first analyze type 2 since there are only 168 of them:

In [None]:
filepaths = get_contract_filepaths(contract_type=2)
ex = Experiment(filepaths)
ex.run()

# Process all contracts type 1

And now type1 (have to split into 2 batches due to memory constraints, get error on sheet size in excel if running one batch):

In [None]:
filepaths = get_contract_filepaths(contract_type=1)

In [None]:
ex1 = Experiment(filepaths[:10000])
ex1.run()

In [None]:
ex1.write_to_disk()

In [None]:
ex2 = Experiment(filepaths[10000:])
ex2.run()