# CalTrans Data Extraction

The goal of this project is to extract relevant data from text files, previously converted from PDF files. Since the text files are quite structured, the decision is to use regex to do this.

## Setup for Google Colab

In [None]:
def get_github_code():
    # First get GitHub code:
    !wget https://github.com/nesaboz/caltrans_data_extraction/archive/refs/heads/main.zip
    # unzip it
    !unzip main.zip
    # copy all the files to root
    !mv ./caltrans_data_extraction-main/* .
    # delete the empty folder
    !rm -r caltrans_data_extraction-main
    # delete zip file
    !rm main.zip
    # delete main.ipynb since it's confusing to have it Colab:
    !rm main.ipynb


def get_data_from_google_drive(data_file: str):
    """
    data_file can be 'raw' or 'sorted'
    download and unzip should take under 30 seconds:

    The file IDs are contained in the name of the link to the file on Google Drive (make sure link is obtained
    by setting unrestricted access i.e. "Anyone with the link" option):
    https://drive.google.com/file/d/<THIS IS THE FILE ID>/view?usp=share_link
    """

    data_files_ids = {
        'raw': '1miDDg2C3MtfdZD4y_GrBFU4FTTW74Lu-',
        'sorted': '1tbJ7vcO6K1NqKW7c_ef1MxxoQgySDMcZ'
        }
        
    if data_file not in data_files_ids:
        print("Use only 'raw' or 'sorted' keywords.")
        return
    print(f"Downloading {data_file} data ...")
    !gdown {data_files_ids[data_file]}
    print("Unzipping ...")
    !unzip {data_file}_data.zip -d {data_file}_data > /dev/null 2>&1
    

def install_packages():
    !pip install pandas==1.5.3 numpy==1.26.4 tqdm==4.66.2 openpyxl==3.1.2 pyperclip==1.8.2


In [None]:
try:
    import google.colab
    IS_COLAB = True
except ModuleNotFoundError:
    IS_COLAB = False


if IS_COLAB: 
    response = input("Do you want to setup everything? ([yes]/no): ").lower().strip()
    if response != "no":
        !rm -r sample_data
        get_github_code()
        get_data_from_google_drive('raw')
        get_data_from_google_drive('sorted')
        install_packages()

# Imports

In [None]:
from experiment import *

import pyperclip

%reload_ext autoreload
%autoreload 2

## Sort contracts into types

Some documents are better presented in lineprinter (i.e. type1), other in table (i.e. type2) format. In addition, some document contain multi contracts and we need to split those:

In [None]:
# # uncomment the following line to sort the contracts:
sort_contracts()

## Single Contract example

Let's look at the single contract:

In [None]:
c = Contract('t2_12752')

To copy file contents to clipboard so one can paste it elsewhere (like regex101.com):

In [None]:
pyperclip.copy(c.file_contents)

There are four attributes of the contract that gets extracted: `info`, `bids`, `subcontractors`, and `items`, for example:

In [None]:
c.extract()

In [None]:
c.info.df

In [None]:
c.bids.df

In [None]:
c.subcontractors.df

In [None]:
c.items.df

# Process single contract

In [None]:
ex = Experiment('t1_2652')
ex.run()

# Process several contract

We now run extraction on a small sample of contracts defined by `num_contracts`:

In [None]:
filepaths = get_contract_filepaths(contract_type=1, num_contracts=5)
ex = Experiment(filepaths)
ex.run()

# Process all contracts type 2

Let's first analyze type 2 since there are only 168 of them:

In [None]:
filepaths = get_contract_filepaths(contract_type=2)
ex = Experiment(filepaths)
ex.run()

# Process all contracts type 1

And now type1 (have to split into 2 batches due to memory constraints, get error on sheet size in excel if running one batch):

In [None]:
filepaths = get_contract_filepaths(contract_type=1)

In [None]:
ex1 = Experiment(filepaths[:10000])
ex1.run()

In [None]:
ex1.write_to_disk()

In [None]:
ex2 = Experiment(filepaths[10000:])
ex2.run()