# Caltrans Data Extraction

The goal of this project is to extract relevant data from text files, previously converted from PDF files. Since the text files are quite structured, the decision is to use regex to do this.

## Setup

Install the following packages if you don't have them yet:

In [5]:
# pip install pandas numpy tqdm ipykernel notebook python-dotenv openpyxl

In [6]:
from experiment import *

import pyperclip

%reload_ext autoreload
%autoreload 2

# Note: to print DataFrame fully use:
# pd.set_option('display.max_rows', None)  # to set globally, or use: 
# with pd.option_context('display.max_rows', None, 
#                        'display.max_columns', None, 
#                        'display.width', None, 
#                        'display.max_colwidth', None):
#   display(df)

## Classify contracts into types

There are 2 main types of contract:

In [7]:
# check_lineprinter_table_files()

# filepaths_lineprinter = list(RAW_DATA_PATH_LINEPRINTER.glob('*.txt'))
# filepaths_doc = list(RAW_DATA_PATH_DOC.glob('*.txt'))

# sort_contracts(filepaths_lineprinter + filepaths_doc, PROCESSED_DATA_PATH)

In [8]:
contract_types, _ = get_contract_types()
contract_types

Unnamed: 0_level_0,Tag,Identifier,Contract_Type
Filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
09-354304.pdf_5533,5533,t1_5533,1
02-360704.pdf_5397,5397,t1_5397,1
12-0S9004.pdf_12386,12386,t1_12386,1
02-0H2904.pdf_6987,6987,t1_6987,1
04-0435E4.pdf_2074,2074,t1_2074,1
...,...,...,...
doc_4479,4479,t1_4479_00,1
doc_4479,4479,t1_4479_01,1
doc_4479,4479,t1_4479_02,1
doc_6508,6508,t1_6508_00,1


## Single Contract Analysis

Let's look at the single contract:

In [9]:
c = Contract('t2_12752')
# c = Contract('t1_2774')

To copy file contents to clipboard:

In [10]:
pyperclip.copy(c.file_contents)

There are four attributes of the contract that gets extracted: `info`, `bids`, `subcontractors`, and `items`, for example:

In [11]:
c.extract()

In [12]:
c.bids.df

Unnamed: 0,Identifier,Bid_Rank,A_plus_B_indicator,Bid_Total,Bidder_ID,Bidder_Name,CSLB_Number,Contract_Notes
0,12752,1,1,1701500.0,VC1300004803,KEVIN MACK CONSTRUCTION INC,471831,SB PREF CLAIMED
1,12752,2,1,1727679.0,VC2100002296,SAN PATRICIO CONSTRUCTION,1025193,SB PREF CLAIMED
2,12752,3,1,1920000.0,VC2000001958,"MCCUEN CONSTRUCTION, INC.",880160,SB PREF CLAIMED


# Process single contract

In [13]:
ex = Experiment('t1_2652')
ex.run()

Processing 1/1 ... 
Done processing 1 files.
Writing to disk, please wait ...
Writing Info ...
Writing Bids ...
Writing Subcontractors ...
Writing Items ...
Saved data to: results/04-02-2024-23:13:13:_t1_2652.


# Process several contract

We now run extraction on a small sample of contracts:

In [14]:
filepaths = get_contract_filepaths(contract_type=1, num_contracts=5, seed=45)
ex = Experiment(filepaths)
ex.run()

Processing 1/5 ... 
Done processing 5 files.
Writing to disk, please wait ...
Writing Info ...
Writing Bids ...
Writing Subcontractors ...
Writing Items ...
Saved data to: results/04-02-2024-23:13:14:_5_contracts_t1.


# Process all contracts

Let's first analyze type 2 since it's fast:

In [15]:
filepaths = get_contract_filepaths(contract_type=2)
ex = Experiment(filepaths)
ex.run()

Processing 1/168 ... 
{'Contract_Type': 't2', 'Identifier': 't2_3534', 'Error': ValueError('Failed to extract basic info for 3534')}
{'Contract_Type': 't2', 'Identifier': 't2_3555', 'Error': ValueError('Failed to extract basic info for 3555')}
Processing 101/168 ... 
{'Contract_Type': 't2', 'Identifier': 't2_10657', 'Error': ValueError('Failed to extract basic info for 10657')}
{'Contract_Type': 't2', 'Identifier': 't2_7827', 'Error': ValueError('Failed to extract basic info for 7827')}
Done processing 168 files.
Writing to disk, please wait ...
Writing Info ...
Writing Bids ...
Writing Subcontractors ...
Writing Items ...
Writing Errors ...
Saved data to: results/04-02-2024-23:13:14:_168_contracts_t2.


And now type1:

In [16]:
filepaths = get_contract_filepaths(contract_type=1)
ex = Experiment(filepaths)
ex.run()

Processing 1/18662 ... 
Processing 101/18662 ... 
Processing 201/18662 ... 
Processing 301/18662 ... 
Processing 401/18662 ... 
Processing 501/18662 ... 
{'Contract_Type': 't1', 'Identifier': 't1_6245', 'Error': ValueError('Failed to extract basic info for 6245')}
Processing 601/18662 ... 
Processing 701/18662 ... 
Processing 801/18662 ... 
Processing 901/18662 ... 
Processing 1001/18662 ... 
Processing 1101/18662 ... 
Processing 1201/18662 ... 
Processing 1301/18662 ... 
Processing 1401/18662 ... 
{'Contract_Type': 't1', 'Identifier': 't1_7564', 'Error': ValueError('Failed to extract basic info for 7564')}
Processing 1501/18662 ... 
Processing 1601/18662 ... 
Processing 1701/18662 ... 
{'Contract_Type': 't1', 'Identifier': 't1_6844', 'Error': ValueError('Failed to extract basic info for 6844')}
Processing 1801/18662 ... 
Processing 1901/18662 ... 
Processing 2001/18662 ... 
{'Contract_Type': 't1', 'Identifier': 't1_6851', 'Error': ValueError('Failed to extract basic info for 6851')}
P

In [None]:
# some known error to exclude:
# exclude_type1 = ['12-0K0234_11520', '12-0R2704_10602', '06-0L8404_3005', '07-0W0404_12652', '07-338004_11638']
# exclude_type2 = ['09-237704_3534']
# exclude = exclude_type1 + exclude_type2
# filepaths = [x for x in filepaths if x.stem not in exclude]
# len(filepaths)