# Caltrans Data Extraction

The goal of this project is to extract relevant data from text files, previously converted from PDF files. Since the text files are quite structured, the decision is to use regex to do this.

## Setup

Install the following packages if you don't have them yet:

In [2]:
# pip install pandas numpy tqdm ipykernel notebook python-dotenv openpyxl

In [3]:
from contract import *
from experiment import *

import pyperclip

%reload_ext autoreload
%autoreload 2

# Note: to print DataFrame fully use:
# pd.set_option('display.max_rows', None)  # to set globally, or use: 
# with pd.option_context('display.max_rows', None, 
#                        'display.max_columns', None, 
#                        'display.width', None, 
#                        'display.max_colwidth', None):
#   display(df)

## Rename the mislabeled contracts

There are some files that were mistakenly labeled, so here we rename them:

In [3]:
d = {'07-0W0404..pdf_12652.txt': '07-0W0404.pdf_12652.txt',
     '110427R0.pdf_3052.txt': '01-0A0804.pdf_3052.txt',
     '110427R0.pdf_2981.txt': '01-0A0804.pdf_2981.txt',
     '120712R0.pdf_4254.txt': '11-270804.pdf_4254.txt',
     '08-1N0304..pdf_12819.txt': '08-1N0304.pdf_12819.txt',
     '120928R0.pdf_4565.txt': '04-4S1204.pdf_4565.txt',
     '121016R0.pdf_4699.txt': '04-4S0304.pdf_4699.txt',
     '130220R0.pdf_4863.txt': '03-3E6204.pdf_4863.txt',
     '120717R0.pdf_4252.txt': '05-1A9704.pdf_4252.txt',
     '08-1G2804..pdf_12877.txt': '08-1G2804.pdf_12877.txt'
}

for key, value in d.items():
    try:
        path1 = RAW_DATA_PATH_LINEPRINTER / key
        path2 = RAW_DATA_PATH_LINEPRINTER / value
        path1.rename(path2)

        path1 = RAW_DATA_PATH_TABLE / key
        path2 = RAW_DATA_PATH_TABLE / value
        path1.rename(path2)
    except FileNotFoundError:
        print(f'File not found: {key}')
        continue


File not found: 07-0W0404..pdf_12652.txt
File not found: 110427R0.pdf_3052.txt
File not found: 110427R0.pdf_2981.txt
File not found: 120712R0.pdf_4254.txt
File not found: 08-1N0304..pdf_12819.txt
File not found: 120928R0.pdf_4565.txt
File not found: 121016R0.pdf_4699.txt
File not found: 130220R0.pdf_4863.txt
File not found: 120717R0.pdf_4252.txt
File not found: 08-1G2804..pdf_12877.txt


## Classify contracts into types

There are 3 main types of contract (type 3 is merged contracts that need to be parsed first):

In [4]:
# sort_contracts()

In [5]:
contract_types, _ = get_contract_types()
contract_types

Unnamed: 0_level_0,Contract_Number,Tag,Identifier,Contract_Type,Relative_Path,Original_Identifier
Filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
09-354304.pdf_5533,09-354304,5533,09-354304_5533,1,type1/09-354304_5533.txt,
02-360704.pdf_5397,02-360704,5397,02-360704_5397,1,type1/02-360704_5397.txt,
12-0S9004.pdf_12386,12-0S9004,12386,12-0S9004_12386,1,type1/12-0S9004_12386.txt,
02-0H2904.pdf_6987,02-0H2904,6987,02-0H2904_6987,1,type1/02-0H2904_6987.txt,
04-0435E4.pdf_2074,04-0435E4,2074,04-0435E4_2074,1,type1/04-0435E4_2074.txt,
...,...,...,...,...,...,...
01-488504.pdf_1428,01-488504,1428,01-488504_1428,1,type1/01-488504_1428.txt,
03-1G0004.pdf_11241,03-1G0004,11241,03-1G0004_11241,1,type1/03-1G0004_11241.txt,
03-3M9504.pdf_4004,03-3M9504,4004,03-3M9504_4004,1,type1/03-3M9504_4004.txt,
04-2K7104.pdf_12426,04-2K7104,12426,04-2K7104_12426,1,type1/04-2K7104_12426.txt,


## Single Contract Analysis

Let's look at the single contract:

In [6]:
c = Contract('04-290834_619')

To copy file contents to clipboard:

In [7]:
# pyperclip.copy(c.file_contents)

Now we extract:

In [8]:
c.extract()

There are four attributes of the contract that gets extracted: info, bids, subcontractors, and items:

In [9]:
c.info.df

Unnamed: 0,Identifier,Postponed_Contract,Bid_Opening_Date,Contract_Date,Contract_Number,Contract_Code,Number_of_Contract_Items,Total_Number_of_Working_Days,Number_of_Bidders,Engineers_Est,Amount_Over,Amount_Under,Percent_Est_Over,Percent_Est_Under,Contract_Description
0,04-290834_619,1,05/12/09,05/18/09,04-290834,A,179,,,,,,,,CONSTRUCT HOV LANES


In [10]:
c.bids.df

In [11]:
c.subcontractors.df

In [12]:
c.items.df

We can also write all the information into excel file:

In [13]:
c.write_to_disk()

Saved to Excel file at: results/single_contracts/04-290834_619.xlsx.


In [14]:
# filepath = RAW_DATA_PATH.parent / 'sample' / '01-0A3804.pdf_2724.txt'
# filepath = RAW_DATA_PATH_LINEPRINTER / '01-0A3804.pdf_4353.txt'
# filepath = RAW_DATA_PATH_LINEPRINTER / '01-0A0904.pdf_2724.txt'
# filepath = RAW_DATA_PATH_LINEPRINTER / '01-0A1204.pdf_11468.txt'
# filepath = RAW_DATA_PATH_LINEPRINTER / '01-0F4304.pdf_12346.txt'  # issue # 11
# filepath = RAW_DATA_PATH_LINEPRINTER / '01-0K6104.pdf_12731.txt'  # issue # 9
# filepath = RAW_DATA_PATH_LINEPRINTER / '01-0K4604.pdf_12040.txt'  # issue # 1
# filename = '01-0H3204.pdf_9871.txt'  # issue # 5
# filepath = RAW_DATA_PATH_LINEPRINTER / '01-0A0404.pdf_10165.txt'  # different format
# filepath = RAW_DATA_PATH_LINEPRINTER / '04-4G6404.pdf_7310.txt'


# Several contract analysis

We now run extraction on a small sample of contracts:

In [15]:
df_contract_types, _ = get_contract_types()
df_contract_types

Unnamed: 0_level_0,Contract_Number,Tag,Identifier,Contract_Type,Relative_Path,Original_Identifier
Filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
09-354304.pdf_5533,09-354304,5533,09-354304_5533,1,type1/09-354304_5533.txt,
02-360704.pdf_5397,02-360704,5397,02-360704_5397,1,type1/02-360704_5397.txt,
12-0S9004.pdf_12386,12-0S9004,12386,12-0S9004_12386,1,type1/12-0S9004_12386.txt,
02-0H2904.pdf_6987,02-0H2904,6987,02-0H2904_6987,1,type1/02-0H2904_6987.txt,
04-0435E4.pdf_2074,04-0435E4,2074,04-0435E4_2074,1,type1/04-0435E4_2074.txt,
...,...,...,...,...,...,...
01-488504.pdf_1428,01-488504,1428,01-488504_1428,1,type1/01-488504_1428.txt,
03-1G0004.pdf_11241,03-1G0004,11241,03-1G0004_11241,1,type1/03-1G0004_11241.txt,
03-3M9504.pdf_4004,03-3M9504,4004,03-3M9504_4004,1,type1/03-3M9504_4004.txt,
04-2K7104.pdf_12426,04-2K7104,12426,04-2K7104_12426,1,type1/04-2K7104_12426.txt,


In [16]:
filepaths = get_some_contracts()
ex = Experiment(filepaths, tag='5')
ex.run()

Processing 1/5 ... 
Done processing 5 files.
Writing to disk ...
Saved data to: results/03-27-2024-16:04:08_tag:_5_type:_1.


# All TYPE1 contracts

In [17]:
filepaths = get_some_contracts(num_contracts=None)
len(filepaths), filepaths[0]

(8776,
 PosixPath('/Users/nenadbozinovic/Documents/caltrans_data_extraction/data/type1/09-354304_5533.txt'))

Let's exclude some known errors:

In [18]:
exclude_type1 = ['12-0K0234_11520', '12-0R2704_10602', '06-0L8404_3005', '07-0W0404_12652', '07-338004_11638']
exclude_type2 = ['09-237704_3534']
exclude = exclude_type1 + exclude_type2
filepaths = [x for x in filepaths if x.stem not in exclude]
len(filepaths)

8772

In [19]:
ex = Experiment(filepaths, tag='all')
ex.run()

Processing 1/8772 ... 
Processing 101/8772 ... 
Processing 201/8772 ... 
Processing 301/8772 ... 
Processing 401/8772 ... 
Processing 501/8772 ... 
Processing 601/8772 ... 
Processing 701/8772 ... 
Processing 801/8772 ... 
Processing 901/8772 ... 
Processing 1001/8772 ... 
Processing 1101/8772 ... 
Processing 1201/8772 ... 
Processing 1301/8772 ... 
Processing 1401/8772 ... 
Processing 1501/8772 ... 
Processing 1601/8772 ... 
Processing 1701/8772 ... 
Processing 1801/8772 ... 
Processing 1901/8772 ... 
Processing 2001/8772 ... 
Processing 2101/8772 ... 
Processing 2201/8772 ... 
Processing 2301/8772 ... 
Processing 2401/8772 ... 
Processing 2501/8772 ... 
Processing 2601/8772 ... 
Processing 2701/8772 ... 
Processing 2801/8772 ... 
Processing 2901/8772 ... 
Processing 3001/8772 ... 
Processing 3101/8772 ... 
Processing 3201/8772 ... 
Processing 3301/8772 ... 
Processing 3401/8772 ... 
Processing 3501/8772 ... 
Processing 3601/8772 ... 
Processing 3701/8772 ... 
Processing 3801/8772 ...

# Process TYPE3 contracts

See `type3_contracts.ipynb` for their definition first.

In [20]:
filepaths = list(TYPE3_PATH.glob('*.txt'))
ex = Experiment(filepaths, tag='all', contract_type=ContractType.TYPE3)
ex.run()

Processing 1/75 ... 
Done processing 75 files.
Writing to disk ...
Saved data to: results/03-27-2024-16:15:07_tag:_all_type:_3.


# Process one TYPE2 contract

In [4]:
c = Contract('02-1J1704_12603', ContractType.TYPE2)

In [22]:
# pyperclip.copy(c.file_contents)

In [5]:
c

<contract.Contract at 0x1121a3c50>