# Caltrans Data Extraction

The goal of this project is to extract relevant data from text files, previously converted from PDF files. Since the text files are quite structured, the decision is to use regex to do this.

## Setup

Install the following packages if you don't have them yet:

In [1]:
# pip install pandas numpy tqdm ipykernel notebook python-dotenv openpyxl

In [2]:
from contract import *
from experiment import *

import pyperclip

%reload_ext autoreload
%autoreload 2

# pd.set_option('display.max_rows', None)  # optional to see all rows in DataFrames

## Rename the mislabeled contracts

There are some files that were mistakenly labeled, so here we rename them:

In [3]:
d = {'07-0W0404..pdf_12652.txt': '07-0W0404.pdf_12652.txt',
     '110427R0.pdf_3052.txt': '01-0A0804.pdf_3052.txt',
     '110427R0.pdf_2981.txt': '01-0A0804.pdf_2981.txt',
     '120712R0.pdf_4254.txt': '11-270804.pdf_4254.txt',
     '08-1N0304..pdf_12819.txt': '08-1N0304.pdf_12819.txt',
     '120928R0.pdf_4565.txt': '04-4S1204.pdf_4565.txt',
     '121016R0.pdf_4699.txt': '04-4S0304.pdf_4699.txt',
     '130220R0.pdf_4863.txt': '03-3E6204.pdf_4863.txt',
     '120717R0.pdf_4252.txt': '05-1A9704.pdf_4252.txt',
     '08-1G2804..pdf_12877.txt': '08-1G2804.pdf_12877.txt'
}

for key, value in d.items():
    try:
        path1 = RAW_DATA_PATH_LINEPRINTER / key
        path2 = RAW_DATA_PATH_LINEPRINTER / value
        path1.rename(path2)

        path1 = RAW_DATA_PATH_TABLE / key
        path2 = RAW_DATA_PATH_TABLE / value
        path1.rename(path2)
    except FileNotFoundError:
        print(f'File not found: {key}')
        continue


File not found: 07-0W0404..pdf_12652.txt
File not found: 110427R0.pdf_3052.txt
File not found: 110427R0.pdf_2981.txt
File not found: 120712R0.pdf_4254.txt
File not found: 08-1N0304..pdf_12819.txt
File not found: 120928R0.pdf_4565.txt
File not found: 121016R0.pdf_4699.txt
File not found: 130220R0.pdf_4863.txt
File not found: 120717R0.pdf_4252.txt
File not found: 08-1G2804..pdf_12877.txt


## Classify

There are two types of contract, we first classify them into two groups, type1 and type2:

In [4]:
# save_contract_types()

In [5]:
contract_types, _ = get_contract_types()
contract_types

Unnamed: 0_level_0,Relative_Path,Contract_Number,Tag,Identifier,Contract_Type
Filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
09-354304.pdf_5533,lineprinter_txt_files/09-354304.pdf_5533.txt,09-354304,5533,09-354304_5533,1
02-360704.pdf_5397,lineprinter_txt_files/02-360704.pdf_5397.txt,02-360704,5397,02-360704_5397,1
12-0S9004.pdf_12386,lineprinter_txt_files/12-0S9004.pdf_12386.txt,12-0S9004,12386,12-0S9004_12386,1
02-0H2904.pdf_6987,lineprinter_txt_files/02-0H2904.pdf_6987.txt,02-0H2904,6987,02-0H2904_6987,1
04-0435E4.pdf_2074,lineprinter_txt_files/04-0435E4.pdf_2074.txt,04-0435E4,2074,04-0435E4_2074,1
...,...,...,...,...,...
01-488504.pdf_1428,lineprinter_txt_files/01-488504.pdf_1428.txt,01-488504,1428,01-488504_1428,1
03-1G0004.pdf_11241,lineprinter_txt_files/03-1G0004.pdf_11241.txt,03-1G0004,11241,03-1G0004_11241,1
03-3M9504.pdf_4004,lineprinter_txt_files/03-3M9504.pdf_4004.txt,03-3M9504,4004,03-3M9504_4004,1
04-2K7104.pdf_12426,lineprinter_txt_files/04-2K7104.pdf_12426.txt,04-2K7104,12426,04-2K7104_12426,1


## Single Contract

Let's look at the single contract:

In [6]:
c = Contract('04-4G6404_7310')

In [7]:
c.extract()

There are four attributes of the contract that gets extracted: info, bids, subcontractors, and items:

In [8]:
c.info.df

Unnamed: 0,Identifier,Postponed_Contract,Bid_Opening_Date,Contract_Date,Contract_Number,Contract_Code,Number_of_Contract_Items,Total_Number_of_Working_Days,Number_of_Bidders,Engineers_Est,Amount_Over,Amount_Under,Percent_Est_Over,Percent_Est_Under,Contract_Description
0,04-4G6404_7310,0,11/03/15,11/05/15,04-4G6404,D,121,90,8,4178805.0,,791668.0,,18.94,ROUTES 84/280 SEPARATION CONSTRUCT


In [9]:
c.bids.df

Unnamed: 0,Identifier,Bid_Rank,A_plus_B_indicator,Bid_Total,Bidder_ID,Bidder_Name,Bidder_Phone,Extra,Contract_Notes,CSLB_Number,Has_Third_Row
0,04-4G6404_7310,1,1,3657137.0,3,GHILOTTI CONSTRUCTION CO. INC.,707 585-1221,,,644515,0
1,04-4G6404_7310,2,1,4039158.0,8,"GORDON N. BALL, INC.",925 838-5675,,,710807,0
2,04-4G6404_7310,3,1,4189774.0,4,"RGW CONSTRUCTION, INC.",925 606-2400,,,591940,0
3,04-4G6404_7310,4,1,4298995.26,1,GRANITE CONSTRUCTION,408 327-7013,,,89,0
4,04-4G6404_7310,5,1,4519443.0,5,GRANITE ROCK COMPANY,408 574-1400,,,22,0
5,04-4G6404_7310,6,1,4609762.0,2,"DISNEY CONSTRUCTION, INC.",650 259-9545,,,866974,0
6,04-4G6404_7310,7,1,4768691.0,6,VALENTINE CORPORATION,415 453-3732,,,229225,0
7,04-4G6404_7310,8,1,5255322.0,7,BUGLER CONSTRUCTION,925 416-0700,,,740863,0


In [10]:
# with pd.option_context('display.max_rows', None, 
#                        'display.max_columns', None, 
#                        'display.width', None, 
#                        'display.max_colwidth', None):
display(c.subcontractors.df)

Unnamed: 0,Identifier,Bidder_ID,Subcontractor_Name,Subcontracted_Line_Item,Bidder_ID1,Subcontractor_Name1,Subcontracted_Line_Item1,Item_Numbers,Percent,Subcontractor_License_Number
0,04-4G6404_7310,03,"AVAR CONSTRUCTION SYSTEMS, INC.",ITEM 77 (100%),03,"AVAR CONSTRUCTION SYSTEMS, INC.",ITEM 77 (100%),77,100,906815
1,04-4G6404_7310,03,"AVAR CONSTRUCTION SYSTEMS, INC.",ITEM 95 (100%),,"AVAR CONSTRUCTION SYSTEMS, INC.",ITEM 95 (100%),95,100,906815
2,04-4G6404_7310,03,"AVAR CONSTRUCTION SYSTEMS, INC.",ITEM 96 (100%),,"AVAR CONSTRUCTION SYSTEMS, INC.",ITEM 96 (100%),96,100,906815
3,04-4G6404_7310,03,CAMBLIN STEEL SERVICES INC.,ITEM 105 (21%),,CAMBLIN STEEL SERVICES INC.,ITEM 105 (21%),105,21,218839
4,04-4G6404_7310,03,CAMBLIN STEEL SERVICES INC.,ITEM 106 (15%),,CAMBLIN STEEL SERVICES INC.,ITEM 106 (15%),106,15,218839
...,...,...,...,...,...,...,...,...,...,...
431,04-4G6404_7310,07,PACIFIC COAST DRILLING,DRILL HOLE (HORZ DRAIN),,PACIFIC COAST DRILLING,DRILL HOLE (HORZ DRAIN),,,539855
432,04-4G6404_7310,07,PACIFIC COAST DRILLING,LEAN CONC BACKFILL,,PACIFIC COAST DRILLING,LEAN CONC BACKFILL,,,539855
433,04-4G6404_7310,07,PACIFIC COAST DRILLING,"42"" DRILLED HOLE",,PACIFIC COAST DRILLING,"42"" DRILLED HOLE",,,539855
434,04-4G6404_7310,07,R AN B PROTETIVE COATINGS,CLEAN AND PAINT STEEL SOLDIER PILING,,R AN B PROTETIVE COATINGS,CLEAN AND PAINT STEEL SOLDIER PILING,,,na


In [11]:
c.items.df

Unnamed: 0,Identifier,Item_Number,Extra1,Item_Code,Item_Description,Extra2,Item_Dollar_Amount
0,04-4G6404_7310,1,,070030,LEAD COMPLIANCE PLAN,"LS LUMP SUM 1,500.00",1500.00
1,04-4G6404_7310,2,,120090,CONSTRUCTION AREA SIGNS,"LS LUMP SUM 5,000.00",5000.00
2,04-4G6404_7310,3,,120100,TRAFFIC CONTROL SYSTEM,"LS LUMP SUM 35,000.00",35000.00
3,04-4G6404_7310,4,,120159,TEMPORARY TRAFFIC STRIPE (PAINT),LF 670 1.80,1206.00
4,04-4G6404_7310,5,,120165,CHANNELIZER (SURFACE MOUNTED),EA 12 45.00,540.00
...,...,...,...,...,...,...,...
116,04-4G6404_7310,117,,995100,WATER METER CHARGES,"LS LUMP SUM 6,600.00",6600.00
117,04-4G6404_7310,118,,995200,IRRIGATION WATER SERVICE CHARGES,"LS LUMP SUM 1,000.00",1000.00
118,04-4G6404_7310,119,,000003,ITEM DELETED PER ADDENDUM,LS LUMP SUM .00,0.00
119,04-4G6404_7310,120,,208424,"1 1/4"" BACKFLOW PREVENTER ASSEMBLY","EA 1 4,000.00",4000.00


We can also extract all the information into excel files:

In [12]:
c.write_to_excel()

Saved to Excel file at: results/single_contracts/04-4G6404_7310.xlsx.


# One sample study

In [13]:
# filepath = RAW_DATA_PATH.parent / 'sample' / '01-0A3804.pdf_2724.txt'
# filepath = RAW_DATA_PATH_LINEPRINTER / '01-0A3804.pdf_4353.txt'
# filepath = RAW_DATA_PATH_LINEPRINTER / '01-0A0904.pdf_2724.txt'
# filepath = RAW_DATA_PATH_LINEPRINTER / '01-0A1204.pdf_11468.txt'
# filepath = RAW_DATA_PATH_LINEPRINTER / '01-0F4304.pdf_12346.txt'  # issue # 11
# filepath = RAW_DATA_PATH_LINEPRINTER / '01-0K6104.pdf_12731.txt'  # issue # 9
# filepath = RAW_DATA_PATH_LINEPRINTER / '01-0K4604.pdf_12040.txt'  # issue # 1
# filename = '01-0H3204.pdf_9871.txt'  # issue # 5
# filepath = RAW_DATA_PATH_LINEPRINTER / '01-0A0404.pdf_10165.txt'  # different format
# filepath = RAW_DATA_PATH_LINEPRINTER / '04-4G6404.pdf_7310.txt'


In [14]:
contract = Contract('10-0U8304_2834')


In [15]:
contract.extract()

In [16]:
# pyperclip.copy(contract.file_contents)

# Extracting all the data from several contracts

We now run extraction on a small sample and define the filepaths:

In [17]:
df_contract_types, _ = get_contract_types()
df_contract_types

Unnamed: 0_level_0,Relative_Path,Contract_Number,Tag,Identifier,Contract_Type
Filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
09-354304.pdf_5533,lineprinter_txt_files/09-354304.pdf_5533.txt,09-354304,5533,09-354304_5533,1
02-360704.pdf_5397,lineprinter_txt_files/02-360704.pdf_5397.txt,02-360704,5397,02-360704_5397,1
12-0S9004.pdf_12386,lineprinter_txt_files/12-0S9004.pdf_12386.txt,12-0S9004,12386,12-0S9004_12386,1
02-0H2904.pdf_6987,lineprinter_txt_files/02-0H2904.pdf_6987.txt,02-0H2904,6987,02-0H2904_6987,1
04-0435E4.pdf_2074,lineprinter_txt_files/04-0435E4.pdf_2074.txt,04-0435E4,2074,04-0435E4_2074,1
...,...,...,...,...,...
01-488504.pdf_1428,lineprinter_txt_files/01-488504.pdf_1428.txt,01-488504,1428,01-488504_1428,1
03-1G0004.pdf_11241,lineprinter_txt_files/03-1G0004.pdf_11241.txt,03-1G0004,11241,03-1G0004_11241,1
03-3M9504.pdf_4004,lineprinter_txt_files/03-3M9504.pdf_4004.txt,03-3M9504,4004,03-3M9504_4004,1
04-2K7104.pdf_12426,lineprinter_txt_files/04-2K7104.pdf_12426.txt,04-2K7104,12426,04-2K7104_12426,1


In [18]:
filepaths = get_some_contracts()
filepaths

[PosixPath('/Users/nenadbozinovic/Documents/caltrans_data_extraction/data/lineprinter_txt_files/03-4M4804.pdf_4764.txt'),
 PosixPath('/Users/nenadbozinovic/Documents/caltrans_data_extraction/data/lineprinter_txt_files/02-0J1404.pdf_11341.txt'),
 PosixPath('/Users/nenadbozinovic/Documents/caltrans_data_extraction/data/lineprinter_txt_files/11-408004.pdf_7191.txt'),
 PosixPath('/Users/nenadbozinovic/Documents/caltrans_data_extraction/data/lineprinter_txt_files/12-0M4804.pdf_10206.txt'),
 PosixPath('/Users/nenadbozinovic/Documents/caltrans_data_extraction/data/lineprinter_txt_files/12-0K3704.pdf_2018.txt')]

In [19]:
ex = Experiment(filepaths, tag='5')

In [20]:
ex.run()

Processing file 1/5


In [21]:
ex.write_to_disk()

Saved data to: results/03-26-2024-17:15:25_tag:_5_type:_1.


# Extracting all the data from type 1 contracts

In [22]:
filepaths = get_some_contracts(num_contracts=None)

In [23]:
len(filepaths)

8809

In [24]:
ex = Experiment(filepaths, tag='all')
ex.run()


Processing file 1/8809
Processing file 101/8809
Processing file 201/8809
Processing file 301/8809
Processing file 401/8809
Processing file 501/8809
Processing file 601/8809
Processing file 701/8809
Processing file 801/8809
Processing file 901/8809
Processing file 1001/8809
Processing file 1101/8809
Processing file 1201/8809
Processing file 1301/8809
Processing file 1401/8809
Processing file 1501/8809
Processing file 1601/8809
Processing file 1701/8809
Processing file 1801/8809
Processing file 1901/8809
Processing file 2001/8809


In [None]:
ex.write_to_disk()

KeyboardInterrupt: 