# Caltrans Data Extraction

The goal of this project is to extract relevant data from text files, previously converted from PDF files. Since the text files are quite structured, the decision is to use regex to do this.

## Setup

Install the following packages if you don't have them yet:

In [10]:
# pip install pandas numpy tqdm ipykernel notebook python-dotenv openpyxl

In [11]:
from contract import *
from experiment import *

import pyperclip

%reload_ext autoreload
%autoreload 2

# Note: to print DataFrame fully use:
# pd.set_option('display.max_rows', None)  # to set globally, or use: 
# with pd.option_context('display.max_rows', None, 
#                        'display.max_columns', None, 
#                        'display.width', None, 
#                        'display.max_colwidth', None):
#   display(df)

## Rename the mislabeled contracts

There are some files that were mistakenly labeled, so here we rename them:

In [12]:
d = {'07-0W0404..pdf_12652.txt': '07-0W0404.pdf_12652.txt',
     '110427R0.pdf_3052.txt': '01-0A0804.pdf_3052.txt',
     '110427R0.pdf_2981.txt': '01-0A0804.pdf_2981.txt',
     '120712R0.pdf_4254.txt': '11-270804.pdf_4254.txt',
     '08-1N0304..pdf_12819.txt': '08-1N0304.pdf_12819.txt',
     '120928R0.pdf_4565.txt': '04-4S1204.pdf_4565.txt',
     '121016R0.pdf_4699.txt': '04-4S0304.pdf_4699.txt',
     '130220R0.pdf_4863.txt': '03-3E6204.pdf_4863.txt',
     '120717R0.pdf_4252.txt': '05-1A9704.pdf_4252.txt',
     '08-1G2804..pdf_12877.txt': '08-1G2804.pdf_12877.txt'
}

for key, value in d.items():
    try:
        path1 = RAW_DATA_PATH_LINEPRINTER / key
        path2 = RAW_DATA_PATH_LINEPRINTER / value
        path1.rename(path2)

        path1 = RAW_DATA_PATH_TABLE / key
        path2 = RAW_DATA_PATH_TABLE / value
        path1.rename(path2)
    except FileNotFoundError:
        print(f'File not found: {key}')
        continue


File not found: 07-0W0404..pdf_12652.txt
File not found: 110427R0.pdf_3052.txt
File not found: 110427R0.pdf_2981.txt
File not found: 120712R0.pdf_4254.txt
File not found: 08-1N0304..pdf_12819.txt
File not found: 120928R0.pdf_4565.txt
File not found: 121016R0.pdf_4699.txt
File not found: 130220R0.pdf_4863.txt
File not found: 120717R0.pdf_4252.txt
File not found: 08-1G2804..pdf_12877.txt


## Classify

There are two types of contract, we first classify them into two groups, type1 and type2:

In [13]:
# save_contract_types()

In [14]:
contract_types, _ = get_contract_types()
contract_types

Unnamed: 0_level_0,Relative_Path,Contract_Number,Tag,Identifier,Contract_Type
Filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
09-354304.pdf_5533,lineprinter/09-354304.pdf_5533.txt,09-354304,5533,09-354304_5533,1
02-360704.pdf_5397,lineprinter/02-360704.pdf_5397.txt,02-360704,5397,02-360704_5397,1
12-0S9004.pdf_12386,lineprinter/12-0S9004.pdf_12386.txt,12-0S9004,12386,12-0S9004_12386,1
02-0H2904.pdf_6987,lineprinter/02-0H2904.pdf_6987.txt,02-0H2904,6987,02-0H2904_6987,1
04-0435E4.pdf_2074,lineprinter/04-0435E4.pdf_2074.txt,04-0435E4,2074,04-0435E4_2074,1
...,...,...,...,...,...
01-488504.pdf_1428,lineprinter/01-488504.pdf_1428.txt,01-488504,1428,01-488504_1428,1
03-1G0004.pdf_11241,lineprinter/03-1G0004.pdf_11241.txt,03-1G0004,11241,03-1G0004_11241,1
03-3M9504.pdf_4004,lineprinter/03-3M9504.pdf_4004.txt,03-3M9504,4004,03-3M9504_4004,1
04-2K7104.pdf_12426,lineprinter/04-2K7104.pdf_12426.txt,04-2K7104,12426,04-2K7104_12426,1


## Single Contract

Let's look at the single contract:

In [15]:
c = Contract('04-294944_6026')

To copy file contest to clipboard:

In [16]:
# pyperclip.copy(c.file_contents)

Now we extract:

In [17]:
c.extract()

There are four attributes of the contract that gets extracted: info, bids, subcontractors, and items:

In [18]:
c.info.df

Unnamed: 0,Identifier,Postponed_Contract,Bid_Opening_Date,Contract_Date,Contract_Number,Contract_Code,Number_of_Contract_Items,Total_Number_of_Working_Days,Number_of_Bidders,Engineers_Est,Amount_Over,Amount_Under,Percent_Est_Over,Percent_Est_Under,Contract_Description
0,04-294944_6026,0,07/08/14,07/10/14,04-294944,B,78,1000,6,2524266.25,,165736.35,,6.57,ROADWAY PLANTING


In [19]:
c.bids.df

Unnamed: 0,Identifier,Bid_Rank,A_plus_B_indicator,Bid_Total,Bidder_ID,Bidder_Name,Bidder_Phone,Extra,Contract_Notes,CSLB_Number,Has_Third_Row
0,04-294944_6026,1,0,2358529.9,5,MARINA LANDSCAPE INC,714 939-6600,NSB PREF CLAIMED,,492862,0
1,04-294944_6026,2,0,2423203.0,4,BORTOLUSSI & WATKIN INC,415 453-4675,SB PREF CLAIMED,,962905,0
2,04-294944_6026,3,0,2492868.1,1,EMPIRE LANDSCAPING INC,530 400-3943,SB PREF CLAIMED,,811554,0
3,04-294944_6026,4,0,2797309.74,3,GREEN GROWTH INDUSTRIES INC,925 484-0830,SB PREF CLAIMED,,662718,0
4,04-294944_6026,5,0,4314835.0,6,WABO LANDSCAPE & CONSTRUCTION INC,510 741-9226,SB PREF CLAIMED,,962263,0
5,04-294944_6026,6,0,2287001.8,2,JJ NGUYEN INC,408 259-7982,SB PREF CLAIMED,(IRREGULAR),857202,0


In [20]:
c.subcontractors.df

Unnamed: 0,Identifier,Bidder_ID,Subcontractor_Name,Subcontracted_Line_Item,Item_Numbers,Subcontractor_License_Number
0,04-294944_6026,5,HUGHES TREE SERVICE INC,"ITEM 20, 21","20, 21",
1,04-294944_6026,5,MIKE BROWN ELECTRIC CO,"ITEM 69, 70","69, 70",
2,04-294944_6026,4,ATLAS TREE SURGERY,PRUNE EXISTING PLANTS,,
3,04-294944_6026,4,FARIAS GARDEN SERVICE,PLANT ESTABLISHMENT WORK,,
4,04-294944_6026,4,FREEDLUN HYDROSEEDING,TEMPORARY TACKED STRAW,,
5,04-294944_6026,4,MIKE BROWN ELECTRIC,ELECTRICAL SERVICE (IRRIGATION),,
6,04-294944_6026,1,NONE,NONE,,
7,04-294944_6026,3,D MERCURIO ENTERPRISES INC,ITEM 15,15,
8,04-294944_6026,3,LEE CONSTRACTORS & CONSULTANTS INC,"ITEMS 69, 70","69, 70",
9,04-294944_6026,6,FREEDLUN HYDROSEEDING,TEM TACKLED STRAW,,


In [21]:
c.items.df

Unnamed: 0,Identifier,Item_Number,Extra1,Item_Code,Item_Description,Extra2,Item_Dollar_Amount
0,04-294944_6026,1,,070030,LEAD COMPLIANCE PLAN,LS LUMP SUM 600.00,600.00
1,04-294944_6026,2,,120090,CONSTRUCTION AREA SIGNS,"LS LUMP SUM 11,000.00",11000.00
2,04-294944_6026,3,,120100,TRAFFIC CONTROL SYSTEM,"LS LUMP SUM 30,000.00",30000.00
3,04-294944_6026,4,,128652,PORTABLE CHANGEABLE MESSAGE SIGN (LS),"LS LUMP SUM 20,000.00",20000.00
4,04-294944_6026,5,,130100,JOB SITE MANAGEMENT,"LS LUMP SUM 10,000.00",10000.00
...,...,...,...,...,...,...,...
73,04-294944_6026,74,,202038,PACKET FERTILIZER,EA 10 1.00,10.00
74,04-294944_6026,75,,204008,PLANT (GROUP H),EA 10 70.00,700.00
75,04-294944_6026,76,,731510,"MINOR CONCRETE (CURB, GUTTER, SIDEWALKAND DRIV...","CY 3 2,000.00",6000.00
76,04-294944_6026,77,,860774,SPRINKLER CONTROL CONDUIT (BRIDGE) (LF),LF 310 17.00,5270.00


We can also write all the information into excel file:

In [22]:
c.write_to_disk()

Saved to Excel file at: results/single_contracts/04-294944_6026.xlsx.


In [23]:
# filepath = RAW_DATA_PATH.parent / 'sample' / '01-0A3804.pdf_2724.txt'
# filepath = RAW_DATA_PATH_LINEPRINTER / '01-0A3804.pdf_4353.txt'
# filepath = RAW_DATA_PATH_LINEPRINTER / '01-0A0904.pdf_2724.txt'
# filepath = RAW_DATA_PATH_LINEPRINTER / '01-0A1204.pdf_11468.txt'
# filepath = RAW_DATA_PATH_LINEPRINTER / '01-0F4304.pdf_12346.txt'  # issue # 11
# filepath = RAW_DATA_PATH_LINEPRINTER / '01-0K6104.pdf_12731.txt'  # issue # 9
# filepath = RAW_DATA_PATH_LINEPRINTER / '01-0K4604.pdf_12040.txt'  # issue # 1
# filename = '01-0H3204.pdf_9871.txt'  # issue # 5
# filepath = RAW_DATA_PATH_LINEPRINTER / '01-0A0404.pdf_10165.txt'  # different format
# filepath = RAW_DATA_PATH_LINEPRINTER / '04-4G6404.pdf_7310.txt'


# Several contracts

We now run extraction on a small sample and define the filepaths:

In [24]:
df_contract_types, _ = get_contract_types()
df_contract_types

Unnamed: 0_level_0,Relative_Path,Contract_Number,Tag,Identifier,Contract_Type
Filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
09-354304.pdf_5533,lineprinter/09-354304.pdf_5533.txt,09-354304,5533,09-354304_5533,1
02-360704.pdf_5397,lineprinter/02-360704.pdf_5397.txt,02-360704,5397,02-360704_5397,1
12-0S9004.pdf_12386,lineprinter/12-0S9004.pdf_12386.txt,12-0S9004,12386,12-0S9004_12386,1
02-0H2904.pdf_6987,lineprinter/02-0H2904.pdf_6987.txt,02-0H2904,6987,02-0H2904_6987,1
04-0435E4.pdf_2074,lineprinter/04-0435E4.pdf_2074.txt,04-0435E4,2074,04-0435E4_2074,1
...,...,...,...,...,...
01-488504.pdf_1428,lineprinter/01-488504.pdf_1428.txt,01-488504,1428,01-488504_1428,1
03-1G0004.pdf_11241,lineprinter/03-1G0004.pdf_11241.txt,03-1G0004,11241,03-1G0004_11241,1
03-3M9504.pdf_4004,lineprinter/03-3M9504.pdf_4004.txt,03-3M9504,4004,03-3M9504_4004,1
04-2K7104.pdf_12426,lineprinter/04-2K7104.pdf_12426.txt,04-2K7104,12426,04-2K7104_12426,1


In [25]:
filepaths = get_some_contracts()
filepaths

[PosixPath('/Users/nenadbozinovic/Documents/caltrans_data_extraction/data/lineprinter/03-4M4804.pdf_4764.txt'),
 PosixPath('/Users/nenadbozinovic/Documents/caltrans_data_extraction/data/lineprinter/02-0J1404.pdf_11341.txt'),
 PosixPath('/Users/nenadbozinovic/Documents/caltrans_data_extraction/data/lineprinter/11-408004.pdf_7191.txt'),
 PosixPath('/Users/nenadbozinovic/Documents/caltrans_data_extraction/data/lineprinter/12-0M4804.pdf_10206.txt'),
 PosixPath('/Users/nenadbozinovic/Documents/caltrans_data_extraction/data/lineprinter/12-0K3704.pdf_2018.txt')]

In [26]:
ex = Experiment(filepaths, tag='5')

In [27]:
ex.run()

Processing file 1/5


In [28]:
ex.write_to_disk()

Saved data to: results/03-27-2024-00:48:26_tag:_5_type:_1.


# All type 1 contracts

In [29]:
filepaths = get_some_contracts(num_contracts=None)

In [30]:
len(filepaths)

8809

In [31]:
ex = Experiment(filepaths, tag='all')
ex.run()

Processing file 1/8809
Processing file 101/8809
Processing file 201/8809
Processing file 301/8809
Processing file 401/8809
Processing file 501/8809
Processing file 601/8809
Processing file 701/8809
Processing file 801/8809
Processing file 901/8809
Processing file 1001/8809
Processing file 1101/8809
Processing file 1201/8809
Processing file 1301/8809
Processing file 1401/8809
Processing file 1501/8809
Processing file 1601/8809
Processing file 1701/8809
Processing file 1801/8809
Processing file 1901/8809
Processing file 2001/8809
Processing file 2101/8809
Processing file 2201/8809
Processing file 2301/8809
Processing file 2401/8809
Processing file 2501/8809
Processing file 2601/8809
Processing file 2701/8809
Processing file 2801/8809
Processing file 2901/8809
Processing file 3001/8809
Processing file 3101/8809
Processing file 3201/8809
Processing file 3301/8809
Processing file 3401/8809
Processing file 3501/8809
Processing file 3601/8809
Processing file 3701/8809
Processing file 3801/880

In [32]:
ex.write_to_disk()