# Caltrans Data Extraction

The goal of this project is to extract relevant data from text files, previously converted from PDF files. Since the text files are quite structured, the decision is to use regex to do this.

## Setup

Install the following packages if you don't have them yet:

In [1]:
# pip install pandas numpy tqdm ipykernel notebook python-dotenv openpyxl

In [2]:
from experiment import *

import pyperclip

%reload_ext autoreload
%autoreload 2

# Note: to print DataFrame fully use:
# pd.set_option('display.max_rows', None)  # to set globally, or use: 
# with pd.option_context('display.max_rows', None, 
#                        'display.max_columns', None, 
#                        'display.width', None, 
#                        'display.max_colwidth', None):
#   display(df)

## Rename the mislabeled contracts

There are some files that were mistakenly labeled, so here we rename them:

In [3]:
d = {'07-0W0404..pdf_12652.txt': '07-0W0404.pdf_12652.txt',
     '110427R0.pdf_3052.txt': '01-0A0804.pdf_3052.txt',
     '110427R0.pdf_2981.txt': '01-0A0804.pdf_2981.txt',
     '120712R0.pdf_4254.txt': '11-270804.pdf_4254.txt',
     '08-1N0304..pdf_12819.txt': '08-1N0304.pdf_12819.txt',
     '120928R0.pdf_4565.txt': '04-4S1204.pdf_4565.txt',
     '121016R0.pdf_4699.txt': '04-4S0304.pdf_4699.txt',
     '130220R0.pdf_4863.txt': '03-3E6204.pdf_4863.txt',
     '120717R0.pdf_4252.txt': '05-1A9704.pdf_4252.txt',
     '08-1G2804..pdf_12877.txt': '08-1G2804.pdf_12877.txt'
}

for key, value in d.items():
    try:
        path1 = RAW_DATA_PATH_LINEPRINTER / key
        path2 = RAW_DATA_PATH_LINEPRINTER / value
        path1.rename(path2)

        path1 = RAW_DATA_PATH_TABLE / key
        path2 = RAW_DATA_PATH_TABLE / value
        path1.rename(path2)
    except FileNotFoundError:
        print(f'File not found: {key}')
        continue


File not found: 07-0W0404..pdf_12652.txt
File not found: 110427R0.pdf_3052.txt
File not found: 110427R0.pdf_2981.txt
File not found: 120712R0.pdf_4254.txt
File not found: 08-1N0304..pdf_12819.txt
File not found: 120928R0.pdf_4565.txt
File not found: 121016R0.pdf_4699.txt
File not found: 130220R0.pdf_4863.txt
File not found: 120717R0.pdf_4252.txt
File not found: 08-1G2804..pdf_12877.txt


## Classify contracts into types

There are 3 main types of contract (type 3 is merged contracts that need to be parsed first):

In [4]:
sort_contracts()

Found 8977 files in lineprinter/table folder. Started sorting ...


 78%|███████▊  | 7014/8977 [00:04<00:01, 1515.73it/s]

Found duplicate new identifier when parsing: 07-338004_11638


100%|██████████| 8977/8977 [00:06<00:00, 1449.44it/s]


Saved 8776 contracts to type1 folder
Saved 168 contracts to type2 folder
Saved 75 contracts to type3 folder


In [5]:
contract_types, _ = get_contract_types()
contract_types

Unnamed: 0_level_0,Contract_Number,Tag,Identifier,Contract_Type,Relative_Path,Original_Identifier
Filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
09-354304.pdf_5533,09-354304,5533,09-354304_5533,1,type1/09-354304_5533.txt,
02-360704.pdf_5397,02-360704,5397,02-360704_5397,1,type1/02-360704_5397.txt,
12-0S9004.pdf_12386,12-0S9004,12386,12-0S9004_12386,1,type1/12-0S9004_12386.txt,
02-0H2904.pdf_6987,02-0H2904,6987,02-0H2904_6987,1,type1/02-0H2904_6987.txt,
04-0435E4.pdf_2074,04-0435E4,2074,04-0435E4_2074,1,type1/04-0435E4_2074.txt,
...,...,...,...,...,...,...
01-488504.pdf_1428,01-488504,1428,01-488504_1428,1,type1/01-488504_1428.txt,
03-1G0004.pdf_11241,03-1G0004,11241,03-1G0004_11241,1,type1/03-1G0004_11241.txt,
03-3M9504.pdf_4004,03-3M9504,4004,03-3M9504_4004,1,type1/03-3M9504_4004.txt,
04-2K7104.pdf_12426,04-2K7104,12426,04-2K7104_12426,1,type1/04-2K7104_12426.txt,


## Single Contract Analysis

Let's look at the single contract:

In [None]:
c = Contract('type2/03-2G4804_12594')

To copy file contents to clipboard:

In [None]:
pyperclip.copy(c.file_contents)

There are four attributes of the contract that gets extracted: `info`, `bids`, `subcontractors`, and `items`, for example:

In [None]:
c.extract()
c.info.df

# Process single contract

In [None]:
ex = Experiment('type3/07-338004_11638')
ex.run()

# Process several contract

We now run extraction on a small sample of contracts:

In [None]:
filepaths = get_contract_filepaths(ContractType.TYPE1, num_contracts=5)
ex = Experiment(filepaths)
ex.run()

# Process all contracts

In [None]:
filepaths = get_contract_filepaths(contract_type=ContractType.TYPE1)
ex = Experiment(filepaths)
ex.run()

In [None]:
filepaths = get_contract_filepaths(contract_type=ContractType.TYPE2)
ex = Experiment(filepaths)
ex.run()

In [None]:
filepaths = get_contract_filepaths(contract_type=ContractType.TYPE3)
ex = Experiment(filepaths)
ex.run()

In [None]:
# some known error to exclude:
# exclude_type1 = ['12-0K0234_11520', '12-0R2704_10602', '06-0L8404_3005', '07-0W0404_12652', '07-338004_11638']
# exclude_type2 = ['09-237704_3534']
# exclude = exclude_type1 + exclude_type2
# filepaths = [x for x in filepaths if x.stem not in exclude]
# len(filepaths)