## Adding data from FOIA requests

The data in the Chicago Data Portal only includes employee reimbursements paid after Jan. 1 of the previous calendar year. I am in the process of submitting FOIA requests to get data from further back, so that my site can host both historical and current data in one place.

After importing necessary packages, I define the path to the data and join path with anything files ending in the extension .xlsx (each file shared with my had the reimbursements paid in one quarter of one requested year, so my first return of 5 years of transactions gave me 20 excel files).

In [96]:
import pandas as pd
import glob
import os

In [97]:
path = '/Users/hgorledeenn/Desktop/Chicago-Reimbursements-site/foia_data'
all_files = glob.glob(os.path.join(path, "*.xlsx"))
columns_to_use = ['Dept Code', 'Pymt Date', 'Vendor Name', 'Vendor Number', 'Voucher (Batch) Number', 'Invoice Line Nbr', 'Invoice Line Amount', 'Invoice Number', 'Invoice Date', 'Invoice Description']

## Reading in the Data

I use a for loop to read in each excel file and merge them into one large file.

In [98]:
df_list = []

for filename in all_files:
    df_small = pd.read_excel(filename, usecols=columns_to_use)
    df_list.append(df_small)
    df = pd.concat(df_list, ignore_index=True)

df.head()

Unnamed: 0,Dept Code,Pymt Date,Vendor Name,Vendor Number,Voucher (Batch) Number,Invoice Line Nbr,Invoice Line Amount,Invoice Number,Invoice Date,Invoice Description
0,1.0,2021-02-05,"HALL, PATRICK J",10530390.0,PV01210100003,1,305.7,00003-2021,2021-01-25,Travel to Springfield 1/8-1/13/2021
1,6.0,2021-03-29,"NOLAN, CARLETON",10551496.0,PV06190600581,1,1093.4,52585266,2019-06-11,LODGING REIMBURSEMENT TO CIO CARLETON NOLAN/NY...
2,15.0,2021-01-05,"CARDONA, FELIX JR.",10560242.0,PV15201554053,1,18.0,194B0039969,2020-12-18,"Parking, gasoline, insurance and maintenance o..."
3,15.0,2021-01-05,"CARDONA, FELIX JR.",10560242.0,PV15201554068,1,55.0,029046,2020-12-18,"Postage, shipping and messenger fees"
4,15.0,2021-01-05,"CARDONA, FELIX JR.",10560242.0,PV15201554068,1,32.98,9569293,2020-12-21,Office supplies


## Data Cleaning

I did some data cleaning, like filtering to remove the 'Sum:' rows and rows where the invoice amount is $0. I also changed column types to make them most useful for me (or just easier to read).

In [99]:
df = df[df['Invoice Line Nbr'] != 'Sum:']
df = df[df['Invoice Line Amount'] != 0]

df['Dept Code'] = df['Dept Code'].astype('Int64')
df['Vendor Number'] = df['Vendor Number'].astype('object')

df.dtypes

Dept Code                          Int64
Pymt Date                 datetime64[ns]
Vendor Name                       object
Vendor Number                     object
Voucher (Batch) Number            object
Invoice Line Nbr                  object
Invoice Line Amount              float64
Invoice Number                    object
Invoice Date              datetime64[ns]
Invoice Description               object
dtype: object

Some of the rows of the dataset represent only parts of an invoice paid (represented by a value>1 for the ```Invoice Line Nbr``` column). I grouped by the ```Voucher (Batch) Number``` column to create a new dataframe of the total invoice amount, then merged in the other columns from the original dataframe.

In [100]:
df_sums = df.groupby('Voucher (Batch) Number')['Invoice Line Amount'].sum()

df = pd.merge(df_sums,df[['Voucher (Batch) Number','Dept Code', 'Pymt Date', 'Vendor Name', 'Vendor Number', 'Invoice Number', 'Invoice Date', 'Invoice Description']],on='Voucher (Batch) Number', how='left')

df = df.drop_duplicates(subset=['Voucher (Batch) Number', 'Invoice Line Amount'], keep='first')

df.head()

Unnamed: 0,Voucher (Batch) Number,Invoice Line Amount,Dept Code,Pymt Date,Vendor Name,Vendor Number,Invoice Number,Invoice Date,Invoice Description
0,PV01180100144,394.8,1,2019-02-21,"COLLINS, ADAM",10230254.0,144,2018-11-14,Travel 11/9-11/11/18
2,PV01180100145,578.18,1,2019-02-22,"SCHWESKA, PATRICK R",10231322.0,145,2018-12-03,Travel 11/26-11/29/18
4,PV01180100146,1298.0,1,2019-02-22,"NEWBERN, TIFFANY G",10202626.0,146,2018-12-06,Travel 3/5-3/7/18
8,PV01180100147,277.03,1,2019-02-22,"MCGRATH, MATTHEW D",10514870.0,147,2018-12-19,Mayoral meeting
9,PV01180100148,147.72,1,2019-02-22,"CASTRO, VERONICA",10131120.0,148,2018-12-12,Mayoral meeting 12/12/18
