## Final Project - NYC Citywide Payroll Data

### Meaghan Burke - Data 608

#### Data Source: https://data.cityofnewyork.us/City-Government/Citywide-Payroll-Data-Fiscal-Year-/k397-673e

### MAy 11th 2019


### Data Citation
**Dataset:** city employee base and overtime salary by fiscal year.

*This dataset reflects the employee's final base and gross salary at the end of the year*

**Columns Included:**

|Column Name|Column Description|
|:----------- |:--------- |
|Payroll Description|	The Payroll agency that the employee works for|
|Last Name|	Last name of employee|
|First Name|	First name of employee|
|Middle Initial|	Middle initial of employee|
|Agency Start Date|	Date which employee began working for their current agency|
|Work Location Borough|	Borough of employee's primary work location|
|Title Description|	Civil service title description of the employee|
|Leave Status as of Jun 30|	Status of employee as of the close of the relevant fiscal year: Active, Ceased, or On Leave|
|Base Salary|	 Base Salary assigned to the employee|
|Pay Basis|	Lists whether the employee is paid on an hourly, per diem or annual basis|
|Regular Hours|	Number of regular hours employee worked in the fiscal year|
|Regular Gross Paid|	The amount paid to the employee for base salary during the fiscal year|
|OT Hours|	Overtime Hours worked by employee in the fiscal year|
|Total OT Paid|	Total overtime pay paid to the employee in the fiscal year|
|Total Other Pay|	Includes any compensation in addition to gross salary and overtime pay, ie Differentials, lump sums, uniform allowance, meal allowance, retroactive pay increases, settlement amounts, and bonus pay, if applicable.|


In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

In [50]:
# all the filtering and data cleaning 
#filter for only per annum jobs, active employees and titles that are not null 
#del the original for memory purposes
pay_dataset =  pd.read_csv("Citywide_Payroll_Data__Fiscal_Year_.csv", low_memory = False)
pay_dataset['Total Pay'] = pay_dataset[['Regular Gross Paid', 'Total OT Paid', 'Total Other Pay']].sum(axis =1 )
pay_dataset = pay_dataset.applymap(lambda s:s.upper().strip() if type(s) == str else s)
filtered_pay= pay_dataset[(pay_dataset['Pay Basis'] == 'PER ANNUM') & 
                          (pay_dataset['Leave Status as of June 30'] == 'ACTIVE') &
                          (~pay_dataset['Title Description'].isnull())]

filtered_pay.loc[filtered_pay['Work Location Borough'].isnull(), 'Work Location Borough'] = 'UNKNOWN'
del(pay_dataset)

In [51]:
#recreate a employee id as the Payroll Number has too many NAS, checked unqiue counts and the combination of the below columns is unqiue to each employee
#https://stackoverflow.com/questions/48008334/anonymize-specific-columns-with-pii-in-pandas-dataframe-python anonymize 
cols = ['Agency Name', 'Last Name', 'First Name', 'Mid Init', 'Agency Start Date', 'Pay Basis']
filtered_pay['Employee_Id'] = filtered_pay[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1).astype('category').cat.codes
filtered_pay.drop(['Last Name', 'Payroll Number', 'First Name', 'Mid Init', 'Leave Status as of June 30'], axis = 1, inplace = True)

In [52]:
#display the unique descriptive information as a table in the dash application
unique_values = filtered_pay.groupby(['Fiscal Year']).nunique()

decribe_table = filtered_pay.describe()

unique_values

Unnamed: 0_level_0,Fiscal Year,Agency Name,Agency Start Date,Work Location Borough,Title Description,Base Salary,Pay Basis,Regular Hours,Regular Gross Paid,OT Hours,Total OT Paid,Total Other Pay,Total Pay,Employee_Id
Fiscal Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2014,1,144,10713,5,1202,19453,1,11146,103730,24570,108685,103529,201618,269984
2015,1,144,10861,19,1214,23895,1,11221,148845,25537,114468,111498,231606,276448
2016,1,144,10968,19,1215,24414,1,11783,133073,26766,115555,116382,217993,285090
2017,1,147,11048,19,1225,24813,1,13549,134388,26028,119483,117741,225601,294890
2018,1,148,11076,19,1243,23786,1,12814,134315,25186,120379,112577,230863,298016


In [79]:
calcs = {'Work Location Borough': ['nunique'],'Employee_Id':['nunique'], 'Title Description':['nunique'],
        'Regular Hours': ['sum'],'Regular Gross Paid':['mean'], 'OT Hours':['mean'],'Total Other Pay':['mean'],
        'Total Pay':['mean']}
consolidated_table = filtered_pay.groupby(['Fiscal Year','Agency Name']).agg(calcs).reset_index()
consolidated_table.columns = consolidated_table.columns.droplevel(-1)
consolidated_table = consolidated_table.sort_values('Employee_Id', ascending = False)

In [82]:
filtered_pay.to_csv("filtered_nyc_payset.csv")

### Project Goal:

I would like to better understand how the city's financial resources are allocated. I would also like to analyze and visualize the impact on NYC employees' salaries by occupation, title, time, overtime and borough. Being a life-long New Yorker and daughter/sister to NYC civil servants, I am very interested to see how NYC compensates its employees. I believe that this project is very relevant to the "current policy" and "business" topics noted in the assignment parameters. 

### Tech Toolset:

The data analysis component for this project will be managed in Python leveraging the pandas library. For the visualizations, I will either deliver embedded images utilizing the seaborn and matplotlib libraries and/or develop a Dash or Flask application.
