<a href="https://colab.research.google.com/github/milan-rajababoo/Chicago-Employee-Information/blob/main/BDI_475_Case_Study_First_Draft.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analysis of the 2021 Chicago Employee Database



The data set I am analyzing is the current employee database for the City of Chicago.

---

I will predominately be analyzing the Annual Salary column by retrieving descriptive statistics and understanding how these values relate to the other qualitative fields in this data set.



In [2]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

In order to read this data, I first downloaded the .csv file off the Chicago website. I then uploaded this file to my GitHub repository, where I retrieved the raw data url. Using this link, I used the pd.read_csv() function to read the file into a Data Frame.

In [3]:
url = 'https://raw.githubusercontent.com/milan-rajababoo/Chicago-Employee-Information/main/Current_Employee_Names__Salaries__and_Position_Titles.csv'
df = pd.read_csv(url)

#Cleans the column names by removing white space and hyphens.
df.columns = df.columns.str.replace(' ', '')
df.columns = df.columns.str.replace('-', '')


_Note: Every row in this data set contains null ("NaN") values, so the .dropna() method is not usable here._

In [4]:
df.head()

Unnamed: 0,Name,JobTitles,Department,FullorPartTime,SalaryorHourly,TypicalHours,AnnualSalary,HourlyRate
0,"AARON, JEFFERY M",SERGEANT,POLICE,F,Salary,,118998.0,
1,"AARON, KARINA",POLICE OFFICER (ASSIGNED AS DETECTIVE),POLICE,F,Salary,,97440.0,
2,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,DAIS,F,Salary,,121272.0,
3,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,F,Salary,,119712.0,
4,"ABARCA, EMMANUEL",CONCRETE LABORER,TRANSPORTN,F,Hourly,40.0,,44.4


In [5]:
#Descriptive Statistics for Annual Salary

# Mean
df_mean = df['AnnualSalary'].mean()

# Median
df_median = df['AnnualSalary'].median()

# Mean Standard Deviation
df_sd = df['AnnualSalary'].mad()

# Variance
df_var = df['AnnualSalary'].var()

# Min
df_min = df['AnnualSalary'].min()

# Max
df_max = df['AnnualSalary'].max()

# Printing a table for easier readability.
stat_dict = {
    'Mean': df_mean,
    'Median': df_median,
    'Mean SD': df_sd,
    'Variance': df_var,
    'Min': df_min,
    'Max': df_max
}
for i in stat_dict:
  print('{}: {:.3f}'.format(i, stat_dict[i]))

Mean: 92413.141
Median: 90024.000
Mean SD: 16343.042
Variance: 479826005.289
Min: 20400.000
Max: 275004.000


In [27]:
# Distribution of Salary
fig = go.Figure(data=[go.Histogram(x=df['AnnualSalary'], marker_color = 'teal')])

fig.update_layout(
    title = 'Distribution of Annual Salary',
    xaxis_title_text = 'Annual Salary (in thousands of dollars)',
    yaxis_title_text = 'Count',
    bargap = 0.2
)

fig.show()

# Skew of the data
df_skew = df['AnnualSalary'].skew()
print('The skewness of this distribution has the value of {:.3f}.'.format(df_skew))

# Kurtosis of the data
df_kurtosis = df['AnnualSalary'].kurt()
print('The kurtosis of this distribution has the value of {:.3f}.'.format(df_kurtosis))

fig1 = px.box(
    df,
    x = 'AnnualSalary',
    orientation = 'h',
)
fig1.show()

The skewness of this distribution has the value of 0.587.
The kurtosis of this distribution has the value of 1.442.


In [67]:
# Displays the count of how many employees belong to each department.

num_of_emp_per_department = df['Department'].value_counts().reset_index()
num_of_emp_per_department = num_of_emp_per_department.rename(columns={'index':'Department', 'Department':'Count'})

# Creating a new column which contains the proportion of employees per department
num_of_emp = df.shape[0]
divided = (num_of_emp_per_department['Count'] / num_of_emp).round(decimals = 4)
num_of_emp_per_department['Percent'] = divided


fig = px.pie(
    num_of_emp_per_department,
    names = 'Department',
    values = 'Percent',
    width = 1000,
    height = 800,
    color_discrete_sequence = px.colors.sequential.Viridis_r
)
fig.update_layout(
    title = 'Percentage of Employees in each department'
)
fig.show()

In [33]:
#Average annual salary based off department
average_salary = df.groupby(['Department'])['AnnualSalary'].mean().reset_index().round(decimals = 2)
average_salary.sort_values(['AnnualSalary'], inplace=True)
fig = px.bar(
    average_salary,
    x = 'Department',
    y = 'AnnualSalary',
    color = 'AnnualSalary',
    color_continuous_scale = 'Viridis'
)
fig.update_layout(
    title = 'Average Annual Salary by Department',
    xaxis_title_text = 'Department',
    yaxis_title_text = 'Annual Salary'
)
fig.show()

In [34]:
median_salary = df.groupby(['Department'])['AnnualSalary'].median().reset_index().round(decimals = 2)
median_salary.sort_values(['AnnualSalary'], inplace=True)
fig = px.bar(
    median_salary,
    x = 'Department',
    y = 'AnnualSalary',
    color = 'AnnualSalary',
    color_continuous_scale = 'Viridis'
)
fig.update_layout(
    title = 'Median Annual Salary by Department',
    xaxis_title_text = 'Department',
    yaxis_title_text = 'Annual Salary'
)
fig.show()

In [59]:
# Average Hours Worked by Department
avg_hourly_rate = df.groupby(['Department'])['TypicalHours'].mean().reset_index().round(decimals = 2).dropna()
avg_hourly_rate.sort_values(['TypicalHours'], inplace=True)
fig = px.bar(
    avg_hourly_rate,
    x = 'Department',
    y = 'TypicalHours',
    color = 'TypicalHours',
    color_continuous_scale = 'Viridis'
)
fig.update_layout(
    title = 'Average Hours Worked by Department',
    xaxis_title_text = 'Department',
    yaxis_title_text = 'TypicalHours'
)
fig.show()

In [58]:
# Average Hourly Rate by Department
avg_hourly_rate = df.groupby(['Department'])['HourlyRate'].mean().reset_index().round(decimals = 2).dropna()
avg_hourly_rate.sort_values(['HourlyRate'], inplace=True)
fig = px.bar(
    avg_hourly_rate,
    x = 'Department',
    y = 'HourlyRate',
    color = 'HourlyRate',
    color_continuous_scale = 'Viridis'
)
fig.update_layout(
    title = 'Average Hourly Rate by Department',
    xaxis_title_text = 'Department',
    yaxis_title_text = 'HourlyRate'
)
fig.show()

print('Note: There were some departments with no hourly rate data.')