### **``Exploration Notebook``** 
``Equity Impact on Employee Attrition in the Workplace``

``Created by: Mijail Q. Mariano``

``13AUGUST2022``

----

In [1]:
# notebook dependencies
%matplotlib inline
import matplotlib as mlp
# mlp.rcParams['figure.dpi'] = 200

# diasbling warning messages
import warnings
warnings.filterwarnings("ignore")

# importing key libraries
import pandas as pd
pd.set_option('display.max_rows', None)
pd.options.display.float_format = '{:.2f}'.format

# numpy import
import numpy as np

# importing acquire module
import acquire

# importing data visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns 
sns.set(style = "darkgrid")

# scipy import
# import scipy.stats as stats
# from math import sqrt

# importing datetime module
# import datetime

# sklearn data science library
# from sklearn.impute import KNNImputer
# from sklearn.impute import MissingIndicator
# from sklearn.experimental import enable_iterative_imputer
# from sklearn.metrics import mean_squared_error
# from sklearn.impute import IterativeImputer
# from sklearn.cluster import KMeans

# from sklearn.preprocessing import RobustScaler, StandardScaler, MinMaxScaler, PolynomialFeatures
# from sklearn.linear_model import LinearRegression, LassoLars, TweedieRegressor
# from sklearn.feature_selection import SelectKBest, RFE, f_regression
# from sklearn.ensemble import RandomForestRegressor
# from sklearn.inspection import permutation_importance

----
#### **``Initial Planning/Ideas``**

Individual Data Science Project:

Mijail Mariano

August 13th 2022

**<u>``1. Formulating the question``</u>**

``This question should be:``

* About social equity or of similar importance (i.e., inequality, racial discrimination, social-mobility, equal opportunity)
* The question is to be freamed in a way that can be quantitatively measured in terms of organizational value & also raises the question around -  “How equal/diverse or fair is an organization's current workplace?”

**<u>``2. Exploration questions:``</u>**

**``What are you attempting to predict/help to address:``**

``Employee/Company Attrition Rate``

* What is company attrition?
* Why is company attrition important?
* What are the employee attrition demographics?
* Are there pros to attrition? If so, what are these?

**``What specifically are you attempting to investigate/understand:``**

``Equity in the workplace and its impact on attrition``

*Ok, but what specifically?...*

``Do socioeconomic/location factors such as:``

* Where an employee is from/grows-up (County level) impact whether or not they remain with a company?
* The high-school graduation rate
* Incarceration/prison rate
* Fraction of population married by 35 years old
* Poverty rate
* Teenage birth rate

``Are there other questions that may be important to answer?``

How much does an employee's geographical background (where they are from) impact their decision to remain or leave the company?
Are there socioeconomic/employee demographic differences between those employees who leave the company and those who remain? (descriptive/summary statistics)

**<u>``3. Methodology:``</u>**

**``Note:``** 

For this project I am assuming the company's geographical location to be New York City, NY and that employees are only from three (3) distinct tri-state NYC areas. This includes counties solely from the state's of Connecticut, New Jersey, and New York. To conduct the analysis I will also use a random generator to blindly assign birthplace/locations where employees grew-up and the socioeconomic variables from those locations to statistically explore these variables.

``Where’s the data from?``

To conduct this analysis and potentially generate a predictive company attrition model I combine real socioeconomic and economic data from Harvard’s Opportunity Atlast with an artificially created 2017 IBM Human Resources Kaggle dataset of a small-medium sized company (~1500 records). .

The Opportunity Atlas is a collaborative social equality project through Harvard University, the US Census Bureau, and the US Internal Revenue Service. The initiative’s aim is to track and plot socioeconomic data by exact US states, counties, cities, and neighborhoods in order to understand the childrens’ outcomes and prospect of social mobility. 

*The Atlas is composed of ~21mil Americans born between 1978-1983 who are in their mid-late thirties today. The platform and estimates are based on:

* The 2000 and 2010 Decennial Census short form
* Federal income tax returns for 1989, 1994, 1995, and 1998-2015
* Data from the American Community Survey

<u>Reference Links:</u>
* https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset
* https://www.opportunityatlas.org/

``Why couldn’t you use a real dataset?``

Given the sensitive nature of real employee information, it is relatively difficult to attain similar publicly available data from businesses. Additionally, since it is not common for organizations to collect similar socioeconomic information/drivers that I attempt to investigate - the combination of synthetic and real data seemed like an adequate method for scientific testing.

``So how should I think about this data?``

You can think about this data and the subsequent estimates as a way to understand how geographical/environmental characteristics potentially play a role in employee tenure. Additionally, these estimates help to understand potential employee equity differences in order to address them and successfully retain their employees. 

``Why might these employees decide to leave their company?`` 

(said another way)....
How might these demographic differences contribute to an employee’s decision to stay or leave their company?

Ok, so what happens if employers don’t retain these employees?

**<u>``4. What can employers do to retain these employees?``</u>**

(placeholder for recommendations)


``Opportunity Atlas (Equity DF): features/variables``
1. High_School_Graduation_Rate_rP_gP_pall
2. Household_Income_at_Age_35_rP_gP_pall
3. Incarceration_Rate_rP_gP_pall
4. Fraction_Married_at_Age_35_rP_gP_pall
5. Poverty_Rate_in_2012-16
6. Teenage_Birth_Rate_women_only_rP_gF_pall

``IBM Dataset: features/variables``
1. Age
2. Attrition
3. BusinessTravel
4. DailyRate
5. Department
6. DistanceFromHome
7. Education
8. EducationField
9. EmployeeCount
10. EmployeeNumber
11. EnvironmentSatisfaction
12. Gender
13. HourlyRate
14. JobInvolvement
15. JobLevel
16. JobRole
17. JobSatisfaction
18. MaritalStatus
19. MonthlyIncome
20. MonthlyRate
21. NumCompaniesWorked
22. Over18
23. OverTime
24. PercentSalaryHike
25. PerformanceRating
26. RelationshipSatisfaction
27. StandardHours
28. StockOptionLevel
29. TotalWorkingYears
30. TrainingTimesLastYear
31. WorkLifeBalance
32. YearsAtCompany
33. YearsInCurrentRole
34. YearsSinceLastPromotion
35. YearsWithCurrManager


----

### **``Data Acquisition and Preparation``**

In [2]:
# let's import the IBM employee data first

ibm_df = pd.read_csv("/Users/mijailmariano/Desktop/IBM_HR-Employee-Attrition.csv")
print()
print(f'IBM dataset shape: {ibm_df.shape}')
ibm_df.head()


IBM dataset shape: (1470, 35)


Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [3]:
# let's import the opportunity atlas data

equity_df = pd.read_csv("/Users/mijailmariano/Desktop/equity_table.csv")
print()
print(f'Equity dataset shape: {equity_df.shape}')
equity_df.head()


Equity dataset shape: (91, 9)


Unnamed: 0,cty,county_name,state,househouse_income_at_35,high-school_graduation_rate,percentage_married_by_35,incarceration_rate,women_teenage_birthrate,poverty_rate
0,cty36001,Albany County,NY,54559,0.92,0.49,0.01,0.1,0.13
1,cty36003,Allegany County,NY,43477,0.89,0.53,0.01,0.17,0.17
2,cty34001,Atlantic County,NJ,40515,0.87,0.39,0.02,0.21,0.15
3,cty34003,Bergen County,NJ,63424,0.94,0.51,0.0,0.04,0.07
4,cty36005,Bronx County,NY,32542,0.78,0.22,0.02,0.28,0.31


In [4]:
# let's use a random sampler to create 1470 geographical location records
# using Pandas' '.sample()' method wtih parameters 'replace' set to True to allow for duplicate records
# resetting the index number
# setting a random state for reproducibility

sample_df = equity_df.sample(n = 1470, replace = True, ignore_index = True, random_state = 528)
sample_df.shape

(1470, 9)

In [5]:
# let's also reshuffle the ibm df for random assignment & suffling of the dataframe
# resetting the index number (can use unique employee id for predictions/future indexing)
# setting a random state for reproducibility

ibm_shuffled = ibm_df.sample(n = 1470, replace = False, ignore_index = True, random_state = 528)

print()
print(f'dataframe shape: {ibm_shuffled.shape}')
ibm_shuffled.head() # checks out!


dataframe shape: (1470, 35)


Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,53,No,Travel_Rarely,1223,Research & Development,7,2,Medical,1,1201,...,2,80,1,26,6,3,7,7,4,7
1,42,No,Travel_Rarely,933,Research & Development,29,3,Life Sciences,1,836,...,4,80,1,10,3,2,9,8,7,8
2,30,Yes,Travel_Rarely,740,Sales,1,3,Life Sciences,1,1562,...,4,80,1,10,4,3,10,8,6,7
3,41,No,Travel_Rarely,1411,Research & Development,19,2,Life Sciences,1,334,...,1,80,2,17,2,2,1,0,0,0
4,34,No,Travel_Rarely,628,Research & Development,8,3,Medical,1,2068,...,1,80,0,6,3,4,4,3,1,2


In [6]:
# concatinating the two(2) dataframes

df = pd.concat([ibm_shuffled, sample_df], axis = 1)

print()
print(f'dataframe shape: {df.shape}')
df.head() # checks out!


dataframe shape: (1470, 44)


Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,YearsWithCurrManager,cty,county_name,state,househouse_income_at_35,high-school_graduation_rate,percentage_married_by_35,incarceration_rate,women_teenage_birthrate,poverty_rate
0,53,No,Travel_Rarely,1223,Research & Development,7,2,Medical,1,1201,...,7,cty36015,Chemung County,NY,44789,0.89,0.49,0.01,0.18,0.16
1,42,No,Travel_Rarely,933,Research & Development,29,3,Life Sciences,1,836,...,8,cty36065,Oneida County,NY,48064,0.9,0.5,0.01,0.12,0.18
2,30,Yes,Travel_Rarely,740,Sales,1,3,Life Sciences,1,1562,...,7,cty09009,New Haven County,CT,50468,0.91,0.45,0.01,0.11,0.13
3,41,No,Travel_Rarely,1411,Research & Development,19,2,Life Sciences,1,334,...,0,cty09009,New Haven County,CT,50468,0.91,0.45,0.01,0.11,0.13
4,34,No,Travel_Rarely,628,Research & Development,8,3,Medical,1,2068,...,2,cty36119,Westchester County,NY,57101,0.92,0.45,0.01,0.09,0.1
