# Project 2
## Lawrence Liu Jackie McGinley

# Background
#### A large company named Canterra, employs, at any given point of time, around 4000 employees. However, every year, around 15% of its employees leave the company and need to be replaced with the talent pool available in the job market. The management believes that this level of attrition (employees leaving, either on their own or because they got fired) is bad for the company, because of the following reasons:
- The former employees’ projects get delayed, which makes it difficult to meet timelines, resulting in a reputation loss among consumers and partners
- A sizable department has to be maintained, for the purposes of recruiting new talent
- More often than not, the new employees have to be trained for the job and/or given time to acclimatize themselves to the company
#### The management hypothesizes that higher job satisfaction and a higher number of total working years will reduce employee attrition. Additionally, the marketing management was interested to know if demographic variables such as gender, education and age affect employee attrition. Hence, the management has contracted you as a consultant to understand whether these two factors they should focus on, in order to curb attrition. In other words, they want to know if changes in their internal and external recruitment strategies would help retain employees.



In [6]:
### Step 0: Setup ----
# Load any libraries used for this project
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import altair as alt
import statsmodels.api as sm

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import (
    confusion_matrix,
    precision_score,
    recall_score,
    roc_curve,
    roc_auc_score
)


In [8]:
# -------------------------------------------------------------------
# Step 1: Pre-processing
# -------------------------------------------------------------------

print("Do we have missing values? Look at 'Non-Null Count'")
df = pd.read_excel('Employee_Data_Project.xlsx')
df.shape
print(df.info())


Do we have missing values? Look at 'Non-Null Count'
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4410 entries, 0 to 4409
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Age                      4410 non-null   int64  
 1   Attrition                4410 non-null   object 
 2   BusinessTravel           4410 non-null   object 
 3   DistanceFromHome         4410 non-null   int64  
 4   Education                4410 non-null   int64  
 5   EmployeeID               4410 non-null   int64  
 6   Gender                   4410 non-null   object 
 7   JobLevel                 4410 non-null   int64  
 8   MaritalStatus            4410 non-null   object 
 9   Income                   4410 non-null   int64  
 10  NumCompaniesWorked       4391 non-null   float64
 11  StandardHours            4410 non-null   int64  
 12  TotalWorkingYears        4401 non-null   float64
 13  TrainingTimesLastYear    4

#### There are missing values in the NumCompaniesWorked , TotalWorkingYears , EnvironmentSatisfaction , JobSatisfaction Columns

In [None]:
# -------------------------------------------------------------------
# Step 1: Pre-processing (check rows with nulls and visualize those rows)
# -------------------------------------------------------------------

# first step is to get the index of where the nulls are and take a look at only those rows
null_rows = df[df.isnull().any(axis=1)].index
# use the index from null rows to view those rows
df.loc[null_rows]
## confirmed that there are truly NaNs (73 total rows have NaNs)


Unnamed: 0,Age,Attrition,BusinessTravel,DistanceFromHome,Education,EmployeeID,Gender,JobLevel,MaritalStatus,Income,NumCompaniesWorked,StandardHours,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsWithCurrManager,EnvironmentSatisfaction,JobSatisfaction
11,36,No,Travel_Rarely,28,1,12,Male,1,Married,33770,0.0,8,16.0,2,15,11,,4.0
23,42,No,Travel_Rarely,4,4,24,Male,1,Married,89260,1.0,8,,4,20,6,2.0,3.0
40,36,No,Travel_Frequently,8,3,41,Female,3,Married,69620,4.0,8,4.0,2,1,0,3.0,
111,31,No,Travel_Rarely,1,3,112,Male,4,Single,28670,0.0,8,3.0,5,2,2,,2.0
115,27,No,Travel_Rarely,2,3,116,Male,1,Divorced,23670,,8,5.0,2,5,4,4.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4226,36,No,Travel_Rarely,2,3,4227,Male,2,Married,43200,,8,5.0,2,1,0,2.0,4.0
4332,31,No,Travel_Rarely,2,5,4333,Male,2,Married,27280,8.0,8,7.0,3,4,2,,4.0
4345,43,No,Non-Travel,6,2,4346,Male,1,Divorced,20280,4.0,8,7.0,2,5,2,4.0,
4395,40,No,Travel_Rarely,2,3,4396,Male,1,Divorced,27180,,8,9.0,4,9,7,1.0,4.0


In [15]:
# -------------------------------------------------------------------
# Step 1: Pre-processing (remove NaNs)
# -------------------------------------------------------------------
df = df.dropna()
df.shape
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 4337 entries, 0 to 4408
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Age                      4337 non-null   int64  
 1   Attrition                4337 non-null   object 
 2   BusinessTravel           4337 non-null   object 
 3   DistanceFromHome         4337 non-null   int64  
 4   Education                4337 non-null   int64  
 5   EmployeeID               4337 non-null   int64  
 6   Gender                   4337 non-null   object 
 7   JobLevel                 4337 non-null   int64  
 8   MaritalStatus            4337 non-null   object 
 9   Income                   4337 non-null   int64  
 10  NumCompaniesWorked       4337 non-null   float64
 11  StandardHours            4337 non-null   int64  
 12  TotalWorkingYears        4337 non-null   float64
 13  TrainingTimesLastYear    4337 non-null   int64  
 14  YearsAtCompany           4337

In [None]:
base =alt.Chart(df).mark_point(opacity=0.4).encode(
    alt.X('log_gdp' , title= 'Log GDP').axis(ticks=False).scale(domain = (4,12)),
    alt.Y('lifeExp' , title = 'Life Expectancy Value').axis(ticks=False).scale(domain = (10,90)),
    color = 'continent:N',
    tooltip = ['continent' , 'country' , 'year']

).properties(
    title = 'Scatterplot of Life Expectancy Vs. Log GDP',
    width = 1000,
    height = 800
)

trendline = base.transform_regression('log_gdp' , 'lifeExp' , method = 'linear' , groupby= ['continent']).mark_line( size=2)

chart = (base + trendline).configure_axis(grid=False)
chart.show()