## Part 1: Preprocessing

In [1]:
# Import our dependencies
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
from tensorflow.keras.models import Model
from tensorflow.keras import layers

#  Import and read the attrition data
# Google Colab attrition_df = pd.read_csv('https://static.bc-edx.com/ai/ail-v-1-0/m19/lms/datasets/attrition.csv')
attrition_df = pd.read_csv('./Resources/attrition.csv')
attrition_df.head()

Unnamed: 0.1,Unnamed: 0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,HourlyRate,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,0,41,Yes,Travel_Rarely,Sales,1,2,Life Sciences,2,94,...,3,1,0,8,0,1,6,4,0,5
1,1,49,No,Travel_Frequently,Research & Development,8,1,Life Sciences,3,61,...,4,4,1,10,3,3,10,7,1,7
2,2,37,Yes,Travel_Rarely,Research & Development,2,2,Other,4,92,...,3,2,0,7,3,3,0,0,0,0
3,3,33,No,Travel_Frequently,Research & Development,3,4,Life Sciences,4,56,...,3,3,0,8,3,3,8,7,3,0
4,4,27,No,Travel_Rarely,Research & Development,2,1,Medical,1,40,...,3,4,1,6,3,3,2,2,2,2


In [2]:
# Look at numver of null values
attrition_df.isnull().sum()

Unnamed: 0                  0
Age                         0
Attrition                   0
BusinessTravel              0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EnvironmentSatisfaction     0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
NumCompaniesWorked          0
OverTime                    0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StockOptionLevel            0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSinceLastPromotion     0
YearsWithCurrManager        0
dtype: int64

In [3]:
# Determine the number of unique values in each column.
attrition_df.nunique()

Unnamed: 0                  1470
Age                           43
Attrition                      2
BusinessTravel                 3
Department                     3
DistanceFromHome              29
Education                      5
EducationField                 6
EnvironmentSatisfaction        4
HourlyRate                    71
JobInvolvement                 4
JobLevel                       5
JobRole                        9
JobSatisfaction                4
MaritalStatus                  3
NumCompaniesWorked            10
OverTime                       2
PercentSalaryHike             15
PerformanceRating              2
RelationshipSatisfaction       4
StockOptionLevel               4
TotalWorkingYears             40
TrainingTimesLastYear          7
WorkLifeBalance                4
YearsAtCompany                37
YearsInCurrentRole            19
YearsSinceLastPromotion       16
YearsWithCurrManager          18
dtype: int64

In [4]:
# Create y_df with the Attrition and Department columns
y_df = attrition_df[['Attrition', 'Department']]

y_df.head()

Unnamed: 0,Attrition,Department
0,Yes,Sales
1,No,Research & Development
2,Yes,Research & Development
3,No,Research & Development
4,No,Research & Development


In [5]:
# Create a list of at least 10 column names to use as X data
selected_columns = ['Age', 'BusinessTravel', 'DistanceFromHome', 'Education', 'EducationField', 'EnvironmentSatisfaction', 'EnvironmentSatisfaction', 'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus', 'NumCompaniesWorked', 'OverTime', 'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction', 'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager']


# Create X_df using your selected columns
X_df = attrition_df[selected_columns]


# Show the data types for X_df
X_df.dtypes



Age                          int64
BusinessTravel              object
DistanceFromHome             int64
Education                    int64
EducationField              object
EnvironmentSatisfaction      int64
EnvironmentSatisfaction      int64
HourlyRate                   int64
JobInvolvement               int64
JobLevel                     int64
JobRole                     object
JobSatisfaction              int64
MaritalStatus               object
NumCompaniesWorked           int64
OverTime                    object
PercentSalaryHike            int64
PerformanceRating            int64
RelationshipSatisfaction     int64
StockOptionLevel             int64
TotalWorkingYears            int64
TrainingTimesLastYear        int64
WorkLifeBalance              int64
YearsAtCompany               int64
YearsInCurrentRole           int64
YearsSinceLastPromotion      int64
YearsWithCurrManager         int64
dtype: object

In [6]:
# Display X_df
X_df.head()

Unnamed: 0,Age,BusinessTravel,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,EnvironmentSatisfaction.1,HourlyRate,JobInvolvement,JobLevel,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Travel_Rarely,1,2,Life Sciences,2,2,94,3,2,...,3,1,0,8,0,1,6,4,0,5
1,49,Travel_Frequently,8,1,Life Sciences,3,3,61,2,2,...,4,4,1,10,3,3,10,7,1,7
2,37,Travel_Rarely,2,2,Other,4,4,92,2,1,...,3,2,0,7,3,3,0,0,0,0
3,33,Travel_Frequently,3,4,Life Sciences,4,4,56,3,1,...,3,3,0,8,3,3,8,7,3,0
4,27,Travel_Rarely,2,1,Medical,1,1,40,3,1,...,3,4,1,6,3,3,2,2,2,2


In [7]:
#Preprocess Y data Binary Field with label encoding

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
le = LabelEncoder()
y_df['Attrition'] = le.fit_transform(y_df['Attrition'])

# Use OneHotEncoder for multi-categorical field
dept_encoder = OneHotEncoder(sparse_output=False)
dept_encoded =dept_encoder.fit_transform(y_df[['Department']])
dept_encoded_columns = dept_encoder.get_feature_names_out(['Department'])
dept_encoded_df = pd.DataFrame(dept_encoded, columns=dept_encoded_columns)

y_df_processed = pd.concat([y_df, dept_encoded_df], axis=1)

# Drop Attrition and Department columns
y_df_processed = y_df_processed.drop(columns=['Department'])

y_df_processed.rename(columns={'Department_Human Resources':'Department_Human_Resources', 'Department_Research & Development':'Department_Research_and_Development'}, inplace=True)
y_df_processed.head(20)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  y_df['Attrition'] = le.fit_transform(y_df['Attrition'])


Unnamed: 0,Attrition,Department_Human_Resources,Department_Research_and_Development,Department_Sales
0,1,0.0,0.0,1.0
1,0,0.0,1.0,0.0
2,1,0.0,1.0,0.0
3,0,0.0,1.0,0.0
4,0,0.0,1.0,0.0
5,0,0.0,1.0,0.0
6,0,0.0,1.0,0.0
7,0,0.0,1.0,0.0
8,0,0.0,1.0,0.0
9,0,0.0,1.0,0.0


In [8]:
# Preprocess X data 

#Use Label Encoder for binary field Overtime
X_df['OverTime'] = le.fit_transform(X_df['OverTime'])

# Use OneHotEncoder for multi-categorical fields

# Encode BusinessTravel
biz_travel_encoder = OneHotEncoder(sparse_output=False)
biz_travel_encoded = biz_travel_encoder.fit_transform(X_df[['BusinessTravel']])
biz_travel_encoded_columns = biz_travel_encoder.get_feature_names_out(['BusinessTravel'])
biz_travel_encoded_df = pd.DataFrame(biz_travel_encoded, columns=biz_travel_encoded_columns)

# Encode EducationField
edu_field_encoder = OneHotEncoder(sparse_output=False)
edu_field_encoded =edu_field_encoder.fit_transform(X_df[['EducationField']])
edu_field_encoded_columns = edu_field_encoder.get_feature_names_out(['EducationField'])
edu_field_encoded_df = pd.DataFrame(edu_field_encoded, columns=edu_field_encoded_columns)

# Encode JobRole
job_role_encoder = OneHotEncoder(sparse_output=False)
job_role_encoded =job_role_encoder.fit_transform(X_df[['JobRole']])
job_role_encoded_columns = job_role_encoder.get_feature_names_out(['JobRole'])
job_role_encoded_df = pd.DataFrame(job_role_encoded, columns=job_role_encoded_columns)

# Encode MaritalStatus
marital_status_encoder = OneHotEncoder(sparse_output=False)
marital_status_encoded =marital_status_encoder.fit_transform(X_df[['MaritalStatus']])
marital_status_encoded_columns = marital_status_encoder.get_feature_names_out(['MaritalStatus'])
marital_status_encoded_df = pd.DataFrame(marital_status_encoded, columns=marital_status_encoded_columns)

# Concatenate the X dataframes
X_processed_df = pd.concat([X_df, biz_travel_encoded_df, edu_field_encoded_df, job_role_encoded_df, marital_status_encoded_df], axis=1)

# Drop original columns
X_processed_df = X_processed_df.drop(columns=['BusinessTravel', 'EducationField', 'JobRole', 'MaritalStatus'])

pd.set_option('display.max_columns', None)
X_processed_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_df['OverTime'] = le.fit_transform(X_df['OverTime'])


Unnamed: 0,Age,DistanceFromHome,Education,EnvironmentSatisfaction,EnvironmentSatisfaction.1,HourlyRate,JobInvolvement,JobLevel,JobSatisfaction,NumCompaniesWorked,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,BusinessTravel_Non-Travel,BusinessTravel_Travel_Frequently,BusinessTravel_Travel_Rarely,EducationField_Human Resources,EducationField_Life Sciences,EducationField_Marketing,EducationField_Medical,EducationField_Other,EducationField_Technical Degree,JobRole_Healthcare Representative,JobRole_Human Resources,JobRole_Laboratory Technician,JobRole_Manager,JobRole_Manufacturing Director,JobRole_Research Director,JobRole_Research Scientist,JobRole_Sales Executive,JobRole_Sales Representative,MaritalStatus_Divorced,MaritalStatus_Married,MaritalStatus_Single
0,41,1,2,2,2,94,3,2,4,8,1,11,3,1,0,8,0,1,6,4,0,5,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,49,8,1,3,3,61,2,2,2,1,0,23,4,4,1,10,3,3,10,7,1,7,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
2,37,2,2,4,4,92,2,1,3,6,1,15,3,2,0,7,3,3,0,0,0,0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,33,3,4,4,4,56,3,1,3,1,1,11,3,3,0,8,3,3,8,7,3,0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
4,27,2,1,1,1,40,3,1,2,9,0,12,3,4,1,6,3,3,2,2,2,2,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [24]:
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split

y_processed_attrition_df = y_df_processed['Attrition']

y_processed_department_df = y_df_processed[['Department_Human_Resources', 'Department_Research_and_Development', 'Department_Sales']]

X_train, X_test, y_attrition_train, y_attrition_test, y_department_train, y_department_test = train_test_split(X_processed_df, y_processed_attrition_df, y_processed_department_df)

X_train.head()


Unnamed: 0,Age,DistanceFromHome,Education,EnvironmentSatisfaction,EnvironmentSatisfaction.1,HourlyRate,JobInvolvement,JobLevel,JobSatisfaction,NumCompaniesWorked,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,BusinessTravel_Non-Travel,BusinessTravel_Travel_Frequently,BusinessTravel_Travel_Rarely,EducationField_Human Resources,EducationField_Life Sciences,EducationField_Marketing,EducationField_Medical,EducationField_Other,EducationField_Technical Degree,JobRole_Healthcare Representative,JobRole_Human Resources,JobRole_Laboratory Technician,JobRole_Manager,JobRole_Manufacturing Director,JobRole_Research Director,JobRole_Research Scientist,JobRole_Sales Executive,JobRole_Sales Representative,MaritalStatus_Divorced,MaritalStatus_Married,MaritalStatus_Single
941,30,6,3,1,1,48,2,2,4,0,0,12,3,1,1,10,6,3,9,2,6,7,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1175,39,12,3,4,4,66,3,2,2,4,0,21,4,3,0,7,3,3,5,4,1,0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1172,29,10,3,3,3,42,2,2,3,9,0,11,3,3,0,8,2,3,5,2,1,4,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1463,31,5,3,2,2,74,3,2,1,0,0,19,3,2,0,10,2,3,9,4,1,7,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1308,38,2,4,2,2,77,1,2,4,2,1,20,4,1,2,20,4,2,4,2,0,3,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0


In [10]:
# Convert your X data to numeric data types however you see fit
# Add new code cells as necessary

# ***** Already preprocessed all data to numeric data types *****



In [25]:
# Create a StandardScaler
scaler = StandardScaler()


# Fit the StandardScaler to the training data
X_scaler = scaler.fit(X_train)


# Scale the training and testing data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)




In [12]:
# Create a OneHotEncoder for the Department column


# Fit the encoder to the training data


# Create two new variables by applying the encoder
# to the training and testing data

# ***** Already preprocessed all data to numeric data types *****


In [13]:
# Create a OneHotEncoder for the Attrition column


# Fit the encoder to the training data


# Create two new variables by applying the encoder
# to the training and testing data

# ***** Already preprocessed all data to numeric data types *****

## Create, Compile, and Train the Model

In [26]:
# Find the number of columns in the X training data
column_count = X_train_scaled.shape[1]

# Create the input layer
input = layers.Input(shape=(column_count,), name='InputLayer')

# Create at least two shared layers
shared1 = layers.Dense(64, activation='relu')(input)
shared2 = layers.Dense(128, activation='relu')(shared1)

In [27]:
# Create a branch for Department
# with a hidden layer and an output layer

# Create the hidden layer
department_hidden = layers.Dense(32, activation='relu')(shared2)

# Create the output layer
department_output = layers.Dense(3, activation='softmax', name='department')(department_hidden)

In [28]:
# Create a branch for Attrition
# with a hidden layer and an output layer

# Create the hidden layer
attrition_hidden = layers.Dense(32, activation='relu')(shared2)

# Create the output layer
attrition_output = layers.Dense(1, activation='sigmoid', name='attrition')(attrition_hidden)

In [29]:
# Create the model
model = Model(inputs=input, outputs=[department_output, attrition_output])

# Compile the model
model.compile(optimizer='adam',
              loss={'department': 'categorical_crossentropy', 'attrition': 'binary_crossentropy'},
              metrics={'department': 'accuracy', 'attrition': 'accuracy'})


# Summarize the model
model.summary()


In [30]:
# Train the model
model.fit(X_train_scaled,
           {'department': y_department_train, 
            'attrition': y_attrition_train}, 
            epochs=100, 
            batch_size=32, 
            validation_split=.2,
            verbose=2)


Epoch 1/100
28/28 - 3s - 92ms/step - attrition_accuracy: 0.7582 - department_accuracy: 0.6186 - loss: 1.3453 - val_attrition_accuracy: 0.8462 - val_department_accuracy: 0.8281 - val_loss: 1.0489
Epoch 2/100
28/28 - 0s - 4ms/step - attrition_accuracy: 0.8388 - department_accuracy: 0.8910 - loss: 0.7995 - val_attrition_accuracy: 0.8462 - val_department_accuracy: 0.9276 - val_loss: 0.7177
Epoch 3/100
28/28 - 0s - 4ms/step - attrition_accuracy: 0.8422 - department_accuracy: 0.9535 - loss: 0.5288 - val_attrition_accuracy: 0.8597 - val_department_accuracy: 0.9412 - val_loss: 0.5550
Epoch 4/100
28/28 - 0s - 4ms/step - attrition_accuracy: 0.8695 - department_accuracy: 0.9773 - loss: 0.4182 - val_attrition_accuracy: 0.8733 - val_department_accuracy: 0.9548 - val_loss: 0.4972
Epoch 5/100
28/28 - 0s - 4ms/step - attrition_accuracy: 0.8831 - department_accuracy: 0.9841 - loss: 0.3565 - val_attrition_accuracy: 0.8733 - val_department_accuracy: 0.9548 - val_loss: 0.4878
Epoch 6/100
28/28 - 0s - 4ms/

<keras.src.callbacks.history.History at 0x2792aa06780>

In [31]:
# Evaluate the model with the testing data
results = model.evaluate(X_test_scaled,
                                            {'department': y_department_test, 
                                             'attrition': y_attrition_test},
                                             verbose=2)




12/12 - 0s - 4ms/step - attrition_accuracy: 0.8234 - department_accuracy: 0.9701 - loss: 1.8691


In [35]:
# Print the accuracy for both department and attrition

print(f"Attrition Accuracy: {results[1]*100:.2f}% Department Accuracy: {results[2]*100:.2f}% Loss: {results[0]:.4f}")
results


Attrition Accuracy: 82.34% Department Accuracy: 97.01% Loss: 1.8691


[1.8690659999847412, 0.823369562625885, 0.970108687877655]

# Summary

In the provided space below, briefly answer the following questions.

1. Is accuracy the best metric to use on this data? Why or why not?

2. What activation functions did you choose for your output layers, and why?

3. Can you name a few ways that this model might be improved?

YOUR ANSWERS HERE

1. Yes, I believe Accuracy is the best metric because the cost of being wrong is high.  For example, if could be extremely costly if a key employee is predicted not to leave but they do.  If you believe they will leave and they don't that's less problematic because there won't be much harm in trying to address the factors that indicate that they will leave.  (FAR less expesive to retain than to replace and train.)  It's important to know the key contribution factors to attrition though so further analysis would be needed to determine this.  Once known those key features can be used to build a retention plan for those that look like they will leave the company.
2. For Department I chose softmax.  Softmax will always choose one and this would create a false positive if an unseen value were to show up in the 'department' data. Softmax is a good choice when the result is non binary but all the data will match one of the values seen in the data.  For Attrition I chose sigmoid as it's ideal for binary data.
3. It would be ideal to have employment and salary history or the employees past positions.  Since this information is often aquired during the interview process it should be available.  This would allow the model to recognize job change patterns and use that to improve accuracy.