## Part 1: Preprocessing

In [1]:
!pip install tensorflow
# Import our dependencies
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
from tensorflow.keras.models import Model
from tensorflow.keras import layers, Input
from scipy.sparse import issparse
from sklearn.preprocessing import OneHotEncoder

#  Import and read the attrition data
attrition_df = pd.read_csv('https://static.bc-edx.com/ai/ail-v-1-0/m19/lms/datasets/attrition.csv')
attrition_df.head()



Unnamed: 0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,HourlyRate,JobInvolvement,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,Sales,1,2,Life Sciences,2,94,3,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,Research & Development,8,1,Life Sciences,3,61,2,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,Research & Development,2,2,Other,4,92,2,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,Research & Development,3,4,Life Sciences,4,56,3,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,Research & Development,2,1,Medical,1,40,3,...,3,4,1,6,3,3,2,2,2,2


In [2]:
# Determine the number of unique values in each column.
attrition_df.nunique()

Unnamed: 0,0
Age,43
Attrition,2
BusinessTravel,3
Department,3
DistanceFromHome,29
Education,5
EducationField,6
EnvironmentSatisfaction,4
HourlyRate,71
JobInvolvement,4


In [3]:
# Create y_df with the Attrition and Department columns
y_df = attrition_df[['Attrition', 'Department']]
print(y_df)

     Attrition              Department
0          Yes                   Sales
1           No  Research & Development
2          Yes  Research & Development
3           No  Research & Development
4           No  Research & Development
...        ...                     ...
1465        No  Research & Development
1466        No  Research & Development
1467        No  Research & Development
1468        No                   Sales
1469        No  Research & Development

[1470 rows x 2 columns]


In [14]:
# Create a list of at least 10 column names to use as X data
column_names = ['Age', 'BusinessTravel', 'PercentSalaryHike', 'DistanceFromHome', 'Education', 'EducationField', 'EnvironmentSatisfaction', 'TotalWorkingYears', 'HourlyRate', 'JobInvolvement']


# Create X_df using your selected columns
X_df = attrition_df[column_names]


# Show the data types for X_df
dtypes = X_df.dtypes
print(dtypes)



Age                         int64
BusinessTravel             object
PercentSalaryHike           int64
DistanceFromHome            int64
Education                   int64
EducationField             object
EnvironmentSatisfaction     int64
TotalWorkingYears           int64
HourlyRate                  int64
JobInvolvement              int64
dtype: object


In [5]:
from re import X
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.2, random_state=42)

# Display the shapes of the resulting datasets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)



X_train shape: (1176, 10)
X_test shape: (294, 10)
y_train shape: (1176, 2)
y_test shape: (294, 2)


In [6]:
# Convert your X data to numeric data types however you see fit
# Add new code cells as necessary
X_train = pd.get_dummies(X_train)
X_test = pd.get_dummies(X_test)

# Ensure both training and testing data have the smae dummy variables
X_train, X_test = X_train.align(X_test, join='left', axis=1, fill_value=0)


In [7]:
# Create a StandardScaler
scaler = StandardScaler()

# Fit the StandardScaler to the training data
X_train_scaled = scaler.fit_transform(X_train)

# Transform the testing data using the same scaler
X_test_scaled = scaler.transform(X_test)

# Convert scaled data back to DataFrame (for easier handling)
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=X_test.columns)

# Display the data
print(X_train_scaled_df.head())
print(X_test_scaled_df.head())


        Age  PercentSalaryHike  DistanceFromHome  Education  \
0 -1.388559          -0.339249          1.440396  -0.863356   
1 -2.040738          -0.066365         -0.522699  -0.863356   
2 -0.845077          -0.339249          1.317703  -0.863356   
3  0.241886           1.570943          0.336155   0.099933   
4 -0.627685          -1.157903          1.317703   0.099933   

   EnvironmentSatisfaction  TotalWorkingYears  HourlyRate  JobInvolvement  \
0                 0.279706          -1.167368   -0.472832       -1.012340   
1                -0.639104          -1.423397    0.309374        0.389912   
2                 1.198515          -0.143254   -1.059487        0.389912   
3                 1.198515          -0.527297   -0.032841        0.389912   
4                -0.639104          -0.143254    1.091580        0.389912   

   BusinessTravel_Non-Travel  BusinessTravel_Travel_Frequently  \
0                  -0.326041                         -0.490414   
1                   3.0670

In [8]:
# Create a OneHotEncoder for the Department column
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()

# Fit the encoder to the training data
encoder.fit(y_train[['Department']])

# Create two new variables by applying the encoder
# to the training and testing data
y_train_encoded = encoder.transform(y_train[['Department']])
y_test_encoded = encoder.transform(y_test[['Department']])


In [9]:
# Create a OneHotEncoder for the Attrition column
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()

# Fit the encoder to the training data
encoder.fit(y_train[['Attrition']])

# Create two new variables by applying the encoder
# to the training and testing data
y_train_encoded = encoder.transform(y_train[['Attrition']])
y_test_encoded = encoder.transform(y_test[['Attrition']])



## Create, Compile, and Train the Model

In [10]:
# Import necessary libraries
from tensorflow.keras.models import Model
from tensorflow.keras import layers, Input

# Find the number of columns in the X training data
number_input_features = len(X_train.columns)

# Define the main input layer
main_input = Input(shape=(number_input_features,), name="main_input")

# Create shared layers
shared_layer1 = layers.Dense(units=32, activation='relu')(main_input)
shared_layer2 = layers.Dense(units=16, activation='relu')(shared_layer1)

# Create a branch for Department
department_hidden = layers.Dense(units=64, activation='relu')(shared_layer2)
department_output = layers.Dense(units=2, activation='softmax', name="department_output")(department_hidden)

# Create a branch for Attrition
attrition_hidden = layers.Dense(units=64, activation='relu')(shared_layer2)  # Connect to shared_layer2
attrition_output = layers.Dense(units=2, activation='softmax', name="attrition_output")(attrition_hidden)

# Create the model with one input and two outputs
model = Model(inputs=main_input, outputs=[department_output, attrition_output])

# Compile the model
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=[['accuracy'], ['accuracy']])


# Summarize the model
model.summary()

# Evaluate the model
if issparse(y_test_encoded):
    y_test_encoded_dense = y_test_encoded.toarray()
else:
    y_test_encoded_dense = y_test_encoded  # Already dense, no conversion needed

loss, department_loss, attrition_loss, department_accuracy, attrition_accuracy = model.evaluate(
    X_test_scaled, [y_test_encoded_dense, y_test_encoded_dense], verbose=0
)


# Print the evaluation results
print("Loss:", loss)
print("Department Loss:", department_loss)
print("Attrition Loss:", attrition_loss)
print("Department Accuracy:", department_accuracy)
print("Attrition Accuracy:", attrition_accuracy)



Loss: 1.2803040742874146
Department Loss: 0.658382773399353
Attrition Loss: 0.618008017539978
Department Accuracy: 0.7108843326568604
Attrition Accuracy: 0.646258533000946


In [11]:
# Train the model
model = model
model.fit(
    X_train_scaled,
    [y_train_encoded.toarray(), y_train_encoded.toarray()],
    epochs=10,
    shuffle=True,
    verbose=2
)





Epoch 1/10
37/37 - 3s - 89ms/step - attrition_output_accuracy: 0.8248 - attrition_output_loss: 0.5083 - department_output_accuracy: 0.8180 - department_output_loss: 0.5179 - loss: 1.0291
Epoch 2/10
37/37 - 0s - 4ms/step - attrition_output_accuracy: 0.8316 - attrition_output_loss: 0.4448 - department_output_accuracy: 0.8316 - department_output_loss: 0.4504 - loss: 0.8956
Epoch 3/10
37/37 - 0s - 8ms/step - attrition_output_accuracy: 0.8316 - attrition_output_loss: 0.4282 - department_output_accuracy: 0.8316 - department_output_loss: 0.4345 - loss: 0.8617
Epoch 4/10
37/37 - 0s - 9ms/step - attrition_output_accuracy: 0.8316 - attrition_output_loss: 0.4187 - department_output_accuracy: 0.8316 - department_output_loss: 0.4236 - loss: 0.8420
Epoch 5/10
37/37 - 0s - 8ms/step - attrition_output_accuracy: 0.8325 - attrition_output_loss: 0.4120 - department_output_accuracy: 0.8316 - department_output_loss: 0.4164 - loss: 0.8302
Epoch 6/10
37/37 - 0s - 8ms/step - attrition_output_accuracy: 0.8342 

<keras.src.callbacks.history.History at 0x799b65d421d0>

In [12]:
# Evaluate the model with the testing data
model_loss, department_loss, attrition_loss, department_accuracy, attrition_accuracy = model.evaluate(
    X_test_scaled, [y_test_encoded.toarray(), y_test_encoded.toarray()], verbose=2
)

# Print the evaluation results
print("Loss:", model_loss)
print("Department Loss:", department_loss)
print("Attrition Loss:", attrition_loss)
print("Department Accuracy:", department_accuracy)
print("Attrition Accuracy:", attrition_accuracy)





10/10 - 0s - 10ms/step - attrition_output_accuracy: 0.8741 - attrition_output_loss: 0.3074 - department_output_accuracy: 0.8673 - department_output_loss: 0.3044 - loss: 0.6498
Loss: 0.6498453617095947
Department Loss: 0.30435875058174133
Attrition Loss: 0.3073651194572449
Department Accuracy: 0.8741496801376343
Attrition Accuracy: 0.8673469424247742


In [13]:
# Print the accuracy for both department and attrition
print("Department Accuracy:", department_accuracy)
print("Attrition Accuracy:", attrition_accuracy)


Department Accuracy: 0.8741496801376343
Attrition Accuracy: 0.8673469424247742


# Summary

In the provided space below, briefly answer the following questions.

1. Is accuracy the best metric to use on this data? Why or why not?

2. What activation functions did you choose for your output layers, and why?

3. Can you name a few ways that this model might be improved?

YOUR ANSWERS HERE

1. The accuracy scores were over 0.8 for both Department and Attrition after training the model.  Before training it, Attrition had a much lower accuracy score.  Before training the model: Department Accuracy: 0.8401360511779785
Attrition Accuracy: 0.24829931557178497.  After training the model: Department Accuracy: 0.8639456033706665
Attrition Accuracy: 0.8673469424247742.  With that said, I think this is a good metric to use because it brings the model above a 75% threshold, while leaving room for improvement.  The model is neither over, nor under-fitted. I know that the goal is to get close to 1.0 in a metric, but factoring in the potential for over/under fitting the model, I am good with having a metric over 0.80.  Given more time, I would probably expend energy selecting different metrics to see if I can achieve higher results.
2. For my activation functions, I selected: Age, BusinessTravel, PercentSalaryHike, DistranceFromHome, Education, EnvironmentSatisfaction, TotalWorkingYear, HourlyRate, JobInvolvement.  I selected these because in going through the complete list, these seemed to resonate with me as to what would impact me personally for staying with a job or leaving (attrition).  At first, I was going to use the top 10 functions with the highest number of unique values.  But then I thought maybe I should select the bottom 10 functions with the fewest number of unique values.  In the end, I decided that doing the top 10 or bottom 10 would be skewing my answer, so I selected the functions I felt most influence me, regardless of their quantity of unique features.
3.  If I could, I would have included 'BusinessLocation' and 'WorkFromHome' into the data, as those two items also play a role in employee attrition, with 'WorkFromHome' becoming a significant employee satisfaction driver since the onset of Covid-19.  I also think there is a way to create this model in a more efficient manner, particularly in the Create, Compile, Train the model areas.  I did some consolidation in that area, but it remains to be seen, whether or not my changes are acceptable to you all.