## Part 1: Preprocessing

In [1]:
# Import our dependencies
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

!pip install tensorflow
import tensorflow
from tensorflow.keras.models import Model
from tensorflow.keras import layers

#  Import and read the attrition data
attrition_df = pd.read_csv('https://static.bc-edx.com/ai/ail-v-1-0/m19/lms/datasets/attrition.csv')
attrition_df.head()



Unnamed: 0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,HourlyRate,JobInvolvement,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,Sales,1,2,Life Sciences,2,94,3,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,Research & Development,8,1,Life Sciences,3,61,2,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,Research & Development,2,2,Other,4,92,2,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,Research & Development,3,4,Life Sciences,4,56,3,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,Research & Development,2,1,Medical,1,40,3,...,3,4,1,6,3,3,2,2,2,2


In [2]:
# Determine the number of unique values in each column.
attrition_df.nunique()

Age                         43
Attrition                    2
BusinessTravel               3
Department                   3
DistanceFromHome            29
Education                    5
EducationField               6
EnvironmentSatisfaction      4
HourlyRate                  71
JobInvolvement               4
JobLevel                     5
JobRole                      9
JobSatisfaction              4
MaritalStatus                3
NumCompaniesWorked          10
OverTime                     2
PercentSalaryHike           15
PerformanceRating            2
RelationshipSatisfaction     4
StockOptionLevel             4
TotalWorkingYears           40
TrainingTimesLastYear        7
WorkLifeBalance              4
YearsAtCompany              37
YearsInCurrentRole          19
YearsSinceLastPromotion     16
YearsWithCurrManager        18
dtype: int64

In [3]:
# Create y_df with the Attrition and Department columns
y_df = attrition_df[['Attrition', 'Department']]
y_df.head()


Unnamed: 0,Attrition,Department
0,Yes,Sales
1,No,Research & Development
2,Yes,Research & Development
3,No,Research & Development
4,No,Research & Development


In [4]:
# Create a list of at least 10 column names to use as X data
column_names = ["Education", "Age",
                "DistanceFromHome",
                "JobSatisfaction",
                "OverTime",
                "StockOptionLevel",
                "WorkLifeBalance",
                "YearsAtCompany",
                "YearsSinceLastPromotion",
                "NumCompaniesWorked"]

# Create X_df using your selected columns
X_df = pd.DataFrame(attrition_df, columns=column_names)
X_df.head()

# Show the data types for X_df
X_df.dtypes

Education                   int64
Age                         int64
DistanceFromHome            int64
JobSatisfaction             int64
OverTime                   object
StockOptionLevel            int64
WorkLifeBalance             int64
YearsAtCompany              int64
YearsSinceLastPromotion     int64
NumCompaniesWorked          int64
dtype: object

In [5]:
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, random_state=42)


In [6]:
# Convert your X data to numeric data types however you see fit
# Add new code cells as necessary
print(X_train.value_counts('OverTime'))

X_train['OverTime'] = X_train['OverTime'].replace({'Yes': 1, 'No': 0})
X_test['OverTime'] = X_test['OverTime'].replace({'Yes': 1, 'No': 0})

# Display the value counts of the 'OverTime' column in the training and testing sets
print("Value counts of 'OverTime' column in training set:")
print(X_train.value_counts('OverTime'))

print("\nValue counts of 'OverTime' column in testing set:")
print(X_test.value_counts('OverTime'))

OverTime
No     780
Yes    322
Name: count, dtype: int64
Value counts of 'OverTime' column in training set:
OverTime
0    780
1    322
Name: count, dtype: int64

Value counts of 'OverTime' column in testing set:
OverTime
0    274
1     94
Name: count, dtype: int64


In [7]:
# Create a StandardScaler
scaler = StandardScaler()

# Fit the StandardScaler to the training data
scaler.fit(X_train)

# Scale the training and testing data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert the scaled data back to DataFrame to maintain column names
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)



In [8]:
# Create a OneHotEncoder for the Department column
from sklearn.preprocessing import OneHotEncoder
department_encoder = OneHotEncoder(sparse_output=False)

# Fit the encoder to the training data
department_encoder.fit(y_train[['Department']])

# Transform the 'Department' column in the training and testing data
y_train_department_encoded = department_encoder.transform(y_train[['Department']])
y_test_department_encoded = department_encoder.transform(y_test[['Department']])

# Convert the encoded columns to DataFrames
y_train_department_encoded_df = pd.DataFrame(y_train_department_encoded, columns=department_encoder.get_feature_names_out(['Department']))
y_test_department_encoded_df = pd.DataFrame(y_test_department_encoded, columns=department_encoder.get_feature_names_out(['Department']))

# Print the resulting arrays
print("\nEncoded 'Department' column in training set:")
print(y_train_department_encoded_df[:5])

print("\nEncoded 'Department' column in testing set:")
print(y_test_department_encoded_df[:5])


Encoded 'Department' column in training set:
   Department_Human Resources  Department_Research & Development  \
0                         0.0                                1.0   
1                         0.0                                0.0   
2                         0.0                                0.0   
3                         0.0                                0.0   
4                         0.0                                0.0   

   Department_Sales  
0               0.0  
1               1.0  
2               1.0  
3               1.0  
4               1.0  

Encoded 'Department' column in testing set:
   Department_Human Resources  Department_Research & Development  \
0                         0.0                                0.0   
1                         0.0                                1.0   
2                         1.0                                0.0   
3                         0.0                                1.0   
4                         0.

In [9]:
# Create a OneHotEncoder for the Attrition column

# Create a OneHotEncoder for the Attrition column
attrition_encoder = OneHotEncoder(sparse_output=False)

# Fit the encoder to the 'Attrition' column in the training data
attrition_encoder.fit(y_train[['Attrition']])

# Transform the 'Attrition' column in the training and testing data
y_train_attrition_encoded = attrition_encoder.transform(y_train[['Attrition']])
y_test_attrition_encoded = attrition_encoder.transform(y_test[['Attrition']])

# Convert the encoded columns to DataFrames
y_train_attrition_encoded_df = pd.DataFrame(y_train_attrition_encoded, columns=attrition_encoder.get_feature_names_out(['Attrition']))
y_test_attrition_encoded_df = pd.DataFrame(y_test_attrition_encoded, columns=attrition_encoder.get_feature_names_out(['Attrition']))

# Print the resulting arrays for verification
print("\nEncoded 'Attrition' column in training set:")
print(y_train_attrition_encoded_df[:5])

print("\nEncoded 'Attrition' column in testing set:")
print(y_test_attrition_encoded_df[:5])



Encoded 'Attrition' column in training set:
   Attrition_No  Attrition_Yes
0           1.0            0.0
1           1.0            0.0
2           1.0            0.0
3           1.0            0.0
4           1.0            0.0

Encoded 'Attrition' column in testing set:
   Attrition_No  Attrition_Yes
0           1.0            0.0
1           1.0            0.0
2           0.0            1.0
3           1.0            0.0
4           1.0            0.0


## Create, Compile, and Train the Model

In [10]:
# Find the number of columns in the X training data
X_train_scaled.shape[1]

# Create the input layer
input_layer = layers.Input(shape=(X_train_scaled.shape[1],))

# Create at least two shared layers
shared_layer1 = layers.Dense(64, activation='relu')(input_layer)
shared_layer2 = layers.Dense(128, activation='relu')(shared_layer1)

In [11]:
# Create a branch for Department with a hidden layer and an output layer

# Create the hidden layer
department_hidden = layers.Dense(32, activation='relu')(shared_layer2)

# Create the output layer
department_output = layers.Dense(3, activation='softmax', name='department_output')(department_hidden)


In [12]:
# Create a branch for Attrition
# with a hidden layer and an output layer

# Create the hidden layer
attrition_hidden = layers.Dense(32, activation='relu')(shared_layer2)

# Create the output layer
attrition_output = layers.Dense(2, activation='sigmoid', name='attrition_output')(attrition_hidden)


In [13]:
# Create the model

model = Model(inputs=input_layer, outputs=[department_output, attrition_output])

# Compile the model
model.compile(optimizer='adam',
              loss={'department_output': 'categorical_crossentropy',
                    'attrition_output': 'binary_crossentropy'},
              metrics={'department_output': 'accuracy',
                       'attrition_output': 'accuracy'})

# Summarize the model
model.summary()


Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_1 (InputLayer)        [(None, 10)]                 0         []                            
                                                                                                  
 dense (Dense)               (None, 64)                   704       ['input_1[0][0]']             
                                                                                                  
 dense_1 (Dense)             (None, 128)                  8320      ['dense[0][0]']               
                                                                                                  
 dense_2 (Dense)             (None, 32)                   4128      ['dense_1[0][0]']             
                                                                                              

In [14]:
# Train the model
history = model.fit(X_train_scaled,
                    {'department_output': y_train_department_encoded_df, 'attrition_output': y_train_attrition_encoded},
                    epochs=50, batch_size=32, validation_split=0.2)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [17]:
# Evaluate the model with the testing data
model.evaluate(X_test_scaled,
               {'department_output': y_test_department_encoded_df, 'attrition_output': y_test_attrition_encoded})



[2.1177473068237305,
 1.4467602968215942,
 0.6709871888160706,
 0.570652186870575,
 0.820652186870575]

In [18]:
# Print the accuracy for both department and attrition
print("Department Accuracy:", history.history['department_output_accuracy'][-1])
print("Attrition Accuracy:", history.history['attrition_output_accuracy'][-1])

Department Accuracy: 0.9239500761032104
Attrition Accuracy: 0.9738932847976685


In [19]:
from sklearn.metrics import confusion_matrix, classification_report

# Make predictions
predictions = model.predict(X_test_scaled)

# Get department predictions
department_pred_probs = predictions[0]  # First output
department_preds = np.argmax(department_pred_probs, axis=1)  # Convert to class labels
department_true = np.argmax(y_test_department_encoded, axis=1)  # True labels

# Get attrition predictions
attrition_pred_probs = predictions[1]  # Second output
attrition_preds = np.argmax(attrition_pred_probs, axis=1)  # Convert to class labels
attrition_true = np.argmax(y_test_attrition_encoded, axis=1)  # True labels

# Generate confusion matrices
department_conf_matrix = confusion_matrix(department_true, department_preds)
attrition_conf_matrix = confusion_matrix(attrition_true, attrition_preds)

# Print confusion matrices
print("Confusion Matrix for Department Prediction:")
print(department_conf_matrix)

print("\nConfusion Matrix for Attrition Prediction:")
print(attrition_conf_matrix)

# Optional: Print classification reports for more detailed metrics
print("\nClassification Report for Department Prediction:")
print(classification_report(department_true, department_preds))

print("\nClassification Report for Attrition Prediction:")
print(classification_report(attrition_true, attrition_preds))


Confusion Matrix for Department Prediction:
[[  0  14   4]
 [  4 182  54]
 [  1  81  28]]

Confusion Matrix for Attrition Prediction:
[[292  28]
 [ 38  10]]

Classification Report for Department Prediction:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        18
           1       0.66      0.76      0.70       240
           2       0.33      0.25      0.29       110

    accuracy                           0.57       368
   macro avg       0.33      0.34      0.33       368
weighted avg       0.53      0.57      0.54       368


Classification Report for Attrition Prediction:
              precision    recall  f1-score   support

           0       0.88      0.91      0.90       320
           1       0.26      0.21      0.23        48

    accuracy                           0.82       368
   macro avg       0.57      0.56      0.57       368
weighted avg       0.80      0.82      0.81       368



# Summary

In the provided space below, briefly answer the following questions.

1. Is accuracy the best metric to use on this data? Why or why not?

2. What activation functions did you choose for your output layers, and why?

3. Can you name a few ways that this model might be improved?

YOUR ANSWERS HERE

**1.**
**No**:
When simply looking at the accuracy scores, the model presents as it is performing very well. But the confusion matrix is demonstrating that the model has a much lower performance.

Accuracy alone can be misleading. Confusion matrices provide a more detailed performance analysis. Even with high training accuracy, poor precision, recall, and F1-scores for certain classes reveal issues not captured by accuracy alone.

One such issue, could be overfitting:The model maybe learning the training data too well, capturing noise and details that do not generalize to new, unseen data. Thus, the model achieves high accuracy on the training data but performs poorly on the test data, as seen in the confusion matrix.

**2.**
**ReLu**:
ReLU was chosen for its ability to effectively learn shared features in deep networks.

Its computational efficiency accelerates training, which is particularly useful for potentially large datasets of employee features. ReLU's promotion of sparsity helps the model generalize better by focusing on the most relevant features.

**3.**
***Model Tuning:***
Experiment with different architectures, such as adding more layers or changing the number of neurons in existing layers.
Tune hyperparameters like learning rate, batch size, and number of epochs.

***Balanced Dataset:***
Ensure that the dataset is balanced for both outputs. If there is an imbalance in classes (especially in the Department column), techniques like oversampling or undersampling, should be considered.

***Regularization:***
Apply regularization techniques like dropout or L2 regularization to prevent overfitting and improve generalization.