<a href="https://colab.research.google.com/github/meikaykam/neural-network-challenge-2/blob/main/attrition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Part 1: Preprocessing

In [1]:
# Import our dependencies
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
from tensorflow.keras.models import Model
from tensorflow.keras import layers
from tensorflow.keras.layers import concatenate
from tensorflow.keras.layers import Input, Dense, concatenate
from tensorflow.keras.optimizers import Adam
from sklearn.preprocessing import OneHotEncoder

#  Import and read the attrition data
attrition_df = pd.read_csv('https://static.bc-edx.com/ai/ail-v-1-0/m19/lms/datasets/attrition.csv')
attrition_df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,HourlyRate,JobInvolvement,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,Sales,1,2,Life Sciences,2,94,3,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,Research & Development,8,1,Life Sciences,3,61,2,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,Research & Development,2,2,Other,4,92,2,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,Research & Development,3,4,Life Sciences,4,56,3,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,Research & Development,2,1,Medical,1,40,3,...,3,4,1,6,3,3,2,2,2,2


In [2]:
# Determine the number of unique values in each column
attrition_df.nunique()

Unnamed: 0,0
Age,43
Attrition,2
BusinessTravel,3
Department,3
DistanceFromHome,29
Education,5
EducationField,6
EnvironmentSatisfaction,4
HourlyRate,71
JobInvolvement,4


In [3]:
# Create y_df with the Attrition and Department columns
y_df = attrition_df[['Attrition', 'Department']]

In [4]:
# Create a list of at least 10 column names to use as X data
list_column = ['HourlyRate',
               'PerformanceRating',
               'TotalWorkingYears',
               'YearsAtCompany',
               'YearsInCurrentRole',
               'YearsSinceLastPromotion',
               'DistanceFromHome',
               'WorkLifeBalance',
               'JobSatisfaction',
               'EnvironmentSatisfaction']

# Create X_df using your selected columns
X_df = attrition_df[list_column]

# Show the data types for X_df
X_df.dtypes

Unnamed: 0,0
HourlyRate,int64
PerformanceRating,int64
TotalWorkingYears,int64
YearsAtCompany,int64
YearsInCurrentRole,int64
YearsSinceLastPromotion,int64
DistanceFromHome,int64
WorkLifeBalance,int64
JobSatisfaction,int64
EnvironmentSatisfaction,int64


In [5]:
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.2, random_state=42)

In [6]:
# Display the shapes of the resulting datasets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

X_train shape: (1176, 10)
X_test shape: (294, 10)
y_train shape: (1176, 2)
y_test shape: (294, 2)


In [7]:
# Convert your X data to numeric data types however you see fit
# Add new code cells as necessary
X_df.dtypes


Unnamed: 0,0
HourlyRate,int64
PerformanceRating,int64
TotalWorkingYears,int64
YearsAtCompany,int64
YearsInCurrentRole,int64
YearsSinceLastPromotion,int64
DistanceFromHome,int64
WorkLifeBalance,int64
JobSatisfaction,int64
EnvironmentSatisfaction,int64


In [8]:
# Convert categorical columns to numeric
X_df = pd.get_dummies(X_df, drop_first=True)

# Convert non-numeric values to NaN (if any)
X_df = X_df.apply(pd.to_numeric, errors='coerce')

# Replace missing values with the median
X_df.fillna(X_df.median(), inplace=True)

# Check data type again
print(X_df.dtypes)

HourlyRate                 int64
PerformanceRating          int64
TotalWorkingYears          int64
YearsAtCompany             int64
YearsInCurrentRole         int64
YearsSinceLastPromotion    int64
DistanceFromHome           int64
WorkLifeBalance            int64
JobSatisfaction            int64
EnvironmentSatisfaction    int64
dtype: object


In [9]:
# Create a StandardScaler
X_scaler = StandardScaler()

# Fit the StandardScaler to the training data
X_scaler.fit(X_train)

# Scale the training and testing data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

# Verify the transformation
print("X_train_scaled shape:", X_train_scaled.shape)
print("X_test_scaled shape:", X_test_scaled.shape)

X_train_scaled shape: (1176, 10)
X_test_scaled shape: (294, 10)


In [10]:
# Create a OneHotEncoder for the Department column
dept_encoder = OneHotEncoder(sparse_output=False, handle_unknown="ignore")

# Fit the encoder to the training data
dept_encoder.fit(y_train[['Department']])

# Create two new variables by applying the encoder to the training and testing data
# Transform the Attrition column in both training and testing sets
dept_train_encoded = dept_encoder.transform(y_train[['Department']])
dept_test_encoded = dept_encoder.transform(y_test[['Department']])

# Convert to DataFrame with meaningful column names
dept_columns = dept_encoder.get_feature_names_out(['Department'])
dept_train_df = pd.DataFrame(dept_train_encoded, columns=dept_columns, index=y_train.index)
dept_test_df = pd.DataFrame(dept_test_encoded, columns=dept_columns, index=y_test.index)

# Convert Attrition column to binary format (1 for 'Yes', 0 for 'No')
y_train.loc[:, 'Attrition'] = y_train['Attrition'].map({"Yes": 1.0, "No": 0.0}).astype(float)
y_test.loc[:, 'Attrition'] = y_test['Attrition'].map({"Yes": 1.0, "No": 0.0}).astype(float)

# Verify the encoding
print(dept_train_df.head())
print(dept_test_df.head())


      Department_Human Resources  Department_Research & Development  \
1097                         0.0                                1.0   
727                          0.0                                1.0   
254                          0.0                                0.0   
1175                         0.0                                1.0   
1341                         0.0                                1.0   

      Department_Sales  
1097               0.0  
727                0.0  
254                1.0  
1175               0.0  
1341               0.0  
      Department_Human Resources  Department_Research & Development  \
1041                         0.0                                0.0   
184                          0.0                                1.0   
1222                         1.0                                0.0   
67                           0.0                                1.0   
220                          0.0                                1.0 

In [11]:
# Drop original Department column and add encoded columns
y_train_encoded = y_train.drop(columns=['Department']).reset_index(drop=True)
y_test_encoded = y_test.drop(columns=['Department']).reset_index(drop=True)

y_train_final = pd.concat([y_train_encoded, dept_train_df.reset_index(drop=True)], axis=1)
y_test_final = pd.concat([y_test_encoded, dept_test_df.reset_index(drop=True)], axis=1)

# Check final output
print(y_train_final.head())
print(y_test_final.head())

# Convert to NumPy arrays with float32
y_train_final = y_train_final.values.astype(np.float32)
y_test_final = y_test_final.values.astype(np.float32)

y_train_final


  Attrition  Department_Human Resources  Department_Research & Development  \
0       0.0                         0.0                                1.0   
1       0.0                         0.0                                1.0   
2       0.0                         0.0                                0.0   
3       0.0                         0.0                                1.0   
4       0.0                         0.0                                1.0   

   Department_Sales  
0               0.0  
1               0.0  
2               1.0  
3               0.0  
4               0.0  
  Attrition  Department_Human Resources  Department_Research & Development  \
0       0.0                         0.0                                0.0   
1       0.0                         0.0                                1.0   
2       1.0                         1.0                                0.0   
3       0.0                         0.0                                1.0   
4       0

array([[0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       ...,
       [1., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]], dtype=float32)

In [12]:
y_train_final


array([[0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       ...,
       [1., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]], dtype=float32)

## Part 2: Create, Compile, and Train the Model

In [13]:
# Find the number of columns in the X training data.
input_dim = X_train_scaled.shape[1]
print("Number of input features:", input_dim)

# Create the input layer
input_layer = layers.Input(shape=(input_dim,))

# Create at least two shared layers
# First hidden layer with 16 neurons and ReLU activation
hidden_layer1 = layers.Dense(16, activation="relu")(input_layer)

# Second hidden layer with 8 neurons and ReLU activation
hidden_layer2 = layers.Dense(8, activation="relu")(hidden_layer1)

Number of input features: 10


In [14]:
# Create a branch for Department with a hidden layer and an output layer
# Create the hidden layer
dept_hidden_layer = layers.Dense(8, activation="relu")(hidden_layer2)

# Create the output layer
dept_output_layer = layers.Dense(3, activation="softmax", name="Department")(dept_hidden_layer)

In [15]:
# Create a branch for Attrition with a hidden layer and an output layer
# Create the hidden layer
attrition_hidden_layer = layers.Dense(8, activation="relu")(hidden_layer2)

# Create the output layer
attrition_output_layer = layers.Dense(1, activation="sigmoid", name='Attrition')(attrition_hidden_layer)

In [16]:
# Create model
model = Model(inputs=input_layer, outputs=[attrition_output_layer, dept_output_layer])

# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss="binary_crossentropy", metrics=["accuracy"])

# Summarize the model
model.summary()



In [19]:
# Train the model
# Set training parameters
epochs = 50  # Number of times the model sees the entire dataset
batch_size = 32  # Number of samples per training step

# Train the model
history = model.fit(
    X_train_scaled, {'Attrition': y_train_final[:, 0], 'Department': y_train_final[:, 1:]},  # Training data
    validation_data=(X_test_scaled, {'Attrition': y_test_final[:, 0], 'Department': y_test_final[:, 1:]}),  # Validation data
    epochs=epochs,
    batch_size=batch_size,
    verbose=1  # Show training progress
)


Epoch 1/50


ValueError: Attr 'Toutput_types' of 'OptionalFromValue' Op passed list of length 0 less than minimum 1.

In [None]:
# Evaluate the model with the testing data
results = model.evaluate(X_test_scaled, {'Attrition': y_test_final[:, 0],
                               'Department': y_test_final[:, 1:]})

In [None]:
# Print the accuracy for both department and attrition
print(f"Attrition predictions accuracy: {results[3]}")
print(f"Department predictions accuracy: {results[4]}")

# Summary

In the provided space below, briefly answer the following questions.

1. Is accuracy the best metric to use on this data? Why or why not?

2. What activation functions did you choose for your output layers, and why?

3. Can you name a few ways that this model might be improved?

YOUR ANSWERS HERE

1. Not always. While accuracy is a commonly used metric, it may not be the best in certain situations, particularly if the data is imbalanced or if the cost of false positives and false negatives varies.

If your dataset has many more samples in one class than in another (like more employees in one department), accuracy can be misleading. A model that predicts the majority class most of the time can still achieve high accuracy but fail to predict minority classes effectively.

2. Sigmoid for Attrition (binary classification):
Sigmoid is appropriate for binary classification tasks because it outputs probabilities between 0 and 1. This allows you to interpret the model’s output as a probability of the "Yes" class (Attrition). The model predicts 1 for “Yes” (attrition) and 0 for "No".
Softmax for Department (multi-class classification):
Softmax is used for multi-class classification tasks, as it ensures that the outputs for each class sum to 1, and each output represents the probability of the sample belonging to each of the classes. For Department, you have more than two categories, so softmax is ideal to get class probabilities for each department.

3. Data Augmentation / Resampling
If the data is imbalanced, you can perform over-sampling or under-sampling to balance the classes.
SMOTE (Synthetic Minority Over-sampling Technique) is a popular method for generating synthetic samples for under-represented classes.
Alternatively, use class weights during model training to give more importance to minority classes.

Model Complexity and Hyperparameter Tuning
Increase model depth: Adding more hidden layers or units might help capture complex patterns in the data, especially for the Department prediction task.
Early stopping: If your model is overfitting (e.g., if training accuracy is much higher than validation accuracy), using early stopping can help prevent overfitting by halting training when the validation performance starts decreasing.
Optimize hyperparameters: Tuning hyperparameters like learning rate, batch size, optimizer type, number of layers, and units can significantly improve performance. Grid search or random search methods can be used for hyperparameter optimization.

Evaluation Metrics Beyond Accuracy
As mentioned earlier, consider using precision, recall, F1-score, and AUC to better evaluate performance, especially in the case of class imbalances.