# **Predicting Employee Retention**

### Objective

The objective of this assignment is to develop a Logistic Regression model. You will be using this model to analyse and predict binary outcomes based on the input data. This assignment aims to enhance understanding of logistic regression, including its assumptions, implementation, and evaluation, to effectively classify and interpret data.


### Business Objective

A mid-sized technology company wants to improve its understanding of employee retention to foster a loyal and committed workforce. While the organization has traditionally focused on addressing turnover, it recognises the value of proactively identifying employees likely to stay and understanding the factors contributing to their loyalty.


In this assignment you’ll be building a logistic regression model to predict the likelihood of employee retention based on the data such as demographic details, job satisfaction scores, performance metrics, and tenure. The aim is to provide the HR department with actionable insights to strengthen retention strategies, create a supportive work environment, and increase the overall stability and satisfaction of the workforce.

## Assignment Tasks

You need to perform the following steps to complete this assignment:
1. Data Understanding
2. Data Cleaning
3. Train Validation Split
4. EDA on training data
5. EDA on validation data [Optional]
6. Feature Engineering
7. Model Building
8. Prediction and Model Evaluation




## Data Dictionary

The data has 24 Columns and 74610 Rows. Following data dictionary provides the description for each column present in dataset:<br>

<table>
  <thead>
    <tr>
      <th>Column Name</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Employee ID</td>
      <td>A unique identifier assigned to each employee.</td>
    </tr>
    <tr>
      <td>Age</td>
      <td>The age of the employee, ranging from 18 to 60 years.</td>
    </tr>
    <tr>
      <td>Gender</td>
      <td>The gender of the employee.</td>
    </tr>
    <tr>
      <td>Years at Company</td>
      <td>The number of years the employee has been working at the company.</td>
    </tr>
    <tr>
      <td>Monthly Income</td>
      <td>The monthly salary of the employee, in dollars.</td>
    </tr>
    <tr>
      <td>Job Role</td>
      <td>The department or role the employee works in, encoded into categories such as Finance, Healthcare, Technology, Education, and Media.</td>
    </tr>
    <tr>
      <td>Work-Life Balance</td>
      <td>The employee's perceived balance between work and personal life (Poor, Below Average, Good, Excellent).</td>
    </tr>
    <tr>
      <td>Job Satisfaction</td>
      <td>The employee's satisfaction with their job (Very Low, Low, Medium, High).</td>
    </tr>
    <tr>
      <td>Performance Rating</td>
      <td>The employee's performance rating (Low, Below Average, Average, High).</td>
    </tr>
    <tr>
      <td>Number of Promotions</td>
      <td>The total number of promotions the employee has received.</td>
    </tr>
     </tr>
     <tr>
      <td>Overtime</td>
      <td>Number of overtime hours.</td>
    </tr>
    <tr>
      <td>Distance from Home</td>
      <td>The distance between the employee's home and workplace, in miles.</td>
    </tr>
    <tr>
      <td>Education Level</td>
      <td>The highest education level attained by the employee (High School, Associate Degree, Bachelor’s Degree, Master’s Degree, PhD).</td>
    </tr>
    <tr>
      <td>Marital Status</td>
      <td>The marital status of the employee (Divorced, Married, Single).</td>
    </tr>
     <tr>
      <td>Number of Dependents</td>
      <td>Number of dependents the employee has.</td>
    </tr>
    <tr>
      <td>Job Level</td>
      <td>The job level of the employee (Entry, Mid, Senior).</td>
    </tr>
    <tr>
      <td>Company Size</td>
      <td>The size of the company the employee works for (Small, Medium, Large).</td>
    </tr>
    <tr>
      <td>Company Tenure (In Months)</td>
      <td>The total number of years the employee has been working in the industry.</td>
    </tr>
    <tr>
      <td>Remote Work</td>
      <td>Whether the employee works remotely (Yes or No).</td>
    </tr>
    <tr>
      <td>Leadership Opportunities</td>
      <td>Whether the employee has leadership opportunities (Yes or No).</td>
    </tr>
    <tr>
      <td>Innovation Opportunities</td>
      <td>Whether the employee has opportunities for innovation (Yes or No).</td>
    </tr>
    <tr>
      <td>Company Reputation</td>
      <td>The employee's perception of the company's reputation (Very Poor, Poor, Good, Excellent).</td>
    </tr>
    <tr>
      <td>Employee Recognition</td>
      <td>The level of recognition the employee receives(Very Low, Low, Medium, High).</td>
    </tr>
    <tr>
      <td>Attrition</td>
      <td>Whether the employee has left the company.</td>
    </tr>
  </tbody>
</table>


## **1. Data Understanding**

In this step, load the dataset and check basic statistics of the data, including preview of data, dimension of data, column descriptions and data types.

### **1.0 Import Libraries**

In [None]:
# Supress unnecessary warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Import the libraries
import numpy as np
import pandas as pd

### **1.1 Load the Data**

In [None]:
# Load the dataset
employee_raw = pd.read_csv('Employee_data.csv')

# Normalize garbled characters in text columns
text_cols = employee_raw.select_dtypes(include="object").columns
for c in text_cols:
    employee_raw[c] = (
        employee_raw[c]
        .astype(str)
        .str.replace("â€™", "'", regex=False)
        .str.strip()
    )

In [None]:
# Check the first few entries
employee_raw.head()

In [None]:
# Inspect the shape of the dataset
employee_raw.shape

In [None]:
# Inspect the different columns in the dataset
employee_raw.columns

### **1.2 Check the basic statistics**

In [None]:
# Check the summary of the dataset
employee_raw.info()

### **1.3 Check the data type of columns**

In [None]:
# Check the info to see the types of the feature variables and the null values present
employee_raw.dtypes

## **2. Data Cleaning** <font color = red>[15 marks]</font>

### **2.1 Handle the missing values** <font color = red>[10 marks]</font>

2.1.1 Check the number of missing values <font color="red">[2 Mark]</font>

In [None]:
# Check the number of missing values in each column
missing_val_cols = employee_raw.isna().sum().sort_values(ascending=False)
missing_val_cols = missing_val_cols[missing_val_cols > 0]
missing_val_cols

2.1.2 Check the percentage of missing values <font color="red">[2 Marks]</font>

In [None]:
# Check the percentage of missing values in each column
missing_val_cols / employee_raw.shape[0] * 100

2.1.3 Handle rows with missing values <font color="red">[4 Marks]</font>

In [None]:
# Handle the missing value rows in the column

employee_clean = employee_raw.copy()
for i in missing_val_cols.index:
    employee_clean[i] = employee_clean[i].fillna(employee_clean[i].mean())

2.1.4 Check percentage of remaning data after missing values are removed <font color="red">[2 Mark]</font>

In [None]:
# Check the percentage of remaining data after missing values are removed
employee_clean.isna().sum().sort_values(ascending=False) / employee_clean.shape[0] * 100

### **2.2 Identify and handle redundant values within categorical columns (if any)** <font color = red>[3 marks]</font>

Examine the categorical columns to determine if any value or column needs to be treated

In [None]:
# Write a function to display the categorical columns with their unique values and check for redundant values

def show_categoricals(df: pd.DataFrame):
    """
    Display categorical columns with their unique values and check for redundant values
    (case-insensitive and whitespace-normalized).
    """
    # Identify categorical columns (object or category dtype)

    cat_cols = df.select_dtypes(include=["object", "category"]).columns.tolist()

    
    if not cat_cols:
        print("No categorical columns found.")
        return
    
    for col in cat_cols:
        s = df[col]
        print("=" * 60)
        print(f"Column: {col}")
        print(f"Unique values: {s.nunique(dropna=False)}")
        print("\nValue counts:")
        print(s.value_counts(dropna=False))

In [None]:
# Check the data
show_categoricals(employee_clean)


### **2.3 Drop redundant columns** <font color = red>[2 marks]</font>

In [None]:
# Drop redundant columns which are not required for modelling
employee_clean.drop(columns='Employee ID', inplace=True, errors='ignore')

In [None]:
# Check first few rows of data
employee_clean.head()

## **3. Train-Validation Split** <font color = red>[5 marks]</font>

### **3.1 Import required libraries**

In [None]:
# Import Train Test Split
from sklearn.model_selection import train_test_split


### **3.2 Define feature and target variables** <font color = red>[2 Mark]</font>

In [None]:
# Put all the feature variables in X
_df = employee_clean.copy()
X = _df.drop(columns=["Attrition"])
# Put the target variable in y
y = _df["Attrition"]

### **3.3 Split the data** <font color="red">[3 Marks]</font>

In [None]:
# Split the data into 70% train data and 30% validation data

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=100, stratify=y
)

## **4. EDA on training data** <font color = red>[20 marks]</font>

### **4.1 Perform univariate analysis** <font color = red>[6 marks]</font>

Perform univariate analysis on training data for all the numerical columns.




4.1.1 Select numerical columns from training data <font color = "red">[1 Mark]</font>

In [None]:
# Select numerical columns
num_cols = X_train.select_dtypes(include=[np.number]).columns
num_cols

4.1.2 Plot distribution of numerical columns <font color = "red">[5 Marks]</font>

In [None]:
# Plot all the numerical columns to understand their distribution

# Import necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt

fig_rows = int(np.ceil(len(num_cols)/3))
fig, axes = plt.subplots(fig_rows, 3, figsize=(16, 4*fig_rows))
axes = np.array(axes).reshape(-1)
for i, c in enumerate(num_cols):
    sns.histplot(X_train[c], kde=True, ax=axes[i])
    axes[i].set_title(f"Distribution: {c}")
for j in range(i+1, len(axes)):
    axes[j].axis("off")
plt.tight_layout()
plt.show()

### **4.2 Perform correlation analysis** <font color="red">[4 Marks]</font>

Check the correlation among different numerical variables.

In [None]:
# Create correlation matrix for numerical columns
corr = X_train[num_cols].corr(numeric_only=True)

# Plot Heatmap of the correlation matrix
plt.figure(figsize=(10,8))
sns.heatmap(corr, cmap="RdBu_r", annot=True, center=0)
plt.title("Correlation matrix (training numerics)")
plt.show()

### **4.3 Check class balance** <font color="red">[2 Marks]</font>

Check the distribution of target variable in training set to check class balance.

In [None]:
# Plot a bar chart to check class balance
palette = {'Stayed': "#58BE97", 'Left': "#C74D3D"}
ax = sns.countplot(x=y_train, palette=palette)
ax.set_title("Target distribution (train)")
ax.bar_label(ax.containers[0])
ax.bar_label(ax.containers[1])
plt.show()

y_train.value_counts()

### **4.4 Perform bivariate analysis** <font color="red">[8 Marks]</font>

Perform bivariate analysis on training data between all the categorical columns and target variable to  analyse how the categorical variables influence the target variable.

In [None]:
# Plot distribution for each categorical column with target variable
cat_cols_train = X_train.select_dtypes(include=["object", "category"]).columns.tolist()
palette = {'Stayed': "#58BE97", 'Left': "#C74D3D"}

for c in cat_cols_train:
    plt.figure(figsize=(7,4))
    tmp = pd.concat([X_train[[c]].copy(), y_train.rename("Attrition")], axis=1)
    sns.countplot(data=tmp, x=c, hue="Attrition", palette=palette)
    plt.title(f"Attrition by {c} (train)")
    plt.xticks(rotation=45, ha="right")
    plt.tight_layout()
    plt.show()

## **5. EDA on validation data** <font color = red>[OPTIONAL]</font>

### **5.1 Perform univariate analysis**

Perform univariate analysis on validation data for all the numerical columns.


5.1.1 Select numerical columns from validation data

In [None]:
# Select numerical columns
num_cols = X_test.select_dtypes(include=[np.number]).columns
num_cols

5.1.2 Plot distribution of numerical columns

In [None]:
# Plot all the numerical columns to understand their distribution
fig_rows = int(np.ceil(len(num_cols)/3))
fig, axes = plt.subplots(fig_rows, 3, figsize=(16, 4*fig_rows))
axes = np.array(axes).reshape(-1)
for i, c in enumerate(num_cols):
    sns.histplot(X_test[c], kde=True, ax=axes[i])
    axes[i].set_title(f"Distribution: {c}")
for j in range(i+1, len(axes)):
    axes[j].axis("off")
plt.tight_layout()
plt.show()

### **5.2 Perform correlation analysis**

Check the correlation among different numerical variables.

In [None]:
# Create correlation matrix for numerical columns
corr_test = X_test[num_cols].corr(numeric_only=True)

# Plot Heatmap of the correlation matrix
plt.figure(figsize=(10,8))
sns.heatmap(corr_test, cmap="RdBu_r", annot=True, center=0)
plt.title("Correlation matrix (training numerics)")
plt.show()

### **5.3 Check class balance**

Check the distribution of target variable in validation data to check class balance.

In [None]:
# Plot a bar chart to check class balance
palette = {'Stayed': "#58BE97", 'Left': "#C74D3D"}
ax = sns.countplot(x=y_test, palette=palette, hue_order=["Stayed", "Left"])
ax.set_title("Target distribution (test)")
ax.bar_label(ax.containers[0])
ax.bar_label(ax.containers[1])
plt.show()

y_test.value_counts()

### **5.4 Perform bivariate analysis**

Perform bivariate analysis on validation data between all the categorical columns and target variable to analyse how the categorical variables influence the target variable.

In [None]:
# Plot distribution for each categorical column with target variable
cat_cols_test = X_test.select_dtypes(include=["object", "category"]).columns.tolist()
palette = {'Stayed': "#58BE97", 'Left': "#C74D3D"}

for c in cat_cols_test:
    plt.figure(figsize=(7,4))
    tmp = pd.concat([X_test[[c]].copy(), y_test.rename("Attrition")], axis=1)
    sns.countplot(data=tmp, x=c, hue="Attrition", palette=palette)
    plt.title(f"Attrition by {c} (test)")
    plt.xticks(rotation=45, ha="right")
    plt.tight_layout()
    plt.show()

## **6. Feature Engineering** <font color = red>[20 marks]</font>

### **6.1 Dummy variable creation** <font color = red>[15 marks]</font>


The next step is to deal with the categorical variables present in the data.

6.1.1 Identify categorical columns where dummy variables are required <font color="red">[1 Mark]</font>

In [None]:
# Check the categorical columns
cat_cols = X_train.select_dtypes(include=["object", "category"]).columns.tolist()
cat_cols

6.1.2 Create dummy variables for independent columns in training set <font color="red">[3 Marks]</font>

In [None]:
# Create dummy variables using the 'get_dummies' for independent columns
X_train_dummy = pd.get_dummies(X_train[cat_cols], drop_first=True).astype(int)
# Add the results to the master DataFrame
X_train = pd.concat([X_train, X_train_dummy], axis=1)

Now, drop the original categorical columns and check the DataFrame

In [None]:
# Drop the original categorical columns and check the DataFrame
X_train.drop(columns=cat_cols, inplace=True, errors='ignore')
X_train.head()

6.1.3 Create dummy variables for independent columns in validation set <font color="red">[3 Marks]</font>

In [None]:
# Create dummy variables using the 'get_dummies' for independent columns
X_test_dummy = pd.get_dummies(X_test[cat_cols], drop_first=True).astype(int)

# Add the results to the master DataFrame
X_test = pd.concat([X_test, X_test_dummy], axis=1)

Now, drop the original categorical columns and check the DataFrame

In [None]:
# Drop categorical columns and check the DataFrame
X_test.drop(columns=cat_cols, inplace=True, errors='ignore')
X_test.head()

6.1.4 Create DataFrame for dependent column in both training and validation set <font color = "red">[1 Mark]</font>

In [None]:
# Convert y_train and y_validation to DataFrame to create dummy variables
y_train = pd.DataFrame({'Attrition':y_train})
y_test = pd.DataFrame({'Attrition':y_test})

In [None]:
y_test

6.1.5 Create dummy variables for dependent column in training set <font color="red">[3 Marks]</font>

In [None]:
# Create dummy variables using the 'get_dummies' for dependent column
y_train = pd.get_dummies(y_train).astype(int)
y_train.head()

6.1.6 Create dummy variable for dependent column in validation set <font color = "red">[3 Marks]</font>

In [None]:
# Create dummy variables using the 'get_dummies' for dependent column
y_test = pd.get_dummies(y_test).astype(int)
y_test.head()

6.1.7 Drop redundant columns <font color="red">[1 Mark]</font>

In [None]:
# Drop redundant columns from both train and validation
y_train.drop(columns='Attrition_Left', inplace=True, errors='ignore')
y_test.drop(columns='Attrition_Left', inplace=True, errors='ignore')

### **6.2 Feature scaling** <font color = red>[5 marks]</font>

Apply feature scaling to the numeric columns to bring them to a common range and ensure consistent scaling.

6.2.1 Import required libraries <font color="red">[1 Mark]</font>

In [None]:
# Import the necessary scaling tool from scikit-learn
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()


6.2.2 Scale the numerical features <font color="red">[4 Marks]</font>

In [None]:
# Scale the numeric features present in the training set
X_train_scaled = X_train.copy()
X_train_scaled[num_cols] = scaler.fit_transform(X_train_scaled[num_cols])
display(X_train_scaled.head())

# Scale the numerical features present in the validation set
X_test_scaled = X_test.copy()
X_test_scaled[num_cols] = scaler.transform(X_test_scaled[num_cols])
display(X_test_scaled.head())

## **7. Model Building** <font color = red>[40 marks]</font>

### **7.1 Feature selection** <font color = red>[5 marks]</font>

As there are a lot of variables present in the data, Recursive Feature Elimination (RFE) will be used to select the most influential features for building the model.

7.1.1 Import required libraries <font color="red">[1 Mark]</font>

In [None]:
# Import 'LogisticRegression' and create a LogisticRegression object
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()


7.1.2 Import RFE  and select 15 variables <font color="red">[3 Mark]</font>

In [None]:
# Import RFE and select 15 variables
from sklearn.feature_selection import RFE

rfe = RFE(estimator=log_reg, n_features_to_select=15)
rfe.fit(X_train_scaled, y_train)

In [None]:
# Display the features selected by RFE
X_train_scaled.columns[rfe.support_]

7.1.3 Store the selected features <font color="red">[1 Mark]</font>




In [None]:
# Put columns selected by RFE into variable 'col'
col = X_train_scaled.columns[rfe.support_]

### **7.2 Building Logistic Regression Model** <font color = red>[20 marks]</font>

Now that you have selected the variables through RFE, use these features to build a logistic regression model with statsmodels. This will allow you to assess the statistical aspects, such as p-values and VIFs, which are important for checking multicollinearity and ensuring that the predictors are not highly correlated with each other, as this could distort the model's coefficients.

7.2.1 Select relevant columns on training set <font color="red">[1 Mark]</font>

In [None]:
# Select only the columns selected by RFE
col

In [None]:
# View the training data
X_train_scaled[col]

7.2.2 Add constant to training set <font color = "red">[1 Mark]</font>

In [None]:
# Import statsmodels and add constant to training set

import statsmodels.api as sm

X_train_const = sm.add_constant(X_train_scaled[col])
X_train_const.head()

7.2.3 Fit logistic regression model <font color="red">[3 Marks]</font>

In [None]:
# Fit a logistic regression model on X_train after adding a constant and output the summary
log_reg.fit(X_train_const, y_train)

print("Classes:", log_reg.classes_)
print("Intercept (bias):", log_reg.intercept_)
print("Coefficients:", log_reg.coef_)


**Model Interpretation**

The output summary table  will provide the features used for building model along with coefficient of each of the feature and their p-value. The p-value in a logistic regression model is used to assess the statistical significance of each coefficient. Lesser the p-value, more significant the feature is in the model.

A positive coefficient will indicate that an increase in the value of feature would increase the odds of the event occurring. On the other hand, a negative coefficient means the opposite, i.e,  an increase in the value of feature would decrease the odds of the event occurring.



7.2.4 Evaluate VIF of features <font color="red">[3 Marks]</font>

In [None]:
# Import 'variance_inflation_factor'
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
# Make a VIF DataFrame for all the variables present
X_vif = X_train_const.drop(columns='const')

def make_vif_df(X):
    vif_values = []
    for i in range(X.shape[1]):
        vif = variance_inflation_factor(X.values, i)
        vif_values.append(vif)
    
    return (pd.DataFrame({
                "feature": X.columns,
                "VIF": vif_values
            })
            .sort_values("VIF", ascending=False)
            .reset_index(drop=True))

make_vif_df(X_vif)

Proceed to the next step if p-values and VIFs are within acceptable ranges.  If you observe high p-values or VIFs, create new cells to drop the features and retrain the model.

7.2.5 Make predictions on training set <font color = "red">[2 Marks]</font>

In [None]:
# Predict the probabilities on the training set
y_train_proba = log_reg.predict_proba(X_train_const)

7.2.6 Format the prediction output <font color="red">[1 Mark]</font>

In [None]:
# Reshape it into an array
y_train_proba = np.array(y_train_proba)
y_train_proba

7.2.7 Create a DataFrame with the actual stayed flag and the predicted probabilities <font color="red">[1 Mark]</font>

In [None]:
# Create a new DataFrame containing the actual stayed flag and the probabilities predicted by the model
y_train_pred_df = pd.concat([
    y_train, 
    pd.DataFrame({'Stayed_Proba':y_train_proba[:, 1]}, index=y_train.index)
    ], axis=1)
y_train_pred_df

7.2.8 Create a new column 'Predicted' with 1 if predicted probabilities are greater than 0.5 else 0 <font color = "red">[1 Mark]</font>

In [None]:
# Create a new column 'Predicted' with 1 if predicted probabilities are greater than 0.5 else 0
y_train_pred_df['Prediction'] = (y_train_pred_df['Stayed_Proba'] > 0.5).astype(int)
y_train_pred_df

**Evaluation of performance of Model**

Evaluate the performance of the model based on the predictions made on the training set.


7.2.9 Check the accuracy of the model based on the predictions made on the training set <font color = "red">[1 Mark]</font>

In [None]:
# Import metrics from sklearn for evaluation
from sklearn import metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report, roc_auc_score, roc_curve, auc

# Check the overall accuracy

accuracy = accuracy_score(y_train_pred_df['Attrition_Stayed'], y_train_pred_df['Prediction'])
# y_train_true = y_train_pred_df['Attrition_Stayed']
accuracy

7.2.10 Create a confusion matrix based on the predictions made on the training set <font color="red">[1 mark]</font>

In [None]:
# Create confusion matrix

confusion_matrix(y_train_pred_df['Attrition_Stayed'], y_train_pred_df['Prediction'])

In [None]:
print(classification_report(y_train_pred_df['Attrition_Stayed'], y_train_pred_df['Prediction']))

7.2.11 Create variables for true positive, true negative, false positive and false negative <font color="red">[1 Mark]</font>

In [None]:
# Create variables for true positive, true negative, false positive and false negative
tn, fp, fn, tp = confusion_matrix(y_train_pred_df['Attrition_Stayed'], y_train_pred_df['Prediction']).ravel().tolist()

7.2.12 Calculate sensitivity and specificity of model  <font color="red">[2 Marks]</font>

In [None]:
# Calculate sensitivity
tp / (tp + fn)

In [None]:
# Calculate specificity
tn / (tn + fp)

7.2.13 Calculate precision and recall of model <font color="red">[2 Marks]</font>

In [None]:
# Calculate precision
tp / (tp + fp)

In [None]:
# Calculate recall
tp / (tp + fn)

### **7.3 Find the optimal cutoff** <font color = red>[15 marks]</font>

Find the optimal cutoff to improve model performance. While a default threshold of 0.5 was used for initial evaluation, optimising this threshold can enhance the model's performance.

First, plot the ROC curve and check AUC.



7.3.1 Plot ROC curve <font color="red">[3 Marks]</font>

In [None]:
# Define ROC function
fpr, tpr, thresholds = roc_curve(y_train_pred_df['Attrition_Stayed'], y_train_pred_df['Stayed_Proba'])

In [None]:
# Call the ROC function
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(6, 5))
plt.plot(fpr, tpr, label=f"LogReg ROC (AUC = {roc_auc:.3f})")
plt.plot([0, 1], [0, 1], "r--", label="Random")
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title("ROC Curve")
plt.legend()
plt.grid(True)
plt.show()

In [None]:

J = tpr - fpr  # since J = TPR + (1 - FPR) - 1 = TPR - FPR
best_idx = np.argmax(J)
best_threshold = thresholds[best_idx]
best_threshold

**Sensitivity and Specificity tradeoff**

Check sensitivity and specificity tradeoff to find the optimal cutoff point.

7.3.2 Predict on training set at various probability cutoffs <font color="red">[1 Mark]</font>

In [None]:
# Predict on training data by creating columns with different probability cutoffs to explore the impact of cutoff on model performance


In [None]:
cutoffs = np.round(np.linspace(0.05, 0.95, 19), 2)

df = pd.DataFrame({'y_true': y_train_pred_df['Attrition_Stayed'], 'y_score': y_train_pred_df['Stayed_Proba']})
for c in cutoffs:
    df[f'pred_c{c:.2f}'] = (df['y_score'] >= c).astype(int)

rows = []
for c in cutoffs:
    y_pred = df[f'pred_c{c:.2f}']
    acc = accuracy_score(df['y_true'], y_pred)
    prec = precision_score(df['y_true'], y_pred, zero_division=0)
    rec = recall_score(df['y_true'], y_pred, zero_division=0)
    f1 = f1_score(df['y_true'], y_pred, zero_division=0)
    tn, fp, fn, tp = confusion_matrix(df['y_true'], y_pred).ravel()
    specificity = tn / (tn + fp) if (tn + fp) > 0 else np.nan

    rows.append({
        'cutoff': c,
        'accuracy': acc,
        'precision': prec,
        'recall': rec,
        'specificity': specificity,
        'f1': f1,
        'tp': tp, 'fp': fp, 'tn': tn, 'fn': fn
    })

metrics_df = pd.DataFrame(rows).sort_values('cutoff').reset_index(drop=True)

roc_auc = roc_auc_score(df['y_true'], df['y_score'])
fpr, tpr, thresholds = roc_curve(df['y_true'], df['y_score'])

print(f"ROC AUC: {roc_auc:.3f}")
metrics_df


7.3.3 Plot for accuracy, sensitivity, specificity at different probability cutoffs <font color="red">[2 Marks]</font>

In [None]:
# Create a DataFrame to see the values of accuracy, sensitivity, and specificity at different values of probability cutoffs


In [None]:
# Plot accuracy, sensitivity, and specificity at different values of probability cutoffs


7.3.4 Create a column for final prediction based on the optimal cutoff <font color="red">[2 Marks]</font>

In [None]:
# Create a column for final prediction based on the optimal cutoff


7.3.5 Calculate model's accuracy <font color="red">[1Mark]</font>

In [None]:
# Calculate the accuracy


7.3.6 Create confusion matrix <font color="red">[1Mark]</font>

In [None]:
# Create the confusion matrix once again


7.3.7 Create variables for true positive, true negative, false positive and false negative <font color="red">[1Mark]</font>

In [None]:
# Create variables for true positive, true negative, false positive and false negative


7.3.8 Calculate sensitivity and specificity of the model <font color="red">[1Mark]</font>

In [None]:
# Calculate Sensitivity


In [None]:
# Calculate Specificity


7.3.9 Calculate precision and recall of the model <font color="red">[1Mark]</font>

In [None]:
# Calculate Precision


In [None]:
# Calculate Recall


**Precision and Recall tradeoff**

Check optimal cutoff value by plotting precision-recall curve, and adjust the cutoff based on the precision and recall tradeoff if required.

In [None]:
# Import precision-recall curve function
from sklearn.metrics import precision_recall_curve

In [None]:
# Check actual and predicted values from initial model


7.3.10 Plot precision-recall curve <font color="red">[2 Marks]</font>

In [None]:
# Plot precision-recall curve


## **8. Prediction and Model Evaluation** <font color = red>[30 marks]</font>

Use the model from the previous step to make predictions on the validation set with the optimal cutoff. Then evaluate the model's performance using metrics such as accuracy, sensitivity, specificity, precision, and recall.

### **8.1 Make predictions over validation set** <font color = red>[15 marks]</font>

8.1.1 Select relevant features for validation set <font color="red">[2 Marks]</font>



In [None]:
# Select the relevant features for validation set


8.1.2 Add constant to X_validation <font color="red">[2 Marks]</font>

In [None]:
# Add constant to X_validation


8.1.3 Make predictions over validation set <font color="red">[3 Marks]</font>

In [None]:
# Make predictions on the validation set and store it in the variable 'y_validation_pred'

# View predictions


8.1.4 Create DataFrame with actual values and predicted values for validation set <font color="red">[5 Marks]</font>

In [None]:
# Convert 'y_validation_pred' to a DataFrame 'predicted_probability'

# Convert 'y_validation' to DataFrame 'actual'

# Remove index from both DataFrames 'actual' and 'predicted_probability' to append them side by side


8.1.5 Predict final prediction based on the cutoff value <font color="red">[3 Marks]</font>

In [None]:
# Make predictions on the validation set using the optimal cutoff and store it in a column 'final_prediction'

# Check the DataFrame


### **8.2 Calculate accuracy of the model** <font color = red>[2 marks]</font>

In [None]:
# Calculate the overall accuracy


### **8.3 Create confusion matrix and create variables for true positive, true negative, false positive and false negative** <font color = red>[5 marks]</font>

In [None]:
# Create confusion matrix


In [None]:
# Create variables for true positive, true negative, false positive and false negative


### **8.4 Calculate sensitivity and specificity** <font color = red>[4 marks]</font>

In [None]:
# Calculate sensitivity


In [None]:
# Calculate specificity


### **8.5 Calculate precision and recall** <font color = red>[4 marks]</font>

In [None]:
# Calculate precision


In [None]:
# Calculate recall


## Conclusion

