# Understanding Overfitting and Underfitting a Model

**Author:** Manaranjan Pradhan</br>
**Email ID:** manaranjan@gmail.com</br>
**LinkedIn:** https://www.linkedin.com/in/manaranjanpradhan/</br>
**Website:** www.manaranjanp.com


## 1. HR - Attrition Analytics

Human Resources are critical resources of any organiazation. Organizations spend huge amount of time and money to hire and nuture their employees. It is a huge loss for companies if employees leave, especially the key resources. So if HR can predict weather employees are at risk for leaving the company, it will allow them to identify the attrition risks and help understand and provie necessary support to retain those employees or do preventive hiring to minimize the impact to the orgranization.

## 2. Data Set

This dataset is taken from kaggle https://www.kaggle.com/datasets/jacksonchou/hr-data-for-analytics


### 2.1 Dependent variable

Left : 0 if employee did not leave , 1 if left company

### 2.2 Independent variables

- **satisfaction_level** : means how much employee satisfied (0 less satisfied , 1 most satisfied)
- **last_evaluation** : means employees' evaluation for last month (0 bad , 1 Excellent)
- **number_project** : number of projects the employee worked on
- **average_montly_hours** : average months employee spends at work per month
- **time_spend_company** : years the employee spent in a company
- **Work_accident** : 0 if he did not have an accident , 1 if had at least one
- **promotion_last_5years** : 0 if he did not have any promotion in last 5 years , 1 if had at least one
- **dept** : department in which employee works
- **salary** : High, medium or low bracket

## 3. Loading Dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn

sn.set_palette("tab10")

In [None]:
hr_df = pd.read_csv('HR_comma_sep.csv')

In [None]:
hr_df.sample(10)

In [None]:
hr_df.info()

## 4. Encode Categorical Features

In [None]:
hr_encoded_df = pd.get_dummies(hr_df, columns=['dept', 'salary'])

In [None]:
hr_encoded_df.head(10)

## 5. Building a Decision Tree


### 5.1. Setting X and y Values

In [None]:
X_features = list(hr_encoded_df.columns)
X_features.remove('left')

In [None]:
X_features

In [None]:
X = hr_encoded_df[X_features]
y = hr_encoded_df.left

### 5.2. Split Dataset into train and test

- Train: 80%
- Test: 20%

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, \
y_train, y_test = train_test_split( X,
                                    y,
                                    test_size = 0.2,
                                    random_state = 23 )

In [None]:
X_train.shape

In [None]:
X_test.shape

### 5.3. Build Decision Tree Model

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
tree = DecisionTreeClassifier(max_depth=1, criterion='gini')

In [None]:
tree.fit( X_train, y_train )

### 5.4. Measuring Accuracy

In [None]:
from sklearn.metrics import roc_auc_score

### 5.5. Predict on Train Data

In [None]:
y_train_pred_probs = tree.predict_proba(X_train)

In [None]:
roc_auc_score(y_train, y_train_pred_probs[:,1])

### 5.6. Predicting on Test Data

In [None]:
y_test_pred_probs = tree.predict_proba(X_test)

In [None]:
roc_auc_score(y_test, y_test_pred_probs[:,1])

## 6. Checking Underfitting and Overfitting

### Underfitting:

Underfitting occurs when a machine learning model is too simple or lacks the capacity to capture the underlying patterns in the data. It fails to learn the relevant relationships between the features and the target variable, resulting in poor performance on both the training and test data.


### Overfitting: 

Overfitting: An overfit model tends to have high accuracy on the training data since it has effectively memorized the training examples. However, when evaluated on the test data, its performance drops significantly due to its inability to generalize well.

In [None]:
train_accuracy = []
test_accuracy = []

### 6.1. Max Depth Vs Accuracy

In [None]:
for depth in range(1, 20):
    print(f"Building a model with depth: {depth}")
    tree_model = DecisionTreeClassifier(max_depth = depth, criterion = 'gini')
    tree_model.fit(X_train, y_train)
    train_probs = tree_model.predict_proba(X_train)
    train_accuracy.append(roc_auc_score(y_train, train_probs[:,1]))
    test_probs = tree_model.predict_proba(X_test)    
    test_accuracy.append(roc_auc_score(y_test, test_probs[:,1]))    

In [None]:
acc_df = pd.DataFrame({"depth" : list(range(1, 20)),
                       "train_accuracy": train_accuracy,
                       "test_accuracy": test_accuracy})

In [None]:
acc_df

### 6.2. Test Accuracy Vs Train Accuracy

In [None]:
sn.lineplot(data = acc_df,
            x = 'depth',
            y = 'train_accuracy',
            label = 'train accuracy')
sn.lineplot(data = acc_df,
            x = 'depth',
            y = 'test_accuracy',
            label = 'test accuracy')
plt.xlabel('Depth of the Decision Trees')
plt.ylabel('Accuracy')
plt.legend();

## 7. Finding Optimal Complexity: Max Depth

### 7.1 Grid Search

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
params = {'max_depth': list(range(1, 20)),
          'criterion': ['gini', 'entropy']}

In [None]:
tuning_grid = GridSearchCV(DecisionTreeClassifier(),
                           param_grid = params,
                           cv = 10,
                           scoring = 'roc_auc')

In [None]:
tuning_grid.fit(X_train, y_train)

### 7.2 Checking Search Results

In [None]:
grid_results = pd.DataFrame(tuning_grid.cv_results_)

In [None]:
grid_results[grid_results.rank_test_score < 10]

In [None]:
grid_results[['param_criterion',
              'param_max_depth',
              'mean_test_score',
              'std_test_score',
              'rank_test_score']].sort_values('rank_test_score', ascending = True)[0:10]

### 7.3 Finding Best Params and Best Accuracy

In [None]:
tuning_grid.best_params_

In [None]:
tuning_grid.best_score_

## Ex1: Participant Exercise:

- Build a KNN Classifier for the HR Attrition Problem using below 5 most important features.
    - satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company
- n_neighhour (number of neighbors) is a hyperparameter for KNN Classfier 
- n_neighhour is a factor of complexity in KNN model
- Build models with n_neighbours varying from 2 to 20
- Measure train and test accuracy against n_neighbours
- Find the optimal value of n_neighbors using Grid Search