# Classification Problem


## 1. HR - Attrition Analytics

Human Resources are critical resources of any organiazation. Organizations spend huge amount of time and money to hire and nuture their employees. It is a huge loss for companies if employees leave, especially the key resources. So if HR can predict weather employees are at risk for leaving the company, it will allow them to identify the attrition risks and help understand and provie necessary support to retain those employees or do preventive hiring to minimize the impact to the orgranization.

## 2. Data Set

This dataset is taken from kaggle https://www.kaggle.com/datasets/jacksonchou/hr-data-for-analytics


### 2.1 Dependent variable

Left : 0 if employee did not leave , 1 if left company

### 2.2 Independent variables

- **satisfaction_level** : means how much employee satisfied (0 less satisfied , 1 most satisfied)
- **last_evaluation** : means employees' evaluation for last month (0 bad , 1 Excellent)
- **number_project** : number of projects the employee worked on
- **average_montly_hours** : average months employee spends at work per month
- **time_spend_company** : years the employee spent in a company
- **Work_accident** : 0 if he did not have an accident , 1 if had at least one
- **promotion_last_5years** : 0 if he did not have any promotion in last 5 years , 1 if had at least one
- **dept** : department in which employee works

## 3. Loading Dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn

sn.set_palette("tab10")

In [None]:
hr_df = pd.read_csv('HR_comma_sep.csv')

In [None]:
hr_df.sample(10)

In [None]:
hr_df.info()

## 4. EDA

**Question 1**: How satisfcation level influences employee's decision to leave?

In [None]:
plt.figure(figsize=(10, 5))
sn.histplot(data = hr_df, 
            x = 'satisfaction_level', 
            hue = 'left', 
            bins = np.arange(0.0, 1.0, 0.1),
            multiple="stack");

**Question 2**: How time spend in the company influences employee's decision to leave?

In [None]:
plt.figure(figsize=(10, 4))
sn.countplot(data = hr_df,
             x = 'time_spend_company',
             hue = 'left');

**Question 3**: Attrition patterns across different departments.

In [None]:
pd.crosstab(hr_df.dept, 
            hr_df.left, 
            normalize='index')

### Ex1: Participant Exercise 

**Question:** How last evaluation influences employee's decision to leave?

## 5. Building a Classification Model

First we will build a model with *satisfaction level* with *left*.

### 5.1 Setting X and y Values

In [None]:
X = hr_df[['satisfaction_level']]
y = hr_df.left

In [None]:
X[0:2]

In [None]:
y[0:2]

In [None]:
plt.figure(figsize=(10, 4))
sn.scatterplot(data = hr_df.sample(100, random_state = 78),
               x = 'satisfaction_level',
               y = 'left');

### 5.2 Logistic Function

Logistic Regression Model - Sigmoid function

<img src="Logistic.png" alt="ML Algorithms" width="500"/>

The probability of y is given by the equation:

$$p(y) = \frac{1}{1+e^{-(\beta_0 + \beta_1x)}}$$


Then predict the class of y based on threshold value.


$$
\hat y
= 
\begin{cases}
0 \text{ if } \hat p < 0.5\\
1 \text{ if } \hat p \geqslant 0.5
\end{cases}
$$

### 5.3 Split Dataset into train and test

- Train: 80%
- Test: 20%

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, \
y_train, y_test = train_test_split( X,
                                    y,
                                    test_size = 0.2,
                                    random_state = 100 )

In [None]:
X_train.shape

In [None]:
X_test.shape

### 5.4 Build Logistic Regression Model

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logreg_v1 = LogisticRegression(random_state=100)

In [None]:
logreg_v1.fit( X_train, y_train )

### 5.5 Finding Parameters

In [None]:
logreg_v1.intercept_

In [None]:
logreg_v1.coef_

**Note:** What is the meaning of negative coefficient here?

### 5.6 Probability of leaving at different Satisfaction Scores

- Create a list satisfaction scores from 0.0 to 1.0

In [None]:
sl_list = np.arange(0.0, 1.0, 0.05)

In [None]:
sl_list

In [None]:
beta_0 = logreg_v1.intercept_[0]
beta_1 = logreg_v1.coef_[0][0]

In [None]:
beta_0, beta_1

- Calculate the probability values based on the logistic function

In [None]:
sl_probs = [(1.0 / (1.0 + np.exp(-(beta_0+beta_1*x)))) for x in sl_list]

In [None]:
sl_probs_df = pd.DataFrame({'SL': sl_list, 'Prob_Left': sl_probs })

In [None]:
sl_probs_df

In [None]:
plt.figure(figsize=(10, 4))
sn.lineplot(data=sl_probs_df, x="SL", y="Prob_Left");
plt.axhline(y=0.5, color = 'r');

### 5.7 Predicting on Test Data

In [None]:
y_pred = logreg_v1.predict(X_test)

In [None]:
y_df = pd.DataFrame({"actual": y_test,
                     "predicted": y_pred})

In [None]:
y_df.sample(10, random_state=20)

### 5.8 Measuring Accuracy

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(y_test, y_pred)

###  Ex2: Participant Exercise:

- Build a logistic regression model between average_montly_hours and left
- Predict probability for different values of average_montly_hours from 50 hours to 400 hours with 10 hours of step
- Plot the logistic function