# Week 5 - Machine Learning

### Objective

Predict the outcomes in a data set using either Random Forest, Decision Tree or k-NN. Write a Jupyter Notebook report documenting your investigation. The dataset of my choice is IBM HR Analytics Employee Attrition & Performance. I want to predict attrition of employees.


In [62]:
import seaborn as sns
import sklearn as sk
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split 
from sklearn.neighbors import KNeighborsClassifier

In [63]:
df = pd.read_csv("employee-attrition.csv")
df = df.dropna()
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


### Selecting variables

In [64]:
df['Attrition'].value_counts()

No     1233
Yes     237
Name: Attrition, dtype: int64

Attrition is categorical variable. I will convert it into numbers. 



In [65]:
df['Attrition'].replace('Yes',1, inplace=True)
df['Attrition'].replace('No',0, inplace=True)

### Selecting variables.
In order to select variables I will check which variables have the highest correlation with Attrition. 

In [66]:
df[df.columns[1:]].corr()['Attrition'][:]

Attrition                   1.000000
DailyRate                  -0.056652
DistanceFromHome            0.077924
Education                  -0.031373
EmployeeCount                    NaN
EmployeeNumber             -0.010577
EnvironmentSatisfaction    -0.103369
HourlyRate                 -0.006846
JobInvolvement             -0.130016
JobLevel                   -0.169105
JobSatisfaction            -0.103481
MonthlyIncome              -0.159840
MonthlyRate                 0.015170
NumCompaniesWorked          0.043494
PercentSalaryHike          -0.013478
PerformanceRating           0.002889
RelationshipSatisfaction   -0.045872
StandardHours                    NaN
StockOptionLevel           -0.137145
TotalWorkingYears          -0.171063
TrainingTimesLastYear      -0.059478
WorkLifeBalance            -0.063939
YearsAtCompany             -0.134392
YearsInCurrentRole         -0.160545
YearsSinceLastPromotion    -0.033019
YearsWithCurrManager       -0.156199
Name: Attrition, dtype: float64

Based on correlation I have chosen variables such as: 

- TotalWorkingYears -0.171063
- JobLevel -0.169105
- YearsInCurrentRole -0.16054
- MonthlyIncome -0.159840
- YearsWithCurrManager -0.156199
- StockOptionLevel -0.137145
- YearsAtCompany -0.134392


In [67]:
df = df[['Attrition', 'TotalWorkingYears', 'JobLevel', 'YearsInCurrentRole', 'MonthlyIncome', 'YearsWithCurrManager', 'StockOptionLevel', 'YearsAtCompany' ]]
df = df.dropna()
df.head()

Unnamed: 0,Attrition,TotalWorkingYears,JobLevel,YearsInCurrentRole,MonthlyIncome,YearsWithCurrManager,StockOptionLevel,YearsAtCompany
0,1,8,2,4,5993,5,0,6
1,0,10,2,7,5130,7,1,10
2,1,7,1,0,2090,0,0,0
3,0,8,1,7,2909,0,0,8
4,0,6,1,2,3468,2,1,2


### Selecting an algorithm
I will use K-nearest neighbout algorithm. It is mostly used to classifies a data point based on how its neighbours are classified.

In [68]:
X = df[['TotalWorkingYears', 'JobLevel', 'YearsInCurrentRole', 'MonthlyIncome', 'YearsWithCurrManager', 'StockOptionLevel', 'YearsAtCompany']]
y = df['Attrition']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=1) 
X_train.head()

Unnamed: 0,TotalWorkingYears,JobLevel,YearsInCurrentRole,MonthlyIncome,YearsWithCurrManager,StockOptionLevel,YearsAtCompany
99,17,2,2,2042,2,1,3
785,14,3,10,10322,1,1,11
918,31,5,10,19847,10,1,29
1335,7,2,2,3902,2,3,2
1182,4,2,2,4374,2,0,3


In [69]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier() 
knn = knn.fit(X_train, y_train) 

### Model evaluation

I will start with calculating accuracy.

In [70]:
knn.score(X_test, y_test) 

0.8095238095238095

Accuracy is 80.95%. To evaluate I am going to create a confusion matrix. 

In [71]:
y_test_pred = knn.predict(X_test) 
cm = confusion_matrix(y_test, y_test_pred)
conf_matrix = pd.DataFrame(cm, index=['no - actual', 'yes - actual'], columns = ['no - predicted', 'yes - predicted']) 
conf_matrix

Unnamed: 0,no - predicted,yes - predicted
no - actual,350,14
yes - actual,70,7


In [72]:
print(classification_report(y_test, y_test_pred))


              precision    recall  f1-score   support

           0       0.83      0.96      0.89       364
           1       0.33      0.09      0.14        77

   micro avg       0.81      0.81      0.81       441
   macro avg       0.58      0.53      0.52       441
weighted avg       0.75      0.81      0.76       441



The performance does not look good. 
- precision for no = 83%
- recall for no = 96%
- precision for yes = 30%
- recall for yes = 8%

In [73]:
for i in range(1,11):
    knn_new = KNeighborsClassifier(n_neighbors = i) 
    knn_new = knn_new.fit(X_train, y_train) 
    y_test_pred_new = knn_new.predict(X_test) 
    print({i})
    print(classification_report(y_test, y_test_pred_new)) 

set([1])
              precision    recall  f1-score   support

           0       0.84      0.85      0.84       364
           1       0.24      0.22      0.23        77

   micro avg       0.74      0.74      0.74       441
   macro avg       0.54      0.54      0.54       441
weighted avg       0.73      0.74      0.74       441

set([2])
              precision    recall  f1-score   support

           0       0.83      0.96      0.89       364
           1       0.25      0.06      0.10        77

   micro avg       0.80      0.80      0.80       441
   macro avg       0.54      0.51      0.50       441
weighted avg       0.73      0.80      0.75       441

set([3])
              precision    recall  f1-score   support

           0       0.83      0.91      0.87       364
           1       0.24      0.13      0.17        77

   micro avg       0.78      0.78      0.78       441
   macro avg       0.54      0.52      0.52       441
weighted avg       0.73      0.78      0.75    

Situation is common for other values.