# Classification with _k_-nearest neigbor
In this assignment, I am going to apply what I learned about machine learning in Employee attrition dataset to predict the variable attrition.
## Objective
Predict the outcomes in a data set using either Random Forest, Decision Tree or k-NN. Write a Jupyter Notebook report documenting your investigation.
## Tips
Cut down the data set down to size. Though not strictly necessary, this is strongly recommended to make it easier. Select 7 variables with strong predictive value, based on your knowledge of the topic (domain knowledge) and/or correlation. Remember to subset the data with df[[‘column 1’, ‘column2’, ‘column3’]]. Don't spend too much time on this step. It's supposed to make the assignment easier, not harder.
If you find the dataset is too big and calculations take too long, take a random sample with the Pandas method .sample() and run the analysis with the entire data at the end.
Included in your Jupyter Notebook
## Please add sufficient comments: not just explaining what you are doing, but why you are doing it.
Which dataset and variables you selected and why
Your pre-processing steps (e.g., transformations of variables)
The head()of the resulting data frame
 Classification
## Choose one of the following: k-nearest neighbor, decision tree or random forest
Explain briefly in your own words how the algorithm works
Split the data set into a training and test set
Train the model
Evaluate the predictive performance of your model on the test set

In [47]:
import seaborn as sns
import sklearn as sk
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split #We need this to split the data

In [48]:
df = pd.read_csv('HR-Employee-Attrition.csv')
df = df.dropna() #first get rid of rows with empty cells
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [49]:
df_sub= df[['Attrition','DistanceFromHome','Education','RelationshipSatisfaction','WorkLifeBalance','YearsInCurrentRole']]
df_sub.head()

Unnamed: 0,Attrition,DistanceFromHome,Education,RelationshipSatisfaction,WorkLifeBalance,YearsInCurrentRole
0,Yes,1,2,1,1,4
1,No,8,1,4,3,7
2,Yes,2,2,2,3,0
3,No,3,4,3,3,7
4,No,2,1,4,3,2


In [50]:
df_sub['Attrition'].value_counts() #Let's have a look at the 'genre' variable

No     1233
Yes     237
Name: Attrition, dtype: int64

In [51]:
from sklearn.preprocessing import normalize #get the function needed to normalize our data.

X = df_sub[['DistanceFromHome','Education','RelationshipSatisfaction','WorkLifeBalance','YearsInCurrentRole']] #create the X matrix
X = normalize(X) #normalize the matrix to put everything on the same scale
y = df_sub['Attrition'] #create the y-variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) #split the data, store it into different variables

In [55]:
from sklearn.neighbors import KNeighborsClassifier #the object class we need

knn = KNeighborsClassifier(n_neighbors=3) #create a KNN-classifier with 3 neighbors (default)
knn = knn.fit(X_train, y_train) #this fits the k-nearest neigbor model with the train data
knn.score(X_test, y_test) #calculate the fit on the test data

0.782312925170068

78% of not attrited is predicted accurately. So, is that good or bad?

Well, given that 78% of the attritions are NO, we could actaully get this performance by predicting _everything_ is 'Not attrited'. So, not good, but kind of expected given the variables. Let's look at the _confusion matrix_ to see how well the model tells apart the different attrition. A confusion matrix gives a the different classes and the number of predictions for each combination.

In [53]:
from sklearn.metrics import confusion_matrix
y_test_pred = knn.predict(X_test) #the predicted values
cm = confusion_matrix(y_test, y_test_pred) #creates a "confusion matrix"
cm

array([[340,  24],
       [ 72,   5]], dtype=int64)

In [54]:
#In order to read it easily , let's make a dataframe out of it, and add labels to it.
conf_matrix = pd.DataFrame(cm, index=['No_Attrition', 'Attrition'], columns = ['No_Attrition_p', 'Attrition_p']) 
conf_matrix

Unnamed: 0,No_Attrition_p,Attrition_p
No_Attrition,340,24
Attrition,72,5


The way to read this is that of the employees, 340 are correctly predicted as 'Not attrited', 24 are instead predicted as 'attrited'. The _recall_ and _precision_ for the category drama is:

$recall = \frac{340}{340 + 24} = .93$

$precision = \frac{340}{340 + 72} = .82$


## Finally, 
I can not make a decision that: Is it a good prediction? Because there is a big difference on our proportion about yes and no, so it seems that the prediction is not accurate enough. for example if we calculate recall and precision for Not attrited employees it is not good at all.

## Would you please send me some comments?