# Data Mining and Machine Learning - Assignment 2

> Topics Covered: Classification (week 6 to week 7)

**Due: Sunday November 21, 11.59pm**


To complete the assignment you have to do ***both***:

1. Complete the exercises and [submit your Python notebook](https://moodle.unil.ch/mod/assign/view.php?id=1182268)
2. Answer the questions to the [quiz on Moodle](https://moodle.unil.ch/mod/quiz/view.php?id=1182280)
>Note: You can only complete the quiz *one time*. Have your notebook ready with your solutions for answering the quiz. 

The answers to the quiz should be supported by your code in the notebook. If they are not you will not receive points for them.

**IMPORTANT!** You can discuss the questions with other students but **do not exchange code!** This is individual work. We will run your code and check for similarities.

You can post your questions in [slack channel #assignment_questions](https://app.slack.com/client/T02C4KVGVMX/C02BBA2TFQF).


If there is need for further clarifications on the questions, after the assignment is released, we will update this file on GitHub, so make sure you check the git repo for updates.

To get started, run the first few cells to load the dataset and then check out the questions.

Good luck!

In [295]:
# Import required packages

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd 
import seaborn as sns
sns.set_style("whitegrid")

**Important!**

**For all the questions below, fix the seed of random generators to 72.**

In [296]:
np.random.seed = 72

# Classification

For the first part we use employee retention data from [Kaggle](https://www.kaggle.com/pankeshpatel/hrcommasep). The dataset contains HR information on a company's employees such as:
* satisfaction level, ranges from 0 to 1
* score of the last evaluation they received, ranges from 0 to 1
* number of projects in which the employee is involved 
* average number of hours worked per month
* years spent with the company
* whether they experienced a work accident (1 if yes, 0 if no)
* whether they left thier job (1) or stayed with the company (0)
* whether they received a promotion in the last 5 years (1 if yes, 0 if no)
* the department in which they work
* whether their salary was low, medium or high.

### _Your task is to build a model that predicts whether an employee stays (0) or leaves the company (1)._

### 1. Load the data

In [297]:
# Load data
df=pd.read_csv('https://raw.githubusercontent.com/michalis0/DataMining_and_MachineLearning/master/data/HR_comma_sep.csv')
df.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_monthly_hours,time_spent_company,work_accident,left,promotion_last_5years,department,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


#### 1.1. How many rows and columns does the dataset have?

In [298]:
#your code here
df.shape

(14999, 10)

#### 1.2. Looking at the `left` column, which shows 1 if an employee left their job and 0 if they stayed, calculate and show the frequency of each class in the total dataset. 

In [299]:
df.left.value_counts() / len(df.left)


0    0.761917
1    0.238083
Name: left, dtype: float64


### 2. Encode categorical variables


For the following categorical features:
- encode `salary` with two different encoders: a label encoder, an ordinal encoder (keep both), and
- `department` with one hot encoding. 



In [300]:
# import some additional packages
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, LabelEncoder
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn. preprocessing import StandardScaler

#### 2.1. Use a LabelEncoder to encode `salary` as `le_salary`.

In [301]:
le = LabelEncoder()
df['le_salary'] = pd.Series(le.fit_transform(df.salary))
df.head()
# 1 : low
# 2 : medium
df

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_monthly_hours,time_spent_company,work_accident,left,promotion_last_5years,department,salary,le_salary
0,0.38,0.53,2,157,3,0,1,0,sales,low,1
1,0.80,0.86,5,262,6,0,1,0,sales,medium,2
2,0.11,0.88,7,272,4,0,1,0,sales,medium,2
3,0.72,0.87,5,223,5,0,1,0,sales,low,1
4,0.37,0.52,2,159,3,0,1,0,sales,low,1
...,...,...,...,...,...,...,...,...,...,...,...
14994,0.40,0.57,2,151,3,0,1,0,support,low,1
14995,0.37,0.48,2,160,3,0,1,0,support,low,1
14996,0.37,0.53,2,143,3,0,1,0,support,low,1
14997,0.11,0.96,6,280,4,0,1,0,support,low,1


#### 2.2. Use an Ordinal Encoder to encode `salary` as `oe_salary`. For the ordinal encoding you should set your own dictionary such that `low salary` corresponds to 0, 1 to medium and 2 to high. 

In [302]:
df['oe_salary'] = [0 if x == 'low'
                  else 1 if x == 'medium'
                  else 2 for x in df.salary]
df

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_monthly_hours,time_spent_company,work_accident,left,promotion_last_5years,department,salary,le_salary,oe_salary
0,0.38,0.53,2,157,3,0,1,0,sales,low,1,0
1,0.80,0.86,5,262,6,0,1,0,sales,medium,2,1
2,0.11,0.88,7,272,4,0,1,0,sales,medium,2,1
3,0.72,0.87,5,223,5,0,1,0,sales,low,1,0
4,0.37,0.52,2,159,3,0,1,0,sales,low,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
14994,0.40,0.57,2,151,3,0,1,0,support,low,1,0
14995,0.37,0.48,2,160,3,0,1,0,support,low,1,0
14996,0.37,0.53,2,143,3,0,1,0,support,low,1,0
14997,0.11,0.96,6,280,4,0,1,0,support,low,1,0


#### 2.3. Encode the `department` column with one hot encoding.

In [303]:
#dpt_1hotencoding = pd.get_dummies(df, columns=['department'])
#df
# with dummies it did not work
# try to do 1hot encoding the other way (as seen in class)
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
department_ohe = ohe.fit_transform(df[["department"]])
department_ohe

<14999x10 sparse matrix of type '<class 'numpy.float64'>'
	with 14999 stored elements in Compressed Sparse Row format>

In [304]:
#ohe.categories_

In [305]:
#department_ohe[0].todense()

In [306]:
department_ohe = pd.get_dummies(df[["department"]])
department_ohe.head()

Unnamed: 0,department_IT,department_RandD,department_accounting,department_hr,department_management,department_marketing,department_product_mng,department_sales,department_support,department_technical
0,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,0,1,0,0
4,0,0,0,0,0,0,0,1,0,0


In [307]:
df

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_monthly_hours,time_spent_company,work_accident,left,promotion_last_5years,department,salary,le_salary,oe_salary
0,0.38,0.53,2,157,3,0,1,0,sales,low,1,0
1,0.80,0.86,5,262,6,0,1,0,sales,medium,2,1
2,0.11,0.88,7,272,4,0,1,0,sales,medium,2,1
3,0.72,0.87,5,223,5,0,1,0,sales,low,1,0
4,0.37,0.52,2,159,3,0,1,0,sales,low,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
14994,0.40,0.57,2,151,3,0,1,0,support,low,1,0
14995,0.37,0.48,2,160,3,0,1,0,support,low,1,0
14996,0.37,0.53,2,143,3,0,1,0,support,low,1,0
14997,0.11,0.96,6,280,4,0,1,0,support,low,1,0


#### 2.4. Now concatenate all the features (`department` one hot-encoded, and the two versions of encoded `salary`) to the initial dataframe. You can use the `pd.concat` function. 
> Hint: You should have a total of 22 features in the concatenated dataset.

In [308]:
df = pd.concat([df, department_ohe], axis=1)
df

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_monthly_hours,time_spent_company,work_accident,left,promotion_last_5years,department,salary,le_salary,oe_salary,department_IT,department_RandD,department_accounting,department_hr,department_management,department_marketing,department_product_mng,department_sales,department_support,department_technical
0,0.38,0.53,2,157,3,0,1,0,sales,low,1,0,0,0,0,0,0,0,0,1,0,0
1,0.80,0.86,5,262,6,0,1,0,sales,medium,2,1,0,0,0,0,0,0,0,1,0,0
2,0.11,0.88,7,272,4,0,1,0,sales,medium,2,1,0,0,0,0,0,0,0,1,0,0
3,0.72,0.87,5,223,5,0,1,0,sales,low,1,0,0,0,0,0,0,0,0,1,0,0
4,0.37,0.52,2,159,3,0,1,0,sales,low,1,0,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14994,0.40,0.57,2,151,3,0,1,0,support,low,1,0,0,0,0,0,0,0,0,0,1,0
14995,0.37,0.48,2,160,3,0,1,0,support,low,1,0,0,0,0,0,0,0,0,0,1,0
14996,0.37,0.53,2,143,3,0,1,0,support,low,1,0,0,0,0,0,0,0,0,0,1,0
14997,0.11,0.96,6,280,4,0,1,0,support,low,1,0,0,0,0,0,0,0,0,0,1,0


#### 2.5. Create a new column, `eval_spent` equal to the product of two of the existing columns: the `evaluation score` and the `time spent` with the company.

In [309]:
df['eval_spent'] = df['last_evaluation'] * df['time_spent_company']
df

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_monthly_hours,time_spent_company,work_accident,left,promotion_last_5years,department,salary,le_salary,oe_salary,department_IT,department_RandD,department_accounting,department_hr,department_management,department_marketing,department_product_mng,department_sales,department_support,department_technical,eval_spent
0,0.38,0.53,2,157,3,0,1,0,sales,low,1,0,0,0,0,0,0,0,0,1,0,0,1.59
1,0.80,0.86,5,262,6,0,1,0,sales,medium,2,1,0,0,0,0,0,0,0,1,0,0,5.16
2,0.11,0.88,7,272,4,0,1,0,sales,medium,2,1,0,0,0,0,0,0,0,1,0,0,3.52
3,0.72,0.87,5,223,5,0,1,0,sales,low,1,0,0,0,0,0,0,0,0,1,0,0,4.35
4,0.37,0.52,2,159,3,0,1,0,sales,low,1,0,0,0,0,0,0,0,0,1,0,0,1.56
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14994,0.40,0.57,2,151,3,0,1,0,support,low,1,0,0,0,0,0,0,0,0,0,1,0,1.71
14995,0.37,0.48,2,160,3,0,1,0,support,low,1,0,0,0,0,0,0,0,0,0,1,0,1.44
14996,0.37,0.53,2,143,3,0,1,0,support,low,1,0,0,0,0,0,0,0,0,0,1,0,1.59
14997,0.11,0.96,6,280,4,0,1,0,support,low,1,0,0,0,0,0,0,0,0,0,1,0,3.84


### 3. Train a Logistic Regression model with Cross Validation (with Label-Encoded Salary)

For this section, train a logistic regression model with cross-validation on the employee retention dataset. Use all of the dependent variable features available from the concatenated dataset, but use only one encoded `salary` column at a time. 

Your dependent variable (y) is the column named `left`.

You can then compare your logistic regression results when using the label-encoded salary and when using the ordinal-encoded salary.

#### 3.1. Set the y and X variables. This time using `le_salary`.

> Hint: X should have a total of 19 features, namely: 'satisfaction_level', 'last_evaluation',            'number_project', 'average_monthly_hours',           'time_spent_company',         'work_accident',       'promotion_last_5years',                 ('IT',),                   ('RandD',),         ('accounting',), ('hr',),         ('management',),  ('marketing',), ('product_mng',), ('sales',),            ('support',), ('technical',), 'le_salary', 'eval_spent'.

In [310]:
X = df.drop(['left', 'department', 'salary', 'oe_salary'],axis=1)
y = df['left']

#### 3.2. Train/test splitting: Now split the data into 80% training and 20% test set. Remember to set the random seed to 72.


In [311]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=72)

#### 3.3. What is the base rate of the classification problem (whether an employee stays or leaves) ?
> Hint: calculate the frequency of the most common class on the test set for y

In [312]:
# method 1
from sklearn.dummy import DummyClassifier

# instantiate with the "most frequent" parameter
dummy = DummyClassifier(strategy='most_frequent')

# fit it as if we had no X features to train it on
dummy.fit(None, y_train)

#compute test baseline and store it for later
baseline = dummy.score(None, y_test)
baseline


0.7626666666666667

In [313]:
# method 2
y_test.value_counts(normalize=True)
y_test_most_common_class = max(y_test.value_counts(normalize=True))
y_test_most_common_class

0.7626666666666667

#### 3.4. Finally, train a Logistic Regression model with cross validation. Use the following parameters for Logistic Regression.

`LogisticRegressionCV(solver='lbfgs', cv=5, max_iter=1000, random_state=72)`

In [314]:
log_reg_cv = LogisticRegressionCV(solver='lbfgs', cv=5, max_iter=1000, random_state=72)
log_reg_cv.fit(X_train, y_train)


LogisticRegressionCV(cv=5, max_iter=1000, random_state=72)

#### 3.5. What is the test accuracy? What is the train accuracy?

In [315]:
print(log_reg_cv.score(X_test, y_test))
print(log_reg_cv.score(X_train, y_train))

0.8006666666666666
0.7927327277273106


In [316]:
# storing the first test-accuracy score for comparing it later
first_test_accuracy_score = log_reg_cv.score(X_test, y_test)
print(first_test_accuracy_score)

0.8006666666666666


#### 3.6. Calculate the precision and recall on the test set.

In [317]:
y_pred = log_reg_cv.predict(X_test)
y_pred


array([0, 0, 0, ..., 1, 0, 0])

In [318]:
def evaluate(test, pred):
  precision = precision_score(test, pred)
  recall = recall_score(test, pred)
  f1= f1_score(test, pred)
  print(f'CONFUSION MATRIX:\n{confusion_matrix(test, pred)}')
  print(f"ACCURACY SCORE:\n{accuracy_score(test, pred) :.4f}")
  print(f'CLASSIFICATION REPORT:\n\tPrecision: {precision:.4f}\n\tRecall: {recall:.4f}\n\tF1_Score: {f1:.4f}')

In [319]:
evaluate(y_test, y_pred)

CONFUSION MATRIX:
[[2127  161]
 [ 437  275]]
ACCURACY SCORE:
0.8007
CLASSIFICATION REPORT:
	Precision: 0.6307
	Recall: 0.3862
	F1_Score: 0.4791


#### 3.7. Plot the confusion matrix

In [320]:
pd.DataFrame(confusion_matrix(y_test, y_pred), columns=['predicted: 0', 'predicted: 1 (left)'], index= ['true: 0', 'true: 1 (left)'])


Unnamed: 0,predicted: 0,predicted: 1 (left)
true: 0,2127,161
true: 1 (left),437,275


### 4. Logistic Regression with Cross Validation (with Ordinal-Encoded Salary)
#### 4.1. One more time, set the y and X variables, this time using `oe_salary` instead of `le_salary`.

> Hint: X shoud contain the following 19 features 'satisfaction_level', 'last_evaluation', 'number_project',
       'average_monthly_hours', 'time_spent_company', 'eval_spent','work_accident',
       'promotion_last_5years', ('IT',), ('RandD',), ('accounting',),
       ('hr',), ('management',), ('marketing',), ('product_mng',),
       ('sales',), ('support',), ('technical',), 'oe_salary'.

In [321]:
X = df.drop(['left', 'department', 'salary', 'le_salary'],axis=1)
y = df['left']



#### 4.2. Train/test splitting¶: Now split the data into 80% training and 20% test set. Remember to set the random seed to 72.


In [322]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=72)


#### 4.3. Finally, train a Logistic Regression model with cross validation. Use the same parameters as before for Logistic RegressionCV. These are copied again below.

`LogisticRegressionCV(solver='lbfgs', cv=5, max_iter=1000, random_state=72)`

In [323]:
log_reg_cv = LogisticRegressionCV(solver='lbfgs', cv=5, max_iter=1000, random_state=72)
log_reg_cv.fit(X_train, y_train)


LogisticRegressionCV(cv=5, max_iter=1000, random_state=72)

#### 4.4 What are accuracy, precision and recall on the test set?

In [324]:
print(log_reg_cv.score(X_test, y_test))

0.8206666666666667


In [325]:
y_pred = log_reg_cv.predict(X_test)
evaluate(y_test, y_pred)

CONFUSION MATRIX:
[[2110  178]
 [ 360  352]]
ACCURACY SCORE:
0.8207
CLASSIFICATION REPORT:
	Precision: 0.6642
	Recall: 0.4944
	F1_Score: 0.5668


In [326]:
# store your test accuracy for later
second_test_accuracy = log_reg_cv.score(X_test, y_test)

#### 4.5. Plot the confusion matrix.

In [327]:
pd.DataFrame(confusion_matrix(y_test, y_pred), columns=['predicted: 0', 'predicted: 1 (left)'], index= ['true: 0', 'true: 1 (left)'])


Unnamed: 0,predicted: 0,predicted: 1 (left)
true: 0,2110,178
true: 1 (left),360,352


### 5. Logistic Regression with Standardisation and Cross Validation

Try to improve the previous model under point 4 (with `oe_salary`) using standardisation. 

#### 5.1. Standardize only these numerical features: `satisfaction_level`, `last_evaluation`, `number_project`, `average_monthly_hours`, `time_spent_company`, `eval_spent`, so that each have a mean of zero and a standard deviation of 1. You can use the Scikit-learn `StandardScaler` function.
> Hint: remember to use only the training set for fiting the StandardScaler, then apply the scaler (transform) to both the train and test sets.
>
> Hint 2: X shoud contain the following 19 features 'satisfaction_level', 'last_evaluation', 'number_project',
       'average_monthly_hours', 'time_spent_company', 'eval_spent','work_accident',
       'promotion_last_5years', ('IT',), ('RandD',), ('accounting',),
       ('hr',), ('management',), ('marketing',), ('product_mng',),
       ('sales',), ('support',), ('technical',), 'oe_salary'.


In [328]:
X = df.drop(['left', 'department', 'salary', 'le_salary'],axis=1)
y = df['left']

In [329]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=72)

In [330]:
from sklearn.compose import ColumnTransformer
numerical_cols=["satisfaction_level", "last_evaluation", "number_project", "average_monthly_hours", "time_spent_company", "eval_spent"]
scaler=StandardScaler()
preprocessor = ColumnTransformer([('standardization', scaler, numerical_cols)], remainder='passthrough')
#encoded_X_train = preprocessor.fit_transform(X_train)
#encoded_X_test = preprocessor.transform(X_test)    
preprocessor.fit(X_train, y_train)
encoded_X_train = preprocessor.transform(X_train)
encoded_X_test = preprocessor.transform(X_test)                                                           

In [331]:
# instantiate
# scaler = StandardScaler()
# fit on train set only
# scaler.fit(X_train, y_train)
# apply to the train set and your test
# X_train_s = scaler.transform(X_train)
# X_test_s = scaler.transform(X_test)

#### 5.2. Training

Train a Logistic Regression model with Cross Validation on the pre-processed dataset to which you applied standardisation. Use the same parameters for LogisticRegressionCV as before. These are copied below.

`LogisticRegressionCV(solver='lbfgs', cv=5, max_iter=1000, random_state=72)`



In [332]:
log_reg_cv = LogisticRegressionCV(solver='lbfgs', cv=5, max_iter=1000, random_state=72)
log_reg_cv.fit(encoded_X_train, y_train)


LogisticRegressionCV(cv=5, max_iter=1000, random_state=72)

#### 5.3. What is the test accuracy? What is the train accuracy?

In [333]:
print(log_reg_cv.score(encoded_X_test, y_test))
print(log_reg_cv.score(encoded_X_train, y_train))


0.8226666666666667
0.823151929327444


In [334]:
# store the test accuracy for later
third = log_reg_cv.score(encoded_X_test, y_test)

#### 5.4. Calculate the precision and recall on the test set.

In [335]:
y_pred = log_reg_cv.predict(encoded_X_test)
evaluate(y_test, y_pred)


CONFUSION MATRIX:
[[2104  184]
 [ 348  364]]
ACCURACY SCORE:
0.8227
CLASSIFICATION REPORT:
	Precision: 0.6642
	Recall: 0.5112
	F1_Score: 0.5778


#### 5.5. Show the confusion matrix for the test set

#### 5.6. Plot the confusion matrix

In [336]:
pd.DataFrame(confusion_matrix(y_test, y_pred), columns=['predicted: 0', 'predicted: 1 (left)'], index= ['true: 0', 'true: 1 (left)'])



Unnamed: 0,predicted: 0,predicted: 1 (left)
true: 0,2104,184
true: 1 (left),348,364


#### 5.7. Use the logistic regression model with standardisation and cross validation to predict whether a employee with the following characteristics will stay on the job or leave:

> An employee from the sales department, low salary, satisfaction 0.43, last evaluation 0.97, involved in 6 projects, working 284 monthly hours, 4 years spent with the company, 0 work accident and 0 promotions in the last 5 years.

In [337]:
# these are the X-values for which you should make the prediction
X_new = pd.DataFrame({'satisfaction_level': 0.43, 'last_evaluation': 0.97, 'number_project': 6,
       'average_monthly_hours': 284, 'time_spent_company': 4, 'eval_spent': 3.88, 
       'work_accident': 0, 'promotion_last_5years': 0, ('IT',): 0, ('RandD',): 0, ('accounting',): 0,
       ('hr',): 0, ('management',): 0, ('marketing',): 0, ('product_mng',): 0,
       ('sales',): 1, ('support',): 0, ('technical',): 0, 'oe_salary': 0}, index=[0])
X_new


Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_monthly_hours,time_spent_company,eval_spent,work_accident,promotion_last_5years,"(IT,)","(RandD,)","(accounting,)","(hr,)","(management,)","(marketing,)","(product_mng,)","(sales,)","(support,)","(technical,)",oe_salary
0,0.43,0.97,6,284,4,3.88,0,0,0,0,0,0,0,0,0,1,0,0,0


In [338]:
#X_new = preprocessor.transform(X_new)
#X_new = preprocessor.fit_transform(X_new)
y_new_pred = log_reg_cv.predict(X_new)
print(y_new_pred)

[1]




### 6. KNN with standardisation

#### 6.1. Train a model using a K-Nearest Neighbours (KNN) algorithm, setting `knn = KNeighborsClassifier(n_neighbors=2)`. Use the same features as in the previous model (`oe_salary`) and standardisation.

> Hint: X shoud contain the following 19 features 'satisfaction_level', 'last_evaluation', 'number_project',
       'average_monthly_hours', 'time_spent_company', 'eval_spent','work_accident',
       'promotion_last_5years', ('IT',), ('RandD',), ('accounting',),
       ('hr',), ('management',), ('marketing',), ('product_mng',),
       ('sales',), ('support',), ('technical',), 'oe_salary'.

In [339]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=2)
X = df.drop(['left', 'department', 'salary', 'le_salary'],axis=1)
y = df['left']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=72)
from sklearn.compose import ColumnTransformer
numerical_cols=["satisfaction_level", "last_evaluation", "number_project", "average_monthly_hours", "time_spent_company", "eval_spent"]
scaler=StandardScaler()
preprocessor = ColumnTransformer([('standardization', scaler, numerical_cols)], remainder='passthrough')
#encoded_X_train = preprocessor.fit_transform(X_train)
#encoded_X_test = preprocessor.transform(X_test)
preprocessor.fit(X_train, y_train)
encoded_X_train = preprocessor.transform(X_train)
encoded_X_test = preprocessor.transform(X_test)
knn.fit(encoded_X_train, y_train)

KNeighborsClassifier(n_neighbors=2)

#### 6.2. What is the train and test accuracy?

In [340]:
print(knn.score(encoded_X_train, y_train))
print(knn.score(encoded_X_test, y_test))


0.9924993749479123
0.9703333333333334


In [341]:
# store the test-accuracy for later
fourth_test_score = knn.score(encoded_X_test, y_test)


#### 6.3. What is the precision and recall?

In [342]:
y_pred = knn.predict(encoded_X_test)
evaluate(y_test, y_pred)


CONFUSION MATRIX:
[[2248   40]
 [  49  663]]
ACCURACY SCORE:
0.9703
CLASSIFICATION REPORT:
	Precision: 0.9431
	Recall: 0.9312
	F1_Score: 0.9371


#### 6.4. Print and plot the confusion matrix

In [343]:
pd.DataFrame(confusion_matrix(y_test, y_pred), columns=['predicted: 0', 'predicted: 1 (left)'], index= ['true: 0', 'true: 1 (left)'])


Unnamed: 0,predicted: 0,predicted: 1 (left)
true: 0,2248,40
true: 1 (left),49,663


#### 6.5. Use the K-Nearest Neighbours (KNN) model trained above to predict whether a employee with the characteristics outilined below will stay on the job. This is the same employee as before under point 5.7.

> An employee from the sales department, low salary, satisfaction 0.43, last evaluation 0.97, involved in 6 projects, working 284 monthly hours, 4 years spent with the company, 0 work accident and 0 promotions in the last 5 years. (same as before)
>
> Hint: do not forget to apply standardisation to the data for the new employee for which you want to make the prediction.

In [344]:
# these are the X-values for which you should make the prediction
X_new = pd.DataFrame({'satisfaction_level': 0.43, 'last_evaluation': 0.97, 'number_project': 6,
       'average_monthly_hours': 284, 'time_spent_company': 4, 'eval_spent': 3.88, 
       'work_accident': 0, 'promotion_last_5years': 0, ('IT',): 0, ('RandD',): 0, ('accounting',): 0,
       ('hr',): 0, ('management',): 0, ('marketing',): 0, ('product_mng',): 0,
       ('sales',): 1, ('support',): 0, ('technical',): 0, 'salary_encoded': 0}, index=[0])
X_new

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_monthly_hours,time_spent_company,eval_spent,work_accident,promotion_last_5years,"(IT,)","(RandD,)","(accounting,)","(hr,)","(management,)","(marketing,)","(product_mng,)","(sales,)","(support,)","(technical,)",salary_encoded
0,0.43,0.97,6,284,4,3.88,0,0,0,0,0,0,0,0,0,1,0,0,0


In [345]:
#X_new = scaler.fit_transform(X_new)
#preprocessor.fit(X_new)
#X_new = scaler.transform(X_new)
X_new = preprocessor.fit_transform(X_new)
y_new_pred = knn.predict(X_new)
print(y_new_pred)

[0]




### 7. Decision Tree Classifier and Standardisation

#### 7.1. Use the same features as before (with `oe_salary`) and standardisation to train your model of emloyee retention. Use the following parameters for your decision tree: `DecisionTreeClassifier(max_depth=7, random_state=72)`

> Hint: X should have a total of 19 features, namely: 'satisfaction_level', 'last_evaluation',            'number_project', 'average_monthly_hours',           'time_spent_company',         'work_accident',       'promotion_last_5years',                 ('IT',),                   ('RandD',),         ('accounting',), ('hr',),         ('management',),  ('marketing',), ('product_mng',), ('sales',),            ('support',), ('technical',), 'le_salary', 'eval_spent'.


In [346]:
from sklearn.tree import DecisionTreeClassifier
tree=DecisionTreeClassifier(max_depth=7, random_state=72)

In [347]:
X = df.drop(['left', 'department', 'salary', 'le_salary'],axis=1)
y = df['left']
#tree = DecisionTreeClassifier()
tree = DecisionTreeClassifier(max_depth=7, random_state=72)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=72)
from sklearn.compose import ColumnTransformer
numerical_cols=["satisfaction_level", "last_evaluation", "number_project", "average_monthly_hours", "time_spent_company", "eval_spent"]
scaler=StandardScaler()
preprocessor = ColumnTransformer([('standardization', scaler, numerical_cols)], remainder='passthrough')
encoded_X_train = preprocessor.fit_transform(X_train)
encoded_X_test = preprocessor.transform(X_test)
tree.fit(encoded_X_train, y_train)


DecisionTreeClassifier(max_depth=7, random_state=72)

#### 7.2. Calculate the test and train accuracy.

In [348]:
print(tree.score(encoded_X_test, y_test))
print(tree.score(encoded_X_train, y_train))


0.9826666666666667
0.9809984165347112


In [349]:
# store for the test-accuracy for later
fifth = tree.score(encoded_X_test, y_test)

#### 7.3. Calculate precision and recall on the test set

In [350]:
y_pred = tree.predict(encoded_X_test)
evaluate(y_test, y_pred)


CONFUSION MATRIX:
[[2283    5]
 [  47  665]]
ACCURACY SCORE:
0.9827
CLASSIFICATION REPORT:
	Precision: 0.9925
	Recall: 0.9340
	F1_Score: 0.9624


#### 7.4. Plot the confusion matrix

In [351]:
pd.DataFrame(confusion_matrix(y_test, y_pred), columns=['predicted: 0', 'predicted: 1 (left)'], index= ['true: 0', 'true: 1 (left)'])


Unnamed: 0,predicted: 0,predicted: 1 (left)
true: 0,2283,5
true: 1 (left),47,665


### 8. Accuracy improvement

Generate a plot with the different test accuracy scores obtained for the different models trained and including the base rate you calculated in the beginning.

In [352]:
# your code here
