# Instructions

+ Rename this file to `<your nusnetid>.ipynb`. For example, `e0321286.ipynb`.
+ The completed assignment must be uploaded to LumiNUS (Files > Assignment Submissions > PA3)
+ Submission deadline: 11AM, 9th October
+ Late Submission Rules: (1) 10 points will be deducted if submitted after 11 AM, 9th October and before 11 AM, 10th October; (2) Submissions after 11 AM, 10th October will not be graded and will be given **zero** marks.

+ This assignment is of 20 marks total. There are two sections I and II.
+ The  <font color='purple'> marks </font> associated with each question is mentioned at the end of the question in  <font color='purple'> purple </font> color.
+ Questions with <font color='blue'> blue </font> color are not graded, but you are encouraged to answer them for your better understanding.

In [1]:
import warnings
warnings.filterwarnings("ignore")

import random
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import Imputer, StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold
from sklearn.datasets import make_classification
from sklearn.metrics import roc_auc_score, precision_score, recall_score, confusion_matrix

## I. Understanding Grid Search and Pipelines



**Q1) Reimplement lines 2-7 of the code below by making all of the following modifications.** <font color='purple'> (2) </font> 

(a) Use sklearn's KFold() and find the best estimator using AUC scores obtained in each cross-validation step.

(b) Do NOT use GridSearchCV(). You can use Pipeline(). 

Select the best combination of hyperparameters based on 3-fold cross-validation scores. 

Print the average AUC score for each cross-validation fold.

In [2]:
#1
X,Y = make_classification(n_samples=100, n_features=20, n_informative=10, n_redundant=10, n_classes=2, n_clusters_per_class=1, shift=2,scale=None, random_state=0)
#2
parameters = dict(pca__n_components=[5,8],lr__C=[1, 3])
#3
estA = Pipeline(steps=[('pca', PCA()), ('lr', LogisticRegression())])
#4
estimator_A = GridSearchCV(estA, parameters, cv=3, scoring = 'roc_auc')
#5
estimator_A.fit(X,Y)
#6
A = estimator_A.best_estimator_
#7
print(estimator_A.best_params_)

{'lr__C': 1, 'pca__n_components': 8}




In [3]:
A = Pipeline(steps=[('pca', PCA(n_components=5)), ('lr', LogisticRegression(C=1))])
B = Pipeline(steps=[('pca', PCA(n_components=8)), ('lr', LogisticRegression(C=1))])
C = Pipeline(steps=[('pca', PCA(n_components=5)), ('lr', LogisticRegression(C=3))])
D = Pipeline(steps=[('pca', PCA(n_components=8)), ('lr', LogisticRegression(C=3))])

ascores, bscores, cscores, dscores = [],[],[],[]
kf = KFold(n_splits=3)

#Grid search is a brute-force approach that trains and tests each estimator 
#on all folds and chooses the best scoring estimator's combination of parameters
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = Y[train_index], Y[test_index]
    
    A.fit(X_train, y_train)
    y_pred = A.predict_proba(X_test)[:,1]
    ascores.append(roc_auc_score(y_test,y_pred))
    
    B.fit(X_train, y_train)
    y_pred = B.predict_proba(X_test)[:,1]
    bscores.append(roc_auc_score(y_test,y_pred))
    
    C.fit(X_train, y_train)
    y_pred = C.predict_proba(X_test)[:,1]
    cscores.append(roc_auc_score(y_test,y_pred))
    
    D.fit(X_train, y_train)
    y_pred = D.predict_proba(X_test)[:,1]
    dscores.append(roc_auc_score(y_test,y_pred))
    
print([np.mean(x) for x in [ascores, bscores, cscores, dscores]])
print('Best parameters: PCA 8 components, LR C=1 or C=3')

[0.8908179012345679, 0.9488425925925926, 0.8908179012345679, 0.942746913580247]
Best parameters: PCA 8 components, LR C=1 or C=3


**MS**
+ 1.5 mark for correct implementaion, i.e. with 4 different classifiers.
+ 0.5 mark printing the correct best parameters.
+ 0 mark if grid search CV is not implemented correctly.

**Q2) Reimplement lines 2 and 3 of the code below by making all of the following modifications.** <font color='purple'> (2) </font> 

(a) Use fit() and transform() for each operation in the pipeline

(b) Use KFold() and find the mean and standard deviation of AUC scores obtained over the 3 folds

(c) Do NOT use the Pipeline() object or cross_val_score() function

Ensure that there is no data leakage. 

Print the AUC score in each fold and the mean and standard deviation of AUC scores over the 3 folds.

In [4]:
#1
X,Y = make_classification(n_samples=100, n_features=20, n_informative=10, n_redundant=10, n_classes=2, n_clusters_per_class=1, shift=2,scale=None, random_state=0)
#2
estA = Pipeline(steps=[('pca', PCA()), ('lr', LogisticRegression())])
#3
scores = cross_val_score(estA, X, Y, cv=3, scoring='roc_auc')
#4
print(scores, np.mean(scores), np.std(scores))

[0.98269896 0.90441176 0.96323529] 0.9501153402537486 0.03327983578332014


In [5]:
scores = []
kf = KFold(n_splits=3)

#To obtain the cross-validation score of an estimator, the estimator's pipeline is
#evaluated by training and testing each step of the pipeline on all folds 
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = Y[train_index], Y[test_index]
    
    #fit PCA on training data, transform data, then fit LR
    #if pca.fit() is called on entire X, there is data leakage:
    #only training data (within each fold) should be used to fit PCA and LR 
    pca = PCA(n_components=8)
    pca.fit(X_train)
    X1 = pca.transform(X_train)
    LR = LogisticRegression(C=1)
    LR.fit(X1,y_train)
    
    #use PCA loadings from train data to transform test data, then predict using LR
    #if pca.transform() not used before LR.predict() -> Pipeline is incorrectly implemented
    y_pred = LR.predict_proba(pca.transform(X_test))[:,1]
    scores.append(roc_auc_score(y_test,y_pred))

print(scores, np.mean(scores), np.std(scores))

[0.96875, 0.9074074074074074, 0.9703703703703704] 0.9488425925925926 0.029306567279163854


**MS**
+ 0 marks if `fit` and `transform` is not for both `pca` and `LR`.
+ deduct 1.5 mark if data leakage happens, i.e. if entire X is used for fitting pca

**Q3) Evaluate the performance of the model from question 2 (trained on X,Y)  on the test set (X2) generated below. Use metrics precision, recall, sensitivity, specificiy and AUC.** <font color='purple'> (0.5) </font> 

In [6]:
X2,Y2 = make_classification(n_samples=500, n_features=20, n_informative=10, n_redundant=10, n_classes=2, n_clusters_per_class=1, shift=2,scale=None, random_state=0)

In [7]:
#cross validation error is an estimate of test error and is typically used to 
#choose a model (and its hyperparameters) using training data 
est = Pipeline(steps=[('pca', PCA(n_components=8)), ('lr', LogisticRegression(C=1))])
est.fit(X,Y)

#After choosing the model, the entire training data can be used to train and deployed to
#predict on the "unseen" test data
ypred = est.predict_proba(X2)[:,1]
print(roc_auc_score(Y2, ypred))

ypred = est.predict(X2)
c = confusion_matrix(Y2,ypred) #[[TN FP],[FN,TP]]
tn, fp, fn, tp = c.ravel()
sens, spec = tp/(tp+fn), tn/(tn+fp) #c[0][0]/(c[0][0]+c[0][1]), c[1][1]/(c[1][1]+c[1][0])
print(spec, sens)
print(precision_score(Y2, ypred), recall_score(Y2, ypred))

0.5021441372247823
0.6854838709677419 0.30952380952380953
0.5 0.30952380952380953


**MS**
+ 0.1 mark for each metric implemented correctly.

## II. Kaggle Competition
In this section, we will attempt a relatively simple Kaggle competition: Porto Seguro's Safe Driver Prediction. The aim is to predict if a driver will file an insurance claim next year. Please read the description of the challenge on [Kaggle](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction). The competition is closed and we cannot participate in it. So, we'll just use the training data provided in this assignment.  

Create a Kaggle account (if you don't have one) and download `train.csv` from the Data tab on the competition webpage. Read the Data description carefully. We'll build and evaluate end-to-end machine learning pipelines using sklearn functions. 

**Q4) Data Loading** <font color='purple'> (2) </font> 

(i) Load the data into a dataframe `df`, and print the number of observations (rows). This is a big dataset, so use the following code to sample only 1% of the observations using the code:`df = df.sample(frac=0.01, random_state=0)`. </font>  <font color='purple'> (0.5) </font> 

Use this sampled data for all the remaining questions in this section. **Make sure you set `random_state=0`, so your code can be reproduced exactly.** Then separate the target variable `Y` from the predictors `X`. Note that the `id` is not a predictor. <font color='blue'> Why? 

   
(ii) Print the number of classes in the target variable, and number of observations available in each class. <font color='purple'> (0.5) </font> 
    
(iii) Split the data for training (80%) (`X_train, Y_train`) and testing (20%) (`X_test, Y_test`). Print the number of observations in the training set. <font color='purple'> (0.5) </font> 
    
(iv) The features in this dataset are heterogeneous. Look at the column names and make separate lists of names for categorical (nominal) features `cat_feat`, binary features `bin_feat`, and numerical features (continuous and ordinal features) `num_feat`. Note that, in this analysis, we are considering binary features separately, not as part of categorical or numerical. Print the number of features in each of these 3 lists. <font color='purple'> (0.5) </font> 

In [8]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [9]:
df = pd.read_csv('train.csv')
df = df.sample(frac=0.01, random_state=0)
df.head()

Unnamed: 0,id,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
495179,1237354,0,5,1,4,0,0,0,1,0,...,7,2,8,4,0,0,0,0,0,0
210156,525008,0,5,1,5,0,0,1,0,0,...,5,2,0,9,0,1,1,0,1,0
170340,425680,0,2,1,6,1,0,1,0,0,...,4,2,3,10,0,0,0,1,1,1
462495,1156017,0,2,1,1,0,0,0,0,1,...,12,3,5,6,0,1,1,0,0,0
6892,17538,0,2,1,7,0,4,1,0,0,...,5,1,3,6,0,1,1,0,1,1


In [10]:
X = df.drop(columns=['id', 'target'])
Y = df['target']
Y.value_counts()

0    5740
1     212
Name: target, dtype: int64

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, 
                                                    test_size=0.2, 
                                                    random_state=0)
# To make some operations convinient we convert X_train as a dataframe
X_train = pd.DataFrame(data=X_train, columns=X.columns)
print(X_train.shape)

(4761, 57)


In [12]:
cat_features = [f for f in X.columns if f.endswith('cat')]
bin_features = [f for f in X.columns if f.endswith('bin')]
num_features = [f for f in X.columns if f not in cat_features+bin_features]
print('# categorical features =', len(cat_features))
print('# binary features =', len(bin_features))
print('# numerical features =', len(num_features))
print(bin_features)

# categorical features = 14
# binary features = 17
# numerical features = 26
['ps_ind_06_bin', 'ps_ind_07_bin', 'ps_ind_08_bin', 'ps_ind_09_bin', 'ps_ind_10_bin', 'ps_ind_11_bin', 'ps_ind_12_bin', 'ps_ind_13_bin', 'ps_ind_16_bin', 'ps_ind_17_bin', 'ps_ind_18_bin', 'ps_calc_15_bin', 'ps_calc_16_bin', 'ps_calc_17_bin', 'ps_calc_18_bin', 'ps_calc_19_bin', 'ps_calc_20_bin']


**MS**  
(i) If id is part of `X` then assign 0 mark.  
(ii) 0.25 for each.  
(iii) Full mark if correct.  
(iv) For each wrong list deduct 0.2 mark; if all 3 are wrong, assign 0 mark.

**Q5) Preprocessing** <font color='purple'> (3) </font>  

Create a `preprocessor` pipeline with the following transformation steps.

(i) "Appropriate" *Imputers* for different types of data (categorical, binary and numerical). <font color='purple'> (1) </font> 

(ii) A *OneHotEncoder* for the categorical features to convert their integer representations to one-hot-vectors. <font color='purple'> (1) </font> 

(iii) A *Scaler* for standardising the resulting data from the above two steps. <font color='purple'> (0.5) </font>

(iv) Print the total number of features after transforming `X_train` with the above constructed `preprocessor`. <font color='purple'> (0.5) </font>


Hint: Use `ColumnTransformer` to consolidate first two steps.

Note: Only the `preprocessor` Pipeline object should be used to `fit` and `transform` the training data. The constituent objects (Imputer, OneHotEncoder and Scaler) should not be separately used to `fit` and `transform`.

In [13]:
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

In [14]:
num_imp = SimpleImputer(missing_values=-1, strategy='median')
bin_imp = SimpleImputer(missing_values=-1, strategy='most_frequent')
cat_imp = SimpleImputer(missing_values=-1, strategy='most_frequent')

In [15]:
oh_enc = OneHotEncoder(categories='auto', sparse=False, handle_unknown='ignore')
cat_transformer = Pipeline(steps=[('imputer', cat_imp), 
                                  ('onehot', oh_enc)])

col_transformer = ColumnTransformer(
    transformers=[
        ('num', num_imp, num_features), 
        ('bin', bin_imp, bin_features),
        ('cat', cat_transformer, cat_features)])

scaler = StandardScaler(with_std=True)
preprocessor = Pipeline(steps=[('col_trans', col_transformer),
                               ('scaler', scaler)])
preprocessor.fit(X_train)
X_train_preprocessed = preprocessor.transform(X_train)
print("Total number of output features:", X_train_preprocessed.shape[1])

Total number of output features: 218


**MS**  
(i) For binary and categrical variables, mode (most frequent) based impuation should be followed. Otherwise assign 0 marks.  
(ii) 0 for wrong implementation.  
(iii) 0 for wrong implementation.  
(iv) should be 218 features.

**Q6) Building Different Estimators**

Build four different estimators using `preprocessor` of previous question and the following pipelines of operations:

1. `est1`: Preprocessing -> Classification <font color='purple'> (1.5) </font> 
2. `est2`: Preprocessing -> Feature Selection -> Classification <font color='purple'> (1.75) </font> 
3. `est3`: Preprocessing -> Feature Selection -> PCA -> Classification <font color='purple'> (2) </font> 
4. `est4`: Preprocessing -> PCA -> Classification <font color='purple'> (1.75) </font> 

The four estimators should differ in one or more functions chosen for feature selection and/or classification (e.g. you could use [SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest) and LogisticRegression in `est2`, and [VarianceThreshold](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#sklearn.feature_selection.VarianceThreshold) and LogisticRegression in `est3`. You are free to use any classiffier or [feature selection method](https://scikit-learn.org/stable/modules/feature_selection.html#feature-selection) from sklearn (not just the ones listed above or discussed in class) that can be used in these pipelines.

Identify at least one hyperparameter per operation in each pipeline. For each selected hyperparameters, choose two values -- they may or may not be the same across pipelines. Use grid search over the hyperparameter values with 5-fold crossvalidation and `roc_auc` scoring. Evaluate the four best estimators over 5-fold cross validation on the training data. Print the mean and std of AUC over the 5-folds for each best estimator. 

<font color='blue'> Among the 4 best estimators found above, which has highest AUC score? </font>

In [16]:
#just to make the notebook look cleaner
import warnings
warnings.filterwarnings("ignore")

In [17]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier

In [18]:
p1 = Pipeline(steps=[('preprocessor', preprocessor), ('lr', LogisticRegression(solver='lbfgs'))])
hyperparameters = {'preprocessor__col_trans__num__strategy':['mean', 'most_frequent']}
est1 = GridSearchCV(p1, hyperparameters, cv=5, scoring='roc_auc')
est1.fit(X_train,Y_train)
best_est1 = est1.best_estimator_
scores = cross_val_score(best_est1, X_train,Y_train, cv=5, scoring='roc_auc')
print("AUC mean = {:.3f}; AUC std = {:.3f}".format(np.mean(scores), np.std(scores)))

AUC mean = 0.532; AUC std = 0.022


In [19]:
p2 = Pipeline(steps=[('preprocessor', preprocessor), ('fsel', SelectKBest()), ('knn', KNeighborsClassifier())])
hyperparameters = {'preprocessor__col_trans__num__strategy':['mean', 'most_frequent'],
                   'fsel__k': [30, 50], 
                   'knn__n_neighbors':[3,5]}
est2 = GridSearchCV(p2, hyperparameters, cv=5, scoring='roc_auc')
est2.fit(X_train,Y_train)
best_est2 = est2.best_estimator_
scores = cross_val_score(best_est2, X_train, Y_train, cv=5, scoring='roc_auc')
print("AUC mean = {:.3f}; AUC std = {:.3f}".format(np.mean(scores), np.std(scores)))

AUC mean = 0.535; AUC std = 0.024


In [20]:
p3 = Pipeline(steps=[('preprocessor', preprocessor), 
                     ('fsel', SelectKBest()), 
                     ('pca', PCA()), 
                     ('knn', KNeighborsClassifier())])
hyperparameters = {'preprocessor__col_trans__num__strategy':['mean', 'most_frequent'],
                   'fsel__k': [30, 50], 
                   'pca__n_components': [10,20],
                   'knn__n_neighbors':[3,5]}
est3 = GridSearchCV(p3, hyperparameters, cv=5, scoring='roc_auc')
est3.fit(X_train,Y_train)
best_est3 = est3.best_estimator_
scores = cross_val_score(best_est3, X_train, Y_train, cv=5, scoring='roc_auc')
print("AUC mean = {:.3f}; AUC std = {:.3f}".format(np.mean(scores), np.std(scores)))

AUC mean = 0.542; AUC std = 0.027


In [21]:
p4 = Pipeline(steps=[('preprocessor', preprocessor), 
                     ('pca', PCA()), 
                     ('knn', KNeighborsClassifier())])
hyperparameters = {'preprocessor__col_trans__num__strategy':['mean', 'most_frequent'],
                   'pca__n_components': [10,20],
                   'knn__n_neighbors':[3,5]}
est4 = GridSearchCV(p4, hyperparameters, cv=5, scoring='roc_auc')
est4.fit(X_train,Y_train)
best_est4 = est4.best_estimator_
scores = cross_val_score(best_est4, X_train, Y_train, cv=5, scoring='roc_auc')
print("AUC mean = {:.3f}; AUC std = {:.3f}".format(np.mean(scores), np.std(scores)))

AUC mean = 0.515; AUC std = 0.012


**MS**  
For each `est`:
+ Correct Implementation of pipeline => 0.5 marks
+ Hyperparameter tuning => 1 mark; 

Deduct marks in the following cases:
+ If classifiers of `est1` and `est4` are same, then deduct 1 mark (0.5 for each question).
+ If classifiers **and** features selector are same in `est2` and `est3`, then deduct 1 mark (0.5 for each)

**Q7) Class Imbalance** <font color='purple'> (3.5) </font> 

Learn what is class imbalance by reading [this article](https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28). 

Additional links to packages for handling class imbalance in python: [Imbalanced-learn.org](http://imbalanced-learn.org/en/stable/index.html), [Over-sampling](http://imbalanced-learn.org/en/stable/over_sampling.html#a-practical-guide), [Under-sampling](http://imbalanced-learn.org/en/stable/under_sampling.html).

Install the `imbalanced-learn` package within Anaconda (See installation details in the link given above). We'll denote the estimator with the best average AUC in the previous question by `M`. 

+ Print the `accuracy_score`, `precision_score`, `recall_score` and `confusion_matrix` of `M` on the test dataset. <font color='purple'> (0.5) </font>  
<font color='blue'> In this case, dose high accuracy score reflect better performance? </font> 

+ Create two more estimators
    1. `M_over`
    2. `M_under`  
    
  by adding 
  
    1. SMOTE 
    2. under-sampler
    
    respectively to the preprocessing pipeline (just before scaling transformation) of M.
    
    Note that for this you may have to use the [Pipeline object](https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.pipeline.Pipeline.html) in imbalanced-learn.

    Train both these estimators on the training data. 

    Print the mean and standard deviation of the AUC values, over the 5 folds, for each estimator on the training data.

    Print the accuracy_score, precision, recall and confusion_matrix of `M_over` and `M_under` on the test data. <font color='purple'> (3) </font>  

<font color='blue'> How does SMOTE or under-sampling impact (increase/decrease) the learning performance on training and test data as compared to `M`? </font>

In [22]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score

In [23]:
Y_pred = best_est3.predict(X_test)
print("test accuracy = {:.2%}".format(accuracy_score(Y_test, Y_pred)))
print("precision: {:.2%}; recall: {:.2%}".format(precision_score(Y_test, Y_pred), 
                                                 recall_score(Y_test, Y_pred)))
cm = confusion_matrix(Y_test, Y_pred)
print(cm)

test accuracy = 96.22%
precision: 0.00%; recall: 0.00%
[[1146    1]
 [  44    0]]


In [24]:
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as Pipe

In [25]:
est3.best_params_

{'fsel__k': 30,
 'knn__n_neighbors': 5,
 'pca__n_components': 20,
 'preprocessor__col_trans__num__strategy': 'most_frequent'}

In [26]:
M_params = est3.best_params_.copy()
M_params['col_trans__num__strategy'] = M_params['preprocessor__col_trans__num__strategy']
del M_params['preprocessor__col_trans__num__strategy']
M_params

{'fsel__k': 30,
 'knn__n_neighbors': 5,
 'pca__n_components': 20,
 'col_trans__num__strategy': 'most_frequent'}

In [27]:
M_over = Pipe(steps=[('col_trans', col_transformer),
                     ('over', SMOTE()),
                     ('scaler', scaler),
                     ('fsel', SelectKBest()), 
                     ('pca', PCA()), 
                     ('knn', KNeighborsClassifier())])
M_over.set_params(**M_params)
M_over.fit(X_train, Y_train)
scores = cross_val_score(M_over, X_train, Y_train, cv=5, scoring='roc_auc')
print("Oversampling:\n-----------")
print("AUC mean = {:.3f}; AUC std = {:.3f}".format(np.mean(scores), np.std(scores)))
Y_pred = M_over.predict(X_test)
print("test accuracy = {:.2%}".format(accuracy_score(Y_test, Y_pred)))
print("precision: {:.2%}; recall: {:.2%}".format(precision_score(Y_test, Y_pred), 
                                                 recall_score(Y_test, Y_pred)))
cm = confusion_matrix(Y_test, Y_pred)
print(cm)

M_under = Pipe(steps=[('col_trans', col_transformer),
                      ('under', RandomUnderSampler(random_state=0, replacement=True)),
                      ('scaler', scaler),
                      ('fsel', SelectKBest()), 
                      ('pca', PCA()), 
                      ('knn', KNeighborsClassifier())])
M_under.set_params(**M_params)
M_under.fit(X_train, Y_train)
scores = cross_val_score(M_under, X_train, Y_train, cv=5, scoring='roc_auc')
print("\nUndersampling:\n-----------")
print("AUC mean = {:.3f}; AUC std = {:.3f}".format(np.mean(scores), np.std(scores)))
Y_pred = M_under.predict(X_test)
print("test accuracy = {:.2%}".format(accuracy_score(Y_test, Y_pred)))
print("precision: {:.2%}; recall: {:.2%}".format(precision_score(Y_test, Y_pred), 
                                                 recall_score(Y_test, Y_pred)))
cm = confusion_matrix(Y_test, Y_pred)
print(cm)

Oversampling:
-----------
AUC mean = 0.500; AUC std = 0.026
test accuracy = 82.96%
precision: 5.08%; recall: 20.45%
[[979 168]
 [ 35   9]]

Undersampling:
-----------
AUC mean = 0.543; AUC std = 0.028
test accuracy = 51.30%
precision: 3.79%; recall: 50.00%
[[589 558]
 [ 22  22]]


**MS**
+ 0.5 for printing the metrics on the test data. 0 marks if computed on training data. 
+ Except AUC, all the metrics should be computed on test data. Otherwise deduct 1 mark for each estimator `M_over` and `M_under`.
+ The hyper parameters of `M_over` and `M_under` should be same as that of `M`. Otherwise deduct 1 mark.

**Q8) Ensemble Method**  <font color='purple'> (1) </font> 

To answer this question, learn about Random Forest Classifier. You may follow these links:
[Understanding Random Forest](https://towardsdatascience.com/understanding-random-forest-58381e0602d2), and [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html). 

+ Create a pipeline with the `preprocessor` object of Q5 and a RandomForest classifier having 100 estimators and a maximum depth of 3. Train it on the training set. 

    Print the `accuracy_sore`, precision, recall and `confusion_matrix` of this model on the test data. <font color='purple'> (1) </font>

In [28]:
from sklearn.ensemble import RandomForestClassifier
rfclf = Pipeline(steps=[('preprocessor', preprocessor), 
                        ('rf', RandomForestClassifier(n_estimators=100, max_depth=3))])
rfclf.fit(X_train, Y_train)
scores = cross_val_score(rfclf, X_train, Y_train, cv=5, scoring='roc_auc')
print("AUC mean = {:.3f}; AUC std = {:.3f}".format(np.mean(scores), np.std(scores)))
Y_pred = rfclf.predict(X_test)
print("test accuracy = {:.2%}".format(accuracy_score(Y_test, Y_pred)))
print("precision: {:.2%}; recall: {:.2%}".format(precision_score(Y_test, Y_pred), 
                                                 recall_score(Y_test, Y_pred)))
cm = confusion_matrix(Y_test, Y_pred)
print(cm)

AUC mean = 0.608; AUC std = 0.043
test accuracy = 96.31%
precision: 0.00%; recall: 0.00%
[[1147    0]
 [  44    0]]


**MS**
+ 0.25 marks for each metric.