In [None]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

# 6 - Model and hyper parameters selection

In this session, we will focus on the techniques for model choice and hyperparameter tuning for classification.
Similar techniques can be used for regression tasks as well.

We are going to use two datasets which we have already explored in previous sessions: 
- the dataset on fraud detection (the `payment_fraud.csv` file in the datasets folder)
- the kdd dataset on intrusion detection ([website](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html)); you can either re-download it from the website or copy-n-paste it from the folder of one of the previous sessions. Remember, if you download it again, that you need the *kdd.data.gz* file (or the *10percent*), which is a compressed archive (you have to extract it).

---

# Index

- [0. imports](#0_imports)
- [1. fraud detection](#1_fraud_detection)
    - [1.1 load dataset](#1.1_load_dataset)
    - [1.2 analysis](#1.2_analysis)
    - [1.3 data preparation](#1.3_data_preparation)
    - [1.4 training and evaluation](#1.4_training_and_evaluation)
    - [1.5 analysis of different parameters](#1.5_analysis_of_different_parameters)
- [2. Introduction to GridSearchCV and RandomizedSearchCV](#2_Introduction_to_GridSearchCV_and_RandomizedSearchCV)
- [3. Intrusion detection](#3_intrusion_detection)
    - [3.1 load dataset](#3.1_load_dataset)
    - [3.2 map each attack to corresponding category](#3.2_mapping_attack_to_category)
    - [3.3 subsampling](#3.3_subsampling)
    - [3.4 data analysis](#3.4_data_analysis)
    - [3.5 data preparation](#3.5_data_preparation)
    - [3.6 training and evaluation](#3.6_training_and_evaluation)

## 0_imports

In [None]:
from collections import Counter
import numpy as np
import pandas as pd
import os
import time

# data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# methods for data preparation
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# evaluation metrics
from sklearn.metrics import (
    accuracy_score, 
    precision_score,
    recall_score,
    confusion_matrix,
    classification_report,
)

# CV
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline

# distribution probabilities
from scipy.stats import uniform, randint

# classifiers
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

## 1_fraud_detection
[Index](#Index)

### 1.1_load_dataset
[Index](#Index)

In [None]:
data_dir = 'datasets'
filename_fraud = 'payment_fraud.csv'

### 1.2_analysis
[Index](#Index)

In [None]:
df_fraud = pd.read_csv(os.path.join(data_dir, filename_fraud))
df_fraud.sample(5)

<div class="alert alert-block alert-danger">
<b>Q: Print the number of rows.</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: Print the number of occurrences of all the possible values of paymentMethod and paymentMethodAgeDays.</b>
</div>

In [None]:
# paymentMethod

In [None]:
# numItems

<div class="alert alert-block alert-danger">
<b>Q: Plot the distribution of the possible values of numItems.</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: Plot the distribution of the accountAgeDays column.</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: Plot the distribution of the localTime column.</b>
</div>

### 1.3_data_preparation
[Index](#Index)

<div class="alert alert-block alert-warning">
You might remember that we had found out that all the fraudulent transactions were identified by the accountAgeDays attribute (see plots below for demonstration). Thus, for the sake of this session, we will drop such column in order to make the dataset more challenging.
</div>

In [None]:
# this cell and the next one have the only purpose of showing that accountAgeDays can be use to perfectly distinguish between standard and malicious transactions
fig, ax = plt.subplots(1, 2, figsize=(14, 4), sharex=True)

bins = np.arange(0, 2001, 50)
ax[0].hist(df_fraud[df_fraud['label']==0]['accountAgeDays'], bins=bins, color='g')
ax[1].hist(df_fraud[df_fraud['label']==1]['accountAgeDays'], bins=bins, color='r')

for idx in [0, 1]:
    ax[idx].set_title("Distribution of accountAgeDays for '%d' entries (i.e. %s)" % (idx, "'good'" if idx==0 else "'fraud'"))
    ax[idx].grid(axis='y')
    ax[idx].set_ylabel('Num. of transactions')
    ax[idx].set_xlabel('accountAgeDays')
    
plt.show()

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(14, 4), sharex=True)

bins = np.arange(0, 10, 1)
ax[0].hist(df_fraud[(df_fraud['label']==0)&(df_fraud['accountAgeDays']<5)]['accountAgeDays'], bins=bins, color='g')
ax[1].hist(df_fraud[(df_fraud['label']==1)&(df_fraud['accountAgeDays']<5)]['accountAgeDays'], bins=bins, color='r')

for idx in [0, 1]:
    ax[idx].grid(axis='y')
    ax[idx].set_title("Focusing on the [0, 10] range")
    ax[idx].set_xticks(bins)
    ax[idx].set_ylabel('Num. of transactions')
    ax[idx].set_xlabel('accountAgeDays')

plt.show()

In [None]:
df_fraud = df_fraud.drop('accountAgeDays', axis=1)

<div class="alert alert-block alert-danger">
<b>Q: How unbalanced is the dataset?</b>
</div>

<div class="alert alert-block alert-info">
It is very unbalanced. There are several techniques to address this, and you will see them in the classroom. Now we will be using the one which is probably the simplest: undersampling.
</div>

In [None]:
df_1 = df_fraud[df_fraud['label']==1]
df_0 = df_fraud[df_fraud['label']==0].sample(len(df_1)*2)

df_fraud = pd.concat([df_0, df_1], axis=0)

<div class="alert alert-block alert-danger">
<b>Q: How unbalanced is the dataset now?</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: Which are the numeric columns? Which are the categorical columns?</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: Define two variables (numeric_cols and nominal_cols) containing a list of numerical and nominal attributes, respectively.</b>
</div>

In [None]:
nominal_cols = # TODO
numeric_cols = # TODO

<div class="alert alert-block alert-danger">
<b>Q: Perform one hot encoding.</b>
</div>

In [None]:
df_fraud = # TODO

<div class="alert alert-block alert-danger">
<b>Q: Perform train-test split.</b>
</div>

In [None]:
X_train, X_test, y_train, y_test = # TODO

<div class="alert alert-block alert-danger">
<b>Q: Perform scaling.</b>
</div>

In [None]:
standard_scaler = # TODO

X_train[numeric_cols] = # TODO
X_test[numeric_cols] = # TODO

### 1.4_training_and_evaluation
[Index](#Index)

<div class="alert alert-block alert-danger">
<b>Q: Define, train, and evaluate a KNeighborsClassifier. Use K=1</b>
</div>

In [None]:
# define the classifier
clf = # TODO

# train the classifier
t0 = time.time()
# TODO
print("elapsed time = %.5f" % (time.time()-t0))

# perform the prediction
y_pred = # TODO

In [None]:
accuracy = # TODO
precision = # TODO
recall = # TODO
print("accuracy", accuracy)
print("precision", precision)
print("recall", recall)

In [None]:
# TODO confusion matrix

<div class="alert alert-block alert-danger">
Try now with K=2
</div>

In [None]:
# define the classifier
clf = # TODO

# train the classifier
t0 = time.time()
# TODO
print("elapsed time = %.5f" % (time.time()-t0))

# perform the prediction
y_pred = # TODO

In [None]:
accuracy = # TODO
precision = # TODO
recall = # TODO
print("accuracy", accuracy)
print("precision", precision)
print("recall", recall)

In [None]:
# TODO confusion matrix

<div class="alert alert-block alert-danger">
<b>Q: How did the performance change? Can you infer anything from it?</b>
</div>

<div class="alert alert-block alert-success">
<b>ANS</b>
</div>

### 1.5_analysis_of_different_parameters
[Index](#Index)

<div class="alert alert-block alert-danger">
<b>Q: Run the following cell. What does it do?</b>
</div>

In [None]:
list_n_neighbors = np.arange(1, 200, 1, dtype=np.int16)

list_accuracy = []
list_precision = []
list_recall = []
list_training_time = []

for n in list_n_neighbors:
    clf = KNeighborsClassifier(n_neighbors=n)

    t0 = time.time()
    clf.fit(X_train, y_train)
    list_training_time.append(time.time()-t0)

    y_pred = clf.predict(X_test)

    list_accuracy.append(accuracy_score(y_pred, y_test))
    list_precision.append(precision_score(y_pred, y_test))
    list_recall.append(recall_score(y_pred, y_test))

<div class="alert alert-block alert-success">
<b>ANS</b>
</div>

In [None]:
fig, ax = plt.subplots(2, 2, figsize=(20, 12))

ax[0][0].plot(list_accuracy, c='tab:blue')
ax[0][0].set_title('KNeighborsClassifier: accuracy')
ax[0][0].set_ylabel('accuracy')
ax[0][0].set_xlabel('Num. of neighbors (K)')

ax[0][1].plot(list_precision, c='tab:orange')
ax[0][1].set_title('KNeighborsClassifier: precision')
ax[0][1].set_ylabel('precision')
ax[0][1].set_xlabel('Num. of neighbors (K)')

ax[1][0].plot(list_recall, c='tab:green')
ax[1][0].set_title('KNeighborsClassifier: recall')
ax[1][0].set_ylabel('recall')
ax[1][0].set_xlabel('Num. of neighbors (K)')

ax[1][1].plot(list_training_time, c='tab:red')
ax[1][1].set_title('KNeighborsClassifier: training_time')
ax[1][1].set_ylabel('training_time')
ax[1][1].set_xlabel('Num. of neighbors (K)')

plt.show()

<div class="alert alert-block alert-danger">
<b>Q: Try to analyse the plots above. Do you see any patterns? Is there a choice of parameters that looks particularly effective in your opinion?</b>
</div>

<div class="alert alert-block alert-success">
<b>ANS</b>
</div>

<div class="alert alert-block alert-info">
Let's repeat the same study, but considering SVM and its C parameter.
</div>

In [None]:
list_c = np.arange(0.1, 5.1, 0.1, dtype=np.float)

list_accuracy = []
list_precision = []
list_recall = []
list_training_time = []

for c in list_c:
    clf = LinearSVC(C=c, max_iter=20000)

    t0 = time.time()
    clf.fit(X_train, y_train)
    list_training_time.append(time.time()-t0)

    y_pred = clf.predict(X_test)

    list_accuracy.append(accuracy_score(y_pred, y_test))
    list_precision.append(precision_score(y_pred, y_test))
    list_recall.append(recall_score(y_pred, y_test))

In [None]:
fig, ax = plt.subplots(2, 2, figsize=(20, 12))

ax[0][0].plot(list_accuracy, c='tab:blue')
ax[0][0].set_title('LinearSVC: accuracy')
ax[0][0].set_ylabel('accuracy')
ax[0][0].set_xlabel('Regularization parameter (C)')

ax[0][1].plot(list_precision, c='tab:orange')
ax[0][1].set_title('LinearSVC: precision')
ax[0][1].set_ylabel('precision')
ax[0][1].set_xlabel('Regularization parameter (C)')

ax[1][0].plot(list_recall, c='tab:green')
ax[1][0].set_title('LinearSVC: recall')
ax[1][0].set_ylabel('recall')
ax[1][0].set_xlabel('Regularization parameter (C)')

ax[1][1].plot(list_training_time, c='tab:red')
ax[1][1].set_title('LinearSVC: training_time')
ax[1][1].set_ylabel('training_time')
ax[1][1].set_xlabel('Regularization parameter (C)')

plt.show()

<div class="alert alert-block alert-danger">
<b>Q: Try to analyse the plots above. Do you see any patterns? Is there a choice of parameters that looks particularly effective in your opinion?</b>
</div>

<div class="alert alert-block alert-success">
<b>ANS</b>
</div>

## 2_Introduction_to_GridSearchCV_and_RandomizedSearchCV
[Index](#Index)

### 2.1_GridSearchCV
[Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

`GridSearchCV` lets you define a grid of hyperparameters to evaluate with cross validation.

In [None]:
# 1. First of all, you have to define the model to use, in this case LinearSVC but it could be any sklearn model (DecisionTreeClassifier, RandomForestClassifier, etc.)
classifier = LinearSVC()

# 2. define a Pipeline, in this case it is made of only one element
pipe = Pipeline(steps=[('clf', classifier)])

# 3. parameters of the model in the pipeline can be set using `'__'` separated parameter names; e.g. `'clf__C'` is a list of values for the C attribute of the classifier
param_grid = {
    'clf__C': np.arange(0.1, 5.1, 0.1, dtype=np.float),
    'clf__max_iter': [20000],
}

# 4. performs cross validation for each of the parameters set above. `cv=3` means that I perform 3-fold cross validation
search = GridSearchCV(pipe, param_grid, iid=False, cv=3, verbose=True)
search.fit(X_train, y_train)

# 5. once the training is done, you can show the parameters of the best performing model
print(search.best_params_)

# 6. and you can also retrieve it in order to perform the final test on the test set
trained_clf = search.best_estimator_.get_params()['clf']
trained_clf

# 7. prediction and evaluation is done as usual
y_pred = trained_clf.predict(X_test)

print("accuracy", accuracy_score(y_pred, y_test))
print("precision", precision_score(y_pred, y_test))
print("recall", recall_score(y_pred, y_test))

### 2.2_RandomizedSearchCV
[Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)

`RandomizedSearchCV` lets you define distribution of hyperparameters to evaluate with cross validation.

It is used in the same way as `GridSearchCV`, but there is a difference in how the parameters are defined.

In [None]:
classifier = LinearSVC()

pipe = Pipeline(steps=[('clf', classifier)])

# 3. parameters are not a grid. They are probability distributions
distributions = {
    'clf__C': uniform(0, scale=5),
    'clf__max_iter': [20000],
}

# 4. performs cross validation, parameters are taken from the distributions defined above
search = RandomizedSearchCV(pipe, distributions, n_iter=50, cv=3, verbose=True)
search.fit(X_train, y_train)

print(search.best_params_)

trained_clf = search.best_estimator_.get_params()['clf']
trained_clf

y_pred = trained_clf.predict(X_test)

print("accuracy", accuracy_score(y_pred, y_test))
print("precision", precision_score(y_pred, y_test))
print("recall", recall_score(y_pred, y_test))

## 3_intrusion_detection
[Index](#Index)

### 3.1_load_dataset
[Index](#Index)

In [None]:
filename = 'kddcup.data.corrected'

In [None]:
# feature names obtained from: http://kdd.ics.uci.edu/databases/kddcup99/kddcup.names
header_names = [
    'duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 
    'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 
    'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 
    'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 
    'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 
    'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 
    'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 
    'dst_host_srv_rerror_rate', 'attack_type'
]

In [None]:
df = pd.read_csv(os.path.join(data_dir, filename), header=None, names=header_names, sep=',')

In [None]:
df[:5]

<div class="alert alert-block alert-info">
Try to run the cells of section 3.2 *without* performing the sampling at this point.
If your machine crashes due to memory issues, then come back here, uncomment this line, and possibly change the portion of the DF to keep.
</div>

In [None]:
# df = df.sample(frac=0.50)

### 3.2_mapping_attack_to_category
[Index](#Index)

In [None]:
# remove final "." from the attack type
df['attack_type'] = df.apply(lambda r: r['attack_type'][:-1], axis=1)

In [None]:
# print number of attacks
len(df['attack_type'].unique())

In [None]:
# show distribution of attacks
tmp_df = df.groupby('attack_type').size().reset_index().sort_values(0, ascending=False)

fig, ax = plt.subplots(figsize=(8, 8))
ax.barh(tmp_df['attack_type'], tmp_df[0])
ax.set_xscale('log')
ax.set_ylabel('attack_type')
ax.set_xlabel('N. of samples (log scale)')
ax.grid(axis='x')
plt.show()

In [None]:
# define dictionary to map attack types to attack categories
category = dict()
category['benign'] = ['normal']

with open(os.path.join(data_dir, 'training_attack_types.txt'), 'r') as f:
    for line in f.readlines():
        attack, cat = line.strip().split(' ')
        if cat in category.keys():
            category[cat].append(attack)
        else:
            category[cat] = [attack]

attack_mapping = {v: k for k in category for v in category[k]}

In [None]:
# perform the mapping
df['attack_category'] = df.apply(lambda r: attack_mapping[r['attack_type']], axis=1)

In [None]:
# # define dictionaries that map an integer to an attack category and vice versa
attack2int = {x: idx for idx, x in enumerate(df['attack_category'].unique())}
int2attack = {v: k for k, v in attack2int.items()}
print("attack2int:", attack2int)
print("int2attack:", int2attack)

In [None]:
# distribution of attack categories
tmp_df = df.groupby('attack_category').size().reset_index().sort_values(0, ascending=False)
display(tmp_df)

color = ['green' if category=='benign' else 'red' for category in tmp_df['attack_category']]
fig, ax = plt.subplots()
ax.barh(tmp_df['attack_category'], tmp_df[0], color=color)
ax.set_xscale('log')
ax.set_ylabel('attack_category')
ax.set_xlabel('N. of samples (log scale)')
plt.show()

### 3.3_subsampling
[Index](#Index)

<div class="alert alert-block alert-info">
In previous notebooks, we had always performed a random sampling, ignoring the labels. Now we are going to perform a "targeted sampling": that is, we are going to reduce the size of the dataset by removing only rows belonging to the most frequent categories, so that we do not discard entries for the least frequent categories.
</div>

In [None]:
# If you have problems running the notebook, you could further reduce the dataframe size
df_dos = df[df['attack_category']=='dos'].sample(n=10**6)
df_not_dos = df[df['attack_category']!='dos']

df = pd.concat([df_dos, df_not_dos], axis=0).sample(frac=1.)  # the final frac=1. is done to reshuffle the dataframe

In [None]:
tmp_df = df.groupby('attack_category').size().reset_index().sort_values(0, ascending=False)
display(tmp_df)

color = ['green' if category=='benign' else 'red' for category in tmp_df['attack_category']]
fig, ax = plt.subplots()
ax.barh(tmp_df['attack_category'], tmp_df[0], color=color)
ax.set_xscale('log')
ax.set_ylabel('attack_category')
ax.set_xlabel('N. of samples (log scale)')
plt.show()

### 3.4_data_analysis
[Index](#Index)

In [None]:
print("Number of rows = %d" % len(df.index))
print("Number of columns = %d" % len(df.columns))

In [None]:
col_names = np.array(header_names)

nominal_idx = [1, 2, 3]
binary_idx = [6, 11, 13, 14, 20, 21]
numeric_idx = list(set(range(41)).difference(nominal_idx).difference(binary_idx))

nominal_cols = col_names[nominal_idx].tolist()
binary_cols = col_names[binary_idx].tolist()
numeric_cols = col_names[numeric_idx].tolist()

print("categorical attributes: \n", nominal_cols, "\n")
print("binary attributes: \n", binary_cols, "\n")
print("numeric attributes: \n", numeric_cols, "\n")

In [None]:
tmp_df = df.groupby('protocol_type').size().reset_index()

fig, ax = plt.subplots()
ax.barh(tmp_df['protocol_type'], tmp_df[0])

ax.set_title('Numer of connections for each protocol_type')
ax.grid(axis='x')
ax.set_ylabel('protocol_type')
ax.set_xlabel('N. of samples (log scale)')
plt.show()

In [None]:
tmp_df = df.groupby('flag').size().reset_index()

fig, ax = plt.subplots()
ax.barh(tmp_df['flag'], tmp_df[0])
ax.set_xscale('log')
ax.grid(axis='x')
ax.set_title('Numer of connections for each flag')
ax.set_ylabel('flag')
ax.set_xlabel('N. of samples (log scale)')
plt.show()

<div class="alert alert-block alert-info">
If you compare the two plots above with the corresponding ones from last week you can see that the subsampling heavily affected the distribution. This is not necessarily a problem, but you should be aware of it.
</div>

### 3.5_data_preparation
[Index](#Index)

<div class="alert alert-block alert-danger">
<b>Q: Create a new DataFrame encoding the categorical attributes with one hot encoding.</b>
</div>

In [None]:
df_one_hot = # TODO

<div class="alert alert-block alert-danger">
<b>Q: Split in train and test set.</b>
</div>

In [None]:
X_train, X_test, y_train, y_test = # TODO

<div class="alert alert-block alert-danger">
<b>Q: Perform scaling.</b>
</div>

In [None]:
standard_scaler = # TODO

X_train[numeric_cols] = # TODO
X_test[numeric_cols] = # TODO

### 3.6_training_and_evaluation
[Index](#Index)

<div class="alert alert-block alert-danger">
<b>
Q: Use the GridSearchCV method seen above to train a DecisionTreeClassifier(), then evaluate the best performing model. Use the following parameters:

</b>
    
    - max_leaf_nodes: [10, None];
    - max_depth: [2, 5, 10]
</div>

In [None]:
classifier = # TODO

# TODO Pipeline

# TODO param_grid

search = # TODO

t0 = time.time()
# TODO fit
print("Elapsed time = %.2f" % (time.time()-t0))

print(search.best_params_)

trained_clf = # TODO get best estimator

In [None]:
y_pred = # TODO

In [None]:
print("ACCURACY:", # TODO

<div class="alert alert-block alert-danger">
<b>Q: Search for the differences between this function and the corresponding one defined in the previous session.</b>
</div>

In [None]:
def plot_heatmap(y_pred, y_test, vmax=None, vmin=None, cmap=None):
    c = Counter(zip(y_pred, y_test))
    dff = pd.DataFrame(0, columns=np.unique(y_pred) , index =np.unique(y_test))
    for k,v in c.items():
        dff[k[0]][k[1]] = v
    sns.heatmap(dff,annot=True, fmt="d", vmax=vmax, vmin=vmin, cmap=cmap)

In [None]:
plot_heatmap(y_pred, y_test, vmax=2000, cmap='inferno')

In [None]:
clf_report = classification_report(y_test, y_pred, output_dict=True)
for attack_type in set(attack_mapping.values()):
    print("%10s -> " % attack_type, clf_report[attack_type])

<div class="alert alert-block alert-danger">
<b>
Q: Use the RandomizedSearchCV method seen above to train a DecisionTreeClassifier(), then evaluate the best performing model. Use the following parameters:

</b>
    
    - max_depth: randint(1, 20)
</div>

In [None]:
classifier = # TODO

# TODO Pipeline

# TODO distributions

search = # TODO RandomizedSearchCV
# TODO fit

print(search.best_params_)

trained_clf = # TODO get best estimator

In [None]:
y_pred = # TODO

In [None]:
print("ACCURACY:", # TODO

In [None]:
plot_heatmap(y_pred, y_test, vmax=2000, cmap='inferno')

In [None]:
clf_report = classification_report(y_test, y_pred, output_dict=True)
for attack_type in set(attack_mapping.values()):
    print("%10s -> " % attack_type, clf_report[attack_type])

<div class="alert alert-block alert-danger">
<b>
Q: Even if XGBClassifier is not part of sklearn, you can use it withing GridSearchCV and RandomizedSearchCV. Train and evaluate XGBClassifier with GridSearch using the following parameters:

</b>
    
    - learning_rate: np.arange(0.1, 0.6, 0.2);
    - n_estimators: [50, 100]
</div>

In [None]:
classifier = # TODO

pipe = # TODO

param_grid = # TODO

search = # TODO
t0 = time.time()
# TODO fit
print("Elapsed time = %.2f" % (time.time()-t0))

print(search.best_params_)

# TODO get best estimator

In [None]:
y_pred = # TODO

In [None]:
print("ACCURACY:", # TODO

In [None]:
plot_heatmap(y_pred, y_test, vmax=2000, cmap='inferno')

In [None]:
clf_report = classification_report(y_test, y_pred, output_dict=True)
for attack_type in set(attack_mapping.values()):
    print("%10s -> " % attack_type, clf_report[attack_type])

---

<div class="alert alert-block alert-danger">
<b>Now it's your turn. Experiment with different models, hyperparameters and try to find the best performing model using cross validation with GridSearchCV and/or RandomizedSearchCV.</b>
    
Some things you might want to try:
    
- work on the features: are all of them important? How are they distributed? Does the model gets better if you remove some of the features? etc...

- work on the data: is the StandardScaler the best scaler to use? Are the chosen parameters the ones leading to the best result?

- try different models and different hyper parameters for each model (you cannot try "every" possible combination, try to pick wisely). For instance, you first do a large grained search and. after that, a fine grained search "around" the best parameters. The SearchCV might even take several hours (or days) to run if you set too many parameters.
</div>

Remember: 
- *one perfect model* to solve this problem does not exist, but by approaching the problem in the correct way, you can get better and better models.
- when performing the final evaluation, look at all the 5 classes and not only at the overall accuracy!

After you finish the notebook, take a look at this [link](https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html)

In the second Section ("Face recognition with eigenfaces"), it presents an example of how to use the techniques that you have seen so far in a more challenging problem (face recognition), which is also very relevant from a security perspective, since face recognition is often used for authentication on hand-held devices.