<img src="../img/GTK_Logo_Social Icon.jpg" width=175 align="right" />

# Worksheet 5.3: Tuning your Classifier - Answers
This worksheet covers concepts relating to tuning a classifier.  It should take no more than 20-30 minutes to complete.  Please raise your hand if you get stuck.  

## Import the Libraries
For this exercise, we will be using:
* Pandas (https://pandas.pydata.org/pandas-docs/stable/)
* Numpy (https://docs.scipy.org/doc/numpy/reference/)
* Matplotlib (https://matplotlib.org/)
* Scikit-learn (https://scikit-learn.org/stable/documentation.html)
* YellowBrick (https://www.scikit-yb.org/en/latest/)
* Seaborn (https://seaborn.pydata.org)
* Lime (https://github.com/marcotcr/lime)

In [1]:
# Load Libraries - Make sure to run this cell!
import pandas as pd
import numpy as np
import time
import pickle
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import uniform as sp_rand
from yellowbrick.classifier import ClassificationReport
from yellowbrick.classifier import ConfusionMatrix
import matplotlib.pyplot as plt
import matplotlib
import lime
import warnings; warnings.simplefilter('ignore')
%matplotlib inline

## Prepare the Data
For this exercise, we are going to focus on building a pipeline and then tuning the resultant model, so we're going to use a simpler model with only five features.

In [2]:
df_final = pd.read_csv('../data/dga_features_final_df.csv')
target = df_final['isDGA']
feature_matrix = df_final.drop(['isDGA'], axis=1)
feature_matrix.sample(5)

Unnamed: 0,length,digits,entropy,vowel-cons,firstDigitIndex,ngrams
477,9,0,2.947703,0.8,0,1652.575397
213,26,11,3.92103,0.071429,3,507.858974
617,26,5,4.238901,0.05,1,565.439145
464,13,0,3.392747,0.444444,0,1131.294289
701,13,0,3.238901,0.444444,0,816.691919


### Split the data into training and testing sets.
We're going to need a training and testing dataset, so you know the drill, split the data..

In [3]:
# Simple Cross-Validation: Split the data set into training and test data
feature_matrix_train, feature_matrix_test, target_train, target_test = train_test_split(feature_matrix, 
                                                                                        target, 
                                                                                        test_size=0.25)

## Build a Model
For this exercise, we're going to create a K-NN Classifier for the DGA data and tune it, but first, create a classifier with the default options and calculate the accuracy score for it. (http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) 

The default parameters are shown below.
```python 
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')
```           

In [4]:
# Your code here ...
clf = KNeighborsClassifier()
clf.fit( feature_matrix_train, target_train )

In [5]:
#Store the predictions
default_predictions = clf.predict( feature_matrix_test)

In [6]:
accuracy_score( target_test, default_predictions)

0.858

In [16]:
filename = '../data/dga_model.sav'
pickle.dump(clf, open(filename, 'wb'))

## Improving Performance 
Out of the box, the model achieves approximately 85% accuracy.  Better than chance but let's see if we can do better. 

**Note:  This notebook is written without using fixed random seeds, so you might get slightly different results.**

### Scaling the Features
K-NN is a distance-based classifier and hence it is necessary to scale the features prior to training the model.  For this exercise however, let's create a simple pipeline with two steps:

1.  StandardScaler
2.  Train the classifier

Once you've done that, calculate the accuracy and see if it has improved.

In [7]:
pipeline = Pipeline([
    ('scaler',StandardScaler()),
    ('clf', KNeighborsClassifier())
])

pipeline.fit(feature_matrix_train, target_train )

In [8]:
pipeline_predictions = pipeline.predict( feature_matrix_test)

In [9]:
accuracy_score( target_test, pipeline_predictions)

0.89

Scaling the features did result in a small improvement: .85 accuracy to .88.  But let's see if we can't do even better.

### Using RandomSearchCV and GridSearchCV to tune Hyperparameters
Now that we've scaled the features and built a simple pipeline, let's try to tune the hyperparameters to see if we can improve the model performance.  Scikit-learn provides two methods for accomplishing this task: `RandomizedSearchCV` and `GridSearchCV`. 


* `GridSearchCV`:  GridSearch iterates through all possible combinations of tuning parameters to find the optimal combination. (http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
* `RandomizedSearchCV`:  RandomizedSearch interates through random combinations of paremeters to find the optimal combination.  While RandomizedSearch does not try every possible combination, is considerably faster than GridSearch and has been shown to get very close to the optimal combination in considerably less time.  (http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) 

You can see in the results below, that the model was able to achieve **91.9%** accuracy with RandomSearch!   
```
[INFO] randomized search took 0.85 seconds
[INFO] grid search accuracy: 91.93%
[INFO] randomized search best parameters: {'clf__weights': 'uniform', 'clf__p': 1, 'clf__n_neighbors': 27, 'clf__metric': 'euclidean', 'clf__leaf_size': 25, 'clf__algorithm': 'kd_tree'}
```

Both `RandomizedSearchCV` and `GridSearchCV` require you to provide a grid of parameters.  You will need to refer to the documentation for the classifier you are using to get a list of paramenters for that particular model.  Also since we will be using the pipeline, you have to format the parameters correctly.  The name of the variable must be preceeded by the name of the step in your pipeline and two underscores.  For example.  If the classifier in the pipeline is called `clf`, and you have a tuning parameter called `metric`, the parameter grid would be as follows:
```python
params = {
    "clf__n_neighbors": np.arange(1, 50, 2),
    "clf__metric": ["euclidean", "cityblock"] 
}
```

### Your Task
Using either GridSearchCV or RandomizedSearchCV, improve the performance of your model.

In [10]:
params = {"clf__n_neighbors": np.arange(1, 50, 2), 
         "clf__weights": ["uniform", "distance"],
         "clf__algorithm": ['auto', 'ball_tree', 'kd_tree', 'brute'],
         "clf__leaf_size": np.arange(1, 80, 2),
         "clf__p": [1,2],
         "clf__metric": ["euclidean", "manhattan"]}



grid = RandomizedSearchCV(pipeline, params, n_iter=100)
start = time.time()
grid.fit(feature_matrix_train, target_train)
 
# evaluate the best randomized searched model on the testing
# data
print("[INFO] randomized search took {:.2f} seconds".format(time.time() - start))

#acc = grid.score(feature_matrix_test, target_test)
acc = grid.best_score_
print("[INFO] grid search accuracy: {:.2f}%".format(acc * 100))
print("[INFO] randomized search best parameters: {}".format(grid.best_params_))

[INFO] randomized search took 2.50 seconds
[INFO] grid search accuracy: 88.73%
[INFO] randomized search best parameters: {'clf__weights': 'distance', 'clf__p': 2, 'clf__n_neighbors': 5, 'clf__metric': 'euclidean', 'clf__leaf_size': 19, 'clf__algorithm': 'ball_tree'}


## Model Comparison
Your final task is to:
1.  Using RandomForest, create a classifier for the DGA dataset
2.  Use either GridSearchCV or RandomizedSearchCV to find the optimal parameters for this model.

How does this model compare with the first K-NN classifier for this data?

In [11]:
rf_clf = RandomForestClassifier()
params = {
    "n_estimators": np.arange(1, 400, 50),
    "max_features": ['auto', 'sqrt','log2' ],
    "max_depth": np.arange(1, 20, 2),
    "criterion": ['gini','entropy']
} 

rf_grid = RandomizedSearchCV(rf_clf, params )
start = time.time()
rf_grid.fit(feature_matrix_train, target_train)
 
# evaluate the best randomized searched model on the testing
# data
print("[INFO] randomized search took {:.2f} seconds".format(time.time() - start))

#acc = grid.score(feature_matrix_test, target_test)
acc = rf_grid.best_score_
print("[INFO] grid search accuracy: {:.2f}%".format(acc * 100))
print("[INFO] randomized search best parameters: {}".format(rf_grid.best_params_))

[INFO] randomized search took 6.13 seconds
[INFO] grid search accuracy: 90.47%
[INFO] randomized search best parameters: {'n_estimators': 201, 'max_features': 'auto', 'max_depth': 5, 'criterion': 'gini'}
