<img src="../img/GTK_Logo_Social Icon.jpg" width=175 align="right" />

# Worksheet 5.3: Tuning your Classifier - Answers
This worksheet covers concepts relating to tuning a classifier.  It should take no more than 20-30 minutes to complete.  Please raise your hand if you get stuck.  

## Import the Libraries
For this exercise, we will be using:
* Pandas (https://pandas.pydata.org/pandas-docs/stable/)
* Numpy (https://docs.scipy.org/doc/numpy/reference/)
* Scikit-learn (https://scikit-learn.org/stable/documentation.html)

In [1]:
# Load Libraries - Make sure to run this cell!
import pandas as pd
import numpy as np
import time
import pickle
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import uniform as sp_rand

import warnings; warnings.simplefilter('ignore')


## Load the Data
For this exercise, we are going to focus on building a pipeline and then tuning the resultant model.

In [2]:
df_final = pd.read_csv('../data/dga_features_final_df.csv')
target = df_final['isDGA']
feature_matrix = df_final.drop(['isDGA'], axis='columns')
feature_matrix.sample(5)

Unnamed: 0,length,digits,entropy,vowel-cons,firstDigitIndex,ngrams
954,26,11,3.931209,0.25,3,627.833376
578,27,11,3.912114,0.230769,1,656.416619
183,14,0,3.521641,0.076923,0,827.46337
614,15,0,3.006239,0.071429,0,628.512821
917,13,0,3.392747,0.3,0,923.229021


### Split the data into training and testing sets.
We're going to need a training and testing dataset, so you know the drill, split the data..

In [3]:
feature_matrix_train, feature_matrix_test, target_train, target_test = train_test_split(feature_matrix, 
                                                                                        target, 
                                                                                        test_size=0.25)

## Build a Model
For this exercise, we're going to create a K-NN Classifier for the DGA data and tune it.   (http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) 


1. Create a classifier with the default options (leave the arguments blank and they will use the deafults). 
2. Train the classifier on the training data
3. Calculate the accuracy score for the model based on the test data.


The sklearn default values for the KNeighborsClassifier hyperparameters are shown below.
```python 
KNeighborsClassifier(algorithm='auto', 
                     leaf_size=30, 
                     metric='minkowski',
                     metric_params=None, 
                     n_jobs=1, 
                     n_neighbors=5, 
                     p=2,
                     weights='uniform')
```           

In [4]:
#create
knn_model = KNeighborsClassifier()
#train
knn_model.fit( feature_matrix_train, target_train )

0,1,2
,n_neighbors,5
,weights,'uniform'
,algorithm,'auto'
,leaf_size,30
,p,2
,metric,'minkowski'
,metric_params,
,n_jobs,


In [5]:
#predict
default_predictions = knn_model.predict(feature_matrix_test)

In [6]:
#metric
accuracy_score(target_test, default_predictions)

0.848

In [7]:
# save the model
filename = '../data/dga_model.sav'
pickle.dump(knn_model, open(filename, 'wb'))

## Improving Performance 
The model achieves approximately 85% accuracy when we use the default parameters.  It is significantly better than chance (50%) but let's see if we can do better. 

**Note:  This notebook is written without using fixed random seeds, so you might get slightly different results.**

### Scaling the Features (preprocessing)
K-NN is a distance-based classifier and hence it is necessary to scale the features prior to training the model.  

### Create Pipeline
Pipeline allows you to sequentially apply a list of transformers to preprocess the data and, if desired, conclude the sequence with a final predictor for predictive modeling.  Let's create a simple pipeline with two steps:

1.  StandardScaler
2.  Train the classifier

Once you've done that, calculate the accuracy and see if it has improved.

In [8]:
pipeline_knn = Pipeline([
    ('scaler',StandardScaler()),
    ('knn_model', KNeighborsClassifier())
])

pipeline_knn.fit(feature_matrix_train, target_train ).score(feature_matrix_test, target_test)

0.894

Scaling the features did result in a small improvement: .85 accuracy to .88.  But let's see if we can't do even better.

### Using RandomSearchCV and GridSearchCV to tune Hyperparameters
Now that we've scaled the features and built a simple pipeline, let's try to tune the hyperparameters to see if we can improve the model performance.  Scikit-learn provides two methods for accomplishing this task: `RandomizedSearchCV` and `GridSearchCV`. 


* `GridSearchCV`:  GridSearch iterates through all possible combinations of tuning parameters to find the optimal combination. (http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
* `RandomizedSearchCV`:  RandomizedSearch interates through random combinations of paremeters to find the optimal combination.  While RandomizedSearch does not try every possible combination, is considerably faster than GridSearch and has been shown to get very close to the optimal combination in considerably less time.  (http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) 

Both `RandomizedSearchCV` and `GridSearchCV` require you to provide a grid of parameters.  You will need to refer to the documentation for the classifier you are using to get a list of paramenters for that particular model.  Also since we will be using the pipeline, you have to format the parameters correctly.  The name of the variable must be preceeded by the name of the step in your pipeline and two underscores.  For example.  If the classifier in the pipeline is called `knn_model`, and you have a tuning parameter called `metric`, the parameter grid would be as follows:
```python
params = {
    "knn_model__n_neighbors": np.arange(1, 50, 2),
    "knn_model__metric": ["euclidean", "cityblock"] 
}
```
### Your Task
Using RandomizedSearchCV, improve the performance of your model.

In [9]:
params = {"knn_model__n_neighbors": np.arange(1, 50, 2), 
         "knn_model__weights": ["uniform", "distance"],
         "knn_model__algorithm": ['auto', 'ball_tree', 'kd_tree', 'brute'],
         "knn_model__leaf_size": np.arange(1, 80, 2),
         "knn_model__p": [1,2],
         "knn_model__metric": ["euclidean", "manhattan"]}


grid = RandomizedSearchCV(pipeline_knn, params, n_iter=100)
start = time.time()
grid.fit(feature_matrix_train, target_train)
 
# evaluate the best randomized searched model on the test data
print("[INFO] randomized search took {:.2f} seconds".format(time.time() - start))

acc = grid.best_score_
print("[INFO] grid search accuracy: {:.2f}%".format(acc * 100))
print("[INFO] randomized search best parameters: {}".format(grid.best_params_))

[INFO] randomized search took 2.67 seconds
[INFO] grid search accuracy: 90.60%
[INFO] randomized search best parameters: {'knn_model__weights': 'uniform', 'knn_model__p': 1, 'knn_model__n_neighbors': 29, 'knn_model__metric': 'euclidean', 'knn_model__leaf_size': 25, 'knn_model__algorithm': 'auto'}


the model was able to achieve an improved accuracy with RandomSearch!   


## Model Comparison
Your final task is to:
1.  Using RandomForest, create a classifier for the DGA dataset
2.  Use either GridSearchCV or RandomizedSearchCV to find the optimal parameters for this model.

How does this model compare with the first K-NN classifier for this data?

In [10]:
rf_clf = RandomForestClassifier()
params = {
    "n_estimators": np.arange(1, 400, 50),
    "max_features": ['auto', 'sqrt','log2' ],
    "max_depth": np.arange(1, 20, 2),
    "criterion": ['gini','entropy']
} 

rf_grid = RandomizedSearchCV(rf_clf, params )
start = time.time()
rf_grid.fit(feature_matrix_train, target_train)
 
# evaluate the best randomized searched model on the testing
# data
print("[INFO] randomized search took {:.2f} seconds".format(time.time() - start))

#acc = grid.score(feature_matrix_test, target_test)
acc = rf_grid.best_score_
print("[INFO] grid search accuracy: {:.2f}%".format(acc * 100))
print("[INFO] randomized search best parameters: {}".format(rf_grid.best_params_))

[INFO] randomized search took 2.70 seconds
[INFO] grid search accuracy: 90.87%
[INFO] randomized search best parameters: {'n_estimators': 251, 'max_features': 'log2', 'max_depth': 5, 'criterion': 'entropy'}
