<img src="../img/GTK_Logo_Social Icon.jpg" width=175 align="right" />

# Worksheet 5.2: DGA Detection - Answers

# Machine Learning - Supervised Learning
 
This worksheet covers concepts covered in Module 5 - Supervised Learning. Train and evaluate a classification model using [sklearn](http://scikit-learn.org/stable/). It should take no more than 40-60 minutes to complete.  Please raise your hand if you get stuck.  

## Import the Libraries
For this exercise, we will be using:
* Pandas (http://pandas.pydata.org/pandas-docs/stable/)
* Numpy (https://docs.scipy.org/doc/numpy/reference/)
* Matplotlib (http://matplotlib.org/api/pyplot_api.html)
* Scikit-learn (http://scikit-learn.org/stable/documentation.html)
* Lime (https://github.com/marcotcr/lime)

In [None]:
# Load Libraries - Make sure to run this cell!
import pandas as pd
import numpy as np
from sklearn import feature_extraction, tree, model_selection, metrics
import matplotlib.pyplot as plt
import matplotlib
import lime
import io
import pickle
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

# 5.2.0 Load Features and Labels

If you got stuck in the Feature Engineering section, please simply uncomment the code below to load the feature matrix we prepared for you, so you can move on to train a Decision Tree Classifier.

In [None]:
#load full dataset
df_final = pd.read_csv('../data/dga_features_final_df.csv')

#If you didn't get a working dataset, uncomment this line
#df_final = pd.read_csv('../data/our_data_dga_features_final_df.csv')


print(df_final['isDGA'].value_counts())
df_final.head()

# 5.2.1 Prepare the ```feature_matrix``` and ```target``` 

- In statistics and machine learning, the ```feature_matrix``` is often referred to as ```X```
- The target vector that contains the labels for each row is called ```y``` 
- In sklearn both the features and targets can either be a pandas DataFrame/Series or numpy array/vector respectively (can't be lists!)

Tasks:
- 5.2.1.1 Create a vector that contains the **target**s
- 5.2.1.2 Create the **feature_matrix** that has only the features and not the targets 

## 5.2.1.1 Create a vector named 'target' 

Assign the **isDGA** column to a pandas Series named **target**. The ```target``` variable should be a vector (1 dimension) of the correct (ground truth) answer for each row of the dataset. For this DGA use case, each item will be a string that indicates whether the domain was **dga** or **legit**. 

In [None]:
target = df_final['isDGA']
target.head()

## 5.2.1.2 Create the Feature Matrix

In order to train a model you have to separate the features from the targets. Create the ```feature_matrix``` (pandas dataframe) by dropping the **isDGA** column from ```df_final```.

In [None]:
feature_matrix = df_final.drop(['isDGA'], axis='columns')
feature_matrix.head()

Creata a list of our feature names for plotting later and if we need to pull the features again from the full dataframe.


In [None]:
feature_names = feature_matrix.columns.to_list()
print(feature_names)

# 5.2.2 Test-Train split

Split (the dataset) your ```feature_matrix``` and ```target``` into **train** and **test** subsets using sklearn [model_selection.train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

Output of the split should be 2 complete sets of data (**train** and **test** that are still separated into features and labels: 
 - **feature_matrix_train**: 75% of the feature matrix (data)
 - **feature_matrix_test**: the remaining 25% of the feature matrix
 - **target_train**: the labels for the train features
 - **target_test**: the labels for the test features

In [None]:
feature_matrix_train, feature_matrix_test, target_train, target_test = (model_selection.
                                                                        train_test_split(
                                                                           feature_matrix, 
                                                                           target,
                                                                           test_size=0.25, 
                                                                           random_state=33))

In [None]:
feature_matrix_train.shape

In [None]:
feature_matrix_test.shape

In [None]:
target_train[1:5]

In [None]:
target_test[1:5]

# 5.2.3 Train the model and make a prediction

Finally, we have prepared and split the data. Let's start classifying!!   

Tasks:
-  Use the sklearn [tree.DecisionTreeClassfier()](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), instantiate (create) a decision tree model with default parameters (we will tune these in the next lab).
- Train this model using the ```feature_matrix_train``` and ```target_train``` data (you will need to call the **.fit()** method on the model to do this).
-  Next, pull a few random rows from the data to spot check the predictions of the model against the true labels.

In [None]:
d_tree_model = tree.DecisionTreeClassifier()  
d_tree_model = d_tree_model.fit(feature_matrix_train, target_train)

That's it! You trained the model. Now extract a row from the test set to see if the model can predict the correct answer by comparing it to the test target (ground truth). 

In [None]:
# Extract a row from the test data

row_number = 14
row_feature = feature_matrix_test[row_number:row_number+1]

# Make the prediction
row_pred = d_tree_model.predict(row_feature)

# pull out the ground truth for this row
row_target = target_test[row_number:row_number+1]

                                                    
# print the results and the ground truth
print('Predicted class:', row_pred)

print('Ground truth class:', row_target)

print('Accurate prediction?', row_pred == row_target)

# 5.2.4 Make predictions on test set

Make predictions for all your **test** data. This will be data that the model has not 'seen' before so we will use these predictions to evaluate how well the model can predict the correct answer on new data by calling a few different metrics functions.

- Call the ```.predict()``` method on the model ```d_tree_model``` with your test data ```feature_matrix_test``` and store the results in a variable called ```test_predictions```. 
  
- Then calculate the **accuracy** (and several other metris) using ```target_test``` (which are the true labels/ground truth) AND your models predictions on the test portion ```test_predictions``` as inputs. 

In [None]:
# make predictions on all of the test data
test_predictions = d_tree_model.predict(feature_matrix_test)

# print a sample of the predictions
print(test_predictions[0:5])

#### Save the model as a pickle file

In [None]:
filename = '../data/dga_decision_tree.sav'
pickle.dump(d_tree_model, open(filename, 'wb'))

# 5.2.5 Evaluate the model performance with metrics and visualizations

## 5.2.5.1 Print metrics
Use sklearn [metrics.accuracy_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) to calculate the model accuracy. 

In [None]:
print("Accuracy:", metrics.accuracy_score(target_test, test_predictions))

In [None]:
print(metrics.classification_report(target_test, test_predictions))

In [None]:
conf_matrix = metrics.confusion_matrix(target_test, test_predictions, labels=d_tree_model.classes_)
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=conf_matrix,
                              display_labels=d_tree_model.classes_)

disp.plot(cmap='summer');

## 5.2.5.2 (Optional) Visualizing your Tree
As an optional step, you can actually visualize your tree.  The following code will generate a graph of your decision tree.  You will need graphviz (http://www.graphviz.org) and pydotplus (or pydot) installed for this to work.

In [None]:
# These libraries are used to visualize the decision tree and require that you have GraphViz
# and pydot or pydotplus installed on your computer.

from IPython.core.display import Image
import pydotplus as pydot


dot_data = io.StringIO() 
tree.export_graphviz(d_tree_model, out_file=dot_data, 
                     feature_names=feature_names,
                    filled=True, rounded=True,  
                    special_characters=True) 

graph = pydot.graph_from_dot_data(dot_data.getvalue()) 
Image(graph.create_png())


# 5.2.6 Explain a Prediction
In the example below, you can use LIME to explain how a classifier arrived at its prediction.  Try running LIME with the  classifier you've created and various rows to see how it functions. 

In [None]:
import lime.lime_tabular
explainer = lime.lime_tabular.LimeTabularExplainer(feature_matrix_train, 
                                                   feature_names=feature_names, 
                                                  class_names=d_tree_model.classes_, 
                                                   discretize_continuous=False)

Let's look at the explanation for one data point (row) from the test set. 

In [None]:
sample_number = 12

exp = explainer.explain_instance(feature_matrix_test.iloc[sample_number], 
                                 d_tree_model.predict_proba, 
                                 num_features=6)

In [None]:
exp.show_in_notebook(show_table=True, show_all=True)

In [None]:
feature_matrix_test.iloc[sample_number]

# 5.2.7 Train and evaluate more models
Now that you've built a Decision Tree, let's try out two other classifiers and see how they perform on this data.  For this next exercise, create classifiers using:

* Support Vector Machine
* Random Forest
* K-Nearest Neighbors (http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)  

Once you've done that, run the various performance metrics to determine which classifier works best.

## 5.2.7.1 Create the Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

random_forest_clf = RandomForestClassifier(n_estimators=10, 
                             max_depth=None, 
                             min_samples_split=2, 
                             random_state=0)

random_forest_clf = random_forest_clf.fit(feature_matrix_train, target_train)

#### Make predictions

In [None]:
random_forest_test_predictions = random_forest_clf.predict(feature_matrix_test)

#### Metrics

In [None]:
print("Accuracy:", metrics.accuracy_score(target_test, random_forest_test_predictions))

In [None]:
print(metrics.classification_report(target_test, random_forest_test_predictions))

In [None]:
rf_conf_matrix = metrics.confusion_matrix(target_test, random_forest_test_predictions, labels=random_forest_clf.classes_)
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=rf_conf_matrix,
                              display_labels=random_forest_clf.classes_)

disp.plot(cmap='cividis');

## 5.2.7.2 Train a SVM classifier


In [None]:
from sklearn import svm

svm_classifier = svm.SVC()
svm_classifier = svm_classifier.fit(feature_matrix_train, target_train)  

#### Make predictions on the test set

In [None]:
svm_test_predictions = svm_classifier.predict(feature_matrix_test)

#### Metrics

In [None]:
print("SVM Accuracy:", metrics.accuracy_score(target_test, svm_test_predictions))

In [None]:
print(metrics.classification_report(target_test, svm_test_predictions))

In [None]:
svm_conf_matrix = metrics.confusion_matrix(target_test, svm_test_predictions, labels=svm_classifier.classes_)
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=svm_conf_matrix,
                              display_labels=svm_classifier.classes_)

disp.plot(cmap='magma');

## 5.2.7.3 Train a K-Nearest Neighbors Classifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier()
knn_clf = knn_clf.fit(feature_matrix_train, target_train) 

#### Make predictions

In [None]:
knn_test_predictions = knn_clf.predict(feature_matrix_test)

#### Metrics

In [None]:
print("KNN Accuracy:", metrics.accuracy_score(target_test, knn_test_predictions))

In [None]:
print(metrics.classification_report(target_test, knn_test_predictions))

In [None]:
knn_conf_matrix = metrics.confusion_matrix(target_test, knn_test_predictions, labels=knn_clf.classes_)
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=knn_conf_matrix,
                                       display_labels=knn_clf.classes_)

disp.plot(cmap='ocean');