<img src="../img/GTK_Logo_Social Icon.jpg" width=175 align="right" />

# Worksheet 5.2: DGA Detection
This worksheet covers concepts covered in Module 5 on Supervised Learning.  It should take no more than 40-60 minutes to complete.  Please raise your hand if you get stuck.  

## Import the Libraries
For this exercise, we will be using:
* Pandas (https://pandas.pydata.org/pandas-docs/stable/)
* Numpy (https://docs.scipy.org/doc/numpy/reference/)
* Matplotlib (https://matplotlib.org/)
* Scikit-learn (https://scikit-learn.org/stable/documentation.html)
* YellowBrick (https://www.scikit-yb.org/en/latest/)
* Seaborn (https://seaborn.pydata.org)
* Lime (https://github.com/marcotcr/lime)

In [2]:
# Load Libraries - Make sure to run this cell!
import pandas as pd
import numpy as np
import re
from collections import Counter
from sklearn import feature_extraction, tree, model_selection, metrics
#from yellowbrick.classifier import ClassificationReport
#from yellowbrick.classifier import ConfusionMatrix
import matplotlib.pyplot as plt
import matplotlib
from sklearn.preprocessing import LabelEncoder
import pickle

import lime
import io
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

## Worksheet - DGA Detection using Machine Learning

This worksheet is a step-by-step guide on how to detect domains that were generated using "Domain Generation Algorithm" (DGA). We will walk you through the process of transforming raw domain strings to Machine Learning features and creating a decision tree classifer which you will use to determine whether a given domain is legit or not. Once you have implemented the classifier, the worksheet will walk you through evaluating your model.  

Overview 2 main steps:

1. **Feature Engineering** - from raw domain strings to numeric Machine Learning features using DataFrame manipulations
2. **Machine Learning Classification** - predict whether a domain is legit or not using a Decision Tree Classifier


  

**DGA - Background**

"Various families of malware use domain generation
algorithms (DGAs) to generate a large number of pseudo-random
domain names to connect to a command and control (C2) server.
In order to block DGA C2 traffic, security organizations must
first discover the algorithm by reverse engineering malware
samples, then generate a list of domains for a given seed. The
domains are then either preregistered, sink-holed or published
in a DNS blacklist. This process is not only tedious, but can
be readily circumvented by malware authors. An alternative
approach to stop malware from using DGAs is to intercept DNS
queries on a network and predict whether domains are DGA
generated. Much of the previous work in DGA detection is based
on finding groupings of like domains and using their statistical
properties to determine if they are DGA generated. However,
these techniques are run over large time windows and cannot be
used for real-time detection and prevention. In addition, many of
these techniques also use contextual information such as passive
DNS and aggregations of all NXDomains throughout a network.
Such requirements are not only costly to integrate, they may not
be possible due to real-world constraints of many systems (such
as endpoint detection). An alternative to these systems is a much
harder problem: detect DGA generation on a per domain basis
with no information except for the domain name. Previous work
to solve this harder problem exhibits poor performance and many
of these systems rely heavily on manual creation of features;
a time consuming process that can easily be circumvented by
malware authors..."    
[Citation: Woodbridge et. al 2016: "Predicting Domain Generation Algorithms with Long Short-Term Memory Networks"]

A better alternative for real-world deployment would be to use "featureless deep learning" - We have a separate notebook where you can see how this can be implemented!

**However, let's learn the basics first!!!**


## Import Feature Engineered Data and Labels

#### Load Features and Labels

If you got stuck in Part 1, please simply load the feature matrix we prepared for you, so you can move on to Part 2 and train a Decision Tree Classifier.

In [3]:
#load full dataset
df_final = pd.read_csv('../data/dga_features_final_df.csv')

#If you didn't get a working dataset, uncomment this line
#df_final = pd.read_csv('../data/our_data_dga_features_final_df.csv')

print(df_final['isDGA'].value_counts())
df_final.head()

isDGA
dga      1000
legit    1000
Name: count, dtype: int64


Unnamed: 0,isDGA,length,digits,entropy,vowel-cons,firstDigitIndex,ngrams
0,dga,13,0,3.546594,0.083333,0,744.67094
1,dga,26,10,4.132944,0.333333,1,715.217265
2,dga,8,0,2.5,0.333333,0,1918.797619
3,dga,26,7,4.180833,0.357143,1,682.269402
4,dga,24,9,3.834963,0.666667,2,544.17814


## Machine Learning - Supervised Classification

To learn simple classification procedures using [sklearn](http://scikit-learn.org/stable/) we have split the work flow into 5 steps.

### Step 1: Prepare Feature matrix and ```target``` vector containing the URL labels

- In statistics, the feature matrix is often referred to as ```X```
- target is a vector containing the labels for each URL (often also called ```y``` in statistics)
- In sklearn both the input and target can either be a pandas DataFrame/Series or numpy array/vector respectively (can't be lists!)

Tasks:
- 1.1 assign 'isDGA' column to a pandas Series named 'target'
- 1.2 encode the strings as digits using sklearn ```LabelEncoder``` 
- 1.3 drop 'isDGA' column from ```df_final``` DataFrame and name the resulting pandas DataFrame ```feature_matrix```

## 1.1 Create a variable named 'target' 
that contains whether the row was dga or legit

In [None]:
target = # Your code here ...

## 1.2 Encode strings as digits in the target variable

LabelEncoder is an encoder from sklearn that will transform a target vector (not the features) into integers.

[label encoder doc](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)


In [None]:
label_encoder_model = LabelEncoder()
encoded_target = label_encoder_model.fit_transform(#your code here)

# print our classes with the .classes_ method
# your code here

In [None]:
#look at encoded targets
encoded_target

The reason we like to encode targets using sklearn, is that once we are finished modeling, we can use the same model to turn it back into strings. This is really useful when plotting metrics or looking at the results (especially when you have more than one class

In [None]:
# transform our encoded targets back into strings (we will use this when we are evaluating the model later
label_encoder_model.inverse_transform(encoded_target)

## 1.3 Create the Feature Matrix

In order to train a model you have to separate the features from the target in to separate objects. Drop **isDGA** column from ```df_final``` DataFrame and name the resulting pandas DataFrame ```feature_matrix```

In [None]:
feature_matrix = # YOUR CODE HERE

In [None]:
# Creata a list of our feature names for plotting later and if we need to pull the features again from the full dataframe.
feature_names = feature_matrix.columns.to_list()
print(feature_names)

### Step 2: Test-Train split

Tasks:
- split your `feature_matrix` and `target` vector into **train** and **test** subsets using sklearn [model_selection.train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

Output of the split should be 2 complete sets of data separated by features and labels: 
 - feature_matrix_train: 75% of the feature matrix (data)
 - feature_matrix_test: the remaining 25% of the feature matrix
 - target_train: the labels for the train features
 - target_test: the labels for the test features

In [None]:
# Split the data set into training and test data
feature_matrix_train, feature_matrix_test, target_train, target_test = (model_selection.train_test_split(#YOUR CODE HERE))

### Step 3: Train the model and make a prediction

Finally, we have prepared and segmented the data. Let's start classifying!!   

Tasks:
-  Use the sklearn [tree.DecisionTreeClassfier()](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), create a decision tree with default parameters, and train it using the ```.fit()``` function with ```feature_matrix_train``` and ```target_train``` data.
-  Next, pull a few random rows from the data and see if your classifier got it correct.

In [None]:
# Train the decision tree based on the entropy criterion
d_tree_model = #YOURCODE HERE

That's it! You trained the model. Now extract a row from the test set to see if the model can predict the correct answer by comparing it to the test target (ground truth). 

In [None]:
# Extract a row from the test data

row_number = 14
test_feature = feature_matrix_test[row_number:row_number+1]

# Make the prediction
pred = d_tree_model.predict(test_feature)

#Let's go back to having strings in the targets to compare 
#b/c it'e easier to read. We can do this using the inverse_transform from the label encoder model
pred_string = label_encoder_model.inverse_transform(pred)

# pull out the ground truth for this row
test_target = target_test[row_number:row_number+1]
# transfrom to a string
test_target_string = label_encoder_model.inverse_transform(test_target)

                                                    
# print the results and the ground truth
print('Predicted class:', pred_string)
print('Ground truth class:', test_target_string)
print('Accurate prediction?', pred_string == test_target_string)



In [None]:
# Make the prediction

### Step 4: Assess model accuracy with multiple reports of the confusion matrix and accuracy

Tasks:
- Make predictions for all your data. Call the ```.predict()``` method on the clf with your training data ```featre_matrix_train``` and store the results in a variable called ```target_pred```.
- Use sklearn [metrics.accuracy_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) to determine your models accuracy. Detailed Instruction:
    - Use your trained model to predict the labels of your test data ```feature_matrix_test```. Run ```.predict()``` method on the clf with your test data ```feature_matrix_test``` and store the results in a variable called ```target_pred```.. 
    - Then calculate the accuracy using ```target_test``` (which are the true labels/groundtruth) AND your models predictions on the test portion ```target_pred``` as inputs. The advantage here is to see how your model performs on new data it has not been seen during the training phase. The fair approach here is a simple **cross-validation**!
    
- Print out the confusion matrix using [metrics.confusion_matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)
- Use Yellowbrick to visualize the classification report and confusion matrix. (https://www.scikit-yb.org/en/latest/)

In [None]:
# fair approach: make prediction on test data portion


In [None]:
# Classification Report...neat summary


In [None]:
# Your code here ...

In [None]:
# Your code here...

In [None]:
# Your code here...

### (Optional) Visualizing your Tree
As an optional step, you can actually visualize your tree.  The following code will generate a graph of your decision tree.  You will need graphviz (http://www.graphviz.org) and pydotplus (or pydot) installed for this to work.
The Griffon VM has this installed already, but if you try this on a Mac, or Linux machine you will need to install graphviz.

In [None]:
# These libraries are used to visualize the decision tree and require that you have GraphViz
# and pydot or pydotplus installed on your computer.

from IPython.core.display import Image
import pydotplus as pydot


dot_data = io.StringIO() 
tree.export_graphviz(dga_detect_tree_model, out_file=dot_data, 
                     feature_names=['length', 'digits', 'entropy', 'vowel-cons', 'firstDigitIndex','ngrams'],
                    filled=True, rounded=True,  
                    special_characters=True) 

graph = pydot.graph_from_dot_data(dot_data.getvalue()) 
Image(graph.create_png())

## Other Models
Now that you've built a Decision Tree, let's try out two other classifiers and see how they perform on this data.  For this next exercise, create classifiers using:

* Support Vector Machine
* Random Forest
* K-Nearest Neighbors (http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)  

Once you've done that, run the various performance metrics to determine which classifier works best.

In [None]:
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

In [None]:
#Create the Random Forest Classifier


In [None]:
#Next, create the SVM classifier


In [None]:
#Finally the knn


## Explain a Prediction
In the example below, you can use LIME to explain how a classifier arrived at its prediction.  Try running LIME with the various classifiers you've created and various rows to see how it functions. 

In [None]:
import lime.lime_tabular
explainer = lime.lime_tabular.LimeTabularExplainer(feature_matrix_train, 
                                                   feature_names=['length', 'digits', 'entropy', 'vowel-cons', 'firstDigitIndex','ngrams'], 
                                                   class_names=['legit', 'isDGA'], 
                                                   discretize_continuous=False)

In [None]:
# this function assumes your model is names random_forest_clf. change that variable to what you named your model
exp = explainer.explain_instance(feature_matrix_test.iloc[5], 
                                 random_forest_clf.predict_proba, 
                                 num_features=6)

In [None]:
exp.show_in_notebook(show_table=True, show_all=True)