# Comparison of various classifier types for detecting DNS exfiltration

This notebook demonstrates the performance of different machine learning models
for the task of classifying DNS requests as either legitimate or exfiltration.

Prior to running the notebook, you need to download the csv's with the data, 
available [here](https://data.mendeley.com/datasets/c4n7fckkz3).

See the docstrings in `exfiltration_classifier.py` and `exfiltration_dataset.py`
for details on the usage of the `ExfiltrationClassifier` and `ExfiltrationDataset`
classes.

In [4]:
import numpy as np
from pathlib import Path
from exfiltration_dataset import ExfiltrationDataset
from exfiltration_classifier import ExfiltrationClassifier
import pprint

Define the paths to the csv's here:

In [5]:
root = Path('..')
csv_path_modified = root / 'data' / 'dataset_modified.csv'
csv_path = root / 'data' / 'dataset.csv'

Choose the type of features to train the classifiers on:
* `'individual'` features are derived from single requests
* `'aggregated'` features are derived from sequences of 10 consecutive requests with the same user id and domain
* `'all'` means use both individual and aggregated features

In [6]:
feature_type = 'all' 

We print the following metrics, evaluated on the original examples, 
modified exfiltrations, and both original and modified examples:
* accuracy, as a ratio of correctly classified examples
* F1-score
    * for legitimate examples (class `False`)
    * for exfiltrations (class `True`)
    * macro average of the two

In [7]:
def evaluate(clf, pp):
    report = clf.evaluate('test', 'unmodified')
    print('Evaluation on original examples:')
    pp.pprint(report)
    report = clf.evaluate('test', 'modified')
    print('Evaluation on modified exfiltrations:')
    pp.pprint(report)
    report = clf.evaluate('test', 'both')
    print('Evaluation on all examples:')
    pp.pprint(report)

The csv with the original examples contains more than 30M rows.
Training the classifiers on the entire dataset could take a long time.
You can specify the size of the train, validation and tests sets for both
the original examples and the modified exfiltrations. For example, specifying
`train_size=0.1`, `val_size=0.1` and `test_size=0.1` means each of the splits
will have around 3M examples. The splits is stratified so that each set has
approximatly the same ratio of legitimate requests and DNS exfiltrations.
Fix the `random_state` parameter for reproducibility.

Please note that the csv's are read one line at a time, and this could take some
time.

In [9]:
# Get objects that assign csv rows across splits
splitter, splitter_modified = ExfiltrationDataset.create_splits(
    csv_path, csv_path_modified, val_size=0, test_size=0.01,
    val_size_modified=0, test_size_modified=0.6,
    train_size=0.01, train_size_modified=0.4,
    random_state=0
)

# Make datasets
ds = {}
for split in ['train', 'test']:
    ds[split] = ExfiltrationDataset(
        feature_type, csv_path, csv_path_modified, 
        split, splitter, splitter_modified, verbose=True
    )
    print(f'{split}: X.shape = {ds[split].X.shape}, y.shape={ds[split].y.shape}')

# Standardization
ds['train'].standardize()
ds['test'].standardize(ds['train'].scaler)

pp = pprint.PrettyPrinter()

Reading original examples:
Reading row   350000 of   350741, 99.79% completed
Reading features of modified examples:
train: X.shape = (367661, 17), y.shape=(367661,)
Reading original examples:
Reading row   350000 of   350741, 99.79% completed
Reading features of modified examples:
test: X.shape = (376121, 17), y.shape=(376121,)


Place the types of classifiers you which to train in the `classifier_types` list.
Allowed options are:
* `'logistic regression'`
* `'svm'` (support vector machine with a Gaussian kernel)
* `'sgd'` (support vector machine with a linear kernel trained with stochastic gradient descent)
* `'naive Bayes`'
* `'decision tree'`
* `'random forest'`
* `'extra trees'`
* `'adaboost'`
* `'hgb'` (histogram-based gradient boosting)
* `'mlp'` (multi-layer perceptron)
* `'xgb'` (XGBoost)

To train only on original examples, and not on instances of exfiltrations
obtained with the modified exfiltrator, choose `include_modified=False` in the
call to the `train` method.

In [10]:
# Instantiate classifier
classifier_types = ['naive Bayes']
for clf_type in classifier_types:
    clf = ExfiltrationClassifier(
        clf_type, ds, verbose=False
    )
    # Train only on unmodified examples?
    clf.train(include_modified=True)

    # Evaluate
    print(clf_type)
    evaluate(clf, pp)

naive Bayes
Evaluation on original examples:
{'accuracy': 0.9734533459162173,
 'f1-score': {'False': 0.9864872905646138,
              'True': 0.2507443469864006,
              'macro avg': 0.6186158187755072}}
Evaluation on modified exfiltrations:
{'accuracy': 0.9630023640661939}
Evaluation on all examples:
{'accuracy': 0.9727481315853143,
 'f1-score': {'False': 0.9851447983605655,
              'True': 0.8353360750546203,
              'macro avg': 0.9102404367075929}}
