# Lab 4, Exercise 3

In [1]:
import numpy as np
import sys
import os

## Load data 

The data is separated into three folders: Attack_Data_Master, Training_Data_Master, and Validation_Data_Master
These can be found here:
data/exercise3/Training_Data_Master
data/exercise3/Validation_Data_Master
data/exercise3/Attack_Data_Master

All of the data in Training_Data_Master and Validation_Data_Master is normal, 
and all the data in Attack_Data_Master is malicious

For the purpose of this exercise, you will ignore the predefined training/validation splits, and simply use Training_Data_Master
and Validation_Data_Master as a single pool of normal data

As mentioned, each system call trace is stored as a single file.  Treat each system call trace as a separate datapoint.

In [2]:
# Load all the normal system call traces (i.e., everything in Training_Data_Master and Validation_Data_Master)

# CODE HERE
normal_data = []
path = ['data/exercise3/Training_Data_Master', 'data/exercise3/Validation_Data_Master']
for p in path:
    files = [os.path.join(p, f) for f in os.listdir(p)]
    for f in files:
        if '.txt' not in f:
            continue
        
        with open(f, 'r') as data_file:
            normal_data.append(data_file.read().rstrip())

# Load all the malicious system call traces (i.e., everything in Attack_Data_Master)

# CODE HERE
mal_data = []
path = 'data/exercise3/Attack_Data_Master'
dirs = [os.path.join(path, d) for d in os.listdir(path)]
for d in dirs:
    files = [os.path.join(d, f) for f in os.listdir(d)]
    for f in files:
        if '.txt' not in f:
            continue

        with open(f, 'r') as data_file:
            mal_data.append(data_file.read().rstrip())

# Hint: A useful way to load this is as one or two Python lists, where each entry in the list corresponds to the text string
#       of system calls ids; feel free to use a single list for all the data, or separate lists for malicious versus normal
#       data

## Feature extraction

Tokenize and create a dataset where each datapoint corresponds to (normalized) counts of 
system call n-grams. Try various sizes of ngrams.

Reminder: A sequence of system call IDs that looks like this:
'6 6 63 6 42'

contains the following 3-grams:
'6 6 63'
'6 63 6'
'63 6 42'

Note: There are a number of ways you could code this up, but if you loaded the data
as lists of strings, you could consider using some of the feature extraction methods in 
sklearn.feature_extraction.text

In [56]:
# Look at the classdemo notebook for an example of doing this
# CODE HERE
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
count_vect = CountVectorizer(analyzer='word', ngram_range=(3,3))
tf_transformer = TfidfTransformer(use_idf=False)

raw_train_counts = count_vect.fit_transform(normal_data + mal_data)
all_data = tf_transformer.fit_transform(raw_train_counts)

all_labels = [0]*len(normal_data) + [1]*len(mal_data)
all_labels = np.asarray(all_labels)

## Create train/test split

In [57]:
# Use 50% of the data for the training set and the rest for the test set
# CODE HERE
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(all_data, all_labels, test_size=0.5)

## Train a classifier

In [58]:
# Please use Logistic Regression for this exercise
# Feel free to experiment with the various hyperparameters available to you in sklearn
# CODE HERE
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression().fit(x_train, y_train)

# from sklearn.linear_model import SGDClassifier
# classifier = SGDClassifier(loss='log_loss', penalty='none', random_state=0).fit(x_train, y_train)

## Inference and results

In [59]:
# Run inference on the test data and predict labels for each data point in the test data
# CODE HERE
x_pred = classifier.predict(x_test)

# Calculate and print the following metrics: precision, recall, f1-measure, and accuracy
# CODE HERE
from sklearn.metrics import classification_report
print(classification_report(y_test, x_pred))

              precision    recall  f1-score   support

           0       0.95      0.98      0.96      2607
           1       0.78      0.62      0.69       369

    accuracy                           0.93      2976
   macro avg       0.86      0.80      0.83      2976
weighted avg       0.93      0.93      0.93      2976



# Part 2: Varying class priors

Create several new test datasets where you have randomly subsampled the number of 
attack datapoints.

In particular, create the following datasets:
- 10 datasets where 25% of the attack datapoints are removed from the original test set
- 10 datasets where 50% of the attack datapoints are removed from the original test set
- 10 datasets where 75% of the attack datapoints are removed from the original test set
- 10 datasets where 90% of the attack datapoints are removed from the original test set
- 10 datasets where 95% of the attack datapoints are removed from the original test set

Report five sets of precision, recall, f1-measure, and accuracy corresponding to the following:
- Average precision, recall, f1-measure, accuracy for datasets where 25% of attack datapoints removed
- Average precision, recall, f1-measure, accuracy for datasets where 50% of attack datapoints removed
- Average precision, recall, f1-measure, accuracy for datasets where 75% of attack datapoints removed
- Average precision, recall, f1-measure, accuracy for datasets where 90% of attack datapoints removed
- Average precision, recall, f1-measure, accuracy for datasets where 95% of attack datapoints removed

Note: You will use the same model trained in part 1 for all of these datasets.  
All you are varying is the class priors during the inference stage.

In [62]:
# Create subsets of the test set by randomly discarding X% of points with label +1
# CODE HERE
from imblearn.datasets import make_imbalance
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from pprint import pprint
import pandas as pd

percents = [0.25, .5, .75, .9, .95]

for p in percents:
    class_rep = {'precision': 0, 'recall': 0, 'f1': 0, 'accuracy': 0}
    for i in range(10):
        sample_strat = {0: len(y_test) - sum(y_test), 1: int((1 - p)*sum(y_test))}
        imbal_x, imbal_y = make_imbalance(x_test, y_test, sampling_strategy=sample_strat)

        x_pred = classifier.predict(imbal_x)
        # print(type(precision_score(imbal_y, x_pred)))
        class_rep['precision'] += precision_score(imbal_y, x_pred, average=None)
        class_rep['recall'] += recall_score(imbal_y, x_pred, average=None)
        class_rep['f1'] += f1_score(imbal_y, x_pred, average=None)
        class_rep['accuracy'] += accuracy_score(imbal_y, x_pred)

    print(f'Dataset with {p} percent of attackers removed')
    # pprint({k: class_rep[k] / 10 for k in class_rep.keys()})
    results_df = pd.DataFrame({k: class_rep[k] / 10 for k in class_rep.keys()})
    print(results_df)
    print('')



Dataset with 0.25 percent of attackers removed
   precision    recall        f1  accuracy
0   0.960349  0.975451  0.967840  0.941381
1   0.727613  0.619565  0.669228  0.941381

Dataset with 0.5 percent of attackers removed
   precision    recall        f1  accuracy
0   0.973063  0.975451  0.974255  0.951845
1   0.639545  0.617391  0.628232  0.951845

Dataset with 0.75 percent of attackers removed
   precision    recall        f1  accuracy
0   0.986884  0.975451  0.981134  0.963764
1   0.475909  0.632609  0.543093  0.963764

Dataset with 0.9 percent of attackers removed
   precision    recall        f1  accuracy
0   0.994565  0.975451  0.984915  0.970526
1   0.256111  0.613889  0.361376  0.970526

Dataset with 0.95 percent of attackers removed
   precision    recall        f1  accuracy
0   0.997725  0.975451  0.986462   0.97341
1   0.159621  0.677778  0.258366   0.97341



# Questions

1) In Part 1, what size of ngrams gives the best performance? What are the tradeoffs as you change the size?

The best size of ngrams is when n=3. Although the metrics for normal data does not change greatly depending on n with exceptions to precision and the f1 score increasing, all metrics for attack data increases from n = 1 to 3, then decreases afterwards. When n increases, the time it takes to generate the ngrams and train the model increases. 

2) In Part 1, how does performance change if we use simple counts as features (i.e., 1-grams) as opposed to counts of 2-grams? What does this tell you about the role of sequences in prediction for this dataset?

Using 1-grams
    precision    recall  f1-score   support

0       0.90      0.98      0.94      2596
1       0.68      0.25      0.37       380

accuracy    0.89      2976

Using 2-grams
    precision    recall  f1-score   support

0       0.94      0.98      0.96      2599
1       0.80      0.56      0.66       377

accuracy    0.93      2976

From the abvoe results, nearly all metrics are improved from using 1-grams to using 2-grams. The metrics from normal data are largely unchanged, but training and testing with 2-grams helps the model improve its predictions for malicious data sequences.
    
The role of sequences are important for this dataset since if is difficult to tell from a single data point if it is normal or malicious. Using sequences of datapoints allows the model to see if a datapoint is an outlier compared to the rest of its sequence or if it is actually a malicious datapoint as it is grouped with multiple other malicious data points.

3) How does performance change as a function of class prior in Part 2?

For normal data, metrics remain mostly the same with slight increases in precision and f1 scores. This is to be expected since the model is less likely to predict false positives which affects precision and by extension, the f1 score. 

For malicious data, precision decreases since as the true positives are decreasing in each test dataset. Recall remains mostly unaffected with some variations between each test dataset. Overall, with a larger percentage of attacker datapoints removed from the test dataset, the model performs worse on the test dataset for the attacker class.

Since the number of attacker datapoints are decreasing within the test set, we see improvements in accuracy even though the model performs worse for attacker data. This is due to the misclassifications being much smaller and decreasing compared to the normal predictions, that assumes more weight in the accuracy calcuation with each dataset.
