# 2 - Why do we need Machine Learning in Security?

This notebook presents some examples explaining some of the reasons that led to the usage of Artificial Intelligence (in particular, Machine Learning techniques, in this case) in cyber security.

---

# SPAM DETECTION

We will focus on the same example of SPAM detection seen during the theoretical class.
In order to do this, we will use a real world dataset of emails (both SPAM and regular emails) and we will try to implement a model capable of detecting SPAM. 

## Preparation

- Download the 2007 TREC Public Spam Corpus from https://plg.uwaterloo.ca/~gvcormac/treccorpus07/ (255MB)
- Read the "Agreement for use"
- Set up the `datasets` directory
- Untar the corpus in the `datasets` directory

<div class="alert alert-block alert-warning">
<b>WARNING</b>:
    
Please be careful with the path where you locate the dataset and where you run the notebook.
In this notebook I have initialized the variables to work in a path structured as the one in github.

Within the folder `session_02` there are: 
- the notebook, 
- the `datasets` folder containing the TREC Public Spam Corpus (which is the `trec07p` folder; remember, you have to extract it!). 

If you have organized your files in a different way you might have to change the value of the variables.
</div>

To double check the current working directory of the notebook, you can run the following cell:

In [None]:
import os
os.getcwd()

## Let's start with the code...

First of all, we have to create some constants for the paths of the folders containing the data

<div class="alert alert-block alert-warning">
<b>WARNING</b>:

The path in the following cell should work on Linux and Mac, if you are on a Windows machine you might have to modify the path (with double '\' instead of the a single .) 
</div>

In [None]:
# Note: these are relative paths. 
# This works because the `datasets` folder is located in the same directory as this notebook.
DATA_DIR = 'datasets/trec07p/data/'
LABELS_FILE = 'datasets/trec07p/full/index'

Import **nltk** ("Natural Language ToolKit") and download the required packages; it is a suite of libraries and programs for natural language processing (NLP) in Python.

For detailed info, you can have a look at their [website](https://www.nltk.org/).

If you are using Anaconda, nltk is most likely already installed. If it is not, you can install it by doing:
```
conda install nltk 
```
Otherwise, if instead of conda you are using pip to manage the modules, you can install nltk with:
```
pip install nltk
```

In [None]:
# the `import` statement lets you gain access to code in another module
import nltk

In [None]:
nltk.download('words')
nltk.download('stopwords')
nltk.download('punkt')

Define three functions that will be used to manage the email data:
- `flatten to string`
- `extract email text`
- `load`

As of now, you don't really have to look at this code, you can just run it as black-box.
Later, you can come back here and try to understand how they work and what they really do.

In [None]:
def flatten_to_string(parts):
    """
    Combine the different parts of the email into a flat list of strings.
    """
    ret = []
    if type(parts) == str:
        ret.append(parts)
    elif type(parts) == list:
        for part in parts:
            ret += flatten_to_string(part)  # Recursion
    elif parts.get_content_type == 'text/plain':
        ret += parts.get_payload()
    return ret

In [None]:
def extract_email_text(path):
    """
    Extract subject and body text from a single email file.
    """
    # Load a single email from an input file
    with open(path, errors='ignore') as f:
        msg = email.message_from_file(f)
    if not msg:
        return ""

    # Read the email subject
    subject = msg['Subject']
    if not subject:
        subject = ""

    # Read the email body
    body = ' '.join(m for m in flatten_to_string(msg.get_payload()) if type(m) == str)
    if not body:
        body = ""

    return subject + ' ' + body

In [None]:
def load(path):
    """
    Process a single email file into stemmed tokens.
    """
    email_text = extract_email_text(path)
    if not email_text:
        return []

    # Tokenize the message
    tokens = nltk.word_tokenize(email_text)

    # Remove punctuation from tokens
    tokens = [i.strip("".join(punctuations)) for i in tokens if i not in punctuations]

    # Remove stopwords and stem tokens
    if len(tokens) > 2:
        return [stemmer.stem(w) for w in tokens if w not in stopwords]
    return []


import other required modules
- `string`: module containing common string operations
- `email`: module for managing email messages
- `os`: module providing functions to navigate, create, delete and modify files and folders.
- `pickle`: implements binary protocols for serializing and de-serializing a Python object structure (basically, can be used to save variables in memory)

In [None]:
import string
import email
import os
import pickle

Define a list containing punctuation symbols (cast to `list` is required because `string.punctuation` returns a `str`)

In [None]:
punctuations = list(string.punctuation)

Let's see which punctuation symbols are considered:

In [None]:
print(punctuations)

Define the set of stopwords (e.g. "and", "or", etc.)

In [None]:
stopwords = set(nltk.corpus.stopwords.words('english'))

Let's print them

In [None]:
print(stopwords)

Define a stemmer to be used for preprocessing text

In [None]:
stemmer = nltk.PorterStemmer()

What is a stemmer? Let's try to use it:

In [None]:
stemmer.stem("speaking")

In [None]:
stemmer.stem("speaks")

In [None]:
stemmer.stem("speaker")

Collect the labels (i.e. the **real** categories) of the emails from the datasets. 
- *ham* is mapped to 0 
- *spam* is mapped to 1

In [None]:
labels = {}
with open(LABELS_FILE) as f:
    for line in f:
        line = line.strip()
        label, key = line.split()
        labels[key.split('/')[-1]] = 1 if label.lower() == 'ham' else 0

Let's check the type of the variable `labels`:

In [None]:
type(labels)

<div class="alert alert-block alert-danger">
Q: How many key-value pairs are in the dictionary?
</div>

<div class="alert alert-block alert-danger">
Q: How many distinct values are in the dictionary?
</div>

<div class="alert alert-block alert-danger">
Q: How many distinct keys are in the dictionary?
</div>

Let's split the corpus in train set and test set:

In [None]:
filelist = os.listdir(DATA_DIR)

TRAINING_SET_RATIO = 0.7
X_train = filelist[:int(len(filelist)*TRAINING_SET_RATIO)]
X_test = filelist[int(len(filelist)*TRAINING_SET_RATIO):]

<div class="alert alert-block alert-danger">
Q: Why do we split the data?
</div>

<div class="alert alert-block alert-danger">
Q: How many elements are in `X_train`?
</div>

<div class="alert alert-block alert-danger">
Q: which is the type of the elements?
</div>

---

## First approach: spam detection with blacklisted words

The first model that we're going to implement is a very simple one.
Given a set of blacklisted words, a new email is classified as *not spam* only if does not contain any blacklisted words. 
Otherwise, it is classified as SPAM.

The first thing to do is to *train* the model: basically, we have to tell the system which words are *spam* and which are *ham*

In [None]:
# this cell might take a while, the first time you run it

spam_words = set()
ham_words = set()

if not os.path.exists('blacklist.pkl'):  # os.path.exists returns True if the file already exists
    for filename in X_train:  # note that we are using the train set
        path = os.path.join(DATA_DIR, filename)
        if filename in labels:
            label = labels[filename]
            stems = load(path)
            if not stems:
                continue
            if label == 1:
                ham_words.update(stems)
            elif label == 0:
                spam_words.update(stems)
            else:
                continue
    blacklist = spam_words - ham_words
    pickle.dump(blacklist, open('blacklist.pkl', 'wb'))
else:
    blacklist = pickle.load(open('blacklist.pkl', 'rb') )

print('Blacklist successfully built/loaded')

Let's see some of them...

In [None]:
blacklist

<div class="alert alert-block alert-danger">
Q: How many elements in the set?
</div>

<div class="alert alert-block alert-danger">
Q: Is "spam" in the set?
</div>

<div class="alert alert-block alert-danger">
Q: is "fibonacci" in the set?
</div>

<div class="alert alert-block alert-danger">
Q: How long is the longest word in the set?
</div>

<div class="alert alert-block alert-danger">
Q: Which word is it?
</div>

But these are not really "words"... Let's look only at the actual words

In [None]:
from nltk.corpus import words
word_set = set(words.words())
word_blacklist = word_set.intersection(blacklist)

In [None]:
word_blacklist

<div class="alert alert-block alert-danger">
Q: How many elements in the set?
</div>

<div class="alert alert-block alert-danger">
Q: How long is the longest word in the set?
</div>

<div class="alert alert-block alert-danger">
Q: Which word is it?
</div>

## Let's try and use this model

#### Metrics for classification

These are common metrics used to evaluate a classification model, and will be discussed later in the course.

- Building blocks for the metrics
    - TP = True Positive
    - TN = True Negative
    - FP = False Positive
    - FN = False Negative


- The actual metrics:
    - $\text{accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$
    - $\text{precision} = \frac{TP}{TP + FP}$; (a.k.a. positive predictive value)
    - $\text{recall} = \frac{TP}{TP + FN}$; (a.k.a. sensitivity, hit rate, true positive rate)

Let's run the model on the test set.

In [None]:
# this cell might take some time
fp = 0
tp = 0
fn = 0
tn = 0

for filename in X_test:
    path = os.path.join(DATA_DIR, filename)
    if filename in labels:
        label = labels[filename]
        stems = load(path)
        if not stems:
            continue
        stems_set = set(stems)
        if stems_set & blacklist:  # INTERSECTION BETWEEN SETS. Checks whether the intersection is not empty
            if label == 1:
                fp = fp + 1
            else:
                tp = tp + 1
        else:
            if label == 1:
                tn = tn + 1
            else:
                fn = fn + 1

In [None]:
print("TN %d" % tn)
print("FP %d" % fp)
print("FN %d" % fn)
print("TP %d" % tp)

In [None]:
print("Confusion matrix:\n")
print("| TN %d | FP %4d |" % (tn, fp))
print("| FN %d | TP %d |" % (fn, tp))

In [None]:
count = tp + fp + tn + fn

print("Accuracy: %.5f" % ((tp+tn)/count))
print("Precision: %.5f" % (tp/(tp+fp)))
print("Recall: %.5f" % (tp/(tp+fn)))

---

## Logistic Regression

Let's try now with a different model, logistic regression.

In [None]:
def read_email_files():
    X = []
    y = [] 
    for i in range(len(labels)):
        filename = 'inmail.' + str(i+1)
        email_str = extract_email_text(
            os.path.join(DATA_DIR, filename))
        X.append(email_str)
        y.append(labels[filename])
    return X, y

"Read" the emails

In [None]:
X, y = read_email_files()

split in train set and test set

In [None]:
from sklearn.model_selection import train_test_split 

X_train, X_test, y_train, y_test, idx_train, idx_test = \
    train_test_split(X, y, range(len(y)), train_size=TRAINING_SET_RATIO, random_state=2)


<div class="alert alert-block alert-danger">
Q: How does the input data (i.e. X) look like? (types, content, ...)
</div>

<div class="alert alert-block alert-danger">
Q: and the target label?
</div>

As input, we have strings (emails). We **have** to convert them into numbers.

<div class="alert alert-block alert-danger">
Q: Any ideas on how to convert strings into numbers? (hint: think about the stemmer we have seen...)
</div>

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X_train_vector = vectorizer.fit_transform(X_train)
X_test_vector = vectorizer.transform(X_test)

<div class="alert alert-block alert-danger">
Q: Let's look at the input data now. How is it different from before?
</div>

let's define and train the model!

In [None]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(X_train_vector, y_train)

and check its predictions

In [None]:
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score
)

y_pred = clf.predict(X_test_vector)

print("Accuracy:  %.5f" % accuracy_score(y_test, y_pred))
print("Precision: %.5f" % precision_score(y_test, y_pred))
print("Recall:    %.5f" % recall_score(y_test, y_pred))

You can play with the hyperparameters of the model, looking for the best configuration.

The complete list of parameters can be found on the sklearn [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

In [None]:
clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial', max_iter=250)
clf.fit(X_train_vector, y_train)

y_pred = clf.predict(X_test_vector)

print("Accuracy:  %.5f" % accuracy_score(y_test, y_pred))
print("Precision: %.5f" % precision_score(y_test, y_pred))
print("Recall:    %.5f" % recall_score(y_test, y_pred))

---

## Decision Tree Classifier

Let's try now with a different type of classifier, a decision tree.

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Define the model
clf = DecisionTreeClassifier()

# Train the model (this might take quite some time)
clf.fit(X_train_vector, y_train)

In [None]:
y_pred = clf.predict(X_test_vector)

print("Accuracy:  %.5f" % accuracy_score(y_test, y_pred))
print("Precision: %.5f" % precision_score(y_test, y_pred))
print("Recall:    %.5f" % recall_score(y_test, y_pred))

Let's try different parameters!

In [None]:
# Define the model
clf = DecisionTreeClassifier(max_leaf_nodes=2)

# Train the model (this might take quite some time)
clf.fit(X_train_vector, y_train)

In [None]:
y_pred = clf.predict(X_test_vector)

print("Accuracy:  %.5f" % accuracy_score(y_test, y_pred))
print("Precision: %.5f" % precision_score(y_test, y_pred))
print("Recall:    %.5f" % recall_score(y_test, y_pred))

that's not very good... let's check how it performed on the training data

In [None]:
y_pred = clf.predict(X_train_vector)

print("Accuracy:  %.5f" % accuracy_score(y_train, y_pred))
print("Precision: %.5f" % precision_score(y_train, y_pred))
print("Recall:    %.5f" % recall_score(y_train, y_pred))

- if the performance is poor both on the training set and the test set, it is a case of **UNDERFITTING**.
- if the performance is poor on the test set but good on the training set, it is a case of **OVERFITTING**. (Basically, the model is not able to generalize).

---

### Challenge for those who finish the notebook early.

1. Have a look at the parameters of the CountVectorizer method that used to generate the dataset for training the Logistic Regression classifier ([here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) the documentation);
2. Change the parameters of the method to:
    1. use an english stop word list (i.e. `stopwords='english'`), or
    2. set the vocabulary size to have a maximum of 1000 words (hint: use `max_features`), or
    3. set the minimum frequency of a vocabulary word across the documents to be 10 (hint: use `min_df`)
3. For each of the above, create a new train/test dataset and train and evaluate the Logistic Regression classifier on it.

Which one performs the best? Can you explain why? 

In [None]:
vectorizer =  # TODO
X_train_vector =  # TODO
X_test_vector =  # TODO

In [None]:
clf = LogisticRegression()
clf.fit(X_train_vector, y_train)
y_pred = clf.predict(X_test_vector)

print("Accuracy:  %.5f" % accuracy_score(y_test, y_pred))
print("Precision: %.5f" % precision_score(y_test, y_pred))
print("Recall:    %.5f" % recall_score(y_test, y_pred))

---