# 2 - Why using Machine Learning in Security?

This notebook presents some examples explaining some of the reasons that led to the usage of Artificial Intelligence (in particular, Machine Learning techniques, in this case) in cyber security.

---

## SPAM DETECTION

### Preparation

- Download the 2007 TREC Public Spam Corpus from https://plg.uwaterloo.ca/~gvcormac/treccorpus07/ (255MB)
- Read the "Agreement for use"
- Set up the `datasets` directory
- Untar the corpus in the `datasets` directory

### Starting with the code...
- create constants with the path of the folders containing the data

In [None]:
DATA_DIR = 'datasets/trec07p/data/'
LABELS_FILE = 'datasets/trec07p/full/index'

- import nltk ("Natural Language ToolKit") and download the required packages.
    - the `import` statement lets you gain access to code in another module
    - `nltk` is a suite of libraries and programs for natural language processing (NLP)

In [None]:
import nltk

In [None]:
nltk.download('words')
nltk.download('stopwords')
nltk.download('punkt')

- define functions to manage the email data (as of now, you don't really have to look at this code, you can run it as black-box)

In [None]:
def flatten_to_string(parts):
    """
    Combine the different parts of the email into a flat list of strings.
    """
    ret = []
    if type(parts) == str:
        ret.append(parts)
    elif type(parts) == list:
        for part in parts:
            ret += flatten_to_string(part)  # Recursion
    elif parts.get_content_type == 'text/plain':
        ret += parts.get_payload()
    return ret

In [None]:
def extract_email_text(path):
    """
    Extract subject and body text from a single email file.
    """
    # Load a single email from an input file
    with open(path, errors='ignore') as f:
        msg = email.message_from_file(f)
    if not msg:
        return ""

    # Read the email subject
    subject = msg['Subject']
    if not subject:
        subject = ""

    # Read the email body
    body = ' '.join(m for m in flatten_to_string(msg.get_payload()) if type(m) == str)
    if not body:
        body = ""

    return subject + ' ' + body

In [None]:
def load(path):
    """
    Process a single email file into stemmed tokens.
    """
    email_text = extract_email_text(path)
    if not email_text:
        return []

    # Tokenize the message
    tokens = nltk.word_tokenize(email_text)

    # Remove punctuation from tokens
    tokens = [i.strip("".join(punctuations)) for i in tokens if i not in punctuations]

    # Remove stopwords and stem tokens
    if len(tokens) > 2:
        return [stemmer.stem(w) for w in tokens if w not in stopwords]
    return []


- import other required modules
    - `string`: module containing common string operations
    - `email`: module for managing email messages
    - `os`: module providing functions to navigate, create, delete and modify files and folders.
    - `pickle`: implements binary protocols for serializing and de-serializing a Python object structure (basically, can be used to store variables)

In [None]:
import string
import email
import os
import pickle

- define a list containing punctuation symbols (cast to `list` is required because `string.punctuation` returns a `str`)

In [None]:
punctuations = list(string.punctuation)

- define the set of stopwords (e.g. "and", "or", etc.)

In [None]:
stopwords = set(nltk.corpus.stopwords.words('english'))

- define a stemmer to be used for preprocessing text

In [None]:
stemmer = nltk.PorterStemmer()

In [None]:
stemmer.stem("speaking")

- collect the labels (i.e. the **real** categories) of the emails from the datasets. 
    - *ham* is mapped to 0 
    - *spam* is mapped to 1

In [None]:
labels = {}
with open(LABELS_FILE) as f:
    for line in f:
        line = line.strip()
        label, key = line.split()
        labels[key.split('/')[-1]] = 1 if label.lower() == 'ham' else 0

In [None]:
type(labels)

#### Q: How many key-value pairs are in the dictionary?

#### Q: How many distinct values are in the dictionary?

#### Q: How many distinct keys are in the dictionary?

- split the corpus in train set and test set

In [None]:
filelist = os.listdir(DATA_DIR)

TRAINING_SET_RATIO = 0.7
X_train = filelist[:int(len(filelist)*TRAINING_SET_RATIO)]
X_test = filelist[int(len(filelist)*TRAINING_SET_RATIO):]

#### Q: Why do we split the data?

#### Q: How many elements are in `X_train`?

#### Q: which is the type of the elements?

---

## SPAM DETECTION WITH BLACKLISTED WORDS

- We have to tell the system which words are *spam* and which are *ham*

In [None]:
spam_words = set()
ham_words = set()

In [None]:
# this cell might take a while, the first time you run it
if not os.path.exists('blacklist.pkl'):
    for filename in X_train:
        path = os.path.join(DATA_DIR, filename)
        if filename in labels:
            label = labels[filename]
            stems = email_read_util.load(path)
            if not stems:
                continue
            if label == 1:
                ham_words.update(stems)
            elif label == 0:
                spam_words.update(stems)
            else:
                continue
    blacklist = spam_words - ham_words
    pickle.dump(blacklist, open('blacklist.pkl', 'wb'))
else:
    blacklist = pickle.load(open('blacklist.pkl', 'rb') )

print('Blacklist successfully built/loaded')

- Let's see some of them...

In [None]:
blacklist

#### Q: How many elements in the set?

#### Q: Is "spam" in the set?

#### Q: is "fibonacci" in the set?

#### Q: How long is the longest word in the set?

#### Q: Which word is it?

- But these are not really "words"... Lt's look only at the actual words

In [None]:
from nltk.corpus import words
word_set = set(words.words())
word_blacklist = word_set.intersection(blacklist)

In [None]:
word_blacklist

#### Q: How many elements in the set?

#### Q: How long is the longest word in the set?

#### Q: Which word is it?

## Let's use this model!

#### Metrics for classification

- Building blocks for the metrics
    - TP = True Positive
    - TN = True Negative
    - FP = False Positive
    - FN = False Negative


- The actual metrics:
    - $\text{accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$
    - $\text{precision} = \frac{TP}{TP + FP}$; (a.k.a. positive predictive value)
    - $\text{recall} = \frac{TP}{TP + FN}$; (a.k.a. sensitivity, hit rate, true positive rate)

In [None]:
# this cell might take quite some time
fp = 0
tp = 0
fn = 0
tn = 0

for filename in X_test:
    path = os.path.join(DATA_DIR, filename)
    if filename in labels:
        label = labels[filename]
        stems = load(path)
        if not stems:
            continue
        stems_set = set(stems)
        if stems_set & blacklist:  # INTERSECTION BETWEEN SETS
            if label == 1:
                fp = fp + 1
            else:
                tp = tp + 1
        else:
            if label == 1:
                tn = tn + 1
            else:
                fn = fn + 1

In [None]:
print("TN %d" % tn)
print("FP %d" % fp)
print("FN %d" % fn)
print("TP %d" % tp)

#### Q: Print the TP, FP, TN, FN as percentage (or fraction).

In [None]:
print("Confusion matrix:\n")
print("| TN %.2f | FP %.2f |" % (tn/count, fp/count))
print("| FN %.2f | TP %.2f |" % (fn/count, tp/count))

In [None]:
print("Accuracy: %.5f" % ((tp+tn)/count))
print("Precision: %.5f" % (tp/(tp+fp)))
print("Recall: %.5f" % (tp/(tp+fn)))

---

## Logistic Regression

Let's try now with a different model.

In [None]:
def read_email_files():
    X = []
    y = [] 
    for i in range(len(labels)):
        filename = 'inmail.' + str(i+1)
        email_str = extract_email_text(
            os.path.join(DATA_DIR, filename))
        X.append(email_str)
        y.append(labels[filename])
    return X, y

- "Read" the emails

In [None]:
X, y = read_email_files()

- split in train set and test set

In [None]:
from sklearn.model_selection import train_test_split 

X_train, X_test, y_train, y_test, idx_train, idx_test = \
    train_test_split(X, y, range(len(y)), 
    train_size=TRAINING_SET_RATIO, random_state=2)


#### Q: How does the input data (i.e. X) look like? (types, content, ...)

#### Q: and the target label?

- As input, we have strings (emails). We **have** to convert them into numbers.

#### Q: Any ideas on how to convert strings into numbers? (hint: think about the stemmer we have seen...)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X_train_vector = vectorizer.fit_transform(X_train)
X_test_vector = vectorizer.transform(X_test)

#### Q: Let's look at the input data now. How is it different from before?

- let's define and train the model!

In [None]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(X_train_vector, y_train)

- and check its predictions

In [None]:
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score
)

y_pred = clf.predict(X_test_vector)

print("Accuracy:  %.5f" % accuracy_score(y_test, y_pred))
print("Precision: %.5f" % precision_score(y_test, y_pred))
print("Recall:    %.5f" % recall_score(y_test, y_pred))

- You can play with the parameters of the model, looking for the best configuration

In [None]:
clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial', max_iter=250)
clf.fit(X_train_vector, y_train)

y_pred = clf.predict(X_test_vector)

print("Accuracy:  %.5f" % accuracy_score(y_test, y_pred))
print("Precision: %.5f" % precision_score(y_test, y_pred))
print("Recall:    %.5f" % recall_score(y_test, y_pred))

---

## Decision Tree Classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Define the model
clf = DecisionTreeClassifier()

# Train the model (this might take quite some time)
clf.fit(X_train_vector, y_train)

In [None]:
y_pred = clf.predict(X_test_vector)

print("Accuracy:  %.5f" % accuracy_score(y_test, y_pred))
print("Precision: %.5f" % precision_score(y_test, y_pred))
print("Recall:    %.5f" % recall_score(y_test, y_pred))

- Let's try different parameters!

In [None]:
# Define the model
clf = DecisionTreeClassifier(max_leaf_nodes=2)

# Train the model (this might take quite some time)
clf.fit(X_train_vector, y_train)

In [None]:
y_pred = clf.predict(X_test_vector)

print("Accuracy:  %.5f" % accuracy_score(y_test, y_pred))
print("Precision: %.5f" % precision_score(y_test, y_pred))
print("Recall:    %.5f" % recall_score(y_test, y_pred))

- that's not very good... let's check how it performed on the training data

In [None]:
y_pred = clf.predict(X_train_vector)

print("Accuracy:  %.5f" % accuracy_score(y_train, y_pred))
print("Precision: %.5f" % precision_score(y_train, y_pred))
print("Recall:    %.5f" % recall_score(y_train, y_pred))

- if the performance is poor both on the training set and the test set, it is a case of **UNDERFITTING**.
- if the performance is poor on the test set but good on the training set, it is a case of **OVERFITTING**. (Basically, the model is not able to generalize).

---

## Random Forest Classifier

#### Q: Now it's your turn! Try to build a Random Forest classifier.

Hint: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Define the model

# Train the model

# Perform the predictions

# Compute the metrics (Accuracy, precision, recall)


---