## This is an introduction to Machine Learning with scikit-learn
### We will build a simple spam-filter trained on some example E-Mails

First let's import all the necessary libraries <br/>
We will use pandas as well ;-)

<span style="color:red">Note: If some of the imports won't work, make sure you have them installed using pip or anaconda</span>

In [2]:
import os
import numpy
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, f1_score
from sklearn.model_selection import train_test_split

### Data preprocessing
Next we will read the spam and no spam (i. e. ham) examples from the data directory. <br/>
Make sure you have the examples in the <i>data</i> directory

In [3]:
NEWLINE = '\n'

HAM = 'ham'
SPAM = 'spam'

SOURCES = [
    ('data/ham/beck-s',      HAM),
    ('data/ham/farmer-d',    HAM),
    ('data/ham/kaminski-v',  HAM),
    ('data/ham/kitchen-l',   HAM),
    ('data/ham/lokay-m',     HAM),
    ('data/ham/williams-w3', HAM),
    ('data/spam/BG',          SPAM),
    ('data/spam/GP',          SPAM),
    ('data/spam/SH',          SPAM)
]

SKIP_FILES = {'cmds'}

''' iterate through all files an yield the email body '''
def read_files(path):
    for root, dir_names, file_names in os.walk(path):
        for path in dir_names:
            read_files(os.path.join(root, path))
        for file_name in file_names:
            if file_name not in SKIP_FILES:
                file_path = os.path.join(root, file_name)
                if os.path.isfile(file_path):
                    past_header, lines = False, []
                    f = open(file_path, encoding="latin-1")
                    for line in f:
                        if past_header:
                            lines.append(line)
                        elif line == NEWLINE:
                            past_header = True
                    f.close()
                    content = NEWLINE.join(lines)
                    yield file_path, content

Now we will build a pandas dataframe for the files

In [4]:
def build_data_frame(path, classification):
    rows = []
    index = []
    for file_name, text in read_files(path):
        rows.append({'text': text, 'class': classification})
        index.append(file_name)

    data_frame = DataFrame(rows, index=index)
    return data_frame

Now we will concate the dataframes using pandas' append method. <br/>
<span style="color:red">Note: This may take some time. Keep calm and let it finish ;-) </span>

In [5]:
data = DataFrame({'text': [], 'class': []})
for path, classification in SOURCES:
    data = data.append(build_data_frame(path, classification), sort=True)

data = data.reindex(numpy.random.permutation(data.index))

### Feature extraction
We start with the basic CountVectorizer, i. e. a bag-of-words approach <br/>
<span style="color:red">Note: This may take some time. Keep calm and let it finish ;-) </span>

In [10]:
count_vectorizer = CountVectorizer()
counts = count_vectorizer.fit_transform(data['text'].values)
labels = data['class'].values

# split data into test and training set - hold 20% out for testing
X_train, X_test, y_train, y_test = train_test_split(counts, labels, test_size=0.2, random_state=1)

### Classification with Naive Bayes Classifier
Train the classifier with the example

In [8]:
nb_clf = MultinomialNB()
targets = data['class'].values
nb_clf.fit(X_train, y_train);

### Validate predictions

In [9]:
print('Test accuracy: %.3f' % nb_clf.score(X_test, y_test))

Test accuracy: 0.992


### Test some other scores like precision, recall and f1-score

In [39]:
from sklearn.metrics import precision_score, recall_score, f1_score

y_pred = nb_clf.predict(X_test)
# we are using macro to evalutate the overall performance of the classifier and averaging the weights of all classes equally
print('Precision: %.3f' % precision_score(y_true=y_test, y_pred=y_pred, average="macro"))
print('Recall: %.3f' % recall_score(y_true=y_test, y_pred=y_pred, average="macro"))
print('F1-Score: %.3f' % f1_score(y_true=y_test, y_pred=y_pred, average="macro"))

Precision: 0.954
Recall: 0.960
F1-Score: 0.955


##### Now try what happens when you are changing the score to use 'micro' averaging
How do these results differ from the previous results and why?

In [11]:
# TODO fill in the averaging mode 'micro'
print('Precision: %.3f' % precision_score(y_true=y_test, y_pred=y_pred, average=" "))
print('Recall: %.3f' % recall_score(y_true=y_test, y_pred=y_pred, average=" "))
print('Recall: %.3f' % f1_score(y_true=y_test, y_pred=y_pred, average=" "))

NameError: name 'precision_score' is not defined

### Validate prediction with an example
Now that we have seen the score, let's test it on an example. <br/>
Is the prediction correct?

In [42]:
example = ["Free Viagra call today"]
example_counts = count_vectorizer.transform(example)
prediction = nb_clf.predict(example_counts);
prediction[0]

'spam'

A few more examples. Are the predictions also correct

In [44]:
examples = ["Free Viagra call today!", 
            "I'm going to attend the Python Learning group tomorrow.",
            "Free Viagra Free Viagra Free Viagra",
            "Today we will learn about Machine Learning"]
example_counts = count_vectorizer.transform(examples)
predictions = nb_clf.predict(example_counts)
for i in range(len(predictions)):
    print("E-Mail %i: %s"%(i, predictions[i]))

E-Mail 0: spam
E-Mail 1: ham
E-Mail 2: spam
E-Mail 3: ham


Now try your own example. Type in some E-Mail text inside the quotes and see if the prediction is correct

In [55]:
your_test_mail = [" ... "]
test_mail_counts = count_vectorizer.transform(your_test_mail)
print(nb_clf.predict(test_mail_counts))

['spam']


### Classification with Support Vector Machine
Tasks:<br/>
1. Instantiate the classifier with default Parameters (empty brackets)
2. Train the classifier with the training data

In [None]:
from sklearn.svm import SVC
svm_clf =  #TODO
svm_clf.fit(X_, y_) #TODO texts, labels from training set

Let's how good the accuracy score is compared to the Naive Bayes Classifier <br/>
Is it better or worse?

In [None]:
print('Test accuracy: %.3f' % svm_clf.score(X_test, y_test))

Don't forget to look at the other scores as well

In [None]:
y_pred = svm_clf.predict(X_test)
# we are using macro to evalutate the overall performance of the classifier and averaging the weights of all classes equally
print('Precision: %.3f' % precision_score(y_true=y_test, y_pred=y_pred, average="macro"))
print('Recall: %.3f' % recall_score(y_true=y_test, y_pred=y_pred, average="macro"))
print('F1-Score: %.3f' % f1_score(y_true=y_test, y_pred=y_pred, average="macro"))

And test it with an example.

In [None]:
example = ["Free Viagra call today"]
example_counts = count_vectorizer.transform(example)
prediction = svm_clf.predict(example_counts);
prediction[0]

### Classification with K-Nearest-Neighbour
Now it's your turn to program a classification using scikit-learn's K-Nearest-Neighbour implementation.<br/>
Validate the classifier using the same apporoaches as above. <br/>
How does K-Nearest-Neighbour perfom on this data set?

In [2]:
from sklearn.neighbors import KNeighborsClassifier

knn_clf = #TODO
knn_clf.fit(#TODO)

## What if we use TfidfVectorizer instead of CountVectorizer