## Q2.2 Spam Dataset

In this task we want to explore email messages datasets using the SPAM Classifier introduced in the tutorial. 
First, we use the code that reads the data from the files, and creates a data frame:

In [1]:
import os
import sys
import numpy
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix, f1_score

def progress(i, end_val, bar_length=50):
    '''
    Print a progress bar of the form: Percent: [#####      ]
    i is the current progress value expected in a range [0..end_val]
    bar_length is the width of the progress bar on the screen.
    '''
    percent = float(i) / end_val
    hashes = '#' * int(round(percent * bar_length))
    spaces = ' ' * (bar_length - len(hashes))
    sys.stdout.write("\rPercent: [{0}] {1}%".format(hashes + spaces, int(round(percent * 100))))
    sys.stdout.flush()

NEWLINE = '\n'

HAM = 'ham'
SPAM = 'spam'

SOURCES = [
    ('data/spam', SPAM),
    ('data/easy_ham', HAM),
    ('data/hard_ham', HAM),
    ('data/beck-s', HAM),
    ('data/farmer-d', HAM),
    ('data/kaminski-v', HAM),
    ('data/kitchen-l', HAM),
    ('data/lokay-m', HAM),
    ('data/williams-w3', HAM),
    ('data/BG', SPAM),
    ('data/GP', SPAM),
    ('data/SH', SPAM)
]

SKIP_FILES = {'cmds'}


def read_files(path):
    '''
    Generator of pairs (filename, filecontent)
    for all files below path whose name is not in SKIP_FILES.
    The content of the file is of the form:
        header....
        <emptyline>
        body...
    This skips the headers and returns body only.
    '''
    for root, dir_names, file_names in os.walk(path):
        for path in dir_names:
            read_files(os.path.join(root, path))
        for file_name in file_names:
            if file_name not in SKIP_FILES:
                file_path = os.path.join(root, file_name)
                if os.path.isfile(file_path):
                    past_header, lines = False, []
                    f = open(file_path, encoding="latin-1")
                    for line in f:
                        if past_header:
                            lines.append(line)
                        elif line == NEWLINE:
                            past_header = True
                    f.close()
                    content = NEWLINE.join(lines)
                    yield file_path, content


def build_data_frame(l, path, classification):
    rows = []
    index = []
    for i, (file_name, text) in enumerate(read_files(path)):
        if ((i + l) % 100 == 0):
            progress(i + l, 58910, 50)
        rows.append({'text': text, 'class': classification})
        index.append(file_name)

    data_frame = DataFrame(rows, index=index)
    return data_frame, len(rows)


def load_data():
    data = DataFrame({'text': [], 'class': []})
    l = 0
    for path, classification in SOURCES:
        data_frame, nrows = build_data_frame(l, path, classification)
        data = data.append(data_frame, sort=False)
        l += nrows
    data = data.reindex(numpy.random.permutation(data.index))
    return data

### 2.2.1 

Now we want to collect some statistics. We start by counting the number of unigrams and bigrams in the CountVectorizer. 

In [2]:
data=load_data()


cv = CountVectorizer(ngram_range=(1, 2))
cv.fit_transform(data["text"].values)
features = cv.get_feature_names()
n = len(features)
print("number of unigrams and bigrams: " + str(n))

Percent: [##################################################] 100%-----------------------------------------------
number of unigrams and bigrams: 4015950


### 2.2.2
Next, we want to check what are the 50 most frequent unigrams and bigrams in the dataset. We'll pass the max_features argument to the CountVectorizer. That way it will  build a vocabulary that only consider the top max features ordered by term frequency across the dataset.

In [3]:
cv = CountVectorizer(ngram_range=(1, 2),max_features=50)
cv.fit_transform(data["text"].values)
features = cv.get_feature_names()
print(features)

['20', '3d', '3d http', 'align', 'and', 'arial', 'be', 'border', 'br', 'br br', 'color', 'com', 'content', 'div', 'face', 'font', 'font face', 'font size', 'for', 'height', 'href', 'html', 'http', 'http www', 'in', 'is', 'it', 'nbsp', 'nbsp nbsp', 'of', 'on', 'size', 'span', 'style', 'style 3d', 'table', 'td', 'td td', 'td tr', 'that', 'the', 'this', 'to', 'tr', 'tr td', 'width', 'with', 'www', 'you', 'your']


## 2.2.3

Now, we will do the same, but this time we check the 50 most frequent unigrams and bigrams per class. FIrst, we create filtered lists, and then we pass them to the CountVectorizer fit_transform

In [6]:
def filter_texts_by_class(data, class_name):
    list = []
    for i in range(len(data)):
        if data["class"].values[i] == class_name:
            list.append(data["text"].values[i])
    return list

print("The 50 most frequent unigrams and bigrams per 'ham' class")
filt_list = filter_texts_by_class(data,"ham")
cv = CountVectorizer(ngram_range=(1, 2),max_features=50)
cv.fit_transform(filt_list)
features = cv.get_feature_names()
print(features)
print('-'*100)
print("The 50 most frequent unigrams and bigrams per 'spam' class")
filt_list = filter_texts_by_class(data,"spam")
cv = CountVectorizer(ngram_range=(1, 2),max_features=50)
cv.fit_transform(filt_list)
features = cv.get_feature_names()
print(features)

The 50 most frequent unigrams and bigrams per 'ham' class
['09', '10', '20', '3d', 'an', 'and', 'are', 'as', 'at', 'be', 'br', 'by', 'com', 'ect', 'enron', 'font', 'for', 'from', 'gif', 'has', 'have', 'height', 'hou', 'http', 'http www', 'if', 'img', 'in', 'in the', 'is', 'it', 'not', 'of', 'of the', 'on', 'or', 'src', 'td', 'that', 'the', 'this', 'to', 'tr', 'we', 'width', 'will', 'with', 'www', 'you', 'your']
----------------------------------------------------------------------------------------------------
The 50 most frequent unigrams and bigrams per 'spam' class
['20', '3d', '3d http', 'align', 'and', 'arial', 'border', 'br', 'br br', 'center', 'color', 'com', 'content', 'div', 'face', 'font', 'font face', 'font size', 'font td', 'for', 'height', 'href', 'href 3d', 'html', 'http', 'http www', 'in', 'is', 'nbsp', 'nbsp nbsp', 'of', 'size', 'span', 'style', 'style 3d', 'table', 'td', 'td td', 'td tr', 'text', 'the', 'this', 'to', 'tr', 'tr td', 'tr tr', 'width', 'www', 'you', 'your

We can see that there are some diffrences between the two sets. The 'spam' set, for example, contains more "HTML words" such as style,span and size. 

## 2.2.4

Now, we want to check what are the 20 most useful features in the Naive Bayes classifier to distinguish between the two classes. First, we train the MultinomialNB classifier on the data (with tranformed text). Then we take the top 20 naive bayes coefficients. The coefficients Mirrors feature_log_prob for interpreting MultinomialNB as a linear model. The feature_log_prob is the Empirical log probability of features given a class $Pr(f_i | Class = c)$

In [8]:
cv = CountVectorizer(ngram_range=(1, 2))
transformed_data=cv.fit_transform(data["text"].values)
clf = MultinomialNB()
clf.fit(transformed_data, data["class"].values)

top20 = numpy.argsort(clf.coef_[0])[-20:]
feature_names= cv.get_feature_names()
topFeatures = [feature_names[j] for j in top20]
print(", ".join(topFeatures))


span, style, border, face, br br, of, width, nbsp nbsp, and, http, 20, to, tr, size, the, nbsp, td, br, 3d, font


These are the 20-top features for the "spam" label. For example, the word 'nbsp' will probably appear when the class is spam. it fits 2.2.3, there we saw that nbsp si frequent in the spam texts, but not in the ham texts. 