<h1> CS4618 Assignment - Spam Filter</h1>

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [3]:
import os
import glob
import re

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import SGDClassifier

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer



class DataFrameSelector(BaseEstimator,TransformerMixin):
    def __init__(self, attribute_names, dtype = None):
        self.attribute_names = attribute_names
        self.dtype = dtype
    def fit(self, X, y = None):
        return self
    def transform(self, X):
        X_selected = X[self.attribute_names]
        if self.dtype:
            return X_selected.astype(self.dtype).values
        return X_selected.values
    
class FeatureBinarizer(BaseEstimator, TransformerMixin):
    def __init__(self, features_values):
        self.features_values = features_values
        self.num_features = len(features_values)
        self.labelencodings = [LabelEncoder().fit(feature_values) for feature_values in features_values]
        self.onehotencoder = OneHotEncoder(sparse=False,n_values=[len(feature_values) for feature_values in features_values])
        self.last_indexes = np.cumsum([len(feature_values) - 1 for feature_values in self.features_values])
    def fit(self, X, y=None):
        for i in range(0, self.num_features):
            X[:, i] = self.labelencodings[i].transform(X[:, i])
        return self.onehotencoder.fit(X)
    def transform(self, X, y=None):
        for i in range(0, self.num_features):
            X[:, i] = self.labelencodings[i].transform(X[:, i])
        onehotencoded = self.onehotencoder.transform(X)
        return np.delete(onehotencoded, self.last_indexes, axis=1)
    def fit_transform(self, X, y=None):
        onehotencoded = self.fit(X).transform(X)
        return np.delete(onehotencoded, self.last_indexes, axis=1)
    def get_params(self, deep=True):
        return {"features_values" : self.features_values}
    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            self.setattr(parameter, value)
        return self

<h1>1. Reading the mails</h1>
<p>The method takes in a <b>path</b>('spam' or 'ham') and returns a <b>pandas DataFrame object</b>.</p>
<p>The pandas DataFrame object has as columns the <b>content</b> and the <b>type</b> of a mail.The content is the text of the mail as it is and the type is determined from the path, which will be either spam or ham</p>

In [5]:
def readMails(path):
    mails = pd.DataFrame(dtype=object, columns=['content','type'])
    files = glob.glob('./' + path + '/*.txt')
    for file in files:
        with open(file) as f:
            text = f.read()
            mails = mails.append({'content':text, 'type':path}, ignore_index=True)
    return mails

In [6]:
spams = readMails('spam')
hams = readMails('ham')

In [7]:
mails = spams.append(hams)

<h1>2. Explore the mails</h1>

In [8]:
(rows, cols) = mails.shape
(rows, cols)

(2898, 2)

In [9]:
mails.describe(include='all')

Unnamed: 0,content,type
count,2898,2898
unique,2898,2
top,From fork-admin@xent.com Mon Jul 22 18:13:54 ...,ham
freq,1,1650


In [10]:
hams_count = (mails['type'] == 'ham').sum();
spams_count = (mails['type'] == 'spam').sum();
print('Total number of hams = ', hams_count, '(',hams_count / rows * 100,'%)')
print('Total number of spam = ', spams_count, '(',spams_count / rows * 100,'%)')

Total number of hams =  1650 ( 56.9358178054 %)
Total number of spam =  1248 ( 43.0641821946 %)


<p> The emails contain:</p>
<ul>
    <li><b>Header and Body</b></li>
    <li><b>Uppercase and lowercas words</b></li>
    <li><b>HTML tags</b>:eg < font color="blue" > </li>
    <li><b>Numbers</b>:eg 123 300,000</li>
    <li><b>Money amount</b>:eg $15.00</li>
    <li><b>Underscores</b>: eg _______ </li>
</ul>

<h1>3. CountVectorizer & TFIDF </h1>

<p><b>3.1 Uppercase words</b><p>
<p>The spam mails usually contain multiple uppercase word (eg. PROTECT YOUR LOVED ONES AND YOURSELF). It could be useful to distinguish between lowercase and uppercase words in case of a spam filter. Decision was to <b>set lowercase to False</b> in both CountVectorizer and TFIDF.</p>

<p><b>3.2 Stop-words</b></p>
<p>Stop-words are very frequent in both spam and ham emails. The decision was to discard them, as they do not add anything to the content, so we <b>set stop_words to english</b></p>

<p><b>3.3 HTML tags</b></p>
<p>HTML tags can be useful in detecting a spam. An HTML tag can mean something being very colorful by setting the font, color and size of the text. Usually, spams contain a lot of html tags, bigger fonts. it would be useful to have an <b>HTML tag as a feature</b> after running the pipeline. The regular expression below defines an HTML tag, which is a '<' followed by 1 or more characters and a '>' at the end. This expression will be used when making the tokenizer.

In [11]:
html = r'\<[a-z]{1,}.*?\>' #'<' + one/more letters + arbitrary sequence of characters + ending in '>'

<p><b>3.4 Additional features</b></p>
<p>In multiple spams appears money (eg $12.50), numbers and underscores for filling out a form (eg. ___ ). This could also be distinct features, as they add to the content and their presence could be a clue whether the mail is a spam of a ham. The regular expressions below define a pattern for the money, underscores and alphanumeric strings. The <b>pattern</b> combines all 4 of them into a single one.</p>

In [12]:
money = r'\$[0-9]{1,}?\.?[0-9]{1,}' #$ + one/more digits + .(optionally) + one/more digits(optionally)
alpha = r'[a-zA-Z0-9]{2,}' #two or more lowercase,uppercase letters and/or numbers
underscores = r'\_{1,}' #one or more consecutive underscores
pattern = re.compile(alpha + '|' + money + '|' + html + '|' + underscores) #combining them

<p><b>3.5 Builing the tokenizer </b></p>
<p><b>tokenize</b> method takes in a <b>text</b>, uses the regular expression <b>pattern</b> and a list <b>discard_chars</b>, and returns the obtained <b>tokenized list.</b><p> In the first step, the characters in the list discard_chars are removed from the text. The second step performes the tokenization, creates a list containing every single token which has the format defined by the pattern regular expression.</p>

In [13]:
discard_chars = '(),!.,?'

In [14]:
def tokenize(text):
    text = re.sub('[' + discard_chars + ']', ' ', text)
    return re.findall(pattern, text)

<p><b>3.5 Testing the new additions</b></p>
<p>Below we test the functionality of CountVectorizer using the parameters <b>stop_words</b>, <b>lowercase</b> and <b>tokenizer</b>, in case of the first 3 mails.</p>

In [15]:
test_pipeline = Pipeline([
    ("selector", DataFrameSelector('content')),
    ("vectorizer", TfidfVectorizer(stop_words='english', 
                                   lowercase=False,
                                   tokenizer=tokenize))
])
sample = mails[0:3]
test_pipeline.fit(sample)
X = test_pipeline.transform(sample)
test_pipeline.named_steps["vectorizer"].get_feature_names()

['$10',
 '$100',
 '$20',
 '$250',
 '$29',
 '$49',
 '$499',
 '$5844',
 '$715',
 '$795',
 '00',
 '000',
 '000000',
 '0000EE',
 '0100',
 '0102',
 '01C20036',
 '025e46c53a6a',
 '04',
 '0400',
 '046G9H4890',
 '05',
 '0500',
 '05602',
 '0700',
 '08',
 '0899',
 '09',
 '0E47D440E7',
 '10',
 '100',
 '101',
 '1027zbAU4',
 '103',
 '10391',
 '105',
 '11',
 '1100',
 '12',
 '127',
 '13',
 '130',
 '14',
 '140',
 '1493',
 '15',
 '1533',
 '1600',
 '161',
 '168',
 '17',
 '18',
 '180',
 '19',
 '192',
 '193',
 '1999',
 '20',
 '200',
 '2002',
 '2003marketing',
 '21',
 '212',
 '212580644306344345450286',
 '213',
 '2195',
 '22',
 '221',
 '222',
 '23',
 '24',
 '242',
 '25',
 '253',
 '254sWzM249l35',
 '28',
 '31',
 '32',
 '33CCFF',
 '35',
 '36',
 '38',
 '39',
 '3D',
 '3Dwindow',
 '3db56bb7',
 '42',
 '43',
 '44',
 '45',
 '46',
 '47',
 '48',
 '4NAK2L08LHSQ274W7VU',
 '50',
 '500',
 '52f',
 '55',
 '551A8B',
 '600',
 '64',
 '68',
 '731',
 '744CWZI4637aGiQ5',
 '760',
 '77',
 '7785',
 '800',
 '8424',
 '86',
 '8859',


<h1>4. Preparation for Classification</h1>

<p><b>4.1 Shuffle the dataset.</b> The emails were read inorder, ham after spam, so they are in a specific order. This can cause problems in case of K-Fold, since it does not shuffle the data by default. The code below takes care that the mails are in a random order.</p> 

In [16]:
mails = mails.take(np.random.permutation(len(mails)))

<p><b>4.2 Encode the labels.</b></p> Currently, the types are either spam or ham, but this must be encoded into 0 and 1, using the LabelEncoder.

In [17]:
mails['type'].unique()

array(['ham', 'spam'], dtype=object)

In [18]:
y = mails['type'].values
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)
y_encoded

array([0, 0, 0, ..., 0, 0, 1])

<p><b>4.3 Dummy Classifier</b><p>
<p>Dummy Classifier always predicts the most frequent class. This is a guideline for further classifiers to see if they do better than simple chance.</p>

In [19]:
pipeline_dummy = Pipeline([
    ("selector", DataFrameSelector('content')),
    ("vectorizer", CountVectorizer(stop_words='english', 
                                   lowercase=False,
                                   tokenizer=tokenize)),
    ("estimator", DummyClassifier(strategy = "most_frequent"))
])
np.mean(cross_val_score(pipeline_dummy, mails, y_encoded, scoring="accuracy", cv=10))

0.56935926500417611

<h1>5 Classifiers</h1>
<h1>5.1 Classifier #1</h1>
<font size="4">Uses <b>CountVectorizer</b> and <b>StratifiedKFold</b>.</font>
<p>1. The content of each mail is selected by the DataFrameSelector.</p>
<p>2. CountVectorizer is used to tokenize the mails, each token will be a feature.</p>
<p>3. StratifiedKFold is used.
    <ul>
        <li> Stratification is used, as the proportion of spam and ham in a fold must be the same as the proportion of spam and ham in total to make it fair and not rely on luck</li>
        <li> In case of Holdout, the data used in the training set might not be representative(ie some mails contain a lot of clues that they are spam, while others can be between the two classes. Training only on mails that are obviously spam and obviously ham, and testing it on special cases would not give a good result. Because of this, I decided to use StratifiedKFold to estimate the accuracy. </li>
        <li> 10 folds. There are 2898 emails, so is a relatively large data to work with. A fold would contain around 289 emails, which is enough. In case of a smaller dataset, it would be a problem. </li>

In [20]:
pipeline1 = Pipeline([
    ("selector", DataFrameSelector("content")),
    ("vectorizer", CountVectorizer(stop_words='english', 
                                   lowercase=False,
                                   tokenizer=tokenize)),
    ("estimator", LogisticRegression())
])

<b>Accuracy:</b> around 97-98%, depends on the shuffle

In [21]:
kf = StratifiedKFold(n_splits = 10)
np.mean(cross_val_score(pipeline1, mails, y_encoded, scoring="accuracy", cv=kf))

0.97999045459968992

<b>Confusion Matrix:</b> It seems that the number of false positives and false negatives are balanced.

In [22]:
y_predicted_1 = cross_val_predict(pipeline1, mails, y_encoded, cv=10)
confusion_matrix(y_encoded, y_predicted_1)

array([[1622,   28],
       [  30, 1218]])

<h1>5.2 Classifier #2</h1>
<font size="4">Uses <b>TfidfVectorizer</b> and <b>StratifiedKFold</b>.</font>
<p>Exactly the same parameters and evaluation as in classifier #2.</p>

In [23]:
pipeline2 = Pipeline([
    ("selector", DataFrameSelector("content")),
    ("vectorizer", TfidfVectorizer(stop_words='english', 
                                   lowercase=False,
                                   tokenizer=tokenize)),
    ("estimator", LogisticRegression())
])

<b>Accuracy:</b> around 95%, depends on the shuffle, it seems that the pipeline with CountVectorizer performs slightly better.

In [24]:
kf = StratifiedKFold(n_splits = 10)
np.mean(cross_val_score(pipeline2, mails, y_encoded, scoring="accuracy", cv=kf))

0.95825796444338385

In [25]:
y_predicted_2 = cross_val_predict(pipeline2, mails, y_encoded, cv=10)
confusion_matrix(y_encoded, y_predicted_2)

array([[1599,   51],
       [  70, 1178]])

<h1> 5.3. Classifier #3</h1>
<font size="4">Uses <b>CountVectorizer</b> and <b>Bigrams</b>.</font>
<p>Using character N-grams can be useful. Spam mails contain sections like: "BUY NOW!", "make more", "more money". Treating these n-grams as features might make detecting spam mails easier.</p>

<p>In this approach we will still keep the previous patterns(ie html tags, money and underscores), but in addition to these we will take into consideration <b>character bigrams (two consecutive words)</b>. The regular expression letters defines words of length greater than one containing only letters.</p>

In [26]:
letters = r'[a-zA-Z]{2,}'

<p>The method <b>nGramAnalyzer</b> takes the raw text, <b>finds all words defined by letters</b>, creates a list containing all the <b>consecutive words</b> separated by a whitespace, and adds to this list the html tags, money values and underscores. The <b>nGramAnalyzer_only_words</b> takes into account ONLY bigrams.</p>

In [27]:
def nGramAnalyzer(text):
    words = re.findall(letters, text)
    result = []
    for i in range (0, len(words)-1):
        result.append(words[i] + ' ' + words[i+1])
    pattern = re.compile(money + '|' + html + '|' + underscores)
    return result + re.findall(pattern,text)

def nGramAnalyzer_only_words(text):
    words = re.findall(letters, text)
    result = []
    for i in range (0, len(words)-1):
        result.append(words[i] + ' ' + words[i+1])
    return result

In [28]:
nGramAnalyzer("This, is just $30 a test __ to see that the analyzer works")

['This is',
 'is just',
 'just test',
 'test to',
 'to see',
 'see that',
 'that the',
 'the analyzer',
 'analyzer works',
 '$30',
 '__']

<p>The following pipeline tests that the analyzer works and we get html tags, money, underscores and consecutive words as features. For this test, I am testing only the first 3 emails from the dataset.</p>

In [29]:
test_pipeline_analyzer = Pipeline([
    ("selector", DataFrameSelector("content")),
    ("vectorizer", CountVectorizer(lowercase=False, analyzer=nGramAnalyzer))
])
sample = mails[0:3]
test_pipeline_analyzer.fit(sample)
X = test_pipeline_analyzer.transform(sample)
test_pipeline_analyzer.named_steps["vectorizer"].get_feature_names()

['$40',
 '<a class="url" href="http://photoshopplugins.tripod.com/wallpampered.htm">',
 '<a class="url" href="http://www.madshrimps.com/index-e.php?action=webnews-e&l1=0&l2=30">',
 '<a class="url" href="http://www.moviegear.com/gmgdet.htm">',
 '<a class="url" href="http://www.yesiknow.com/lte/">',
 '<a href="&#109;&#97;&#105;&#108;&#116;&#111;&#58;&#102;&#101;&#101;&#100;&#98;&#97;&#99;&#107;&#64;&#108;&#111;&#99;&#107;&#101;&#114;&#103;&#110;&#111;&#109;&#101;&#46;&#99;&#111;&#109;?subject=Lockergnome Site Feedback">',
 '<a href="http://applecore.lockergnome.com/">',
 '<a href="http://digitalmedia.lockergnome.com/">',
 '<a href="http://lockergnome.pricegrabber.com/search_getprod.php/masterid=102088204">',
 '<a href="http://mp3.lockergnome.com">',
 '<a href="http://penguinshell.lockergnome.com/">',
 '<a href="http://techspecialist.lockergnome.com/">',
 '<a href="http://updates.lockergnome.com/">',
 '<a href="http://webmasterweekly.lockergnome.com/">',
 '<a href="http://windowsdaily.loc

<p><b>The pipelines for this Classifier</b></p>
<ul>
    <li>Selects the content of each mail</li>
    <li>Uses CountVectorizer, distinguishing between uppercase and lowercase words, and uses the newly created analyzer nGramAnalyzer or nGramAnalyzer_only_words.</li>
    <li>Performs Logistic Regression </li>
</ul>
<p>There are two pipelines. The first one takes into account html tags, money and underscores besides the bigrams, while the other one only focuses on bigrams</p>

In [30]:
pipeline3 = Pipeline([
    ("selector", DataFrameSelector("content")),
    ("vectorizer", CountVectorizer(lowercase=False, analyzer=nGramAnalyzer)),
    ("estimator", LogisticRegression())
])
pipeline3_only_bigrams = Pipeline([
    ("selector", DataFrameSelector("content")),
    ("vectorizer", CountVectorizer(lowercase=False, analyzer=nGramAnalyzer_only_words)),
    ("estimator", LogisticRegression())
])

<b>Accuracy - in case of nGramAnalyzer</b>

In [31]:
kf = StratifiedKFold(n_splits = 10)
np.mean(cross_val_score(pipeline3, mails, y_encoded, scoring="accuracy", cv=kf))

0.97792148908244836

<b>Accuracy - in case of nGramAnalyzer_only_words</b>

In [32]:
kf = StratifiedKFold(n_splits = 10)
np.mean(cross_val_score(pipeline3_only_bigrams, mails, y_encoded, scoring="accuracy", cv=kf))

0.97895716501610797

<h1> 5.4 Classifier #4</h1>
<font size="4">Uses <b>TfidfVectorizer</b>(stop_words, lowercase and tokenizer) and <b>SGDClassifier</b>.</font>
<p>Another estimator we can use is the Stochastic Gradient Descent Classifier. <info about it></p>
<p>The pipeline for this classifier:</p>
<ul>
    <li>Select the content of the mails using DataFrameSelector</li>
    <li>TFIDF is used, because SGDClassifier works better if the features are scaled. In case of CountVectorizer the features will represent the frequency of a word, which is a positive integer(not scaled). In case of Tfidf the values of the features are in the range of 0 to 1. In case of CountVectorizer, without scaling the features, the accuracy was lower.</li>
    <li>SGDClassifier: it will stop after max_iter iterations, or when the difference in the loss is smaller than the tolerance tol. The learning-rate is left with the default value.</li>

In [33]:
pipeline4 = Pipeline([
    ("selector", DataFrameSelector("content")),
    ("vectorizer", TfidfVectorizer(stop_words='english', 
                                   lowercase=False,
                                   tokenizer=tokenize)),
    ("estimator", SGDClassifier(max_iter=500, tol=1e-4))
])

In [34]:
kf = StratifiedKFold(n_splits = 10)
np.mean(cross_val_score(pipeline4, mails, y_encoded, scoring="accuracy", cv=kf))

0.98206299964204735

In [35]:
y_predicted_4 = cross_val_predict(pipeline4, mails, y_encoded, cv=10)
confusion_matrix(y_encoded, y_predicted_4)

array([[1630,   20],
       [  30, 1218]])

<h1> 5.5 Classifier #5</h1>
<font size="4">Uses <b>TfidfVectorizer, Bigrams</b> and <b>SGDClassifier</b>.</font>

In [36]:
pipeline5 = Pipeline([
    ("selector", DataFrameSelector("content")),
    ("vectorizer", TfidfVectorizer(lowercase=False, analyzer=nGramAnalyzer_only_words)),
    ("estimator", SGDClassifier(max_iter=200, tol=1e-4))
])

In [37]:
kf = StratifiedKFold(n_splits = 10)
np.mean(cross_val_score(pipeline5, mails, y_encoded, scoring="accuracy", cv=kf))

0.98171817205584055

In [38]:
y_predicted_5 = cross_val_predict(pipeline5, mails, y_encoded, cv=10)
confusion_matrix(y_encoded, y_predicted_5)

array([[1623,   27],
       [  28, 1220]])

<h1> 6 References</h1>

<p>[1] Reading all the text files from directory, stackoverflow.com: https://stackoverflow.com/questions/37534141/reading-all-the-text-files-from-directory</p>
<p>[2] Python Regular Expressions tutorial, tutorialspoint.com: https://www.tutorialspoint.com/python/python_reg_expressions.htm</p>
<p>[3] CountVectorizer, scikit-learn.org: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html</p>
<p>[4] TfidfVectorizer, scikit-learn.org: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html</p>
<p>[5] Stochastic Gradient Descent Classifier, scikit-learn.org: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html</p>
<p>[6] 4.2.3.3. Common Vectorizer usage, scikit-learn.org: http://scikit-learn.org/stable/modules/feature_extraction.html</p>
<p>[7] pandas.DataFrame, pandas.pydata.org: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html </p>