<h1> SPAM FILTER </h1>
<p> The purpose of this assignment is to construct and evaluate one or more spam filters (classifiers). </p>

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

<h1> Imports </h1>

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import KFold

from sklearn.dummy import DummyClassifier

from sklearn.model_selection import cross_val_predict

from sklearn.metrics import confusion_matrix

<h1> Data exploration and preparation </h1>
<p> Our emails are divided in two different folders: spam and ham. We can say that in this way we have a labeled dataset. All we should do is to organize this dataset into a structure. But first, we should explore a bit the dataset.</p>


In [4]:
import os

#create two paths from where the emails will be read
path_spam = os.getcwd() + '/spam'
path_ham = os.getcwd() + '/ham'

#create a list of tuples where a tuple will contain an email and a label
#emails=[(email,label)]
emails = []
emails_original = []

<p>If we have a look through emails we notice that most of the emails have an header which contains information about the email: when was received,from whom etc. I made two data structures, one which will contain the original emails and one which will contain only the content of the emails without the information from beginning. I made this because some words appears in most of the documents and they didn't help us so much to separate spam from ham. We are interested more in the content of the emails. </p>

<h1>Reading and cleaning the dataset </h1>

<p>Here I will try to make two data structures. One of them will contain the orginal emails and another which will contain only the content of the email.</p>

In [5]:
#these words represent the beginning of the line which should be deleted
words_to_delete = ['From','Received','Return-Path','Delivered-To','Message-Id','Date','To','MIME-Version','Sender',\
                   'Errors-To','Reply-To','References','List-','Precedence','X-','Content-','\t','Cc','User-Agent',\
                  'Message-ID','    ','Irish Linux Users','List maintainer','Organization','Thread-','Sent', 'Importance']

#reading the spam emails from the directory
#I label the spam email with one, because they are representing the positive class
#Positive class is the class which we try to identify
for filename in sorted(os.listdir(path_spam)):
    file = open("spam/"+filename,"r")
    lines = file.readlines()
    email_original = ""
    email = ""
    #append those lines which not begin with a word listed above
    for line in lines:
        if not any(word_to_delete in line for word_to_delete in words_to_delete): 
            email+=line
        email_original+=line
    #append the new email as a tuple to the list
    emails.append((email,1))
    #append the original emails as a tuple to the list
    emails_original.append((email_original,1))
    file.close()
    
#reading the ham emails from the directory
##I label the ham email with zero, because they are representing the negative class
for filename in sorted(os.listdir(path_ham)):
    file = open("ham/"+filename,"r")
    lines = file.readlines()
    email_original = ""
    email = ""
    #append those lines which not begin with a word listed above
    for line in lines:
        if not any(word_to_delete in line for word_to_delete in words_to_delete):
            email+=line
        email_original+=line
    #append the new email as a tuple to the list
    emails.append((email,0))
    #append the original email as a tuple to the list
    emails_original.append((email_original,0))
    file.close()


In [6]:
#Let's have a look how emails look now
print(emails[0])


('Subject: [ILUG] STOP THE MLM INSANITY\n\nGreetings!\n\nYou are receiving this letter because you have expressed an interest in \nreceiving information about online business opportunities. If this is \nerroneous then please accept my most sincere apology. This is a one-time \nmailing, so no removal is necessary.\n\nIf you\'ve been burned, betrayed, and back-stabbed by multi-level marketing, \nMLM, then please read this letter. It could be the most important one that \nhas ever landed in your Inbox.\n\nMULTI-LEVEL MARKETING IS A HUGE MISTAKE FOR MOST PEOPLE\n\nMLM has failed to deliver on its promises for the past 50 years. The pursuit \nof the "MLM Dream" has cost hundreds of thousands of people their friends, \ntheir fortunes and their sacred honor. The fact is that MLM is fatally \nflawed, meaning that it CANNOT work for most people.\n\nThe companies and the few who earn the big money in MLM are NOT going to \ntell you the real story. FINALLY, there is someone who has the courage to

In [7]:
#Let's have a look at the original mail
print(emails_original[0])



In [8]:
#Build a dataframe of emails 
#The purpose is to work with numpy arrays
labels = ['email','label']
df = pd.DataFrame.from_records(emails, columns=labels)
df

Unnamed: 0,email,label
0,Subject: [ILUG] STOP THE MLM INSANITY\n\nGreet...,1
1,"Subject: Real Protection, Stun Guns! Free Shi...",1
2,"Subject: New Improved Fat Burners, Now With TV...",1
3,"Subject: New Improved Fat Burners, Now With TV...",1
4,"Subject: Never Repay Cash Grants, $500 - $50,0...",1
5,"(SMTPD32-7.10) id A2E8640144; Tue, 23 Jul 20...",1
6,Subject: New Product Announcement\n\nNEW PRODU...,1
7,Subject: FW:\n\n\n<HTML>\n<BODY bgColor=3D#C0C...,1
8,Subject: [SA] URGENT HELP..............\n\n--=...,1
9,Subject: Your Agent in Saudi Arabia.\n\n \n\nC...,1


<p>As you can notice I kept the subject of email because it can offer us information about the content.</p>

In [9]:
#Build a dataframe of original emails
labels = ['email','label']
df_original = pd.DataFrame.from_records(emails_original, columns=labels)
df_original

Unnamed: 0,email,label
0,From ilug-admin@linux.ie Tue Aug 6 11:51:02 ...,1
1,From lmrn@mailexcite.com Mon Jun 24 17:03:24 ...,1
2,From amknight@mailexcite.com Mon Jun 24 17:03...,1
3,From jordan23@mailexcite.com Mon Jun 24 17:04...,1
4,From merchantsworld2001@juno.com Tue Aug 6 1...,1
5,Received: from hq.pro-ns.net (localhost [127.0...,1
6,From sales@outsrc-em.com Mon Jun 24 17:53:15 ...,1
7,From ormlh@imail.ru Sun Jul 15 04:56:31 2001\...,1
8,From spamassassin-sightings-admin@lists.source...,1
9,From sathar@amtelsa.com Mon Jun 24 17:40:14 2...,1


In [10]:
#Let's look more detailed to emails
df.describe(include='all')

Unnamed: 0,email,label
count,2898,2898.0
unique,2825,
top,\n\nHello I am your hot lil horny toy.\n\n\n\n...,
freq,7,
mean,,0.430642
std,,0.495252
min,,0.0
25%,,0.0
50%,,0.0
75%,,1.0


In [11]:
#We can notice that when I build the dataframe I put the emails in order so I should shuffle it.
df = df.take(np.random.permutation(len(df)))
df.reset_index(inplace=True, drop=True)

df_original = df_original.take(np.random.permutation(len(df_original)))
df_original.reset_index(inplace=True, drop=True)

In [85]:
#Prepare the actual values
y = df['label'].values
y_original = df_original['label'].values

In [86]:
#Prepare the data which will be processed
X = df['email'].values
X_original = df_original['email'].values

<h1>Classifiers </h1>
<p>Here I begin to create different pipelines. All I try to do is to play with different parameters and compare between these pipelines.</p>
<p> I have decided to use K-Fold validation stratified because it divides the examples proportional into folds. For the value of K I use 10 because are 2898 emails which means that in every fold it will be aproximately 289, which means that we have plenty examples to train and test the data. </p>
<p>To compare them I used also different performance measures.</p>
<ul>
    <li>Accuracy: the ratio of the number of correct predictions to number of predictions made.I used this because because measures the overall corectness of the classifier.</li>
    <li>Precision: the fraction of positive predictions that are correct. In our case the fraction of messages classified as spam that are actually spam.</li>
    <li>Recall: represents the ability of classifier to find all the positive examples. In our case is the fraction of spam messages that were truly classified as spam.</li>
    <li>F1: represents the weighted average of the precison and recall scores. It penalizes classifiers with imbalanced precision and recall scores.</li>
     <li>Confusion matrix. I used this to visualize better the number of correct and incorrect predictions made.</li>
</ul>
<p> Also, I have decided to use only these combinations for pipelines: TfidfVectorizer + LogisticRegression and
CountVectorizer + LogisticRegression because I want to demonstrate that they can be improved by playing only with their parameters. </p>

<p> Source for the performance measures: Mastering Machine Learning with scikit-learn, author Gavin Hackeling and <a href = "http://scikit-learn.org/stable/modules/model_evaluation.html"> sklearn documentation. </a> </p>

<p> Also, we can consider the Majority-Class Classifier which always predicts the majority class. </p>
<p> I used it as a baseline that I can compare against when evaluating a classifier. </p>

In [14]:
#Create the majority classifier 
emails_pipeline_dummy = Pipeline([("vectorizer", TfidfVectorizer(stop_words = 'english')),
                           ("estimator", DummyClassifier(strategy="most_frequent"))])

In [15]:
np.mean(cross_val_score(emails_pipeline_dummy, X, y, scoring='accuracy', cv=10))

0.56935926500417611

<h1>TfidVectorizer pipelines</h1>

In [16]:
#Create the pipeline for the original emails
emails_pipeline_original = Pipeline([("vectorizer", TfidfVectorizer()),
                           ("estimator", LogisticRegression())])

In [17]:
print("Accuracy score:",np.mean(cross_val_score\
                                 (emails_pipeline_original, X_original, y_original, scoring='accuracy', cv=10)))
print("Precision score:",np.mean(cross_val_score\
                                 (emails_pipeline_original, X_original, y_original, scoring='precision', cv=10)))
print("Recall score:",np.mean(cross_val_score\
                                 (emails_pipeline_original, X_original, y_original, scoring='recall', cv=10)))
print("F1 score:",np.mean(cross_val_score\
                                 (emails_pipeline_original, X_original, y_original, scoring='f1', cv=10)))

Accuracy score: 0.956531440162
Precision score: 0.955034819542
Recall score: 0.943948387097
F1 score: 0.949229720177


In [18]:
#Confusion matrix
y_predicted = cross_val_predict(emails_pipeline_original, X_original, y_original, cv=10)
confusion_matrix(y_original, y_predicted)

array([[1594,   56],
       [  70, 1178]])

In [19]:
#Create the pipeline with emails after I deleted the lines from beginning
emails_pipeline_simple = Pipeline([("vectorizer", TfidfVectorizer()),
                           ("estimator", LogisticRegression())])

In [20]:
#Performance measures
print("Accuracy score:",np.mean(cross_val_score\
                                 (emails_pipeline_simple, X, y, scoring='accuracy', cv=10)))
print("Precision score:",np.mean(cross_val_score\
                                 (emails_pipeline_simple, X, y, scoring='precision', cv=10)))
print("Recall score:",np.mean(cross_val_score\
                                 (emails_pipeline_simple, X, y, scoring='recall', cv=10)))
print("F1 score:",np.mean(cross_val_score\
                                 (emails_pipeline_simple, X, y, scoring='f1', cv=10)))


Accuracy score: 0.966525474287
Precision score: 0.963241687227
Recall score: 0.959129032258
F1 score: 0.961068666175


In [21]:
#Confusion matrix
y_predicted = cross_val_predict(emails_pipeline_simple, X, y, cv=10)
confusion_matrix(y, y_predicted)

array([[1604,   46],
       [  51, 1197]])

<p> As we can see, if you delete the information from the beginning we will obtain e better result. </p>

<p>I take a look of TfidfVectorizer and Logistic Regression parameters.
   Here are the links.
   <a href ="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html">Logistic Regression</a>
   <a href ="http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html">TfidVectorizer</a>
</p>
<p> I tried to understand what are the roles of these parameters and I found interesting things which make my classifier better. </p>

In [22]:
#discard stop-words:common words such a, as, the etc. which doesn' t help us for spam detection
emails_pipeline_sw = Pipeline([("vectorizer", TfidfVectorizer(stop_words='english')),
                           ("estimator", LogisticRegression())])

In [23]:
#Performance measures
print("Accuracy score:",np.mean(cross_val_score\
                                 (emails_pipeline_sw, X, y, scoring='accuracy', cv=10)))
print("Precision score:",np.mean(cross_val_score\
                                 (emails_pipeline_sw, X, y, scoring='precision', cv=10)))
print("Recall score:",np.mean(cross_val_score\
                                 (emails_pipeline_sw, X, y, scoring='recall', cv=10)))
print("F1 score:",np.mean(cross_val_score\
                                 (emails_pipeline_sw, X, y, scoring='f1', cv=10)))


Accuracy score: 0.958933301515
Precision score: 0.963231946437
Recall score: 0.940696774194
F1 score: 0.951752306589


In [24]:
#Confusion matrix
y_predicted = cross_val_predict(emails_pipeline_sw, X, y, cv=10)
confusion_matrix(y, y_predicted)

array([[1605,   45],
       [  74, 1174]])

<p>As we can notice, the number of true negatives it doesn't change from the previous, but it decreases the number of true positives and it increses the number of false negatives. This means that the classifier predicts more hams as spams which is not good for a spam filter(it must be possible that an important email to be predicted as spam).</p>
<p>Now, let's try to improve this.</p>
<p>Let's look at the other parameters.</p>

In [25]:
#let's change the entropy. As default is liblinear algorithm.
emails_pipeline_ncg = Pipeline([("vectorizer", TfidfVectorizer(stop_words = 'english')),
                           ("estimator", LogisticRegression(solver="newton-cg"))])

In [26]:
#Performance measures
print("Accuracy score:",np.mean(cross_val_score\
                                 (emails_pipeline_ncg, X, y, scoring='accuracy', cv=10)))
print("Precision score:",np.mean(cross_val_score\
                                 (emails_pipeline_ncg, X, y, scoring='precision', cv=10)))
print("Recall score:",np.mean(cross_val_score\
                                 (emails_pipeline_ncg, X, y, scoring='recall', cv=10)))
print("F1 score:",np.mean(cross_val_score\
                                 (emails_pipeline_ncg, X, y, scoring='f1', cv=10)))


Accuracy score: 0.958933301515
Precision score: 0.963231946437
Recall score: 0.940696774194
F1 score: 0.951752306589


In [27]:
#Confusion matrix
y_predicted = cross_val_predict(emails_pipeline_ncg, X, y, cv=10)
confusion_matrix(y, y_predicted)

array([[1605,   45],
       [  74, 1174]])

<p>It doesn't change anything.</p>

In [28]:

emails_pipeline_lbfgs = Pipeline([("vectorizer", TfidfVectorizer(stop_words = 'english')),
                           ("estimator", LogisticRegression(solver="lbfgs"))])

In [29]:
#Performance measures
print("Accuracy score:",np.mean(cross_val_score\
                                 (emails_pipeline_lbfgs, X, y, scoring='accuracy', cv=10)))
print("Precision score:",np.mean(cross_val_score\
                                 (emails_pipeline_lbfgs, X, y, scoring='precision', cv=10)))
print("Recall score:",np.mean(cross_val_score\
                                 (emails_pipeline_lbfgs, X, y, scoring='recall', cv=10)))
print("F1 score:",np.mean(cross_val_score\
                                 (emails_pipeline_lbfgs, X, y, scoring='f1', cv=10)))

Accuracy score: 0.958933301515
Precision score: 0.963231946437
Recall score: 0.940696774194
F1 score: 0.951752306589


In [30]:
#Confusion matrix
y_predicted = cross_val_predict(emails_pipeline_ncg, X, y, cv=10)
confusion_matrix(y, y_predicted)

array([[1605,   45],
       [  74, 1174]])

<p>Same, it doesn't change anything.</p>

In [31]:
emails_pipeline_sag = Pipeline([("vectorizer", TfidfVectorizer(stop_words = 'english')),
                           ("estimator", LogisticRegression(solver="sag"))])

In [32]:
#Performance measures
print("Accuracy score:",np.mean(cross_val_score\
                                 (emails_pipeline_sag, X, y, scoring='accuracy', cv=10)))
print("Precision score:",np.mean(cross_val_score\
                                 (emails_pipeline_sag, X, y, scoring='precision', cv=10)))
print("Recall score:",np.mean(cross_val_score\
                                 (emails_pipeline_sag, X, y, scoring='recall', cv=10)))
print("F1 score:",np.mean(cross_val_score\
                                 (emails_pipeline_sag, X, y, scoring='f1', cv=10)))

Accuracy score: 0.958933301515
Precision score: 0.963231946437
Recall score: 0.940696774194
F1 score: 0.951752306589


In [33]:
#Confusion matrix
y_predicted = cross_val_predict(emails_pipeline_sag, X, y, cv=10)
confusion_matrix(y, y_predicted)

array([[1605,   45],
       [  74, 1174]])

<p>Same, it doesn't change anything.</p>
<p>As we can see, changing the entropy it doesn't help us so much.</p>

In [34]:
#Let's try to change the norm.
#http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/
emails_pipeline_penalty = Pipeline([("vectorizer", TfidfVectorizer(stop_words = 'english')),
                           ("estimator", LogisticRegression(penalty="l1"))])

In [35]:
#Performance measures
print("Accuracy score:",np.mean(cross_val_score\
                                 (emails_pipeline_penalty, X, y, scoring='accuracy', cv=10)))
print("Precision score:",np.mean(cross_val_score\
                                 (emails_pipeline_penalty, X, y, scoring='precision', cv=10)))
print("Recall score:",np.mean(cross_val_score\
                                 (emails_pipeline_penalty, X, y, scoring='recall', cv=10)))
print("F1 score:",np.mean(cross_val_score\
                                 (emails_pipeline_penalty, X, y, scoring='f1', cv=10)))

Accuracy score: 0.939954659349
Precision score: 0.932476186381
Recall score: 0.927870967742
F1 score: 0.929712958964


In [36]:
#Confusion matrix
y_predicted = cross_val_predict(emails_pipeline_penalty, X, y, cv=10)
confusion_matrix(y, y_predicted)

array([[1565,   85],
       [  90, 1158]])

<p>It doesn't perform better and also it doesn't decrese the number of false negatives. In this case it does worst. </p>

In [37]:
#max_df - discard those words which have a frequency higher than 0.5
#ngram_range - extract different n-grams, in our case extract unigrams and bigrams
#for example a bigram is "to be"
#I used these combination of two parameters because I want to see if I do not discard the stop-words and
#make them to have a sense together with other words what will hapen in the case of emails.
emails_pipeline_params =  Pipeline([("vectorizer", TfidfVectorizer(max_df=0.5, ngram_range=(1,2))),
                           ("estimator", LogisticRegression())])

In [38]:
#Performance measures
print("Accuracy score:",np.mean(cross_val_score\
                                 (emails_pipeline_params, X, y, scoring='accuracy', cv=10)))
print("Precision score:",np.mean(cross_val_score\
                                 (emails_pipeline_params, X, y, scoring='precision', cv=10)))
print("Recall score:",np.mean(cross_val_score\
                                 (emails_pipeline_params, X, y, scoring='recall', cv=10)))
print("F1 score:",np.mean(cross_val_score\
                                 (emails_pipeline_params, X, y, scoring='f1', cv=10)))

Accuracy score: 0.962386350078
Precision score: 0.964455755184
Recall score: 0.947922580645
F1 score: 0.955955195202


In [39]:
#Confusion matrix
y_predicted = cross_val_predict(emails_pipeline_params, X, y, cv=10)
confusion_matrix(y, y_predicted)

array([[1606,   44],
       [  65, 1183]])

<p>The conclusion is that this fact doesn't make a big difference in the score.</p>

<p>Now let's a look at the C parameter of LogisticRegression.</p>
<p>C parameter represents the inverse of regularization strength. If C is smaller the regularization is stronger.</p>
<p>But what it means? I find different sources which explains better than me:<a href = "https://stats.stackexchange.com/questions/31066/what-is-the-influence-of-c-in-svms-with-linear-kernel"> here </a> and <a href ="https://www.quora.com/What-is-regularization-in-machine-learning">  here </a>.But what I understand is it decides how much freedom the model has.</p>

In [40]:
emails_pipeline_c1 =  Pipeline([("vectorizer", TfidfVectorizer(stop_words='english')),
                           ("estimator", LogisticRegression(C=10))])

In [41]:
#Performance measures
print("Accuracy score:",np.mean(cross_val_score\
                                 (emails_pipeline_c1, X, y, scoring='accuracy', cv=10)))
print("Precision score:",np.mean(cross_val_score\
                                 (emails_pipeline_c1, X, y, scoring='precision', cv=10)))
print("Recall score:",np.mean(cross_val_score\
                                 (emails_pipeline_c1, X, y, scoring='recall', cv=10)))
print("F1 score:",np.mean(cross_val_score\
                                 (emails_pipeline_c1, X, y, scoring='f1', cv=10)))

Accuracy score: 0.976533826512
Precision score: 0.978170193793
Recall score: 0.967141935484
F1 score: 0.972585894784


In [42]:
#Confusion matrix
y_predicted = cross_val_predict(emails_pipeline_c1, X, y, cv=10)
confusion_matrix(y, y_predicted)

array([[1623,   27],
       [  41, 1207]])

<p>As we notice the prediction is getting better.</p> 
<p>Why?Comparing with email_pipeline_sw, I notice that the number of correct predictions increseases because increasing the C it is trying to separate as many as instances possible. Imagine like you have two teams: the red ones and the blue ones. Some of reds
are in blue team by mistake and the same for the blues. Now I am trying to find a line which give me the possibility to separe these teams better.</p>

In [43]:
emails_pipeline_c2 =  Pipeline([("vectorizer", TfidfVectorizer(stop_words='english')),
                           ("estimator", LogisticRegression(C=100))])

In [44]:
#Performance measures
print("Accuracy score:",np.mean(cross_val_score\
                                 (emails_pipeline_c2, X, y, scoring='accuracy', cv=10)))
print("Precision score:",np.mean(cross_val_score\
                                 (emails_pipeline_c2, X, y, scoring='precision', cv=10)))
print("Recall score:",np.mean(cross_val_score\
                                 (emails_pipeline_c2, X, y, scoring='recall', cv=10)))
print("F1 score:",np.mean(cross_val_score\
                                 (emails_pipeline_c2, X, y, scoring='f1', cv=10)))

Accuracy score: 0.979642047488
Precision score: 0.982260746495
Recall score: 0.97035483871
F1 score: 0.976212968561


In [45]:
#Confusion matrix
y_predicted = cross_val_predict(emails_pipeline_c2, X, y, cv=10)
confusion_matrix(y, y_predicted)

array([[1628,   22],
       [  37, 1211]])

<p>Increasing the number of C gives a better classifier.</p>

In [46]:
#Let's try a simple TfidfVectorizer() and with the same C number
emails_pipeline_c3 =  Pipeline([("vectorizer", TfidfVectorizer()),
                           ("estimator", LogisticRegression(C=100))])

In [47]:
#Performance measures
print("Accuracy score:",np.mean(cross_val_score\
                                 (emails_pipeline_c3, X, y, scoring='accuracy', cv=10)))
print("Precision score:",np.mean(cross_val_score\
                                 (emails_pipeline_c3, X, y, scoring='precision', cv=10)))
print("Recall score:",np.mean(cross_val_score\
                                 (emails_pipeline_c3, X, y, scoring='recall', cv=10)))
print("F1 score:",np.mean(cross_val_score\
                                 (emails_pipeline_c3, X, y, scoring='f1', cv=10)))

Accuracy score: 0.981022551008
Precision score: 0.979274972871
Recall score: 0.976767741935
F1 score: 0.977951068263


In [48]:
#Confusion matrix
y_predicted = cross_val_predict(emails_pipeline_c3, X, y, cv=10)
confusion_matrix(y, y_predicted)

array([[1624,   26],
       [  29, 1219]])

<p>It does a bit better if I did not discard the stop-words.</p> 

In [49]:
#Let's introduce another parameter
#Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf) 
#http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
emails_pipeline_c2 =  Pipeline([("vectorizer", TfidfVectorizer(sublinear_tf=True)),
                           ("estimator", LogisticRegression(C=100))])

In [50]:
#Performance measures
print("Accuracy score:",np.mean(cross_val_score\
                                 (emails_pipeline_c2, X, y, scoring='accuracy', cv=10)))
print("Precision score:",np.mean(cross_val_score\
                                 (emails_pipeline_c2, X, y, scoring='precision', cv=10)))
print("Recall score:",np.mean(cross_val_score\
                                 (emails_pipeline_c2, X, y, scoring='recall', cv=10)))
print("F1 score:",np.mean(cross_val_score\
                                 (emails_pipeline_c2, X, y, scoring='f1', cv=10)))

Accuracy score: 0.985164061568
Precision score: 0.982587468614
Recall score: 0.983174193548
F1 score: 0.982779762222


In [51]:
#Confusion matrix
y_predicted = cross_val_predict(emails_pipeline_c2, X, y, cv=10)
confusion_matrix(y, y_predicted)

array([[1628,   22],
       [  21, 1227]])

<p>The classifier it is getting better. Now the number of false negatives and false positives are equaly(when I ran).</p>
<p>Probably this parameter it helps us because of the log function.</p>

In [52]:
#increase the number of C
emails_pipeline_c4 =  Pipeline([("vectorizer", TfidfVectorizer(sublinear_tf=True)),
                           ("estimator", LogisticRegression(C=100000))])

In [53]:
#Performance measures
print("Accuracy score:",np.mean(cross_val_score\
                                 (emails_pipeline_c4, X, y, scoring='accuracy', cv=10)))
print("Precision score:",np.mean(cross_val_score\
                                 (emails_pipeline_c4, X, y, scoring='precision', cv=10)))
print("Recall score:",np.mean(cross_val_score\
                                 (emails_pipeline_c4, X, y, scoring='recall', cv=10)))
print("F1 score:",np.mean(cross_val_score\
                                 (emails_pipeline_c4, X, y, scoring='f1', cv=10)))

Accuracy score: 0.987577854671
Precision score: 0.988144451713
Recall score: 0.983174193548
F1 score: 0.985539501354


In [54]:
#Confusion matrix
y_predicted = cross_val_predict(emails_pipeline_c4, X, y, cv=10)
confusion_matrix(y, y_predicted)

array([[1635,   15],
       [  21, 1227]])

<p> The new classifier decreases the number of false positives. This gives you more details about our dataset. Some hams are almost similary with spams. We should do something else in this case.</p>

In [55]:
#increase the number of C
emails_pipeline_c5 =  Pipeline([("vectorizer", TfidfVectorizer(sublinear_tf=True)),
                           ("estimator", LogisticRegression(C=10000000))])

In [56]:
#Performance measures
print("Accuracy score:",np.mean(cross_val_score\
                                 (emails_pipeline_c5, X, y, scoring='accuracy', cv=10)))
print("Precision score:",np.mean(cross_val_score\
                                 (emails_pipeline_c5, X, y, scoring='precision', cv=10)))
print("Recall score:",np.mean(cross_val_score\
                                 (emails_pipeline_c5, X, y, scoring='recall', cv=10)))
print("F1 score:",np.mean(cross_val_score\
                                 (emails_pipeline_c5, X, y, scoring='f1', cv=10)))

Accuracy score: 0.987577854671
Precision score: 0.988144451713
Recall score: 0.983174193548
F1 score: 0.985539501354


In [57]:
#Confusion matrix
y_predicted = cross_val_predict(emails_pipeline_c5, X, y, cv=10)
confusion_matrix(y, y_predicted)

array([[1635,   15],
       [  21, 1227]])

<p>I think we should stop here with increasing the number of C.</p>

<h1>CountVectorizer pipelines</h1>
<p> Now we will look at some pipelines which are similary with those seen until now, but instead of using TfidfVectorizer I am using CountVectorizer.</p>

In [58]:
emails_pipeline_cv_original = Pipeline([("vectorizer", CountVectorizer()),
                           ("estimator", LogisticRegression())])

In [59]:
#Performance measures
print("Accuracy score:",np.mean(cross_val_score\
                                 (emails_pipeline_cv_original, X_original, y_original, scoring='accuracy', cv=10)))
print("Precision score:",np.mean(cross_val_score\
                                 (emails_pipeline_cv_original, X_original, y_original, scoring='precision', cv=10)))
print("Recall score:",np.mean(cross_val_score\
                                 (emails_pipeline_cv_original, X_original, y_original, scoring='recall', cv=10)))
print("F1 score:",np.mean(cross_val_score\
                                 (emails_pipeline_cv_original, X_original, y_original, scoring='f1', cv=10)))

Accuracy score: 0.979646820189
Precision score: 0.976299491394
Recall score: 0.976787096774
F1 score: 0.976401048836


In [60]:
#Confusion matrix
y_predicted = cross_val_predict(emails_pipeline_cv_original, X_original, y_original, cv=10)
confusion_matrix(y_original, y_predicted)

array([[1620,   30],
       [  29, 1219]])

<p> Without deleting information from the beginning it does better than TfidfVectorizer pipeline. </p>

In [61]:
#simple pipeline, without parameters
emails_pipeline_cv = Pipeline([("vectorizer", CountVectorizer()),
                           ("estimator", LogisticRegression())])

In [62]:
#Performance measures
print("Accuracy score:",np.mean(cross_val_score\
                                 (emails_pipeline_cv, X, y, scoring='accuracy', cv=10)))
print("Precision score:",np.mean(cross_val_score\
                                 (emails_pipeline_cv, X, y, scoring='precision', cv=10)))
print("Recall score:",np.mean(cross_val_score\
                                 (emails_pipeline_cv, X, y, scoring='recall', cv=10)))
print("F1 score:",np.mean(cross_val_score\
                                 (emails_pipeline_cv, X, y, scoring='f1', cv=10)))

Accuracy score: 0.97999284095
Precision score: 0.976315096997
Recall score: 0.977574193548
F1 score: 0.976834748508


In [63]:
#Confusion matrix
y_predicted = cross_val_predict(emails_pipeline_cv, X, y, cv=10)
confusion_matrix(y, y_predicted)

array([[1620,   30],
       [  28, 1220]])

<p>As we can notice it performs better as TdifVectorizer. And the number of false negatives and false positives are approximately the same. Also we can notice that it does a bit better than previous pipeline.</p>

In [64]:
#let's drop the stop-words
emails_pipeline_sw1 = Pipeline([("vectorizer", CountVectorizer(stop_words='english')),
                           ("estimator", LogisticRegression())])

In [65]:
#Performance measures
print("Accuracy score:",np.mean(cross_val_score\
                                 (emails_pipeline_sw1, X, y, scoring='accuracy', cv=10)))
print("Precision score:",np.mean(cross_val_score\
                                 (emails_pipeline_sw1, X, y, scoring='precision', cv=10)))
print("Recall score:",np.mean(cross_val_score\
                                 (emails_pipeline_sw1, X, y, scoring='recall', cv=10)))
print("F1 score:",np.mean(cross_val_score\
                                 (emails_pipeline_sw1, X, y, scoring='f1', cv=10)))

Accuracy score: 0.974467247345
Precision score: 0.968116092839
Recall score: 0.973548387097
F1 score: 0.970541904852


In [66]:
#Confusion matrix
y_predicted = cross_val_predict(emails_pipeline_sw1, X, y, cv=10)
confusion_matrix(y, y_predicted)

array([[1609,   41],
       [  33, 1215]])

<p> It has a little significant difference in the score, but the number of false negatives is less than the number of false positives. It predicts more spams as hams which is not so bad.</p>

In [67]:
#let's try another entropy
emails_pipeline_ncg_cv = Pipeline([("vectorizer", CountVectorizer(stop_words='english')),
                           ("estimator", LogisticRegression(solver='newton-cg'))])

In [68]:
#Performance measures
print("Accuracy score:",np.mean(cross_val_score\
                                 (emails_pipeline_ncg_cv, X, y, scoring='accuracy', cv=10)))
print("Precision score:",np.mean(cross_val_score\
                                 (emails_pipeline_ncg_cv, X, y, scoring='precision', cv=10)))
print("Recall score:",np.mean(cross_val_score\
                                 (emails_pipeline_ncg_cv, X, y, scoring='recall', cv=10)))
print("F1 score:",np.mean(cross_val_score\
                                 (emails_pipeline_ncg_cv, X, y, scoring='f1', cv=10)))

Accuracy score: 0.973084357475
Precision score: 0.967336145529
Recall score: 0.971129032258
F1 score: 0.968884328241


In [69]:
#Confusion matrix
y_predicted = cross_val_predict(emails_pipeline_ncg_cv, X, y, cv=10)
confusion_matrix(y, y_predicted)

array([[1608,   42],
       [  36, 1212]])

<p>As we have seen at TfidVectorizer, it doesn't change anything.</p>

In [70]:
#pipeline using count vectorizer
emails_pipeline_params_cv = Pipeline([("vectorizer", CountVectorizer(max_df=0.5, ngram_range=(1,2))),
                           ("estimator", LogisticRegression())])

In [71]:
#Performance measures
print("Accuracy score:",np.mean(cross_val_score\
                                 (emails_pipeline_params_cv, X, y, scoring='accuracy', cv=10)))
print("Precision score:",np.mean(cross_val_score\
                                 (emails_pipeline_params_cv, X, y, scoring='precision', cv=10)))
print("Recall score:",np.mean(cross_val_score\
                                 (emails_pipeline_params_cv, X, y, scoring='recall', cv=10)))
print("F1 score:",np.mean(cross_val_score\
                                 (emails_pipeline_params_cv, X, y, scoring='f1', cv=10)))

Accuracy score: 0.977575468321
Precision score: 0.969372849003
Recall score: 0.979174193548
F1 score: 0.974134290102


In [72]:
#Confusion matrix
y_predicted = cross_val_predict(emails_pipeline_params_cv, X, y, cv=10)
confusion_matrix(y, y_predicted)

array([[1611,   39],
       [  26, 1222]])

<p>It does a bit better than if I discard the stop_words.</p> Probably some words have a meaning which help us to separate better between spam and hams. </p>

In [73]:
#pipeline using count vectorizer
emails_pipeline_c_cv = Pipeline([("vectorizer", CountVectorizer(stop_words='english')),
                           ("estimator", LogisticRegression(C=10))])

In [74]:
#Performance measures
print("Accuracy score:",np.mean(cross_val_score\
                                 (emails_pipeline_c_cv, X, y, scoring='accuracy', cv=10)))
print("Precision score:",np.mean(cross_val_score\
                                 (emails_pipeline_c_cv, X, y, scoring='precision', cv=10)))
print("Recall score:",np.mean(cross_val_score\
                                 (emails_pipeline_c_cv, X, y, scoring='recall', cv=10)))
print("F1 score:",np.mean(cross_val_score\
                                 (emails_pipeline_c_cv, X, y, scoring='f1', cv=10)))

Accuracy score: 0.973429185061
Precision score: 0.967167618228
Recall score: 0.971935483871
F1 score: 0.969299193884


In [75]:
#Confusion matrix
y_predicted = cross_val_predict(emails_pipeline_c_cv, X, y, cv=10)
confusion_matrix(y, y_predicted)

array([[1608,   42],
       [  35, 1213]])

<p>It does not so better as TfidfVectorizer when I increase the C.</p>

In [76]:

#pipeline using count vectorizer
emails_pipeline_c1_cv = Pipeline([("vectorizer", CountVectorizer(stop_words='english')),
                           ("estimator", LogisticRegression(C=10000))])

In [77]:
#Performance measures
print("Accuracy score:",np.mean(cross_val_score\
                                 (emails_pipeline_c1_cv, X, y, scoring='accuracy', cv=10)))
print("Precision score:",np.mean(cross_val_score\
                                 (emails_pipeline_c1_cv, X, y, scoring='precision', cv=10)))
print("Recall score:",np.mean(cross_val_score\
                                 (emails_pipeline_c1_cv, X, y, scoring='recall', cv=10)))
print("F1 score:",np.mean(cross_val_score\
                                 (emails_pipeline_c1_cv, X, y, scoring='f1', cv=10)))

Accuracy score: 0.970329316311
Precision score: 0.964671594346
Recall score: 0.96715483871
F1 score: 0.965659830653


In [78]:
#Confusion matrix
y_predicted = cross_val_predict(emails_pipeline_c1_cv, X, y, cv=10)
confusion_matrix(y, y_predicted)

array([[1605,   45],
       [  41, 1207]])

In [79]:
emails_pipeline_params_c_cv = Pipeline([("vectorizer", CountVectorizer(max_df=0.5, ngram_range=(1,2))),
                           ("estimator", LogisticRegression(C=10000))])

In [80]:
print("Accuracy score:",np.mean(cross_val_score\
                                 (emails_pipeline_params_c_cv, X, y, scoring='accuracy', cv=10)))
print("Precision score:",np.mean(cross_val_score\
                                 (emails_pipeline_params_c_cv, X, y, scoring='precision', cv=10)))
print("Recall score:",np.mean(cross_val_score\
                                 (emails_pipeline_params_c_cv, X, y, scoring='recall', cv=10)))
print("F1 score:",np.mean(cross_val_score\
                                 (emails_pipeline_params_c_cv, X, y, scoring='f1', cv=10)))

Accuracy score: 0.972051067892
Precision score: 0.959411051386
Recall score: 0.976767741935
F1 score: 0.967868544657


In [81]:
#Confusion matrix
y_predicted = cross_val_predict(emails_pipeline_params_c_cv, X, y, cv=10)
confusion_matrix(y, y_predicted)

array([[1598,   52],
       [  29, 1219]])

In [82]:
#without stratification
#I tried also K-Fold without stratification to justify better my decision to work only with K-Fold stratified.
kf = KFold(n_splits = 10)

np.mean(cross_val_score(emails_pipeline_simple, X, y, scoring='accuracy', cv=kf))

0.96687149504832348

<h1>Results </h1>
<ul>
    <li> When I create only a simple pipeline (TfidfVectorizer(CountVectorizer)+LogisticRegression) CountVectorizer does better than TfidfVectorizer. Probably if we don't penalize the words which appears more frequently in a document it helps us to classify better spams from hams.</li>
    <li> Also, we can notice that in the case of CountVectorizer I have a number of false negatives lower than in case of TdifdVectorizer.I think that some of the hams are similar with some of the spams, probably they have common words which in case of TfidfVectorizer they are getting penalized.</li>
    <li> But, as we have seen, I improved TfidfVectorizer pipelines by adding some parameters which means that these pipelines reach a score of 98% which is better and also the number of false positives and false negatives are almost equal.</li>
    <li> Another idea was to delete those lines which contain information about emails and to keep only the content. In this case our scores are better than if you keep these lines.Why? I think we should compare only the content, because this is more relevant in our case.(spam detection).</li>
    <li> Overall, if you compare with the majority class classifier, all the classifier have a good precision.</li>
</ul>

<h1>Conclusions</h1>
<p>In conclusion, the last TfidfVectorizer pipeline does better than all I tried until now. </p>
<p>I demonstrated that by changing some values of parameters and adding them to pipeline, it help us to get a better pipeline.</p>