In this notebook I have used TREC 2007 Public Corpus [availiable online](http://plg.uwaterloo.ca/~gvcormac/treccorpus07/). My purpose is to classify e-mails to spam and not spam e-mails. 

##First
First of all, I upload TREC07 corpus and extract it somewhere. After that i read index file that includes a name of a file and its label ('class'). 

In [1]:
import pandas as pd
target = pd.read_csv('./trec07Data/data/index', sep=' ', names = ['class', 'file'])
print("How many e-mails are availiable? - %i e-mails"%len(target))





How many e-mails are availiable? - 75419 e-mails


In [3]:
import numpy as np
def printSpamHamStat():
    spam = np.where((target['class']=='spam')==True)[0].size
    Spam = np.where((target['class']=='Spam')==True)[0].size

    ham = np.where((target['class']=='ham')==True)[0].size
    Ham = np.where((target['class']=='Ham')==True)[0].size

    print("How many spam e-mails?  - %i "% (spam+Spam))
    print("How many non-spam e-mails?  - %i "% (ham+Ham))
    
printSpamHamStat()

How many spam e-mails?  - 50199 
How many non-spam e-mails?  - 25220 


##Second

I'm creating an e-mail dataframe while reading all the files from the specified folder that are included in the index file. The email content is stored in __emails__ while __UnicodeDecodeError__ is catched and indices of problematic e-mails are stored to avoid from __target__ for the futher consideration. 

TODO:  
- revise code by finding special symbols that make UnicodeDecodeError to appear. 
- are there any other less time-consuming way to read files?

In [4]:
emails = []
i = 1
count = 0
index = -1
indexL = []


for f in target.file:
    index = index + 1
    fileLocation = './trec07Data/' + f[3:]
    try:
        email = open(fileLocation, 'r', encoding='utf8')
        emailContent = email.read()
        emails.append(emailContent)
        email.close()
    except UnicodeDecodeError:
        pass
        count = count + 1
        indexL.append(index)

target.drop(target.index[indexL], inplace=True)

print("What is the amount of dropped e-mails?  - %i"%count)
print("What is the current amount of e-mails in our dataset?  - %i"% len(emails))

printSpamHamStat()

What is the amount of dropped e-mails?  - 11877
What is the current amount of e-mails in our dataset?  - 63542
How many spam e-mails?  - 40450 
How many non-spam e-mails?  - 23092 


##Third 

With the help of __email__ library I read a content from each of e-mails while omitting e-mails with 1 character. 

TODO: Are there any other less time-consuming way to read content. 

In [5]:
import email

allContent = []
i = 0
index = -1
indexL=[]
for em in emails:
    i+=1
    index+=1
    msg_out = email.message_from_string(em)
    content=''
    for part in msg_out.walk():
        #part.get_content_type() == 'text/html')or
        if part.get_content_type() == 'text/plain':
            content=content+part.get_payload()
    if(len(content)<2):
        indexL.append(index)
        content=''
    else:
        allContent.extend([content])

target.drop(target.index[indexL], inplace=True)

print("How many e-mails are omitted? - %i "%len(indexL))

data = pd.DataFrame(pd.Series(allContent), columns=['content'])

print("How many e-mails are in the data set? - %i "%data.shape[0])

printSpamHamStat()

How many e-mails are omitted? - 14186 
How many e-mails are in the data set? - 49356 
How many spam e-mails?  - 28153 
How many non-spam e-mails?  - 21203 


##Forth

Now I assign 1 or 0 instead of 'ham' or 'spam' in __target__ file while creating a dictionary. Indices in __target__ need to be rewritten (0:len(target)) for the further work with newly created __data__ object.


In [6]:
map = dict(zip(['spam','ham', 'Spam', 'Ham'], [0,1, 0,1]))

target = target.set_index(keys = data.index.values)

target.replace({'class':map}, inplace=True)

data = data.assign(hamOrSpam = target['class'])


In [7]:
data.head(12)

Unnamed: 0,content,hamOrSpam
0,"Hi, i've just updated from the gulus and I che...",1
1,Mega authenticV I A G R A $ DISCOUNT priceC...,0
2,"\nHey Billy, \n\nit was really fun going out t...",0
3,"\nsystem"" of the home. It will have the capab...",0
4,\nthe program and the creative abilities of th...,0
5,HoodiaLife - Start Losing Weight Now!\n\nHoo...,0
6,\nHi...\n\nI have to use R to find out the 90%...,1
7,Good day!Visit our new online drug store and s...,0
8,\nAnatrim =96 The latest and most delighting p...,0
9,\nmovement on the tablet. I could even select...,0


##Fifth

Now I can prepare data to feed different ML models. For this purpose I split data into __data_train__ and __data_test__ as well as target data into __y_train__ and __y_test__. Models are learning based on __data_train__ and __y_train__ and predict __y_test__ based trained models and __data_test__. We use *random_state=11* to be able to reproduce the same splitting since we define the seed for the random number generator. 

In [8]:
from sklearn.model_selection import train_test_split
import numpy as np

data_train, data_test, y_train, y_test = train_test_split(data['content'], data['hamOrSpam'], test_size=0.2, 
                                                          random_state=11)

print("How many samples are included into train data set? - %i"%data_train.shape[0])
print("How many samples are included into test data set? - %i"%data_test.shape[0])




How many samples are included into train data set? - 39484
How many samples are included into test data set? - 9872


##Sixth

Now we check what is the distribution of classes counting amount of spam and non-spam e-mails. In case we have extremely imbalanced classes (10 of 100 are in the minor class) we need to consider better strategies to separate into classes or generate addtional samples of the monir class or reduce a number of samples of the major class, etc.

In [9]:
np.bincount(np.array(y_train).astype('int'))

array([22513, 16971])

##Seventh

Now I prepare text data to feed a model. For this purpose we need train and test data as lists. In the first phase I use __CountVectorizer__ that count a number of words' appearance. This method includes as well ability to ignore *stop_words* that appear often in texts and do not have impact on semantics (e.g. like, the, a, and, or...)  

In [10]:
#data_train = data_train.tolist()
#y_train = y_train.tolist()
#data_test = data_test.tolist()
#y_test = y_test.tolist()

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

count_vect = CountVectorizer(stop_words='english', analyzer='word',
                             tokenizer=lambda doc: 
                                doc.lower().replace('\n', ' ').
                                replace('\t', ' ').split(' '), lowercase=False)

X_train_counts = count_vect.fit_transform(data_train)
X_train_counts.shape

(39484, 476006)

##Eight## 

Let us check what is inside the vocabulary. 

In [24]:
i = 1
for k in count_vect.vocabulary_.items():
    if i%10==0:
        print(k)
    i+=1
    if i==150:
        break
    

('50mg', 105489)
('$141.02', 16797)
('$2.95', 17057)
('$76.68', 18056)
('$55', 17792)
('$6', 17858)
('$1', 16480)
('day', 197607)
('term!!!', 432338)
("masters'", 316865)
('life', 305666)
('assured', 148194)
('distribution.', 206806)
('rife', 385009)


##Ninth##

Now I transform all counts of words for e-mail to inverse document frequency that emphasize the importance of a word in a collection of documents. 

In [25]:
tf_transformer = TfidfTransformer(use_idf=True).fit(X_train_counts)
X_train_tfidf = tf_transformer.transform(X_train_counts)
X_train_tfidf.shape  


(39484, 476006)

##Tenth##

Now I can try to use different models with data got from __TfidfTransformer__ where each e-mail is presented as a row with *invese document frequency* in columns that state for words presented in the e-mail and with zeros in other columns. The whole training set of e-mails is then presented as a matrix where columns are words in the training set and vocabulary and rows are training set samples.

In [28]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import LinearSVC

predictedValues = []
maxMean = 0
bestClass = None
for clf in [MultinomialNB(),LogisticRegression(),
            SGDClassifier(loss='hinge', penalty='l2',
                                            alpha=1e-5, random_state= 42, max_iter = 5, 
                                                                tol=None), LinearSVC()]:
    clf = clf.fit(X_train_tfidf, y_train)
    data_test_counts = count_vect.transform(data_test)
    data_test_tfidf = tf_transformer.transform(data_test_counts)
    
    predicted = clf.predict(data_test_tfidf)
 
    currentMean = np.mean(predicted==np.array(y_test))
    if(currentMean>maxMean):
        maxMean = currentMean
        bestClass = clf
        predictedValues = predicted
        
        
print(bestClass)
print(maxMean)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)
0.99290923825


##Eleventh##

In the code below I trained four models that are commonly used for text classification. In this case __LinearSVC__ is the winner with best accuracy that was calculated just by comparing predicted array with 0s and 1s for each e-mail in a test set with 0s and 1s from an array of actual class of e-mails in the test set (y_test). 

__LinearSVC__ is Linear Support Vector Classification that is the special case for Support Vector Machines where classification is done using a linear function with no kernels needed (no transformations and non-linear functions)

##Finally##
I just want to see a more concrete information about results of the winner model.

In [30]:
from sklearn import metrics
print(metrics.classification_report(y_test, predictedValues, target_names = ['spam', 'ham']))

             precision    recall  f1-score   support

       spam       0.99      1.00      0.99      5640
        ham       1.00      0.99      0.99      4232

avg / total       0.99      0.99      0.99      9872

