# 02-email_spam_classification [enron]

* __Status__ : OK
* __Dataset__ : /thetidinbox_1004/notebooks/Marine/enron1.tar.gz [as an example]
* __Source__ : Enron Email Dataset [https://www.cs.cmu.edu/~enron/]
* __Labeled__ : Yes. Spam/Ham


## 📚 **Import libraries**

In [3]:
import numpy as np
import pandas as pd
from math import sqrt

from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import mean_squared_error
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

In [6]:
emails_df = pd.read_csv('email_df.csv')

In [21]:
emails_df

Unnamed: 0,message,class
0,subject christma tree farm pictur,0
1,subject vastar resourc inc gari product high i...,0
2,subject calpin daili gas nomin calpin daili ga...,0
3,subject issu fyi see note alreadi done stella ...,0
4,subject meter nov alloc fyi forward lauri alle...,0
...,...,...
4989,subject pro forma invoic attach divid cover ga...,1
4990,subject str rndlen extra time word bodyhtml,1
4991,subject check bb hey derm bbbbb check pari man...,1
4992,subject hot job global market specialti po box...,1


## Model setting

Splitting the dataset into train and test data. The split is 70%-30% respectively.
The splitting function is imported from the library 'sklearn'. this function has many parameter like features, target, test_size (this defines percentage of testset), shuffle (this defines if the data should be shuffled before splitting), random_state (this defines the seed used by random number generator) and stratify (data is split in stratified fashion).

In [7]:
# Define the independent variables as X
X = emails_df['message'].values

# Define the target as Y
y = emails_df['class'].values

In [9]:
# Create a train/test split using 30% test size.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, shuffle=True, random_state=123, stratify=y)

### Feature Extraction Process :

Text data requires special preparation before you can start using it for predictive modeling. The text must be parsed to remove words, called tokenization. Then the words need to be encoded as integers or floating point values for use as input to a machine learning algorithm, called feature extraction (or vectorization).

A simple and effective model for thinking about text documents in machine learning is called the Bag-of-Words Model, or BoW. This model first tokenizes the document into words and then assign these word a unique number. the fit() function in order to learn a vocabulary from one or more documents.

Here, to convert into Bag of Words we use Count vectorizer. This provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words.

In [10]:
# Count Vectorizer/ Bag of Words

# Creating  an instance of the CountVectorizer class
cv = CountVectorizer(stop_words='english')
# The fit() function is to learn a vocabulary from one or more documents.
bag_of_words_train = cv.fit_transform(X_train)
# the transform() function is used to encode each document as a vector.
bag_of_words_test = cv.transform(X_test)

### Model Selection:

Applying 3 different classification models :
1. MultinomialNB
2. Logistic Regression 
3. KNN

In [11]:
nb = MultinomialNB(alpha=0.05)

# Cross Validation using Multinomial NB
scores_nb = cross_val_score(nb, bag_of_words_train, y_train, scoring= 'accuracy',cv=10)

#The mean score and the 95% confidence interval of the score estimate are hence given by:
print("MultinomialNB Accuracy: %0.3f (+/- %0.3f)" % (scores_nb.mean(), scores_nb.std() * 2))

MultinomialNB Accuracy: 0.977 (+/- 0.015)


In [12]:
accuracy_dict = {}
accuracy_dict.update({'MultinomialNB':scores_nb})
print(accuracy_dict)

{'MultinomialNB': array([0.98      , 0.96857143, 0.96285714, 0.98      , 0.97714286,
       0.97421203, 0.97421203, 0.99140401, 0.98567335, 0.97421203])}


In [13]:
logreg = LogisticRegression()

# Cross Validation using Logistic Regression
scores_logreg = cross_val_score(logreg, bag_of_words_train, y_train, scoring= 'accuracy',cv=10)

#The mean score and the 95% confidence interval of the score estimate are hence given by:
print("LogisticRegression Accuracy: %0.3f (+/- %0.3f)" % (scores_logreg.mean(), scores_logreg.std() * 2))

LogisticRegression Accuracy: 0.976 (+/- 0.020)


In [14]:
accuracy_dict.update({'LogisticRegression':scores_logreg})
print(accuracy_dict)

{'MultinomialNB': array([0.98      , 0.96857143, 0.96285714, 0.98      , 0.97714286,
       0.97421203, 0.97421203, 0.99140401, 0.98567335, 0.97421203]), 'LogisticRegression': array([0.98      , 0.95714286, 0.97714286, 0.97142857, 0.98285714,
       0.96275072, 0.97994269, 0.98853868, 0.98853868, 0.96848138])}


In [15]:
# 10-fold cross-validation with K=5 for KNN (the n_neighbors parameter)
# k = 5 for KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)

# Use cross_val_score function
# cv=10 for 10 folds
scores_knn = cross_val_score(knn, bag_of_words_train, y_train, cv=10, scoring='accuracy')

#The mean score and the 95% confidence interval of the score estimate are hence given by:
print("KNN Accuracy: %0.2f (+/- %0.2f)" % (scores_knn.mean(), scores_knn.std() * 2))

predicted = cross_val_predict(knn, bag_of_words_train, y_train, cv=10)

KNN Accuracy: 0.87 (+/- 0.03)


In [16]:
accuracy_dict.update({'KNN':scores_knn})
print(accuracy_dict)

{'MultinomialNB': array([0.98      , 0.96857143, 0.96285714, 0.98      , 0.97714286,
       0.97421203, 0.97421203, 0.99140401, 0.98567335, 0.97421203]), 'LogisticRegression': array([0.98      , 0.95714286, 0.97714286, 0.97142857, 0.98285714,
       0.96275072, 0.97994269, 0.98853868, 0.98853868, 0.96848138]), 'KNN': array([0.87428571, 0.85714286, 0.89142857, 0.84571429, 0.88571429,
       0.86819484, 0.86246418, 0.86532951, 0.89684814, 0.85673352])}


In [17]:
accuracy_df = pd.DataFrame.from_dict(accuracy_dict,orient='columns')
print(accuracy_df)

   MultinomialNB  LogisticRegression       KNN
0       0.980000            0.980000  0.874286
1       0.968571            0.957143  0.857143
2       0.962857            0.977143  0.891429
3       0.980000            0.971429  0.845714
4       0.977143            0.982857  0.885714
5       0.974212            0.962751  0.868195
6       0.974212            0.979943  0.862464
7       0.991404            0.988539  0.865330
8       0.985673            0.988539  0.896848
9       0.974212            0.968481  0.856734


From the above dataframe we can infer that :
* KNN showed the least accuracy 
* Multinomial Naive Bayes has scored highest accuracy of 97.7%
* Thus for the further evaluation we may use Multinomial Naive Bayes Classifier

### Model Evaluation :

Proceeding further with Multinomial Naive Bayes Classifier.

**Word Frequencies with TfidfVectorizer**: 
One issue with simple counts is that some words like “the” will appear many times and their large counts will not be very meaningful in the encoded vectors.
An alternative is to calculate word frequencies, and by far the most popular method is called TF-IDF. This is an acronym than stands for “Term Frequency – Inverse Document” Frequency which are the components of the resulting scores assigned to each word. TF-IDF are word frequency scores that try to highlight words that are more interesting. The same create, fit, and transform process is used as with the CountVectorizer.

In [18]:
pipe = Pipeline([('vectorize', CountVectorizer()),
                 ('tfidf', TfidfTransformer()),
                 ('classify', MultinomialNB())])

parameters = {'vectorize__ngram_range': [(1, 1), (1, 2)],
              'tfidf__use_idf': (True, False),
              'classify__alpha':(1e-2, 1e-3)}

gs_clf = GridSearchCV(pipe, parameters, cv=5, n_jobs=-1)
gs_clf = gs_clf.fit(X_train, y_train)

predict_MNB = gs_clf.predict(X_test)

metrics.confusion_matrix(y_test, predict_MNB)

print("MultinomialNB Accuracy: ",gs_clf.best_score_)
for param_name in sorted(parameters.keys()):
     print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))

MultinomialNB Accuracy:  0.9874105865522175
classify__alpha: 0.001
tfidf__use_idf: True
vectorize__ngram_range: (1, 2)


Displayed above is the Accuarcy of Naive Bayes Classifier after converting the texts into bag of words and then calculating its frequency and then finally applying the model to it.

Accuracy is 98.74% .
Also displayed is best parameters which helped in bringing out this precision to the model.

*Now calculating the out-of-sample error:*

Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are. RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit..

R-squared is a relative measure of fit, RMSE is an absolute measure of fit.

Lower values of RMSE indicate better fit.

In [19]:
rmse = sqrt(mean_squared_error(y_test, predict_MNB))
'''returns out-of-sample error for already fit model.'''
print(rmse)


0.10003335000926467


After final Prediction with an accuracy of 98.74% and RMSE of 0.1 shown above is the Confusion matrix of testing dataset.

In [20]:
# calculate AUC
metrics.roc_auc_score(y_test, predict_MNB)

0.9875864958954743

* AUC is useful as a single number summary of classifier performance
* Higher value = better classifier