# Case Study 3 - Email Spam

__Team Members:__ Amber Clark, Andrew Leppla, Jorge Olmos, Paritosh Rai

# Team Strategy

Emails are read in.  Kept: From, Subject, Body.
- Remove \n with regex
- Remove other non-alphabetic characters - keep the counts?

Feature Extraction:
1. Vectorization - TFIDF (removes stop words)
If we have time:
2. Created 'trusted' and 'spam' email address book to filter spam - IS THIS OUT OF SCOPE PER SCOPE?

Model:
Classification with Naive Bayes - Amber, Jorge
1. Subject Only = Baseline
2. Body (+Subject?)

Clustering:
kNN with cosine distance for NLP - Andrew, Paritosh

# Content
* [Business Understanding](#business-understanding)
    - [Scope](#scope)
    - [Introduction](#introduction)
    - [Methods](#methods)
    - [Results](#results)
* [Data Evaluation](#data-evaluation)
    - [Loading Data](#loading-data) 
    - [Data Summary](#data-summary)
    - [Missing Values](#missing-values)
    - [Feature Removal](#feature-removal)
    - [Exploratory Data Analysis (EDA)](#eda)
    - [Assumptions](#assumptions)
* [Model Preparations](#model-preparations)
    - [Sampling & Scaling Data](#sampling-scaling-data)
    - [Proposed Method](#proposed-metrics)
    - [Evaluation Metrics](#evaluation-metrics)
    - [Feature Selection](#feature-selection)
* [Model Building & Evaluations](#model-building)
    - [Sampling Methodology](#sampling-methodology)
    - [Model](#model)
    - [Performance Analysis](#performance-analysis)
* [Model Interpretability & Explainability](#model-explanation)
    - [Examining Feature Importance](#examining-feature-importance)
* [Conclusion](#conclusion)
    - [Final Model Proposal](#final-model-proposal)
    - [Future Considerations and Model Enhancements](#model-enhancements)
    - [Alternative Modeling Approaches](#alternative-modeling-approaches)

# Business Understanding & Executive Summary <a id='business-understanding'/>

What are we trying to solve for and why is it important?


### Scope <a id='scope'/>


### Introduction <a id='introduction'/>


### Methods <a id='methods'/>
 
 
### Results <a id='results'/>
 

# Data Evaluation <a id='data-evaluation'>
    

Summarize data being used?

Are there missing values?

Which variables are needed and which are not?

What assumptions or conclusions are you drawing about your data?

In [1]:
# standard libraries
import pandas as pd
import numpy as np
import os
from IPython.display import Image

# email
#from email.message import EmailMessage
from email import policy
from email.parser import BytesParser

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from tabulate import tabulate

# data pre-processing
from sklearn.model_selection import train_test_split

# prediction models


# import warnings filter
'''import warnings
warnings.filterwarnings('ignore')
from warnings import simplefilter 
simplefilter(action='ignore', category=FutureWarning)'''



## Loading Data <a id='loading-data'>

In [2]:
# Specify your local directory
email_dir = 'C:\\Paritosh\\SMU\\7333 Quantifying the World\\Proj\\CS3\\Data'
os.chdir(email_dir)

In [3]:
# Get the list of folder names
folders = os.listdir()
folders

['easy_ham', 'easy_ham_2', 'hard_ham', 'spam', 'spam_2']

In [4]:
# Get the file names in each folder (list of lists)
files = [ os.listdir('.\\' + i) for i in folders] 

# Create a list of dataframes for all of the folders
emails = [ pd.DataFrame({'folder' : [], 'from' : [], 'subject' : [], 'body': []}) ]*len(folders)

# Add folder path to file names
for i in range(0,len(folders)):
    for j in range(0, len(files[i])):
        files[i][j] = str(folders[i] + '\\' + files[i][j]) 
        
        # Parse and extract email 'subject' and 'from'
        with open(files[i][j], 'rb') as fp:
            msg = BytesParser(policy=policy.default).parse(fp)
            
            # Error checking when reading in body for some html-based emails from spam folders
            try:
                simplest = msg.get_body(preferencelist=('plain', 'html'))
                try:
                    new_row = {'folder': folders[i], 'from': msg['from'], 'subject': msg['subject'], 'body': simplest.get_content()}
                    emails[i] = emails[i].append(new_row, ignore_index=True)
                except:
                    new_row = {'folder': folders[i], 'from': msg['from'], 'subject':msg['subject'], 'body':'Error(html)'}
                    emails[i] = emails[i].append(new_row, ignore_index=True)
            except:
                new_row = {'folder': folders[i], 'from': msg['from'], 'subject':msg['subject'], 'body':'Error(html)'}
                emails[i] = emails[i].append(new_row, ignore_index=True)

In [5]:
# Emails per folder
print("# files in folders:", [len(i) for i in files])
print("# emails read in  :", [i.shape[0] for i in emails])

# Total emails
print( "\n# total emails =", sum([len(i) for i in files]) )

# files in folders: [5052, 1401, 500, 1001, 1398]
# emails read in  : [5052, 1401, 500, 1001, 1398]

# total emails = 9352


In [6]:
# Create single dataframe from all folders
df = pd.concat( [emails[i] for i in range(0, len(emails))], axis=0)

# create response column from folder names
df['spam'] = (df['folder']=='spam|spam_2').astype(int)
#df['spam'].astype(int)

#  Keep the indices from the folders
df = df.reset_index() 
df.columns = ['folder_idx', 'folder', 'from', 'subject', 'body', 'spam']

df.shape

(9352, 6)

In [7]:
df.head()

Unnamed: 0,folder_idx,folder,from,subject,body,spam
0,0,easy_ham,Robert Elz <kre@munnari.OZ.AU>,Re: New Sequences Window,"Date: Wed, 21 Aug 2002 10:54:46 -05...",0
1,1,easy_ham,Steve Burt <Steve_Burt@cursor-system.com>,[zzzzteana] RE: Alexander,"Martin A posted:\nTassos Papadopoulos, the Gre...",0
2,2,easy_ham,Tim Chapman <timc@2ubh.com>,[zzzzteana] Moscow bomber,Man Threatens Explosion In Moscow \n\nThursday...,0
3,3,easy_ham,Monty Solomon <monty@roscom.com>,[IRR] Klez: The Virus That Won't Die,Klez: The Virus That Won't Die\n \nAlready the...,0
4,4,easy_ham,Stewart Smith <Stewart.Smith@ee.ed.ac.uk>,Re: [zzzzteana] Nothing like mama used to make,"> in adding cream to spaghetti carbonara, whi...",0


In [8]:
df = df.dropna()

In [9]:
subject=df.subject
subject.head()

0                          Re: New Sequences Window
1                         [zzzzteana] RE: Alexander
2                         [zzzzteana] Moscow bomber
3             [IRR] Klez: The Virus That  Won't Die
4    Re: [zzzzteana] Nothing like mama used to make
Name: subject, dtype: object

In [10]:
## Cleaning the ham text
import re
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
stop_words = set(stopwords.words('english')) ## import stopwords
from nltk.stem import PorterStemmer ## imported the porter stemmer
from nltk.stem import WordNetLemmatizer ## imported wordnet lemmatizer
ps = PorterStemmer()  
wrlm = WordNetLemmatizer()
indx = list(subject.index.values)

subject_corp = []
for i in indx:
    updated = re.sub('[^a-zA-Z]', ' ', str(subject[i])) ## if not A-Za_Z replace with empty space
    updated = updated.lower()   ## change to lower case
    updated = updated.split()    ## splts at space
    updated = [w for w in updated if not w.lower() in stop_words] ## remove stop words
    updated = [ps.stem(w) for w in updated]                       ## stemmed the words
    updated = [wrlm.lemmatize(w) for w in updated]                ## keep words that have meaning
#     print(updated)
#     review = ' '.join(updated)
    subject_corp.append(updated)

In [11]:
subject_corp

[['new', 'sequenc', 'window'],
 ['zzzzteana', 'alexand'],
 ['zzzzteana', 'moscow', 'bomber'],
 ['irr', 'klez', 'viru', 'die'],
 ['zzzzteana', 'noth', 'like', 'mama', 'use', 'make'],
 ['zzzzteana', 'noth', 'like', 'mama', 'use', 'make'],
 ['zzzzteana', 'playboy', 'want', 'go', 'bang'],
 ['zzzzteana', 'noth', 'like', 'mama', 'use', 'make'],
 ['zzzzteana', 'meaning', 'sentenc'],
 ['new', 'sequenc', 'window'],
 ['satalk', 'sa', 'cgi', 'configur', 'script'],
 ['sadev', 'interest', 'approach', 'spam', 'handl'],
 ['sadev', 'live', 'rule', 'updat', 'releas'],
 ['ilug', 'problem', 'raid', 'cobalt', 'raq'],
 ['new', 'sequenc', 'window'],
 ['case', 'spam'],
 ['iiu', 'eircom', 'adsl', 'nat', 'ing'],
 ['zzzzteana', 'australian', 'cathol', 'kiddi', 'perv', 'step', 'asid'],
 ['ilug', 'sun', 'solari'],
 ['zzzzteana', 'muppet'],
 ['zzzzteana', 'alexand'],
 ['ilug', 'sun', 'solari'],
 ['zzzzteana', 'muppet'],
 ['ilug', 'sun', 'solari'],
 ['ilug', 'sun', 'solari'],
 ['zzzzteana', 'muppet'],
 ['ilug', 'su

In [12]:
X=(df.subject)

In [13]:
#X

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer


# tfidf vectorizer of scikit learn
vectorizer = TfidfVectorizer(stop_words=stop_words,max_features=10000, max_df = 0.5, use_idf = True, ngram_range=(1,3))
X = vectorizer.fit_transform(X)
print(X.shape) # check shape of the document-term matrix
terms = vectorizer.get_feature_names()

(9340, 10000)




In [15]:
#X
#print(X)

In [16]:
from sklearn.cluster import KMeans
num_clusters = 2
km = KMeans(n_clusters=num_clusters)
km.fit(X)
sub_clusters = km.labels_.tolist()

In [17]:
df['sub_cluster'] = sub_clusters

In [18]:
df

Unnamed: 0,folder_idx,folder,from,subject,body,spam,sub_cluster
0,0,easy_ham,Robert Elz <kre@munnari.OZ.AU>,Re: New Sequences Window,"Date: Wed, 21 Aug 2002 10:54:46 -05...",0,0
1,1,easy_ham,Steve Burt <Steve_Burt@cursor-system.com>,[zzzzteana] RE: Alexander,"Martin A posted:\nTassos Papadopoulos, the Gre...",0,0
2,2,easy_ham,Tim Chapman <timc@2ubh.com>,[zzzzteana] Moscow bomber,Man Threatens Explosion In Moscow \n\nThursday...,0,0
3,3,easy_ham,Monty Solomon <monty@roscom.com>,[IRR] Klez: The Virus That Won't Die,Klez: The Virus That Won't Die\n \nAlready the...,0,0
4,4,easy_ham,Stewart Smith <Stewart.Smith@ee.ed.ac.uk>,Re: [zzzzteana] Nothing like mama used to make,"> in adding cream to spaghetti carbonara, whi...",0,0
...,...,...,...,...,...,...,...
9346,1392,spam_2,Professional_Career_Development_Institute@Frug...,Busy? Home Study Makes Sense!,"<html>\n<head>\n<meta http-equiv=""content-type...",0,0
9347,1393,spam_2,IQ - TBA <tba@insiq.us>,Preferred Non-Smoker Rates for Smokers,\t Preferred Non-Smoker\n \t\n Just what the ...,0,0
9348,1394,spam_2,Mike <raye@yahoo.lv>,"How to get 10,000 FREE hits per day to any web...","Dear Subscriber,\n\nIf I could show you a way ...",0,0
9349,1395,spam_2,"""Mr. Clean"" <cweqx@dialix.oz.au>",Cannabis Difference,****Mid-Summer Customer Appreciation SALE!****...,0,0


In [19]:
df['SPAM'] = df['folder']

In [20]:
# creating a dict file 
SPAM_d = {'easy_ham': 0,'easy_ham_2': 0, 'hard_ham': 0, 'spam' : 1, 'spam_2':1}

In [21]:
df.SPAM = [SPAM_d[item] for item in df.SPAM]
df.head()

Unnamed: 0,folder_idx,folder,from,subject,body,spam,sub_cluster,SPAM
0,0,easy_ham,Robert Elz <kre@munnari.OZ.AU>,Re: New Sequences Window,"Date: Wed, 21 Aug 2002 10:54:46 -05...",0,0,0
1,1,easy_ham,Steve Burt <Steve_Burt@cursor-system.com>,[zzzzteana] RE: Alexander,"Martin A posted:\nTassos Papadopoulos, the Gre...",0,0,0
2,2,easy_ham,Tim Chapman <timc@2ubh.com>,[zzzzteana] Moscow bomber,Man Threatens Explosion In Moscow \n\nThursday...,0,0,0
3,3,easy_ham,Monty Solomon <monty@roscom.com>,[IRR] Klez: The Virus That Won't Die,Klez: The Virus That Won't Die\n \nAlready the...,0,0,0
4,4,easy_ham,Stewart Smith <Stewart.Smith@ee.ed.ac.uk>,Re: [zzzzteana] Nothing like mama used to make,"> in adding cream to spaghetti carbonara, whi...",0,0,0


In [22]:
df.groupby(['SPAM'])['SPAM'].count()

SPAM
0    6943
1    2397
Name: SPAM, dtype: int64

In [24]:
df.groupby(['folder'])['folder'].count()


folder
easy_ham      5051
easy_ham_2    1395
hard_ham       497
spam          1000
spam_2        1397
Name: folder, dtype: int64

In [25]:
df.groupby(['sub_cluster'])['sub_cluster'].count()

sub_cluster
0    9233
1     107
Name: sub_cluster, dtype: int64

In [26]:
from sklearn.metrics import accuracy_score
accuracy_score(df.SPAM, df.sub_cluster)

0.7319057815845824

In [27]:
# with subject text
import pandas as pd

y_actual = pd.Series(df.SPAM, name='Actual')
y_predicted = pd.Series(df.sub_cluster, name='Predicted')

from sklearn import metrics
metrics.confusion_matrix(df.SPAM, df.sub_cluster)

array([[6836,  107],
       [2397,    0]], dtype=int64)

In [28]:
# with subject text
import pandas as pd

y_actual = pd.Series(df.SPAM, name='Actual')
y_predicted = pd.Series(df.sub_cluster, name='Predicted')


metrics.confusion_matrix(df.SPAM, df.sub_cluster)

y_actual = pd.Series(df.SPAM, name='Actual')
y_predicted = pd.Series(df.sub_cluster, name='Predicted')

print(pd.crosstab(y_actual, y_predicted))
#print accuracy of model
print("Accuracy")
print(metrics.accuracy_score(y_actual, y_predicted))


#print precision value of model
print("Precision")
print(metrics.precision_score(y_actual, y_predicted))



#print recall value of model
print("Recall")
print(metrics.recall_score(y_actual, y_predicted))

Predicted     0    1
Actual              
0          6836  107
1          2397    0
Accuracy
0.7319057815845824
Precision
0.0
Recall
0.0


In [29]:
from sklearn.metrics import classification_report
print(classification_report(df.SPAM,df.sub_cluster))

              precision    recall  f1-score   support

           0       0.74      0.98      0.85      6943
           1       0.00      0.00      0.00      2397

    accuracy                           0.73      9340
   macro avg       0.37      0.49      0.42      9340
weighted avg       0.55      0.73      0.63      9340



# Using Body

In [30]:
df.head()

Unnamed: 0,folder_idx,folder,from,subject,body,spam,sub_cluster,SPAM
0,0,easy_ham,Robert Elz <kre@munnari.OZ.AU>,Re: New Sequences Window,"Date: Wed, 21 Aug 2002 10:54:46 -05...",0,0,0
1,1,easy_ham,Steve Burt <Steve_Burt@cursor-system.com>,[zzzzteana] RE: Alexander,"Martin A posted:\nTassos Papadopoulos, the Gre...",0,0,0
2,2,easy_ham,Tim Chapman <timc@2ubh.com>,[zzzzteana] Moscow bomber,Man Threatens Explosion In Moscow \n\nThursday...,0,0,0
3,3,easy_ham,Monty Solomon <monty@roscom.com>,[IRR] Klez: The Virus That Won't Die,Klez: The Virus That Won't Die\n \nAlready the...,0,0,0
4,4,easy_ham,Stewart Smith <Stewart.Smith@ee.ed.ac.uk>,Re: [zzzzteana] Nothing like mama used to make,"> in adding cream to spaghetti carbonara, whi...",0,0,0


In [50]:
body=df.body
body.head()

0        Date:        Wed, 21 Aug 2002 10:54:46 -05...
1    Martin A posted:\nTassos Papadopoulos, the Gre...
2    Man Threatens Explosion In Moscow \n\nThursday...
3    Klez: The Virus That Won't Die\n \nAlready the...
4    >  in adding cream to spaghetti carbonara, whi...
Name: body, dtype: object

In [51]:
## Cleaning the ham text
import re
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
stop_words = set(stopwords.words('english')) ## import stopwords
from nltk.stem import PorterStemmer ## imported the porter stemmer
from nltk.stem import WordNetLemmatizer ## imported wordnet lemmatizer
ps = PorterStemmer()  
wrlm = WordNetLemmatizer()
indx = list(subject.index.values)

body_corp = []
for i in indx:
    updated = re.sub('[^a-zA-Z]', ' ', str(body[i])) ## if not A-Za_Z replace with empty space
    updated = updated.lower()   ## change to lower case
    updated = updated.split()    ## splts at space
    updated = [w for w in updated if not w.lower() in stop_words] ## remove stop words
    updated = [ps.stem(w) for w in updated]                       ## stemmed the words
    updated = [wrlm.lemmatize(w) for w in updated]                ## keep words that have meaning
#     print(updated)
#     review = ' '.join(updated)
    body_corp.append(updated)

In [52]:
X_b=(df.body)

In [53]:
from sklearn.feature_extraction.text import TfidfVectorizer


# tfidf vectorizer of scikit learn
vectorizer = TfidfVectorizer(stop_words=stop_words,max_features=10000, max_df = 0.5, use_idf = True, ngram_range=(1,3))
X_b = vectorizer.fit_transform(X_b)
print(X_b.shape) # check shape of the document-term matrix
terms = vectorizer.get_feature_names()

(9340, 10000)




In [54]:
from sklearn.cluster import KMeans
num_clusters = 2
km = KMeans(n_clusters=num_clusters)
km.fit(X_b)
body_clusters = km.labels_.tolist()

In [55]:
df['body_cluster'] = body_clusters

In [56]:
df.groupby(['SPAM'])['SPAM'].count()

SPAM
0    6943
1    2397
Name: SPAM, dtype: int64

In [57]:
df.groupby(['body_cluster'])['body_cluster'].count()

body_cluster
0    8107
1    1233
Name: body_cluster, dtype: int64

In [58]:
df['body_cluster_mod'] = df['body_cluster']

In [59]:
df.head()

Unnamed: 0,folder_idx,folder,from,subject,body,spam,sub_cluster,SPAM,body_cluster,body_cluster_mod
0,0,easy_ham,Robert Elz <kre@munnari.OZ.AU>,Re: New Sequences Window,"Date: Wed, 21 Aug 2002 10:54:46 -05...",0,0,0,0,0
1,1,easy_ham,Steve Burt <Steve_Burt@cursor-system.com>,[zzzzteana] RE: Alexander,"Martin A posted:\nTassos Papadopoulos, the Gre...",0,0,0,0,0
2,2,easy_ham,Tim Chapman <timc@2ubh.com>,[zzzzteana] Moscow bomber,Man Threatens Explosion In Moscow \n\nThursday...,0,0,0,0,0
3,3,easy_ham,Monty Solomon <monty@roscom.com>,[IRR] Klez: The Virus That Won't Die,Klez: The Virus That Won't Die\n \nAlready the...,0,0,0,0,0
4,4,easy_ham,Stewart Smith <Stewart.Smith@ee.ed.ac.uk>,Re: [zzzzteana] Nothing like mama used to make,"> in adding cream to spaghetti carbonara, whi...",0,0,0,0,0


In [60]:
#df['body_cluster_mod'] = np.where((df.body_cluster_mod ==1), 10, df.body_cluster_mod)


In [61]:
#df['body_cluster_mod'] = np.where((df.body_cluster_mod == 0), 1, df.body_cluster_mod)

In [62]:
#df['body_cluster_mod'] = np.where((df.body_cluster_mod == 10), 0, df.body_cluster_mod)

In [63]:
from sklearn.metrics import accuracy_score
accuracy_score(df.SPAM, df.body_cluster_mod)

0.823982869379015

In [64]:
# with body text
import pandas as pd

y_actual = pd.Series(df.SPAM, name='Actual')
y_predicted = pd.Series(df.body_cluster_mod, name='Predicted')

from sklearn import metrics
metrics.confusion_matrix(df.SPAM, df.body_cluster_mod)

array([[6703,  240],
       [1404,  993]], dtype=int64)

In [65]:
# with body text
import pandas as pd

y_actual = pd.Series(df.SPAM, name='Actual')
y_predicted = pd.Series(df.body_cluster_mod, name='Predicted')

print(pd.crosstab(y_actual, y_predicted))
#print accuracy of model
print("Accuracy")
print(metrics.accuracy_score(y_actual, y_predicted))


#print precision value of model
print("Precision")
print(metrics.precision_score(y_actual, y_predicted))



#print recall value of model
print("Recall")
print(metrics.recall_score(y_actual, y_predicted))


Predicted     0    1
Actual              
0          6703  240
1          1404  993
Accuracy
0.823982869379015
Precision
0.805352798053528
Recall
0.4142678347934919


In [66]:
from sklearn.metrics import classification_report
print(classification_report(df.SPAM,df.body_cluster_mod))

              precision    recall  f1-score   support

           0       0.83      0.97      0.89      6943
           1       0.81      0.41      0.55      2397

    accuracy                           0.82      9340
   macro avg       0.82      0.69      0.72      9340
weighted avg       0.82      0.82      0.80      9340



## Data Summary <a id='data-summary'>

## Missing Values <a id='missing-values'>



In [None]:
# Rows where body couldn't be read in = 'Error(html)'
df.loc[df['body']=='Error(html)']

# All spam emails

In [None]:
# Count of body read Errors
df.loc[df['body']=='Error(html)'].shape[0]

In [None]:
# Look at file example with Error(html)
with open(files[4][1], 'rb') as fp:
    msg = BytesParser(policy=policy.default).parse(fp)
print(msg)

## Feature Removal <a id='feature-removal'>

## Exploratory Data Analysis (EDA) <a id='eda'>

### 

### Feature Collinearity <a id='feature-collinearity'>


### Feature Outliers 
 

## Assumptions <a id='assumptions'>

# Model Preparations <a id='model-preparations'/>

What methods did you use (or not) to solve the problem?

Why are the methods you chose appropriate given the business objective?

How did you decide your approach was useful?  If more than one method, which one was better or why are each better or not?

What evaluation smetrics are most useful given the problem is a binary classification (ex. accuracy, f1-score, precision, recall AUC, etc)?



## Sampling & Scaling Data <a id='sampling-scaling-data' />

## Proposed Method <a id='proposed-metrics' />

## Evaluation Metrics <a id='evaluation-metrics' />

### Baseline Model

## Feature Selection <a id='feature-selection' />

# Model Building & Evaluations <a id='model-building'/>

Primary task is buiding a logistic regression to predict hospital readmittances.

How did you handle missing values?

Specify your sampling methodology

Set up your models - highlights of any important parameters

Analysis of your models performance

## Sampling Methodology <a id='sampling-methodology'/>

#### Per the code above we used a 70/30 train test sample split

## Model's Performance Analysis <a id='performance-analysis'/>

# Model Interpretability & Explainability <a id='model-explanation'>

Which variables were more important and why?

How did you come to the conclusion these variables were important how how should the audience interpret this?

## Examining Feature Importance <a id='examining-feature-importance'/>

# Conclusion <a id='conclusion'>

What are you proposing to the audience with your models and why?

How should your audience interpret your conclusion and whwere should they go moving forward on the topic?

What other approaches do you recommend exploring?

Bring it all home!

### Final Model Proposal <a id='final-model-proposal'/>

### Future Considerations and Model Enhancements <a id='model-enhancements'/>

### Alternative Modeling Approaches <a id='alternative-modeling-approaches'>