# Case Study 3 - Email Spam

__Team Members:__ Amber Clark, Andrew Leppla, Jorge Olmos, Paritosh Rai

# Team Strategy

Emails are read in.  Kept: From, Subject, Body.
- Remove \n with regex
- Remove other non-alphabetic characters - keep the counts?

Feature Extraction:
1. Vectorization - TFIDF (removes stop words)
If we have time:
2. Created 'trusted' and 'spam' email address book to filter spam - IS THIS OUT OF SCOPE PER SCOPE?

Model:
Classification with Naive Bayes - Amber, Jorge
1. Subject Only = Baseline
2. Body (+Subject?)

Clustering:
DBSCAN with cosine distance for NLP - Andrew, Paritosh

# Content
* [Business Understanding](#business-understanding)
    - [Scope](#scope)
    - [Introduction](#introduction)
    - [Methods](#methods)
    - [Results](#results)
* [Data Evaluation](#data-evaluation)
    - [Loading Data](#loading-data) 
    - [Data Summary](#data-summary)
    - [Missing Values](#missing-values)
    - [Feature Removal](#feature-removal)
    - [Exploratory Data Analysis (EDA)](#eda)
    - [Assumptions](#assumptions)
* [Model Preparations](#model-preparations)
    - [Sampling & Scaling Data](#sampling-scaling-data)
    - [Proposed Method](#proposed-metrics)
    - [Evaluation Metrics](#evaluation-metrics)
    - [Feature Selection](#feature-selection)
* [Model Building & Evaluations](#model-building)
    - [Sampling Methodology](#sampling-methodology)
    - [Model](#model)
    - [Performance Analysis](#performance-analysis)
* [Model Interpretability & Explainability](#model-explanation)
    - [Examining Feature Importance](#examining-feature-importance)
* [Conclusion](#conclusion)
    - [Final Model Proposal](#final-model-proposal)
    - [Future Considerations and Model Enhancements](#model-enhancements)
    - [Alternative Modeling Approaches](#alternative-modeling-approaches)

# Business Understanding & Executive Summary <a id='business-understanding'/>

What are we trying to solve for and why is it important?


### Scope <a id='scope'/>


### Introduction <a id='introduction'/>


### Methods <a id='methods'/>
 
 
### Results <a id='results'/>
 

# Data Evaluation <a id='data-evaluation'>
    

Summarize data being used?

Are there missing values?

Which variables are needed and which are not?

What assumptions or conclusions are you drawing about your data?

In [2]:
# standard libraries
import pandas as pd
import numpy as np
import os
from IPython.display import Image

# email
from email import policy
from email.parser import BytesParser

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from tabulate import tabulate

# data pre-processing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.feature_extraction.text import TfidfVectorizer

# prediction models
from sklearn.cluster import DBSCAN

# import warnings filter
'''import warnings
warnings.filterwarnings('ignore')
from warnings import simplefilter 
simplefilter(action='ignore', category=FutureWarning)'''



## Loading Data <a id='loading-data'>

In [3]:
# Specify your local directory
email_dir = 'C:\\Users\\allep\\QTW_Projects\\SpamAssasinMessages (CS3)'
os.chdir(email_dir)

In [4]:
# Get the list of folder names
folders = os.listdir()
folders

['easy_ham', 'easy_ham_2', 'hard_ham', 'spam', 'spam_2']

In [5]:
# Get the file names in each folder (list of lists)
files = [ os.listdir('.\\' + i) for i in folders] 

# Create a list of dataframes for all of the folders
emails = [ pd.DataFrame({'folder' : [], 'from' : [], 'subject' : [], 'body': []}) ]*len(folders)

# Add folder path to file names
for i in range(0,len(folders)):
    for j in range(0, len(files[i])):
        files[i][j] = str(folders[i] + '\\' + files[i][j]) 
        
        # Parse and extract email 'subject' and 'from'
        with open(files[i][j], 'rb') as fp:
            msg = BytesParser(policy=policy.default).parse(fp)
            
            # Error checking when reading in 'body' for some html-based emails from spam folders
            try:
                simplest = msg.get_body(preferencelist=('plain', 'html'))
                try:
                    new_row = {'folder': folders[i], 'from': msg['from'], 'subject': msg['subject'], 'body': simplest.get_content()}
                    emails[i] = emails[i].append(new_row, ignore_index=True)
                except:
                    new_row = {'folder': folders[i], 'from': msg['from'], 'subject':msg['subject'], 'body':'Error(html)'}
                    emails[i] = emails[i].append(new_row, ignore_index=True)
            except:
                new_row = {'folder': folders[i], 'from': msg['from'], 'subject':msg['subject'], 'body':'Error(html)'}
                emails[i] = emails[i].append(new_row, ignore_index=True)

In [6]:
# Emails per folder
print("# files in folders:", [len(i) for i in files])
print("# emails read in  :", [i.shape[0] for i in emails])

# Total emails
print( "\n# total emails =", sum([len(i) for i in files]) )

# files in folders: [5052, 1401, 501, 1001, 1398]
# emails read in  : [5052, 1401, 501, 1001, 1398]

# total emails = 9353


In [7]:
# Create single dataframe from all folders
df = pd.concat( [emails[i] for i in range(0, len(emails))], axis=0)

#  Keep the indices from the folders
df = df.reset_index() 

# create response column from folder names
spam = [(i=='spam' or i=='spam_2') for i in df['folder']]
df = pd.concat([df, pd.Series(spam).astype(int)], axis=1)

df.columns = ['folder_idx', 'folder', 'from', 'subject', 'body','spam']

df.shape

(9353, 6)

In [8]:
df.head()

Unnamed: 0,folder_idx,folder,from,subject,body,spam
0,0,easy_ham,Robert Elz <kre@munnari.OZ.AU>,Re: New Sequences Window,"Date: Wed, 21 Aug 2002 10:54:46 -05...",0
1,1,easy_ham,Steve Burt <Steve_Burt@cursor-system.com>,[zzzzteana] RE: Alexander,"Martin A posted:\nTassos Papadopoulos, the Gre...",0
2,2,easy_ham,Tim Chapman <timc@2ubh.com>,[zzzzteana] Moscow bomber,Man Threatens Explosion In Moscow \n\nThursday...,0
3,3,easy_ham,Monty Solomon <monty@roscom.com>,[IRR] Klez: The Virus That Won't Die,Klez: The Virus That Won't Die\n \nAlready the...,0
4,4,easy_ham,Stewart Smith <Stewart.Smith@ee.ed.ac.uk>,Re: [zzzzteana] Nothing like mama used to make,"> in adding cream to spaghetti carbonara, whi...",0


## Data Summary <a id='data-summary'>

## Missing Values <a id='missing-values'>



In [9]:
# Rows where body couldn't be read in = 'Error(html)'
df.loc[df['body']=='Error(html)']

# All spam emails

Unnamed: 0,folder_idx,folder,from,subject,body,spam
7191,237,spam,eb@via.ecp.fr,"Over $100,000 Per Year Possible On The Net! N...",Error(html),1
7303,349,spam,"""Books@Books""@BlackRealityPublishing.com","Free Excerpt; Baby Makers, Loser Choosers, & ...",Error(html),1
7379,425,spam,zzzz@netscape.net,Collect Your Money! Time:1:30:33 AM,Error(html),1
7466,512,spam,Affordable Computer Supply <InkjetDeals@acsmsu...,Printer Cartridges as low as $1.21 each!,Error(html),1
7679,725,spam,eb@via.ecp.fr,"Over $100,000 Per Year Possible On The Net! N...",Error(html),1
7782,828,spam,"""Books@Books""@BlackRealityPublishing.com","Free Excerpt; Baby Makers, Loser Choosers, & ...",Error(html),1
7852,898,spam,zzzz@netscape.net,Collect Your Money! Time:1:30:33 AM,Error(html),1
7931,977,spam,Affordable Computer Supply <InkjetDeals@acsmsu...,Printer Cartridges as low as $1.21 each!,Error(html),1
7956,1,spam_2,lmrn@mailexcite.com,"Real Protection, Stun Guns! Free Shipping! Ti...",Error(html),1
7957,2,spam_2,amknight@mailexcite.com,"New Improved Fat Burners, Now With TV Fat Abso...",Error(html),1


In [10]:
# Count of body read Errors
df.loc[df['body']=='Error(html)'].shape[0]

33

In [11]:
# Look at file example with Error(html)
with open(files[4][1], 'rb') as fp:
    msg = BytesParser(policy=policy.default).parse(fp)
print(msg)

Return-Path: merchantsworld2001@juno.com
Delivery-Date: Mon May 13 04:46:13 2002
Received: from mandark.labs.netnoteinc.com ([213.105.180.140]) by
    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g4D3kCe15097 for
    <jm@jmason.org>; Mon, 13 May 2002 04:46:12 +0100
Received: from 203.129.205.5.205.129.203.in-addr.arpa ([203.129.205.5]) by
    mandark.labs.netnoteinc.com (8.11.2/8.11.2) with SMTP id g4D3k2D12605 for
    <jm@netnoteinc.com>; Mon, 13 May 2002 04:46:04 +0100
Received: from html (unverified [207.95.174.49]) by
    203.129.205.5.205.129.203.in-addr.arpa (EMWAC SMTPRS 0.83) with SMTP id
    <B0000178595@203.129.205.5.205.129.203.in-addr.arpa>; Mon, 13 May 2002
    09:04:46 +0530
Message-Id: <B0000178595@203.129.205.5.205.129.203.in-addr.arpa>
From: lmrn@mailexcite.com
To: ranmoore@cybertime.net
Subject: Real Protection, Stun Guns!  Free Shipping! Time:2:01:35 PM
Date: Mon, 28 Jul 1980 14:01:35
MIME-Version: 1.0
X-Keywords: 
Content-Type: text/html; charset="DEFAULT"

<ht

### Subject is None type or Null for some rows.  This throws an error with vectorization.  Replace with 'No Subject' string.

In [12]:
df.tail()

Unnamed: 0,folder_idx,folder,from,subject,body,spam
9348,1393,spam_2,IQ - TBA <tba@insiq.us>,Preferred Non-Smoker Rates for Smokers,\t Preferred Non-Smoker\n \t\n Just what the ...,1
9349,1394,spam_2,Mike <raye@yahoo.lv>,"How to get 10,000 FREE hits per day to any web...","Dear Subscriber,\n\nIf I could show you a way ...",1
9350,1395,spam_2,"""Mr. Clean"" <cweqx@dialix.oz.au>",Cannabis Difference,****Mid-Summer Customer Appreciation SALE!****...,1
9351,1396,spam_2,"""wilsonkamela400@netscape.net"" <wilsonkamela50...",[ILUG] WILSON KAMELA,ATTN:SIR/MADAN \n\n ...,1
9352,1397,spam_2,,,mv 00001.317e78fa8ee2f54cd4890fdc09ba8176 0000...,1


In [13]:
subject_none = [ x is None for x in df['subject'] ]
pd.Series(subject_none).value_counts()

False    9341
True       12
dtype: int64

In [14]:
df.loc[ df['subject']=='', :]

Unnamed: 0,folder_idx,folder,from,subject,body,spam
7102,148,spam,Donald Bae <donaldbae@purplehotel.com>,,"\n������ Worldwide*\nGreat Restaurants, Shop...",1
7500,546,spam,Mary's Store <removeme@marysstore.com>,,<TABLE border=0 cellPadding=0 cellSpacing=10 h...,1
8015,60,spam_2,"DONT@cpprimaonline.com, PAY@cpprimaonline.com,...",,"\n<html>\n\n<head>\n\n<body background=""glabkg...",1
8052,97,spam_2,winafreevacation@hotmail.com,,\nYou have been specially selected to qualify ...,1
8091,136,spam_2,jasper5645@yahoo.com,,\nDATA\nMessage-ID: <MnAtWBN0tB8txc.Hc6wd_JBTg...,1
8100,145,spam_2,travelincentives@aol.com,,\nYou have been specially selected to qualify ...,1
8163,208,spam_2,Danny Creech <southerngent23@hotmail.com>,,\nTo: \nFrom: \nSubject: Hello\nMy name is Dan...,1
8248,293,spam_2,JOSEPH EDWARD <jos_ed@mail.com>,,\nATTN:\n\nI PRESUME THIS MAIL WILL NOT BE A S...,1
8454,499,spam_2,Margaret <gyyyyy@public.ayptt.ha.cn>,,"Dear Sirs,\nWe know your esteemed company in b...",1
8455,500,spam_2,Margaret <gyyyyy@public.ayptt.ha.cn>,,"Dear Sirs,\nWe know your esteemed company in b...",1


In [15]:
df.loc[ df['subject']=='', 'subject'] = 'No Subject'

In [16]:
df.loc[ df['subject']=='', :]

Unnamed: 0,folder_idx,folder,from,subject,body,spam


In [17]:
df.loc[ df['subject'].isna(), :]

Unnamed: 0,folder_idx,folder,from,subject,body,spam
5051,5051,easy_ham,,,mv 00001.7c53336b37003a9286aba55d2945844c 0000...,0
6329,1277,easy_ham_2,mail <mail@dogma.slashnull.org>,,Problem with spamtrap\n/home/yyyy/lib/spamtrap...,0
6330,1278,easy_ham_2,mail <mail@dogma.slashnull.org>,,Problem with spamtrap\n/home/yyyy/lib/spamtrap...,0
6331,1279,easy_ham_2,mail <mail@dogma.slashnull.org>,,Problem with spamtrap\n/home/yyyy/lib/spamtrap...,0
6383,1331,easy_ham_2,nobody@sonic.spamtraps.taint.org,,Problem with spamtrap\nCould not lock /home/yy...,0
6407,1355,easy_ham_2,nobody@sonic.spamtraps.taint.org,,Problem with spamtrap\nCould not lock /home/yy...,0
6452,1400,easy_ham_2,,,mv 00001.d4365609129eef855bd5da583c90552b 0000...,0
6644,191,hard_ham,Vincent Chin <nukiez@hotmail.com>,,Ripped from\n\nhttp://www.hpl.hp.com/personal/...,0
6861,408,hard_ham,Vincent Chin <nukiez@hotmail.com>,,Ripped from\n\nhttp://www.hpl.hp.com/personal/...,0
6953,500,hard_ham,,,mv 00001.7c7d6921e671bbe18ebb5f893cd9bb35 0000...,0


In [18]:
df.loc[ df['subject'].isna(), 'subject'] = 'No Subject'

In [19]:
df.loc[ df['subject'].isna(), :]

Unnamed: 0,folder_idx,folder,from,subject,body,spam


## Feature Removal <a id='feature-removal'>

Remove new lines \n

In [20]:
df.loc[0, 'body']

'    Date:        Wed, 21 Aug 2002 10:54:46 -0500\n    From:        Chris Garrigues <cwg-dated-1030377287.06fa6d@DeepEddy.Com>\n    Message-ID:  <1029945287.4797.TMDA@deepeddy.vircio.com>\n\n\n  | I can\'t reproduce this error.\n\nFor me it is very repeatable... (like every time, without fail).\n\nThis is the debug log of the pick happening ...\n\n18:19:03 Pick_It {exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace} {4852-4852 -sequence mercury}\n18:19:03 exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace 4852-4852 -sequence mercury\n18:19:04 Ftoc_PickMsgs {{1 hit}}\n18:19:04 Marking 1 hits\n18:19:04 tkerror: syntax error in expression "int ...\n\nNote, if I run the pick command by hand ...\n\ndelta$ pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace  4852-4852 -sequence mercury\n1 hit\n\nThat\'s where the "1 hit" comes from (obviously).  The version of nmh I\'m\nusing is ...\n\ndelta$ pick -version\npick -- nmh-1.0.4 [compiled on fuchsia.c

In [21]:
df['body'] = [ x.replace("\n"," ") for x in df['body'] ]
df['body'] = [ x.replace("\t"," ") for x in df['body'] ]
df['body'] = [ x.replace("\'","") for x in df['body'] ]

In [22]:
df.loc[0, 'body']

'    Date:        Wed, 21 Aug 2002 10:54:46 -0500     From:        Chris Garrigues <cwg-dated-1030377287.06fa6d@DeepEddy.Com>     Message-ID:  <1029945287.4797.TMDA@deepeddy.vircio.com>     | I cant reproduce this error.  For me it is very repeatable... (like every time, without fail).  This is the debug log of the pick happening ...  18:19:03 Pick_It {exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace} {4852-4852 -sequence mercury} 18:19:03 exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace 4852-4852 -sequence mercury 18:19:04 Ftoc_PickMsgs {{1 hit}} 18:19:04 Marking 1 hits 18:19:04 tkerror: syntax error in expression "int ...  Note, if I run the pick command by hand ...  delta$ pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace  4852-4852 -sequence mercury 1 hit  Thats where the "1 hit" comes from (obviously).  The version of nmh Im using is ...  delta$ pick -version pick -- nmh-1.0.4 [compiled on fuchsia.cs.mu.OZ.AU at Sun Mar 17 14:55:5

## Exploratory Data Analysis (EDA) <a id='eda'>

### 

In [23]:
df['spam'].value_counts()

0    6954
1    2399
Name: spam, dtype: int64

### Feature Collinearity <a id='feature-collinearity'>


### Feature Outliers 
 

## Assumptions <a id='assumptions'>

# Model Preparations <a id='model-preparations'/>

What methods did you use (or not) to solve the problem?

Why are the methods you chose appropriate given the business objective?

How did you decide your approach was useful?  If more than one method, which one was better or why are each better or not?

What evaluation smetrics are most useful given the problem is a binary classification (ex. accuracy, f1-score, precision, recall AUC, etc)?



## Sampling & Scaling Data <a id='sampling-scaling-data' />

## Proposed Method <a id='proposed-metrics' />

70/30 training/test split

In [24]:
def split_dependant_and_independant_variables(df: pd.DataFrame, y_var: str):
    X = df.copy()
    y = X[y_var]
    X = X.drop([y_var], axis=1)
    return X, y

In [25]:
X, y = split_dependant_and_independant_variables(df, 'spam')

In [26]:
X.head()

Unnamed: 0,folder_idx,folder,from,subject,body
0,0,easy_ham,Robert Elz <kre@munnari.OZ.AU>,Re: New Sequences Window,"Date: Wed, 21 Aug 2002 10:54:46 -05..."
1,1,easy_ham,Steve Burt <Steve_Burt@cursor-system.com>,[zzzzteana] RE: Alexander,"Martin A posted: Tassos Papadopoulos, the Gree..."
2,2,easy_ham,Tim Chapman <timc@2ubh.com>,[zzzzteana] Moscow bomber,Man Threatens Explosion In Moscow Thursday A...
3,3,easy_ham,Monty Solomon <monty@roscom.com>,[IRR] Klez: The Virus That Won't Die,Klez: The Virus That Wont Die Already the mo...
4,4,easy_ham,Stewart Smith <Stewart.Smith@ee.ed.ac.uk>,Re: [zzzzteana] Nothing like mama used to make,"> in adding cream to spaghetti carbonara, whi..."


In [27]:
X = X.drop('folder_idx', axis=1)

In [28]:
X.index

RangeIndex(start=0, stop=9353, step=1)

In [29]:
y.index

RangeIndex(start=0, stop=9353, step=1)

In [101]:
def shuffle_split(X, y, test_size, random_state):
    stratified_shuffle_split = StratifiedShuffleSplit(n_splits=1, test_size=test_size, random_state=random_state)
    for train_index, test_index in stratified_shuffle_split.split(X, y):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y[train_index], y[test_index]
    return X_train, X_test, y_train, y_test

def vectorize_data(X_train, X_test):
    tfidf = TfidfVectorizer(stop_words='english', strip_accents='unicode') # 'unicode' 
    tfidf.fit(X_train)
    X_train_vect = tfidf.transform(X_train)
    X_test_vect = tfidf.transform(X_test)
    return X_train_vect, X_test_vect

In [102]:
X_train, X_test, y_train, y_test = shuffle_split(X['subject'], y, test_size=0.3, random_state=12343)
X_train_vect, X_test_vect = vectorize_data(X_train, X_test)

## Clustering
Use DBSCAN with cosine distance for NLP applications.  Look for descriptors for the clusters, like business vs. personal emails, IT emails, etc.

In [103]:
dbscan = DBSCAN(metric='cosine', min_samples=100)
eps_clusters = []
dbscan.eps = 0.9 

for i in range(0,5):
    clustering = dbscan.fit_predict(X_train_vect)
    num_clusters = len( pd.Series( clustering ).unique() )
    eps_clusters.append([dbscan.eps, num_clusters])
    dbscan.eps = dbscan.eps - 0.05

pd.DataFrame(eps_clusters, columns = ['epsilon', '# clusters'])

Unnamed: 0,epsilon,# clusters
0,0.9,2
1,0.85,4
2,0.8,6
3,0.75,6
4,0.7,7


In [129]:
dbscan.eps = 0.8
clusters = dbscan.fit_predict(X_train_vect)
pd.Series( clusters ).value_counts()

-1    4319
 0    1865
 4     135
 1     105
 2     100
 3      23
dtype: int64

In [130]:
clusters = pd.Series(clusters)
clusters.index = X_train.index
subject_clusters = pd.concat([clusters, X_train], axis=1)
subject_clusters.columns = ['cluster','subject']

In [140]:
for i in range( 0, max(clusters.unique())+1 ):
    print( subject_clusters.loc[subject_clusters['cluster']==i].head(10) )

# Cluster 0 = IT & Tech
# Cluster 1 = Pain (ouch, hurts)
# Cluster 2 = Money & Free
# Cluster 3 = CDs
# cluster 4 = [zzzzteana]

      cluster                                            subject
1371        0                     Re: problems with apt/synaptic
2075        0                                P2P slogans for EFF
44          0                              Re: The case for spam
8520        0  Secretly Record all internet activity on any c...
1968        0                  [use Perl] Stories for 2002-09-17
6760        0         What's facing FBI's new CIO? (Tech Update)
5214        0                    Re: [ILUG] Fwd: Linux Beer Hike
327         0                              Re: The case for spam
5269        0                  Re: [ILUG] stupid pics of the day
4383        0                [use Perl] Headlines for 2002-09-10
      cluster                                            subject
574         1      Re[3]: Selling Wedded Bliss (was Re: Ouch...)
492         1         Re: Selling Wedded Bliss (was Re: Ouch...)
5780        1  Re: "Ouch. Ouch. Ouch. Ouch. Ouch...."(was Re:...
512         1         Re:

## Evaluation Metrics <a id='evaluation-metrics' />

### Baseline Model

Try to predict spam/ham by only using the email subjects

## Feature Selection <a id='feature-selection' />

# Model Building & Evaluations <a id='model-building'/>

Primary task is buiding a logistic regression to predict hospital readmittances.

How did you handle missing values?

Specify your sampling methodology

Set up your models - highlights of any important parameters

Analysis of your models performance

## Sampling Methodology <a id='sampling-methodology'/>

#### Per the code above we used a 70/30 train test sample split

## Model's Performance Analysis <a id='performance-analysis'/>

# Model Interpretability & Explainability <a id='model-explanation'>

Which variables were more important and why?

How did you come to the conclusion these variables were important how how should the audience interpret this?

## Examining Feature Importance <a id='examining-feature-importance'/>

# Conclusion <a id='conclusion'>

What are you proposing to the audience with your models and why?

How should your audience interpret your conclusion and whwere should they go moving forward on the topic?

What other approaches do you recommend exploring?

Bring it all home!

### Final Model Proposal <a id='final-model-proposal'/>

### Future Considerations and Model Enhancements <a id='model-enhancements'/>

### Alternative Modeling Approaches <a id='alternative-modeling-approaches'>