In [3]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Overview  
This notebook sourcecode is a reproduction of Aurelien Geron's sourcecode for a Spam filter. He is the author of Hands-On Machnine Learning with Scikit-Learn, Keras, & Tensorflow, the book I'm using to study ML. This notebook contains my notes and commentary explaining the sourcecode.
    The spam filter works by creating a vocabulary of the most commonly occuring words and creates a vector of counts of how many times those words occur in each email. That vector of counts is then fed into a logistic regression model.  
    
1. Fetch and Load Data
2. Preview Data
3. Split the Data into Train and Test Sets
4. Feed Data into a Preparation Pipeline
5. Train the Model
6. Predict the Test Data and Score

# Fetch and Load Data  
Specify the local download folders and server url containing the datasets. The download folders are created if they dont exist. The dataset files are checked for existence and downloaded if not. The dataset file is a compressed archive .tar.bz2. The files are extracted and their filenames are loaded into separate lists according to whether they're spam or ham. The contents are then parsed from each file into an email structure.


In [4]:
'''
The tarfile module allows us to decompress and unpack the dataset. The .tar file is an archive, the name comess from
"tape archive". The .tar file is compressed with Bzip2, hence the .bz2 extension. There are many compression algorithms. Gzip is
another common program for compressing files. 

six.moves: Six provides simple utilities for wrapping over differences between Python 2 and Python 3. It is intended to support codebases
that work on both Python 2 and 3 without modification. six consists of only one Python file, so it is painless to copy into a project.
'''
import tarfile
from six.moves import urllib

DOWNLOAD_ROOT = "https://spamassassin.apache.org/old/publiccorpus/"
HAM_URL = DOWNLOAD_ROOT + "20030228_easy_ham.tar.bz2"
SPAM_URL = DOWNLOAD_ROOT + "20030228_spam.tar.bz2"
SPAM_PATH = os.path.join("datasets", "spam")  
# os.path.join joins the paths intelligently according to the OS. SPAM_PATH is the directory where the dataset is held locally

In [5]:
#SPAM_PATH is the local parent directory for our dataset

def fetch_spam_data(spam_url=SPAM_URL, spam_path=SPAM_PATH):
    #check if directory exists, create if doesn't
    if not os.path.isdir(spam_path):
        os.makedirs(spam_path)
    #check if dataset files exist. if does not exist, download and extract as "ham.tar.bz2" and "spam.tar.bz2"
    for filename, url in (("ham.tar.bz2", HAM_URL), ("spam.tar.bz2", SPAM_URL)):
        #path is the local filename with its path
        path = os.path.join(spam_path, filename)
        if not os.path.isfile(path):
            urllib.request.urlretrieve(url, path)
        tar_bz2_file = tarfile.open(path)
        tar_bz2_file.extractall(path=SPAM_PATH)
        tar_bz2_file.close()
        
fetch_spam_data()

In [6]:
#specify the directories containing spam's and ham's archive contents
HAM_DIR = os.path.join(SPAM_PATH, "easy_ham")
SPAM_DIR = os.path.join(SPAM_PATH, "spam")

#get file with filename of greater that 20 character length.
ham_filenames = [filename for filename in sorted(os.listdir(HAM_DIR)) if len(filename) > 20]
spam_filenames = [filename for filename in sorted(os.listdir(SPAM_DIR)) if len(filename) > 20]

print("Sample filenames:\n", ham_filenames[:5], "\n", spam_filenames[:5])
print("There are", len(spam_filenames) ,"spam files and",  len(ham_filenames) ,"ham files.\n")

Sample filenames:
 ['00001.7c53336b37003a9286aba55d2945844c', '00002.9c4069e25e1ef370c078db7ee85ff9ac', '00003.860e3c3cee1b42ead714c5c874fe25f7', '00004.864220c5b6930b209cc287c361c99af1', '00005.bf27cdeaf0b8c4647ecd61b1d09da613'] 
 ['00001.7848dde101aa985090474a91ec93fcf0', '00002.d94f1b97e48ed3b553b3508d116e6a09', '00003.2ee33bc6eacdb11f38d052c44819ba6c', '00004.eac8de8d759b7e74154f142194282724', '00005.57696a39d7d84318ce497886896bf90d']
There are 500 spam files and 2500 ham files.



## Load the Data

Use Python's **email** module to parse email files. According to documentation, the **email policy** should be specified when creating a BytesParse instance. The 'email policy' seems to specify the structure of the email so that the parsing method maps the file contents to some certain object properties.  

Custom functions to load emails and parse emails are defined. Since spam and ham emails are in different folder, one of the parameters has to specify
which folder to look into. The two other parameters the function takes is a specific filename and the parent folder of "/spam/" and "/easy_ham".  

The spam_emails and ham_emails consists of EmailMessage objects returned by the load_email function. Below are two of EmailMessage's methods.
* get_content()  
returns a string of the Message minus the header
* get_payload()  
Return the current payload, which will be a list of Message objects when is_multipart() is True, or a string when is_multipart() is False. If the payload is a list and you mutate the list object, you modify the message’s payload in place.
* get_content_type()  
Return the message’s content type. The returned string is coerced to lower case of the form maintype/subtype. If there was no Content-Type header in the message the default type as given by get_default_type() will be returned. Since according to RFC 2045, messages always have a default type, get_content_type() will always return a value.

In [7]:
import email
import email.policy

def load_email(is_spam, filename, parent_dir=SPAM_PATH): #the last parameter is defaulted SPAM_PATH
    #determine which folder to look in, spam or easy_ham
    folder = "spam" if is_spam else "easy_ham"
    #open the file specified by filename, using 'with' to free resources after returning. flag open() to read-only and binary modes
    with open(os.path.join(parent_dir, folder, filename), 'rb') as f:
        #specify email policy
        return email.parser.BytesParser(policy=email.policy.default).parse(f)
    
#load emails
spam_emails = [load_email(is_spam=True, filename=name) for name in spam_filenames]
ham_emails = [load_email(is_spam=False, filename=name) for name in ham_filenames]
        

# Preview Data  
Look at a sample email in its entirety and another sample email's contents stripped of leading and trailing whitespace characters. Some emails are multipart, with images and attachments (which can have their own attachments). We write a recursive function to get the email structure. 

In [8]:
display(type(ham_emails[1]))
print(ham_emails[1])

email.message.EmailMessage

Return-Path: <Steve_Burt@cursor-system.com>
Delivered-To: zzzz@localhost.netnoteinc.com
Received: from localhost (localhost [127.0.0.1])
	by phobos.labs.netnoteinc.com (Postfix) with ESMTP id BE12E43C34
	for <zzzz@localhost>; Thu, 22 Aug 2002 07:46:38 -0400 (EDT)
Received: from phobos [127.0.0.1]
	by localhost with IMAP (fetchmail-5.9.0)
	for zzzz@localhost (single-drop); Thu, 22 Aug 2002 12:46:38 +0100 (IST)
Received: from n20.grp.scd.yahoo.com (n20.grp.scd.yahoo.com    [66.218.66.76])
 by dogma.slashnull.org (8.11.6/8.11.6) with SMTP id    g7MBkTZ05087 for
 <zzzz@spamassassin.taint.org>; Thu, 22 Aug 2002 12:46:29 +0100
X-Egroups-Return: =?utf-8?q?sentto-2242572-52726-1030016790-zzzz=3Dspamassas?=
 =?utf-8?q?sin=2Etaint=2Eorg=40returns=2Egroups=2Eyahoo=2Ecom?=
Received: from [66.218.67.196] by n20.grp.scd.yahoo.com with NNFMP;
    22 Aug 2002 11:46:30 -0000
X-Sender: steve.burt@cursor-system.com
X-Apparently-To: zzzzteana@yahoogroups.com
Received: (EGP: mail-8_1_0_1); 22 Aug 2002 11:4

In [9]:
print(spam_emails[50].get_content().strip())

Help wanted.  We are a 14 year old fortune 500 company, that is
growing at a tremendous rate.  We are looking for individuals who
want to work from home.

This is an opportunity to make an excellent income.  No experience
is required.  We will train you.

So if you are looking to be employed from home with a career that has
vast opportunities, then go:

http://www.basetel.com/wealthnow

We are looking for energetic and self motivated people.  If that is you
than click on the link and fill out the form, and one of our
employement specialist will contact you.

To be removed from our link simple go to:

http://www.basetel.com/remove.html


4592gPjt0-916msGW0934HwlS5-965Tqzv4189Rjvx0-174yaja0756SEjNl56


## Inspect Email Structure  
The function handles a string, an EmailMessage object, or list of EmailMessage objects. A string argument results in that string being returned. An EmailMessage argument results in the EmailMessage's content type being returned. A list will lead to recursion. 

In [10]:
'''
isinstance(object, classinfo)
Return True if the object argument is an instance of the classinfo argument, or of a (direct, indirect or virtual) subclass thereof. 
If object is not an object of the given type, the function always returns False. If classinfo is a tuple of type objects (or recursively, other such tuples), 
return True if object is an instance of any of the types. If classinfo is not a type or tuple of types and such tuples, a TypeError exception is raised.
'''

def get_email_structure(email):
    #check if 'email' is a string. 
    if isinstance(email, str):
        return email
    #otherwise get the message payload
    content = email.get_payload()
    if isinstance(content, list):
        return "multipart({})".format(",".join([get_email_structure(sub_email) for sub_email in content]))
    else:
        return email.get_content_type()
    
#example email from the dataset
get_email_structure(spam_emails[91])

'multipart(text/html)'

In [11]:
'''
A Counter is a dict subclass for counting hashable objects. It is a collection where elements are stored as dictionary keys and their counts are stored as 
dictionary values. Counts are allowed to be any integer value including zero or negative counts. The Counter class is similar to bags or multisets in other 
languages
'''
from collections import Counter

def structures_counter(emails):
    structures = Counter()
    for email in emails:
        structure = get_email_structure(email)
        structures[structure] += 1
    return structures

In [12]:
structures_counter(spam_emails).most_common()

[('text/plain', 218),
 ('text/html', 183),
 ('multipart(text/plain,text/html)', 45),
 ('multipart(text/html)', 20),
 ('multipart(text/plain)', 19),
 ('multipart(multipart(text/html))', 5),
 ('multipart(text/plain,image/jpeg)', 3),
 ('multipart(text/html,application/octet-stream)', 2),
 ('multipart(text/plain,application/octet-stream)', 1),
 ('multipart(text/html,text/plain)', 1),
 ('multipart(multipart(text/html),application/octet-stream,image/jpeg)', 1),
 ('multipart(multipart(text/plain,text/html),image/gif)', 1),
 ('multipart/alternative', 1)]

# Split Data into Train and Test Sets  
Merge the spam and ham email into a single numpy array. Create labels by constructing a vector of 0's and 1's.

In [13]:
from sklearn.model_selection import train_test_split

X = np.array(ham_emails + spam_emails)
y = np.array([0] * len(spam_emails) + [1] * len(ham_emails))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#specifying a "random state" seeds the random number generator with the same number for each run so that we'll get the same results each time the script is ran


# Build a Preprocessing Pipeline  
The preprocessing pipeline consists of two main parts, one counts the words in each email and the other turns the word counts into vectors. The following processes are part of the pipeline and are turned on/off by parameters flags, which are set to True by default.  
1. Strip Headers
2. Convert to Lowercase
3. Remove Punctuations
4. Replace URLs
5. Replace Numbers
6. Stem words (e.g. running -> run)  

The pipeline requires support functions, a email-to-text function which uses an html-to-text function, that we'll write before constructing the pipeline.

## Support Functions for Pipeline  

### HTML to Text and Email to Text  
This function gets rid of HTML tags, whitespace characters, newline characters, and unescapes html code into its characters (e.g. &lt; -> '<'). Hyperlinks tags are replaced with ' HYPERLINK '.

In [14]:
import re  #regular expression package
from html import unescape 

def html_to_text(html):
    text = re.sub('<head.*?>.*?</head>', '', html, flags= re.M | re.S | re.I) 
    text = re.sub('<a.*?>.*?</a>', ' HYPERLINK ', text, flags= re.M | re.S | re.I)
    text = re.sub('<.*?>', '', text,  flags= re.M | re.S | re.I)
    text = re.sub(r'(\s*\n)+', '\n', text,  flags= re.M | re.S | re.I)
    return unescape(text)

In [15]:
'''
Examples of get_content_type() return values
         'multipart/signed': 68,
         'multipart/alternative': 9,
         'multipart/mixed': 10,
         'multipart/related': 3,
         'multipart/report': 2
'''

def email_to_text(email):
    html = None
    for part in email.walk():
        ctype = part.get_content_type()
        if not ctype in ("text/plain", "text/html"):
            continue
        try:
            #this part will result in a KeyError if 'part' is multipart 
            content = part.get_content()    
        except:
            content = str(part.get_payload())     
            #print(content)
        if ctype == 'text/plain':
            return content
        else:
            html = content
        
    if html:
        return html_to_text(html)

In [16]:
html_spam_email = [email for email in X_train[y_train == 1] if get_email_structure(email) == "text/html"]
sample_html_spam = html_spam_email[7]
print(html_to_text(sample_html_spam.get_content())[:1000], "...")


OTC
 Newsletter
Discover Tomorrow's Winners 
For Immediate Release
Cal-Bay (Stock Symbol: CBYI)
Watch for analyst "Strong Buy Recommendations" and several advisory newsletters picking CBYI.  CBYI has filed to be traded on the OTCBB, share prices historically INCREASE when companies get listed on this larger trading exchange. CBYI is trading around 25 cents and should skyrocket to $2.66 - $3.25 a share in the near future.
Put CBYI on your watch list, acquire a position TODAY.
REASONS TO INVEST IN CBYI
A profitable company and is on track to beat ALL earnings estimates!
One of the FASTEST growing distributors in environmental & safety equipment instruments.
Excellent management team, several EXCLUSIVE contracts.  IMPRESSIVE client list including the U.S. Air Force, Anheuser-Busch, Chevron Refining and Mitsubishi Heavy Industries, GE-Energy & Environmental Research.
RAPIDLY GROWING INDUSTRY
Industry revenues exceed $900 million, estimates indicate that there could be as much as $25 billi

### Porter Stemmer and URL Extractor

In [17]:
try:
    import nltk
    
    stemmer = nltk.PorterStemmer()
except ImportError:
    print("Error: Stemming requires the NLTK module.")
    stemmer = None

In [18]:
try:
    !pip install -q -U urlextract #issue a command to operating system
except ImportError:
    print("Couldn't install urlextract")


You should consider upgrading via the '/opt/conda/bin/python3.7 -m pip install --upgrade pip' command.[0m


In [19]:
try:
    import urlextract
    url_extractor = urlextract.URLExtract()
except ImportError:
    print("Error: replacing URLs requires the urlextract module.")
    url_extractor = None
        

## Constructing the Pipeline  
Construct the two transformers to be used to in the pipeline then put the pipeline together.

### Email to Word Counter  
Iterate through emails passed to the EmailToWordCounterTransformer. Convert each email into text. Process the text according to flags. Count the stemmed words. Put the counts into an array.

In [20]:
from sklearn.base import TransformerMixin, BaseEstimator

class EmailToWordCounterTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, strip_header=True, lower_case=True, remove_punctuation=True, replace_urls=True, replace_numbers=True, stemming=True):
        self.strip_header=strip_header
        self.lower_case=lower_case
        self.remove_punctuation=remove_punctuation
        self.replace_urls=replace_urls
        self.replace_numbers=replace_numbers
        self.stemming=stemming
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        X_transformed = [] #list to store word counts for each email
        for email in X:
            #convert email to text
            text = email_to_text(email) or "" 
            #process text accoring to flags. some flags aren't used :(
            if self.lower_case:
                text = text.lower()
            if self.replace_urls and url_extractor is not None:
                #create a list of URLs in the email and sort them according to length. To eliminate redundant URLs, we turn them into a set first.
                urls = list(set(url_extractor.find_urls(text)))
                urls.sort(key=lambda url: len(url), reverse=True)
                for url in urls:
                    text = text.replace(url, ' URL ')
            if self.replace_numbers:
                text = re.sub(r'\d+(?:\.\d*(?:[eE]\d+))', ' NUMBER ', text) #replaces scientific formatted numbers
            if self.remove_punctuation:
                text = re.sub(r'\W+', ' ', text, flags=re.M)
            
            #create a Counter object initialized with the emails words
            word_counts = Counter(text.split())
            if self.stemming and stemmer is not None:
                stemmed_word_counts = Counter()
                for word, count in word_counts.items():
                    stemmed_word = stemmer.stem(word)
                    stemmed_word_counts[stemmed_word] += count
                word_counts = stemmed_word_counts
            X_transformed.append(word_counts)
        return np.array(X_transformed)

### Word Counter to Vector Transformer  
For each email, create a vector of counts of the n-most occuring words. The n most occuring words are found by sorting and slicing a dictionary of all words occuring in all emails. We construct and return a sparse matrix of the counts. Each row corresponds to an email, each column corresponds to the occurences of a vocubalry word in that email. The 0th element is the non-vocabulary words.


In [21]:
from scipy.sparse import csr_matrix

class WordCounterToVectorTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, vocab_size=1000):
        self.vocab_size=vocab_size
    def fit(self, X, y=None):
        total_count = Counter()
        for word_count in X: #X is a list of Counters
            for word, count in word_count.items():
                total_count[word] += min(count, 10) #cap the increment to 10
        most_common = total_count.most_common()[:self.vocab_size] #get the n most common words
        
        #create instance variables for inspection
        self.most_common_ = most_common
        #self.vocabulary_ is offset by 1 for assisting in creating the a matrix. The values of these key:value pairs
        #will correspond to the column indices in the upcoming matrix produced in transform()
        self.vocabulary_ = {word: index + 1 for index, (word, count) in enumerate(most_common)}
        return self
    def transform(self, X, y=None):
        rows = []
        cols = []
        data = []
        for row, word_count in enumerate(X):
            for word, counts in word_count.items():
                rows.append(row)
                cols.append(self.vocabulary_.get(word, 0))
                data.append(counts)
        return csr_matrix((data, (rows, cols)), shape=(len(X), self.vocab_size+1))
        

In [22]:
X_few = X_train[:3]
X_few_wordcounts = EmailToWordCounterTransformer().fit_transform(X_few)

vocab_transformer = WordCounterToVectorTransformer(vocab_size=10)
X_few_vectors = vocab_transformer.fit_transform(X_few_wordcounts)

## Construct the Pipeline

In [23]:
from sklearn.pipeline import Pipeline

preprocessing_pipeline = Pipeline([
    ("email_to_wordcount", EmailToWordCounterTransformer()),
    ("wordcount_to_vector", WordCounterToVectorTransformer())
])

In [28]:
X_train_transformed = preprocessing_pipeline.fit_transform(X_train)

<html><xbody>
<hr width = "100%">
<center><font size = "+1" color =
"blue"><b>Over $100,000 The First Year, Most Of That While I Was Sleeping!  Will Work For Anyone, Anywhere!</font></b><p>
<table><Tr><td>

      <p align="center"><font face="Verdana, Arial, Helvetica, sans-serif" size="2"><b><font color="#000000" size="4" face="Arial">Imagine 
        The Perfect Business</font></b></font></p>
      <ul>
        <li><font face="Verdana, Arial, Helvetica, sans-serif" size="2"><b>You 
          Can Run It From Home...Or From Anywhere With A Telephone Connection</b></font></li>
      </ul>
      <ul>
        <li><font face="Verdana, Arial, Helvetica, sans-serif" size="2" color="#000099"><b><font color="#000000">There 
          Is No Large Investment To Get Started</font></b></font></li>
      </ul>
      <ul>
        <li><font face="Verdana, Arial, Helvetica, sans-serif" size="2"><b>You 
          Can Put Everything On Auto-Pilot</b></font></li>
      </ul>
      <ul>
        <li><font 

# Feed into a Logitistic Regression Model

In [30]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

log_clf = LogisticRegression(solver="lbfgs", random_state=42)
score = cross_val_score(log_clf, X_train_transformed, y_train, cv=3, verbose=3)
print(score, '\n', score.mean())

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s remaining:    0.0s


[CV]  ................................................................
[CV] .................................... , score=0.892, total=   0.2s
[CV]  ................................................................
[CV] .................................... , score=0.879, total=   0.2s
[CV]  ................................................................
[CV] .................................... , score=0.904, total=   0.2s
[0.8925  0.87875 0.90375] 
 0.8916666666666666


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.3s remaining:    0.0s
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.5s finished


In [34]:
from sklearn.metrics import precision_score, recall_score

X_test_transformed = preprocessing_pipeline.transform(X_test)

log_clf = LogisticRegression(solver="lbfgs", random_state=42)
log_clf.fit(X_train_preprocessed, y_train)

y_pred = log_clf.predict(X_test_transformed)

print("Precision: {:.2f}%".format(100 * precision_score(y_test, y_pred)))
print("Recall: {:.2f}%".format(100 * recall_score(y_test, y_pred)))

PROFESSIONAL, EFFECTIVE DEBT COLLECTION SERVICES AVAILABLE

For the last seventeen years, National Credit Systems, Inc. has been providing
top flight debt collection services to over 15,000 businesses, institutions, and 
healthcare providers.

We charge only a low-flat fee (less than $20) per account, and all proceeds are 
forwarded to you directly -- not to your collections agency.

If you wish, we will report unpaid accounts to Experian (formerly TRW), 
TRANSUNION, and Equifax. There is no charge for this important service.

PLEASE LET US KNOW IF WE CAN BE OF SERVICE TO YOU.

Simply reply to debt_collectors@chmailnet.com with the following instructions 
in the Subject field - 

REMOVE  --  Please remove me from your mailing list.
EMAIL   --  Please email more information.
FAX     --  Please fax more information.
MAIL    --  Please snailmail more information.
CALL    --  Please have a representative call.

Indicate the best time to telephone and any necessary addresses and 
telephone/

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
