# Machine-Learning SPAM Classifier
- Labelled SPAM/HAM email instances taken from 'https://spamassassin.apache.org/old/publiccorpus/'
- Inspired by the contents of Aurelien Geron's Hands-On Machine Learning with Scikit-Learn & TensorFlow book.

#### Plan:
1. __Parse the emails__ using Python's email package & beautiful soup.
2. Use __NLP library spaCy__ to lemmatize the contents of the email & __create a most common lemma vocabulary__.
3. __Create vectors__ from the lemma vocabulary Counters.
4. Use __Scikit-Learn to train a selection of classification model__.
5. __Select the most promising model__'s for __GridSearchCV__ (Logistic Regression & Random Forest Classifier)
6. __Evaluate the best model on the test set__.

## 1 Prepare Email Filenames

In [1]:
import os

HAM_PATH = '/home/keir/it/1_DS/projects/002_ML_SPAM_FILTER/datasets/20030228_ham0'
SPAM_PATH = '/home/keir/it/1_DS/projects/002_ML_SPAM_FILTER/datasets/20030228_spam0'

def generate_filenames(ham_path=HAM_PATH, spam_path=SPAM_PATH):
    # Return sorted lists containing the names of each file found in the os.listdir(directory)
    ham_filenames = [name for name in sorted(os.listdir(ham_path))]
    spam_filenames = [name for name in sorted(os.listdir(spam_path))]
    return ham_filenames, spam_filenames

ham_filenames, spam_filenames = generate_filenames()
print('Number of HAM:', len(ham_filenames))
print('Number of SPAM:', len(spam_filenames))

Number of HAM: 2500
Number of SPAM: 500


In [2]:
import email
import email.policy as empol

# Function to load & parse a specified email name in a given path
def load_email(filename, email_path):
    with open(os.path.join(email_path, filename), 'rb') as f:
        # Return parsed emails
        return email.parser.BytesParser(policy=empol.default).parse(f)
    
    
# Create lists of all parsed HAM & SPAM emails
ham_emails = [load_email(filename, HAM_PATH) for filename in ham_filenames]
spam_emails = [load_email(filename, SPAM_PATH) for filename in spam_filenames]

## 2 Create a Test Set

In [3]:
import numpy as np
from sklearn.model_selection import train_test_split

# Create array of all the email objects
x = np.array(ham_emails + spam_emails)

# Create an array containing binary labels in the same order as x (0 for ham_emails & 1 for spam_emails)
y = np.array([0] * len(ham_emails) + [1] * len(spam_emails))

# Create a test set & training set using skikit-learn
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

## 3 Preprocessing
### 3.1 Detect Email HTML & Convert to Text using BeautifulSoup

In [4]:
from bs4 import BeautifulSoup
import lxml

def html_to_plain_text(html):
    # Create a BS object containing email object contents
    soup = BeautifulSoup(html, 'lxml')
    # Return extracted plain text from the HTML soup - remove newlines & spacing
    return soup.get_text().replace('\n', '').replace(' ', '')

def email_to_text(email):
    # Set html variable to None for Pythonic if statement at function end
    html = None
    
    # Iterate over all parts & subparts of the email object tree
    for part in email.walk():
        # Get the email part's or subpart's content type
        ctype = part.get_content_type()
        # If ctype is not "text/plain" or "text/html", break this iteration of the for loop
        if not ctype in ("text/plain", "text/html"):
            continue

        # If the part is not multipart, set content = the part's contents using get_content() to parse
        try:
            content = part.get_content()
        # Email part is multipart - set content = the string of a list of the Message objects (payload)
        except:
            content = str(part.get_payload())
        
        # if there is no HTML in the part, return content (just plain text)
        if ctype == "text/plain":
            return content
        # if there is html, set the variable HTML to it
        else:
            html = content
            
    # If html has been assigned from the 'for part in email.walk()', convert HTML to plain text & return
    if html:
        return html_to_plain_text(html)

### 3.2 Clean Email Transformer

In [5]:
from sklearn.base import BaseEstimator, TransformerMixin
import urlextract
import re
# Create an instance of the URL extractor
url_extractor = urlextract.URLExtract()


class EmailCleaningTransformer(BaseEstimator, TransformerMixin):
    # Initialise all variables
    def __init__(self, strip_headers=True, lower_case=True, remove_punctuation=True, 
                         replace_urls=True, replace_numbers=True):
        self.strip_headers = strip_headers
        self.lower_case = lower_case
        self.remove_punctuation = remove_punctuation
        self.replace_urls = replace_urls
        self.replace_numbers = replace_numbers
        
    def fit(self, x, y=None):
        return self
    
    def transform(self, x, y=None):
        # Create an array with the cleaned emails for all emails passed in as x
        emails_transformed = []
        for email in x:
            # Convert emails to plain text
            text = email_to_text(email) or ""

            # If lower_case=True (default): convert text to lowercase
            if self.lower_case:
                text = text.lower()

            # Convert all URLs present in the plain text to the word ' URL '
            if self.replace_urls and url_extractor is not None:
                urls = list(set(url_extractor.find_urls(text)))
                urls.sort(key=lambda url: len(url), reverse=True)
                for url in urls:
                    text = text.replace(url, " URL ")

            # Use re module (regex) to replace any numbers in the text with the word 'NUMBER'
            if self.replace_numbers:
                text = re.sub(r'\d+(?:\.\d*(?:[eE]\d+))?', 'NUMBER', text)

            # Use regex to remove punctuation from text & replace with a ' ' (space)
            if self.remove_punctuation:
                text = re.sub(r'\W+', ' ', text, flags=re.M)
            
            # Append the transformed email 
            emails_transformed.append(text)
        return np.array(emails_transformed)

### 3.3 Email Lemmatization To Counter Transformer

In [6]:
from collections import Counter
import spacy
from spacy.lang.en import English
nlp = spacy.load('en_core_web_sm')
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

class EmailToLemmaCounterTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self, lemmatization=True):
        self.lemmatization = lemmatization
    
    def fit(self, x, y=None):
        return self
    
    def transform(self, x, y=None):
        x_transformed = []
        
        try:
            if self.lemmatization:
                # Disable all pipeline components within the with block - just tokenize
                with nlp.disable_pipes('tagger', 'parser', 'ner'):
                    for email in nlp.pipe(x.tolist()):
                        # Make a lemma counter for each email 
                        lemma_counter = Counter()
                        
                        
                        for token in email:
                            if not token.is_stop:
                                lemma_counter[token.lemma_] += 1
                        # Add each email's lemma counter to the new dataset
                        x_transformed.append(lemma_counter)
        except:
            print('Error')
            
        return np.array(x_transformed)

### 3.4 General Counter To Vector Transformer

In [7]:
from scipy.sparse import csr_matrix

class CounterToVectorTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self, vocab_size=15):
        self.vocab_size = vocab_size
        
    def fit(self, x, y=None):
        total_counter = Counter()
        for count in x:
            for i, count in count.items():
                # use min(count, 10) to stop any one email lemma counter outlier overpower 
                total_counter[i] += min(count, 10)
        # Find the most common lemmas - number of them determined by lemma_vocab_size         
        most_common = total_counter.most_common()[:self.vocab_size]
        # Add entity counts for each of the vocabulary words to the lemma_vocabulary_ class attribute
        self.most_common = most_common
        self.vocabulary_ = {i: index + 1 for index, (i, count) in enumerate(most_common)}
        return self
    
    def transform(self, x, y=None):
        rows = []
        cols = []
        data = []
        
        for row, count in enumerate(x):
            for i, count in count.items():
                rows.append(row)
                cols.append(self.vocabulary_.get(i, 0))
                data.append(count)
        return csr_matrix((data, (rows, cols)), shape=(len(x), self.vocab_size + 1))

## 4 Preprocessing Data Pipeline

In [8]:
from sklearn.pipeline import Pipeline

preprocess_pipeline = Pipeline([
    ("email_clean", EmailCleaningTransformer()),
    ("email_to_counter", EmailToLemmaCounterTransformer()),
    ("counter_to_vector", CounterToVectorTransformer(vocab_size=4500))])

In [9]:
x_train_transformed = preprocess_pipeline.fit_transform(x_train)

## 5 Train a Selection of Models

In [10]:
# Remove scikit-learn warnings!
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

### 5.1 Logistic Regression Classifier 

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

log_clf = LogisticRegression(solver="liblinear", random_state=42)

score = cross_val_score(log_clf, x_train_transformed, y_train, cv=5, verbose=3)
print("Mean Accuracy: ", score.mean())

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.1s remaining:    0.0s


[CV]  ................................................................
[CV] .................................... , score=0.979, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.979, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.988, total=   0.1s
[CV]  ................................................................
[CV] .................................... , score=0.990, total=   0.1s
[CV]  ................................................................
[CV] .................................... , score=0.985, total=   0.1s
Mean Accuracy:  0.9841666666666666


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.3s finished


### 5.2 Random Forest Classifier

In [12]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(random_state=42)

score = cross_val_score(rfc, x_train_transformed, y_train, cv=5, verbose=3)
score.mean()

[CV]  ................................................................
[CV] .................................... , score=0.956, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.965, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.975, total=   0.0s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.1s remaining:    0.0s


[CV] .................................... , score=0.973, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.973, total=   0.0s


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.2s finished


0.9683333333333334

### 5.3 Guassian Naive Bayes

In [13]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()

score = cross_val_score(gnb, x_train_transformed.toarray(), y_train, cv=5, verbose=3)
score.mean()

[CV]  ................................................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s remaining:    0.0s


[CV] .................................... , score=0.952, total=   0.2s
[CV]  ................................................................
[CV] .................................... , score=0.942, total=   0.2s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.4s remaining:    0.0s


[CV] .................................... , score=0.927, total=   0.2s
[CV]  ................................................................
[CV] .................................... , score=0.919, total=   0.2s
[CV]  ................................................................
[CV] .................................... , score=0.929, total=   0.2s


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.9s finished


0.9337499999999999

### 5.4 K-Neighbors Classifier

In [14]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 3)

score = cross_val_score(knn, x_train_transformed, y_train, cv=5, verbose=3)
score.mean()

[CV]  ................................................................
[CV] .................................... , score=0.725, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.725, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.700, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.721, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.710, total=   0.0s


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.2s finished


0.7162499999999999

## 6 GridSearchCV Hyperparameter Tuning
- GridSearchCV the most promising models - LogisticRegression & RandomForestClassifier


### 6.1 Logistic Regression Classifier 

In [15]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score, precision_score, accuracy_score

In [16]:
param_grid_log = [
    {'penalty' : ['l1', 'l2'],
     'C' : [9, 10, 11],
    'solver' : ['liblinear'],
    }]

log_clf = LogisticRegression(random_state=42)
grid_search_log = GridSearchCV(log_clf, param_grid=param_grid_log, cv = 5, verbose=True, n_jobs=-1)
grid_search_log.fit(x_train_transformed, y_train)

Fitting 5 folds for each of 6 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:    9.2s finished


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=42, solver='warn',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid=[{'C': [9, 10, 11], 'penalty': ['l1', 'l2'],
                          'solver': ['liblinear']}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=True)

In [17]:
print(grid_search_log.best_params_)

{'C': 9, 'penalty': 'l2', 'solver': 'liblinear'}


In [18]:
y_train_pred = cross_val_predict(grid_search_log.best_estimator_, x_train_transformed, y_train, cv=5)
confusion_matrix(y_train, y_train_pred)

array([[1972,   23],
       [  24,  381]])

In [19]:
print(f'Precision: {precision_score(y_train, y_train_pred)}')
print(f'Recall: {recall_score(y_train, y_train_pred)}')

Precision: 0.943069306930693
Recall: 0.9407407407407408


### 6.2 Random Forest Classifier

In [20]:
param_grid_rfc = [
    {'n_estimators' : [10, 1000, 1500],
     'max_features' : [25, 50, 100],
     'bootstrap' : [True, False] 
    }]

rf_clf = RandomForestClassifier(random_state=42)
grid_search_rf = GridSearchCV(rf_clf, param_grid=param_grid_rfc, cv = 5, n_jobs=-1)
grid_search_rf.fit(x_train_transformed, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=None,
                                              oob_score=False, random_state=42,
                                              verbose=0, warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid=[{'b

In [21]:
print(grid_search_rf.best_params_)

{'bootstrap': False, 'max_features': 100, 'n_estimators': 1500}


In [22]:
y_train_pred = cross_val_predict(grid_search_rf.best_estimator_, x_train_transformed, y_train, cv=5)
confusion_matrix(y_train, y_train_pred)

array([[1981,   14],
       [  37,  368]])

In [23]:
print(f'Precision: {precision_score(y_train, y_train_pred)}')
print(f'Recall: {recall_score(y_train, y_train_pred)}')

Precision: 0.9633507853403142
Recall: 0.908641975308642


## 7 Final Model & Test Set Evaluation
- Evaluate the final model (Logistic Regression Classifier) on the test set.

In [24]:
final_model = grid_search_log.best_estimator_

In [25]:
x_test_transformed = preprocess_pipeline.transform(x_test)

In [26]:
final_predictions = final_model.predict(x_test_transformed)

In [28]:
print('Accuracy Score: {:.0f}%'.format(100*accuracy_score(y_test, final_predictions)))
print()
print("Precision: {:.0f}%".format(100 * precision_score(y_test, final_predictions)))
print("Recall: {:.0f}%".format(100 * recall_score(y_test, final_predictions)))

Accuracy Score: 98%

Precision: 92%
Recall: 97%


# Notes
- __spaCy's lemmatization is not as effective as NLTK stemming__ for this particular dataset (in both recall & precision), however spaCy was more fun!
- Performing email length analysis, structure analysis, and spaCy entity analysis diluted the effectiveness of stemming & lemmatization. 
- Could __manipulate the decision function's threshold to balance precision & recall__ if desired. 