# Chapter 3, Exercise : Email Spam / Ham Classification

Build a simple email classification system.

1. Frame the problem and look at the big picture
2. Get the data
3. Explore the data to gain insights
4. Prepare the data to better expose the underlying data patterns to ML-Learning algorithms
5. Explore many different models and short-list the best ones
6. Fine-tune your models and combine them into a great solution
7. Present your solution (skip for this exercise)
8. Launch, monitor, and maitain your system (skip for this exercise).

## Frame the Problem

The data set is a bunch of text email files in folders called spam and ham.  Spam are bad, ham = not spam (i.e. legit). Since they are basically text, most of the work will be in creating a feature dataframe / matrix.  This will involve some sort of text feature extraction.  Perhaps Tfidf or something similar.

In [24]:
# required imports
import numpy as np
import pandas as pd
import re
import os
import logging
from time import time
from pprint import pprint
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

## Get the Data

In [2]:
DATA_FOLDER = "../datasets/emails/"
SPAM_FOLDER = DATA_FOLDER + "spam/"
HAM_FOLDER = DATA_FOLDER + "ham/"

In [3]:
spam_filenames = os.listdir(SPAM_FOLDER)
ham_filenames = os.listdir(HAM_FOLDER)
#for filename in os.listdir('SPAM_FOLDER'):   

In [26]:
# read contents of all SPAM and HAM files
TAG_RE = re.compile(r'<[^>]+>')
spam_corpus = []
ham_corpus = []

contents = ""
# SPAM
for spam_file in spam_filenames:
    with open(SPAM_FOLDER + spam_file, "r", errors='ignore') as f:
        contents = f.read()
        #spam_corpus.append(TAG_RE.sub('', contents))
        spam_corpus.append(contents)
spam_target = np.ones(len(spam_corpus))
#print(spam_target[0:10])

contents = ""
# HAM
for ham_file in ham_filenames:
    with open(HAM_FOLDER + ham_file, "r", errors='ignore') as f:
        contents = f.read()
        #ham_corpus.append(TAG_RE.sub('', contents))
        ham_corpus.append(contents)
ham_target = np.zeros(len(ham_corpus))

# Combine Spam / Ham Corpus
corpus = []
corpus.extend(spam_corpus)
corpus.extend(ham_corpus)
print(ham_corpus[0])

From exmh-workers-admin@redhat.com  Thu Aug 22 12:36:23 2002
Return-Path: <exmh-workers-admin@spamassassin.taint.org>
Delivered-To: zzzz@localhost.netnoteinc.com
Received: from localhost (localhost [127.0.0.1])
	by phobos.labs.netnoteinc.com (Postfix) with ESMTP id D03E543C36
	for <zzzz@localhost>; Thu, 22 Aug 2002 07:36:16 -0400 (EDT)
Received: from phobos [127.0.0.1]
	by localhost with IMAP (fetchmail-5.9.0)
	for zzzz@localhost (single-drop); Thu, 22 Aug 2002 12:36:16 +0100 (IST)
Received: from listman.spamassassin.taint.org (listman.spamassassin.taint.org [66.187.233.211]) by
    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g7MBYrZ04811 for
    <zzzz-exmh@spamassassin.taint.org>; Thu, 22 Aug 2002 12:34:53 +0100
Received: from listman.spamassassin.taint.org (localhost.localdomain [127.0.0.1]) by
    listman.redhat.com (Postfix) with ESMTP id 8386540858; Thu, 22 Aug 2002
    07:35:02 -0400 (EDT)
Delivered-To: exmh-workers@listman.spamassassin.taint.org
Received: from int-mx1.corp

## Explore data to gain insights

## Prepare data for ML

In [19]:
# first play around with vectorizers
vectorizer = CountVectorizer(stop_words='english', max_df=1.0, min_df=3)
print(vectorizer)
fit = vectorizer.fit(corpus)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=3,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)


In [20]:
# transform spam
X_spam = fit.transform(spam_corpus)
print(X_spam.shape)
print(X_spam.toarray())

(1898, 22277)
[[ 0  0  0 ...,  0  0  0]
 [ 1  0  1 ...,  0  0  0]
 [ 2  8  0 ...,  0  0  0]
 ..., 
 [70  0  0 ...,  0  0  0]
 [ 0  0  4 ...,  0  0  0]
 [ 0  0  0 ...,  0  0  0]]


In [21]:
# transform ham
X_ham = fit.transform(ham_corpus)
print(X_ham.shape)
print(X_ham.toarray())

(2501, 22277)
[[0 0 0 ..., 0 0 0]
 [0 0 5 ..., 0 0 0]
 [2 0 5 ..., 0 0 0]
 ..., 
 [3 0 4 ..., 0 0 0]
 [1 0 7 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]]


In [31]:
# combine features and targets
# HAM
ham = pd.DataFrame(data=X_ham.toarray(), columns=['x'+str(i) for i in range(X_ham.shape[1])])
ham = ham.assign(Spam = pd.Series(ham_target, index=ham.index))

# SPAM
spam = pd.DataFrame(data=X_spam.toarray(), columns=['x'+str(i) for i in range(X_spam.shape[1])])
spam = spam.assign(Spam = pd.Series(spam_target, index=spam.index))

# Combine them
train = pd.concat([spam, ham], ignore_index=True)
print(train.head())

   x0  x1  x2  x3  x4  x5  x6  x7  x8  x9  ...   x22268  x22269  x22270  \
0   0   0   0   0   0   0   0   0   0   0  ...        0       0       0   
1   1   0   1   0   1   0   0   0   0   0  ...        0       0       0   
2   2   8   0   0   0   0   0   0   0   0  ...        0       0       0   
3   1   0   1   0   0   0   0   0   0   0  ...        0       0       0   
4   1   0   1   0   0   0   0   0   0   0  ...        0       0       0   

   x22271  x22272  x22273  x22274  x22275  x22276  Spam  
0       0       0       0       0       0       0   1.0  
1       0       0       0       0       0       0   1.0  
2       0       0       0       0       0       0   1.0  
3       0       0       0       0       0       0   1.0  
4       0       0       0       0       0       0   1.0  

[5 rows x 22278 columns]


In [32]:
# need to reindex and shuffle?
train = train.sample(frac=1).reset_index(drop=True)
train.head()

Unnamed: 0,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,...,x22268,x22269,x22270,x22271,x22272,x22273,x22274,x22275,x22276,Spam
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0.0
1,0,0,2,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0.0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1.0
3,3,0,5,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0.0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1.0


In [48]:
train.iloc[:1,:-1]

Unnamed: 0,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,...,x22267,x22268,x22269,x22270,x22271,x22272,x22273,x22274,x22275,x22276
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Explore several models to create a short list

In [50]:
# Try a simple SGD classifier?
clf = SGDClassifier()
clf.fit(train.iloc[:,:-1], train.iloc[:,-1])



SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=None, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
       shuffle=True, tol=None, verbose=0, warm_start=False)

## Fine tune your models and combine 

## Present solution