# Spamtastic

#### This goal of this notebook is to utilize machine learning to sort through a given dataset composed of legit email and spam observations so that it can accurately determine whether out-of-sample data is spam or ham (non-spam).  Data was procured from https://archive.ics.uci.edu/ml/datasets/Spambase

#### Import all the things

In [1]:
import pandas as pd
from sklearn import metrics
from sklearn.cross_validation import train_test_split as tts
from sklearn.naive_bayes import MultinomialNB

#### Read in all the data

In [2]:
data = pd.read_csv('spambase.data', names=list(range(0,58)))


In [3]:
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


#### Split the data into a dataframe of attributes and a series of results

In [4]:
attr = data[data.columns[0:57]]

In [5]:
attr.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,47,48,49,50,51,52,53,54,55,56
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.0,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191


In [6]:
spamdex = data[data.columns[57]]

In [7]:
type(spamdex)

pandas.core.series.Series

#### Split the data and results into testing and training data

In [8]:
X_train, X_test, y_train, y_test = tts(attr, spamdex, test_size=0.4, random_state=1)

#### Create instance of Naive Bayes Multinomial model

In [9]:
nb = MultinomialNB()

#### Fit the training data to the model

In [10]:
nb.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

#### Create a prediction object based on the test data

In [11]:
y_pred = nb.predict(X_test)

#### Measure the accuracy of your predicted outcomes against your test data outcomes

In [12]:
metrics.accuracy_score(y_test, y_pred)

0.77838131450298753

This is not especially good, but we are only testing with a particular portion of our whole data set.  Let's try using K-fold cross validation to get a list of potential accuracy scores from our model, then take the mean of that list.

#### K-Fold cross-validation

In [13]:
from sklearn.cross_validation import cross_val_score

In [14]:
scores = cross_val_score(nb, attr, spamdex, cv=10, scoring='accuracy')
print(scores)

[ 0.79175705  0.79392625  0.80911063  0.83478261  0.82826087  0.77608696
  0.77826087  0.81521739  0.69281046  0.74291939]


In [15]:
scores.mean()

0.78631324693940163

Not much better, really.  That might lead us to believe that any one or a combination of the following is happening:

1. There is a weak correlation between the data provided and the measured outcomes 
2. There are features to our data that are causing our model to overfit
3. Our model is not particularly effective, at least not with its current paramaters