# Exercise 6 | Spam Classification with SVMs



In [27]:
import matplotlib.pyplot as plt
import scipy.io as scio
from sklearn import svm
# Initialization
from ex6_spamfunc import *
%matplotlib inline
plt.rcParams['figure.figsize'] = (12.0, 9.0)  # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Part 1: Email Preprocessing
To use an SVM to classify emails into Spam v.s. Non-Spam, you first need
to convert each email into a vector of features. In this part, you will
implement the preprocessing steps for each email. You should
complete the code in processEmail.m to produce a word indices vector
for a given email.

In [28]:
print('\nPreprocessing sample email (emailSample1.txt)\n')
with open('emailSample1.txt', 'r') as f:
    file_contents= f.read()
print(file_contents)
word_indices=processEmail(file_contents)


Preprocessing sample email (emailSample1.txt)

> Anyone knows how much it costs to host a web portal ?
>
Well, it depends on how many visitors you're expecting.
This can be anywhere from less than 10 bucks a month to a couple of $100. 
You should checkout http://www.rackspace.com/ or perhaps Amazon EC2 
if youre running something big..

To unsubscribe yourself from this mailing list, send an email to:
groupname-unsubscribe@egroups.com



==== Processed Email ====


['anyon', 'know', 'how', 'much', 'it', 'cost', 'to', 'host', 'a', 'web', 'portal', 'well', 'it', 'depend', 'on', 'how', 'mani', 'visitor', 'you', 're', 'expect', 'thi', 'can', 'be', 'anywher', 'from', 'less', 'than', 'number', 'buck', 'a', 'month', 'to', 'a', 'coupl', 'of', 'dollarnumb', 'you', 'should', 'checkout', 'httpaddr', 'or', 'perhap', 'amazon', 'ecnumb', 'if', 'your', 'run', 'someth', 'big', 'to', 'unsubscrib', 'yourself', 'from', 'thi', 'mail', 'list', 'send', 'an', 'email', 'to', 'emailaddr']


##  Part 2: Feature Extraction
Now, you will convert each email into a vector of features in R^n.
You should complete the code in emailFeatures.m to produce a feature
vector for a given email.

In [29]:
print('\nExtracting features from sample email (emailSample1.txt)\n')
features = emailFeatures(word_indices)


Extracting features from sample email (emailSample1.txt)



## Part 3: Train Linear SVM for Spam Classification
In this section, you will train a linear classifier to determine if an
email is Spam or Not-Spam.

In [30]:
data = scio.loadmat('spamTrain.mat')
X,y=data['X'],data['y'][:,0]
print('\nTraining Linear SVM (Spam Classification)\n')
print('(this may take 1 to 2 minutes) ...\n')
c=0.1
model=svm.LinearSVC(C=c,loss='hinge')
model.fit(X,y)
p=model.predict(X)
print('Training Accuracy: %f\n'% (np.mean(p == y)*100,))


Training Linear SVM (Spam Classification)

(this may take 1 to 2 minutes) ...

Training Accuracy: 99.850000



## Part 4: Test Spam Classification
After training the classifier, we can evaluate it on a test set. We have
included a test set in spamTest.mat

In [31]:
data = scio.loadmat('spamTest.mat')
Xtest,ytest=data['Xtest'],data['ytest'][:,0]
p=model.predict(Xtest)
print('Training Accuracy: %f\n'% (np.mean(p == ytest)*100,))

Training Accuracy: 98.900000



## Part 5: Try Your Own Emails
Now that you've trained the spam classifier, you can use it on your own
emails! In the starter code, we have included spamSample1.txt,
spamSample2.txt, emailSample1.txt and emailSample2.txt as examples.
The following code reads in one of these emails and then uses your
learned SVM classifier to determine whether the email is Spam or
Not Spam

In [32]:
filename = 'spamSample2.txt'
with open(filename, 'r') as f:
    file_contents= f.read()
print(file_contents)
word_indices  = processEmail(file_contents)
x = emailFeatures(word_indices)
p=model.predict(x.T)
print('\nProcessed %s\n\nSpam Classification: %d\n' % (filename, p))
print('(1 indicates spam, 0 indicates not spam)\n\n')

Best Buy Viagra Generic Online

Viagra 100mg x 60 Pills $125, Free Pills & Reorder Discount, Top Selling 100% Quality & Satisfaction guaranteed!

We accept VISA, Master & E-Check Payments, 90000+ Satisfied Customers!
http://medphysitcstech.ru




==== Processed Email ====


['best', 'buy', 'viagra', 'gener', 'onlin', 'viagra', 'numbermg', 'x', 'number', 'pill', 'dollarnumb', 'free', 'pill', 'reorder', 'discount', 'top', 'sell', 'number', 'qualiti', 'satisfact', 'guarante', 'we', 'accept', 'visa', 'master', 'e', 'check', 'payment', 'number', 'satisfi', 'custom', 'httpaddr']

Processed spamSample2.txt

Spam Classification: 1

(1 indicates spam, 0 indicates not spam)


