# Jonathan Halverson
# Tuesday, May 10, 2016
# Spam classification

In this notebook we build a classifier for emails.

In [1]:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.classification import LogisticRegressionWithLBFGS

Below the raw data is read in to RDD's:

In [2]:
ham = sc.textFile('ham.txt')
spam = sc.textFile('spam.txt')

NameError: name 'sc' is not defined

In [3]:
print ham.count()
print ham.first()

7
Dear Spark Learner, Thanks so much for attending the Spark Summit 2014!  Check out videos of talks from the summit at ...


In [4]:
print spam.count()
print spam.first()

5
Dear sir, I am a Prince in a far kingdom you have not heard of.  I want to send you money via wire transfer so please ...


The text is converted to words count vectors (bag of words):

In [6]:
tf = HashingTF(numFeatures=10000)
hamFeatures = ham.map(lambda email: tf.transform(email.split()))
spamFeatures = spam.map(lambda email: tf.transform(email.split()))

In [7]:
hamFeatures.first()

SparseVector(10000, {543: 1.0, 773: 1.0, 1034: 2.0, 1704: 1.0, 1957: 1.0, 1962: 2.0, 2837: 1.0, 2916: 1.0, 3683: 1.0, 3731: 1.0, 3921: 1.0, 5057: 1.0, 5325: 1.0, 6292: 1.0, 6902: 1.0, 7928: 1.0, 8297: 1.0, 8787: 1.0, 9382: 1.0, 9683: 1.0})

The labels are assigned to the appropriate records:

In [9]:
positiveClass = spamFeatures.map(lambda record: LabeledPoint(1, record))
negativeClass = hamFeatures.map(lambda record: LabeledPoint(0, record))

In [11]:
positiveClass.first()

LabeledPoint(1.0, (10000,[438,451,620,1175,1801,1882,1948,2916,3542,3921,3937,5011,5245,6416,7009,7344,7928,8072,8104,8318,8350,8729,9308,9416],[1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0]))

In [14]:
trainingData = positiveClass.union(negativeClass).cache()
model = LogisticRegressionWithLBFGS.train(trainingData)

Let's try out model on the training data:

In [17]:
[model.predict(item) for item in hamFeatures.collect()]

[0, 0, 0, 0, 0, 0, 0]

In [18]:
[model.predict(item) for item in spamFeatures.collect()]

[1, 1, 1, 1, 1]

Let's try two out-of-sample emails are see if they are correctly classified:

In [19]:
model.predict(tf.transform("Get a free mansion by sending 1 million dollars to me.".split()))

1

In [20]:
model.predict(tf.transform("Hi Mark, Let's meet at the coffee shop at 3 pm.".split()))

0

We see that both predictions are correct. One could extend this example by doing more pre-processing on the emails and working with more data. The model could also be evaluated by looking at an ROC curve.