# Spam Filtering Using [Euron's Dataset][1] - PART 1
[1]: [http://www.aueb.gr/users/ion/data/enron-spam/]

In [1]:
from pymldb import Connection
mldb = Connection('http://localhost/')

First let's load the 1st of Enron's datasets (there are 6) into MDLB, using a separate script.

In [2]:
mldb.put('/v1/datasets/enron_data', {'type': 'sparse.mutable'})
%run -n load_enron.py
add_enron_file_to_dataset(mldb, '/v1/datasets/enron_data', 1)
mldb.post('/v1/datasets/enron_data/commit')

This is what the dataset looks like.

*index*: order in which the emails arrived in the user's inbox  
*msg*: actual content of the email  
*label*: was the email legitimate (*ham*) or not (*spam*)  

In [3]:
mldb.query('select index, msg, label from enron_data order by index limit 10')

Unnamed: 0_level_0,index,msg,label
_rowName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
enron_1_mail_0,0,Subject: dobmeos with hgh my energy level has ...,spam
enron_1_mail_1,1,Subject: christmas tree farm pictures\n,ham
enron_1_mail_2,2,"Subject: vastar resources , inc .\ngary , prod...",ham
enron_1_mail_3,3,Subject: calpine daily gas nomination\n- calpi...,ham
enron_1_mail_4,4,Subject: re : issue\nfyi - see note below - al...,ham
enron_1_mail_5,5,Subject: meter 7268 nov allocation\nfyi .\n- -...,ham
enron_1_mail_6,6,Subject: your prescription is ready . . oxwq s...,spam
enron_1_mail_7,7,"Subject: mcmullen gas for 11 / 99\njackie ,\ns...",ham
enron_1_mail_8,8,"Subject: meter 1517 - jan 1999\ngeorge ,\ni ne...",ham
enron_1_mail_9,9,Subject: duns number changes\nfyi\n- - - - - -...,ham


Let's create a *sql.expression* that will simply tokenize the emails into a bag of words. Those will be our features on which we will train a classifier.

In [4]:
print mldb.put('/v1/functions/bow', {
    'type': 'sql.expression',
    'params': {
        'expression': """
            tokenize(msg, {splitchars: ' \n', quotechar: ''}) as bow
            """
    }
})

<Response [201]>


Then we can generate the features for the whole dataset, and write them into a new dataset, using the *transform* procedure.

In [5]:
print mldb.put('/v1/procedures/generate_feats', {
    'type': 'transform',
    'params': {
        'inputData': """
            select bow({msg:msg}) as features, label = 'spam' as label
            from enron_data
            """,
        'outputDataset': 'enron_features',
        'runOnCreation': True
    }
})

<Response [201]>


Finally, let's train a very simple classifier, by training on the first half of the messages, and testing on the second half. This classifier will give a score to every email, and we can then choose a threshold where everything above the threshold is classified as spam, and every thing below as ham.

In [6]:
n = mldb.get('/v1/query', q='select count(*) as n from enron_features',
             format='aos').json()[0]['n']
res = mldb.put('/v1/procedures/experiment', {
    'type': 'classifier.experiment',
    'params': {
        'experimentName': 'enron_experiment1',
        'trainingData': 'select {features.*} as features, label from enron_features',
        # for now 50/50 split in time, but we might do something more
        # fancy later!
        'datasetFolds': [{
            'training_limit': n // 2,
            'testing_offset': n // 2,
            'orderBy': 'index',
        }],
        'modelFileUrlPattern': 'file://enron_model_$runid.cls',
        'algorithm': 'dt',
        'runOnCreation': True
    }
})
print res

<Response [201]>


In [7]:
print 'AUC =', res.json()['status']['firstRun']['status']['aggregated']['auc']['mean']

AUC = 0.9575486565


Not a bad AUC for a model that simple. But [the AUC score of a classifier is only a very generic measure of performance][1]. When having a specific problem like spam filtering, we're better off using a performance metric that matches our intuition about what a good spam filter is. Namely, a good spam filtering algorithm should almost never flag as spam a legitime email, while keeping your inbox as spam-free as possible. This is what should be used to choose the threshold for the classifier, and then to measure its performance.

So instead of the AUC (that doesn't pick a specific threshold but uses all of them), let's use as our performance metric the best [$F_{0.05}$ score][2], which gives 20 times more importance to precision than recall. In other words, this metric represents the fact that classifying as spam **only** what is really spam is 20 times more important than finding all the spam.

Let's see how our we are doing with that metric.
[1]: http://mldb.ai/blog/posts/2016/01/ml-meets-economics/
[2]: https://en.wikipedia.org/wiki/F1_score

In [8]:
print mldb.put('/v1/functions/enron_score', {
    'type': 'sql.expression',
    'params': {
        'expression': """
            (1 + pow(.05, 2)) * (precision * recall) / (precision * pow(.05, 2) + recall) as score
            """
    }
})

<Response [201]>


In [9]:
mldb.query("""
    select "truePositives", "trueNegatives", "falsePositives", "falseNegatives", precision, recall, score,
           enron_cost({precision, recall}) as *
    from enron_experiment1_results_0
    order by cost desc
""")

Unnamed: 0_level_0,truePositives,trueNegatives,falsePositives,falseNegatives,precision,recall,score,cost
_rowName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
enron_1_mail_5044,155,3672,0,1345,1.0,0.103333,1.0,0.978819
enron_1_mail_4637,447,3661,11,1053,0.975983,0.298,0.976489,0.970476
enron_1_mail_531,492,3657,15,1008,0.970414,0.328,0.946242,0.965698
enron_1_mail_3366,501,3656,16,999,0.969052,0.334,0.933693,0.964479
enron_1_mail_4542,643,3628,44,857,0.935953,0.428667,0.888078,0.933199
enron_1_mail_5043,711,3614,58,789,0.924577,0.474,0.883715,0.922391
enron_1_mail_3270,749,3594,78,751,0.905683,0.499333,0.748286,0.903849
enron_1_mail_2068,1454,3191,481,46,0.751421,0.969333,0.732412,0.751843
enron_1_mail_3105,1482,3172,500,18,0.74773,0.988,0.697496,0.748183
enron_1_mail_3463,1490,3162,510,10,0.745,0.993333,0.555889,0.745465


    As you can see, the best threshold is the one where in case of doubt, everything is classified as "ham". This leads to 1345 spam messages in the inbox, but no ham wrongly filtered as spam. Clearly this can be improved!

# TBC...