# Assignment : Spam Filter
## Description
This assignment is near-final modulo some small adjustments (6 Nov '16)
In this assignment, you will discover that in many practical machine learning problems implementing the learning algorithm is often only a small part of the overall system. Thus to get a high mark for this assignment you need to implement any of the more advanced classification techniques or clever pre-processing methods. You will find plenty of them on the internet. If you do not know where to look for them, ask Google ;-).

Here your task is to build the standard (i.e. multinomial) Naive Bayes text classifier described during the lectures. You should test your program using the automatic marking software (described below), so it is critically important that it follows the specifications in detail.

You will train your classifier on real-world e-mails, which you can download from here. Each training e-mail is stored in a separate file. The names of spam training e-mails start with spam, while the names of ham e-mails start with ham.

## Marking criteria

Part 1 (40%):
    - Your program classifies the testing set with an accuracy significantly higher than random within 30 minutes
    - Use very simple data preprocessing so that the emails can be read into the Naive Bayes (remove everything else other than words from emails)
    - Write simple Naive Bayes multinomial classifier or use an implementation from a library of your choice
    - Classify the data
    - Report your results with a metric (e.g. accuracy) and method (e.g. cross validation) of your choice
    - Choose a baseline and compare your classifier against it

Part 2 (30%):
    - Use some smart feature processing techniques to improve the classification results
    - Compare the classification results with and without these techniques
    - Analyse how the classification results depend on the parameters (if available) of chosen techniques
    - Compare (statistically) your results against any other algorithm of your choice (use can use any library); compare and contrast results, ensure fair comparison

Part 3 (30%):
    - Calibration (15%): calibrate Naive Bayes probabilities, such that they result in low mean squared error
    - Naive Bayes extension (15%): modify the algorithm in some interesting way (e.g. weighted Naive Bayes)
    

** Convert from .ipynb to .py : ** $ ipython nbconvert --to python filter.ipynb

** BeautifulSoup: ** $ pip install beautifulsoup4




In [232]:
# Import all necessary packages
import math
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
import email
import nltk



import os, re




** Part 1 Filtering

In [470]:

# Removes header from email if present
def get_body(msg):
    if msg.is_multipart():
        for payload in msg.get_payload():
            # if payload.is_multipart(): ...
            return payload.get_payload()
    else:
        return msg.get_payload()
    
    
# Returns text containing only words (lowercase)
def remove_extras(text):
    brackets = "\([^)]*\)"
    email_address = "([\w.-]+)@([\w.-]+)"
    web_address = "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
    numbers = "\d+"
    alphanumeric = "([^\s\w]|_)+"
    whitespace = "\s+"
    stop_words = '\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*'

    text = re.sub(email_address, '', text)
    text = re.sub(web_address, '', text)
    text = re.sub(numbers, '', text)
    text = re.sub(alphanumeric, '', text)
    text = re.sub(whitespace, ' ', text).strip()
    text = text.lower() # Lowercase
    return text

def simple_html_filter(html_doc):

    soup = BeautifulSoup(str(html_doc), "html.parser")

    for script in soup(["script", "style"]):
        script.extract()    # rip it out

    text = soup.get_text()
    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)
    
    return text

def filter_stop_words(text):
    text = ' '.join([word for word in text.split() if word not in stopwords.words('english')])
    return text
    
    
def filter_part1(email): 
    message = get_body(email)
    message = simple_html_filter(message)
    message = remove_extras(message)
    message = filter_stop_words(message)
    return message


# ---------------------

# Example

with open('training_data/ham000.txt') as f:
    text_test = f.read()

msg = email.message_from_string(text_test)

print(filter_part1(msg))



you are receiving this email because you signed up to receive one of our free reports if you would prefer not to receive messages of this type please unsubscribe by following the instructions at the bottom of this message dear investor thank you again for requesting our free special report the one stock that keeps wall street buzzing we began the motley fool in with the idea that investors like you deserved better better than wall streets alltoooften biased research better than analysts who speak in secret codes allowing them to hedge or spin any recommendation and better than what passes for full financial disclosure in big business today given a level playing field we believe that regular folks like us and you can do quite well in the stock market why put trust in conflicted information from others when you could count on your own abilities and potentially blow the pros away more than two million people visit our foolcom web site each month we spend a great deal of time at foolcom in

** Part 2 Filtering ** - ignore for now

In [471]:
from email.header import decode_header
def getheader(header_text, default="ascii"):
    """Decode the specified header"""

    headers = decode_header(header_text)
    header_sections = [unicode(text, charset or default)
                       for text, charset in headers]
    return u"".join(header_sections)


getheader(msg["from"])

u'"David&TomGardner@fooladvisor.com"<Subscriber@fooladvisor.com>'

read in all documents and convert to a bag of words

In [472]:
import glob
data = []
training_labels = []
training_label_names = ['ham', 'spam']
        
# ham
for filename in glob.glob('training_data/ham*.txt'):
    f = open(filename, 'r')
    data.append(f.read())
    training_labels.append(0)

# spam    
for filename in glob.glob('training_data/spam*.txt'):
    f = open(filename, 'r')
    data.append(f.read())
    training_labels.append(1)


In [473]:
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

class PreProcessor(object):

    def transform(self, X):
        out = []
        isRiskFree = False
        for item in X:
            data = filter_part1(email.message_from_string(item))
            out.append(data)
        return out

    def fit(self, X, y=None):
        return self

text_clf = Pipeline([('preprocess', PreProcessor()),
                     ('vect', CountVectorizer(decode_error='ignore')),
#                      ('tfidf', TfidfTransformer(use_idf=False)),
                     ('clf', MultinomialNB()),
])




Now we can actually use a naive Bayes classifier!

In [474]:
from sklearn.naive_bayes import MultinomialNB
# we're using multinomial because it's most relevant for word counts
text_clf = text_clf.fit(data, training_labels)


Done :) Now we can test

In [502]:
import random

def pickTestData(count=1):
    labelsFile = open('testdata.label', 'r')
    labeling = labelsFile.readlines()
    random.shuffle(labeling)
    sample = labeling[0:count]
#     return [('new risk want addresses', 2)]
    return [(open('test_data/'+row.split()[1], 'r').read(), 1-int(row.split()[0])) for row in sample]

test_data = pickTestData(10)

docs_new = [i[0] for i in test_data]

X_new_counts = count_vect.transform(docs_new)

X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = text_clf.predict(docs_new)

tot_correct = 0

print(predicted)

for original_data, category in zip(test_data, predicted):
    print('%r => %s' % (training_label_names[original_data[1]], training_label_names[category]))
    if original_data[1] == category:
        tot_correct += 1

print('%d correctly classified out of %d' % (tot_correct, len(test_data)))

# np.mean(predicted == training_labels)
    
    
    

[0 0 1 0 0 1 0 0 0 0]
'ham' => ham
'ham' => ham
'spam' => spam
'ham' => ham
'spam' => ham
'spam' => spam
'ham' => ham
'ham' => ham
'ham' => ham
'ham' => ham
9 correctly classified out of 10
