# Spam Classifier

One of the most common problems associated with social media, especially texting/communication platforms is unwanted stream of messages sent from bots. These unwanted stream of messages are called __spam__. These messages clogs and congest a users inbox and ruins the entire user experience.  
To curate the influx of spams and sustain a good user experience, we come up with ways to classify these spam and filter them accordingly.  
There are several ways to achieve this goal, one of which is __Natural Language Processing(NLP)__, another is __Naive Bayes__. __NLP__ is however, the best approach to solve this problem. 

In [11]:
# Import necessary packages. 
import os
import io
import numpy
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
import warnings


# Ignore unnecessary warning.
warnings.filterwarnings('ignore')

Now, I'll define a function that load the messages from the directory on my hard disk.

In [None]:
def readFiles(path):
    for root, dirnames, filenames in os.walk(path):
        for filename in filenames:
            path = os.path.join(root, filename)

            inBody = False
            lines = []
            f = io.open(path, 'r', encoding='latin1')
            for line in f:
                if inBody:
                    lines.append(line)
                elif line == '\n':
                    inBody = True
            f.close()
            message = '\n'.join(lines)
            yield path, message

The __yield__ statement makes the above function a generator object.  
The above function generates the messages from the directory and the path of said message.<br>  
I'll will be working with a __pandas__ dataframe. Thus, any message that is generated from the `readFiles` generator will be added to the prospective  __pandas dataframe__. Hence, I'll write a function that uses the messages from the generator to create a __pandas dataframe__. 

In [None]:
def dataFrameFromDirectory(path, classification):
    rows = []
    index = []
    for filename, message in readFiles(path):
        rows.append({'message': message, 'class': classification})
        index.append(filename)

    return DataFrame(rows, index=index)

Now with the above functions, I can easily load all the messages from the directory. 

In [None]:
# Initialise a DataFrame.
data = DataFrame({'message': [], 'class': []})

# Load spam messages.
data = data.append(dataFrameFromDirectory('emails/spam', 'spam'))

# Load authentic messages.
data = data.append(dataFrameFromDirectory('emails/ham', 'ham'))


In [5]:
data.head()

Unnamed: 0,message,class
emails/spam\00001.7848dde101aa985090474a91ec93fcf0,"<!DOCTYPE HTML PUBLIC ""-//W3C//DTD HTML 4.0 Tr...",spam
emails/spam\00002.d94f1b97e48ed3b553b3508d116e6a09,1) Fight The Risk of Cancer!\n\nhttp://www.adc...,spam
emails/spam\00003.2ee33bc6eacdb11f38d052c44819ba6c,1) Fight The Risk of Cancer!\n\nhttp://www.adc...,spam
emails/spam\00004.eac8de8d759b7e74154f142194282724,##############################################...,spam
emails/spam\00005.57696a39d7d84318ce497886896bf90d,I thought you might like these:\n\n1) Slim Dow...,spam


Now we will use a CountVectorizer to split up each message into its list of words, and throw that into a MultinomialNB classifier. Call fit() and we've got a trained spam filter ready to go! It's just that easy.

In [8]:
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(data['message'].values)
classifier = MultinomialNB()
targets = data['class'].values
classifier.fit(counts, targets)

MultinomialNB()

Let's try it out:

In [9]:
examples = ['Free free free!!!', "Hi Bob, how about a game of golf tomorrow"]
example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)
predictions

array(['spam', 'ham'], dtype='<U4')

## Activity

Our data set is small, so our spam classifier isn't actually very good. Try running some different test emails through it and see if you get the results you expect.

If you really want to challenge yourself, try applying train/test to this spam classifier - see how well it can predict some subset of the ham and spam emails.

In [1]:
exaple_counts[2]

NameError: name 'exaple_counts' is not defined