# Naive Bayes
A probabilistic algorithm which is based on playing with the concept of conditional probability. It is easy to implement and very fast to train.

## Building an email filter
We are going to use sklearn's [MultinomialNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) which implements the naive Bayes algorithm for multinomially distributed data, and is one of the two classic naive Bayes variants used in text classification (where the data are typically represented as word vector counts, although tf-idf vectors are also known to work well in practice). Hence, MultinomialNB is suitable for classification with discrete features (e.g., word counts for text classification).

### Load Data

In [35]:
import pandas as pd
import swifter
import os

X = []
y = []

def load_files(path, isSpam):
    for file_name in os.listdir(path):
        content = read_email(os.path.join(path, file_name))
        X.append(content)
        y.append(isSpam)
        
def read_email(file_name):
    try:
        with open(file_name, 'r', encoding='latin1') as file:
            return file.read()
    except IOError:
        print("can't open file", file_name)
        return ""

spam_path = "./datasets/emails/spam/"
ham_path = "./datasets/emails/ham/"

load_files(spam_path, 1)
load_files(ham_path, 0)

df = pd.DataFrame({
    'X' : X,
    'y' : y
});
df.head()

Unnamed: 0,X,y
0,From 12a1mailbot1@web.de Thu Aug 22 13:17:22 ...,1
1,From ilug-admin@linux.ie Thu Aug 22 13:27:39 ...,1
2,From sabrina@mx3.1premio.com Thu Aug 22 14:44...,1
3,From wsup@playful.com Thu Aug 22 16:17:00 200...,1
4,From social-admin@linux.ie Thu Aug 22 16:37:3...,1


### Preprocess Data

First, we need to convert each email into a vector of features.

1. Tokenize email's content using [NLTK's world_tokenize](https://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize)
1. Stem each word to remove morphological affixes from it, leaving only the word stem using [NLTK's PorterStemmer](http://www.nltk.org/howto/stem.html)
2. Compute count of words (tokens) in each email using the sklearn's [count vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)


In [10]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import re

# uncomment if you want to check fo newer version
# import nltk
# nltk.download('punkt')

def tokenize_and_stem_email(email_body, stemmer):
    email_contents = preprocess_email(email_body)
    result = ''
    for token in word_tokenize(email_contents):
        # Remove any non alphanumeric characters
        word = regexprep(token.strip(), '[^a-zA-Z0-9]', '')

        # Stem the word
        word = stemmer.stem(word)
        
        result+= ' ' + word
    
    return result

def regexprep(contents, regex, replace_value):
    return re.sub(regex, replace_value, contents)

def preprocess_email(email_body):
    
    email_contents = email_body.lower()

    # Strip all HTML
    email_contents = regexprep(email_contents, r'<[^<>]+>', ' ')

    # Handle Numbers
    email_contents = regexprep(email_contents, r'[0-9]+', 'number')

    # Handle URLS
    email_contents = regexprep(email_contents, r'(http|https)://[^\s]*', 'httpaddr')

    # Handle Email Addresses
    email_contents = regexprep(email_contents, r'[^\s]+@[^\s]+', 'emailaddr')

    # Handle $ sign
    email_contents = regexprep(email_contents, r'[$]+', 'dollar')

    # get rid of any punctuation
    email_contents = regexprep(email_contents, r'[^\w\s]', '')

    # remove \n
    email_contents = regexprep(email_contents, r'\n', '')
    
    return email_contents

**Performance issue**

Applying a function without parallelising would consume too much time. We'll use [swifter](https://github.com/jmcarpenter2/swifter) to parallelise the processing.

In [36]:
print('processing e-mails...')
stemmer = PorterStemmer()
df['X'] = df['X'].swifter.apply(lambda email_body: tokenize_and_stem_email(email_body, stemmer))
print('done!')

processing e-mails...


Pandas Apply: 100%|████████████████████████████████████████████████████████████████| 2500/2500 [01:10<00:00, 35.50it/s]


done!


In [37]:
df.head()

Unnamed: 0,X,y
0,from emailaddr thu aug number numbernumbernum...,1
1,from emailaddr thu aug number numbernumbernum...,1
2,from emailaddr thu aug number numbernumbernum...,1
3,from emailaddr thu aug number numbernumbernum...,1
4,from emailaddr thu aug number numbernumbernum...,1


In [38]:
from sklearn.feature_extraction.text import CountVectorizer
countVectorizer = CountVectorizer()
counts = countVectorizer.fit_transform(df['X'].values)

In [39]:
from sklearn.naive_bayes import MultinomialNB
classifierNB = MultinomialNB().fit(counts, df['y'].values)

In [40]:
samples = ['you want 100000 $??! Get it now here for free http://fishing.com', "Hello, where are we going ?"]
samples_counts = countVectorizer.transform([tokenize_and_stem_email(sample, stemmer) for sample in samples])
predictions = classifierNB.predict(samples_counts)
print(predictions, classifierNB.predict_proba(samples_counts))

[1 0] [[0.30411967 0.69588033]
 [0.97048709 0.02951291]]
