# Naive Bayes Classifier
###### Bayes' theorem
![bayes__theorem.svg](attachment:bayes__theorem.svg)
The Naive Bayes classifier is a probabilistic classifier that is based on Bayes' theorem with a caveat of strong(naive) independence of features.<br>
The Naive Bayes classifier assumes conditional independence. This means that the relationship between all input features are independent; That is why it is called __naive__.<br>  
For instance given two sentences;  
"Free! Free!! Free!!! Free iPhone 12 pro max for you."__,__ <br> "Hey Max, I've got some free tickets to the game, would you like to come?"  
Apparently, the first message is a spam, whereas the second seems authentic.  
As a result of conditional independence, there's a likelihood that the __NB classifier__ might classify the latter message as a spam just because of the word __free__.

In [1]:
# Import necessary packages
import numpy as np
import pickle
from sklearn.model_selection import train_test_split as tts
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

Load the DataFrame from the previous notebook...

In [2]:
%store -r data
data.head()

Unnamed: 0,message,label
../emails/spam\00001.7848dde101aa985090474a91ec93fcf0,"<!DOCTYPE HTML PUBLIC ""-//W3C//DTD HTML 4.0 Tr...",spam
../emails/spam\00002.d94f1b97e48ed3b553b3508d116e6a09,1) Fight The Risk of Cancer!\n\nhttp://www.adc...,spam
../emails/spam\00003.2ee33bc6eacdb11f38d052c44819ba6c,1) Fight The Risk of Cancer!\n\nhttp://www.adc...,spam
../emails/spam\00004.eac8de8d759b7e74154f142194282724,##############################################...,spam
../emails/spam\00005.57696a39d7d84318ce497886896bf90d,I thought you might like these:\n\n1) Slim Dow...,spam


We'll split the data into training and testing chunks.

In [3]:
# Define features(X) and targets(Y).
X = data['message'].values
Y = data['label'].values

X_train, X_test, Y_train, Y_test = tts(X, Y, test_size=0.1, random_state=12345, shuffle=True)

In [4]:
# Initialise a Naive Bayes Classifier
classifier = MultinomialNB()

We cannot just throw text into the classifier. We must extract features from said text before feeding it to the classifier. 
A count vectorizer is an amazing way to extract features from text. It simply creates a matrix that counts the frequency of each unique word in the entire dataset.<br>  
If you wish to gain better undertanding of how CountVectorizer works, this is an amazing [article](https://www.geeksforgeeks.org/using-countvectorizer-to-extracting-features-from-text/) for that.

In [5]:
# Initialise a count vectorizer.
vectorizer = CountVectorizer()

# Transform the messages.
X_train = vectorizer.fit_transform(X_train)

Ok!<br>
Now, we have a sparse matrix of features extracted from the messages.  
Let's go ahead train the classifier by calling `.fit`

In [6]:
classifier.fit(X_train, Y_train)

MultinomialNB()

Time to evaluate the classifier using the test chunk.

In [7]:
X_test = vectorizer.transform(X_test)

accuracy = classifier.score(X_test, Y_test)
print(f"Accuracy of the spam classifier: {round(100*accuracy)}%")

Accuracy of the spam classifier: 95%


Amazing!!!<br>
The spam classifier achieved 95% accuracy on unseen data.<br>  
Now, I can apply this classifier in function for __spam filtering__.  
Let's do that...

In [24]:
def spamFilterer(message):
    """
    This function takes an incoming message from the server,
    If the message is perceived to be a spam, the function filters it.
    Whereas, if the message is perceived to be authentic, the function returns it.
    """
    text = vectorizer.transform(message)
    prediction = classifier.predict(text)[0]
    
    if prediction == 'spam':
        return None
    else:
        return message[0]

In [25]:
spamFilterer(["Amanda, you were supposed to come around yesterday. Why didn't you?"])

"Amanda, you were supposed to come around yesterday. Why didn't you?"

In [26]:
spamFilterer(["Colect gifts here!!!"])

As expected.  
Finally, I'll export this trained classifier to a pickle file.<br>  
__NB:__ The count vectorizer is restricted to only the amount of unique words in this dataset, which is about 60600 words. Also, the classifier was specifically trained using this count vectorizer. Therefore, the model is accustomed to the feature matrix, and format of this count vectorizer.  
Hence, there is a limitation to using this classifier.

In [9]:
# Export the model as a pickle file.
with open('../spam_models/naive_bayes.pickle', 'wb') as f:
    pickle.dump(classifier, f)

In [10]:
# ifunanyaScript