## Learning Objectives

By the end of today's class, you should be able to...

- Review Bayes'formula for conditional probability 

- Apply Bayes' rule for text classification

- Write a Python function for text classification with Naive Bayes 

## Text Classification

Text classification is the process of attaching labels to bodies of text, e.g., tax document, medical form, etc. based on the content of the text itself.

Think of your spam folder in your email. How does your email provider know that a particular message is spam or “ham” (not spam)?

### Question: How do _you_ tell if an email is spam or ham? What are the signs?

#### Followup: How does your process differ from a text classifier's?

## Review of conditional probability and its application on Text

- Assume this small dataset is given:

<img src="Images/spam_ham_data_set.png" width="600" height="600">

## Question: What is the probability that an email is spam? What is the probability that an email is ham?

- $P(spam) = ?$

- $P(ham) = ?$

## Activity: Create spam and ham dictionary

- Create two dictionaries for spam and ham where keys are unique words and values are the frequency of each word
    - Example: if the word "password" shows up 4 times in the text, then in the dictionary, the key would be "password" and the value would be 4
- Create the dictionaries programatically using `for` loops
- Use the below text to create your dictionaries:
    - `spam_text= ['Send us your password', 'review us', 'Send your password', 'Send us your account']`
    - `ham_text= ['Send us your review', 'review your password']`

In [2]:
spam_text= ['Send us your password', 'review us', 'Send your password', 'Send us your account']

ham_text= ['Send us your review', 'review your password']

spam = {}
for i in spam_text:
    for j in i.lower().split(' '):
        if j not in spam:
            spam[j] = 1
        else:
            spam[j] += 1

print("Spam Dictionary:")            
print(spam)
print("\n")

ham = {}
for i in ham_text:
    for j in i.lower().split(' '):
        if j not in ham:
            ham[j] = 1
        else:
            ham[j] += 1

print("Ham Dictionary:")
print(ham)

Spam Dictionary:
{'send': 3, 'us': 3, 'your': 3, 'password': 2, 'review': 1, 'account': 1}


Ham Dictionary:
{'send': 1, 'us': 1, 'your': 2, 'review': 2, 'password': 1}


## Question: Given our dictionaries from the last activity, if we know an email is spam, what is the probability that the word "password" is in the email? 

What is the frequency of "password" in a spam email?

- Answer:

 $P(password \mid spam) = 2/(3+3+3+2+1+1) = 2/13 \approx 15.38\%$ 

In [4]:
# or 
p_password_given_spam = spam['password']/sum(spam.values())
print(p_password_given_spam)

0.15384615384615385


## Question: Given our dictionaries from the last activity, if we know an email is ham, what is the probability that the word "password" is in the email? 

What is the frequency of "password" in a ham email?

- Answer:

$P(password \mid ham) = 1/(1+2+1+1+2+0) = 1/7 \approx 14.29\%$ 

In [5]:
# or 
p_password_given_ham = ham['password']/sum(ham.values())
print(p_password_given_ham)

0.14285714285714285


## Question: Assume we have seen the word "password" in an email, what is the probability that the email is spam?

- $P(spam \mid password) = ?$
- Hint: Use Bayes' rule and Law of Total Probability (LOTP):
    - Bayes' Rule: $P(spam \mid password) = (P(password \mid spam) P(spam))/ P(password)$ 
    - LOTP: $P(password) = P(password \mid spam) P(spam) + P(password \mid ham) P(ham)$

In [6]:
# Calculated by viewing our dataset
p_spam = 4/6
p_ham = 2/6

# LOTP
p_password = p_password_given_spam*p_spam + p_password_given_ham*p_ham 
print("Probability of Password:")
print(p_password)
print("\n")

# Bayes Rule
p_spam_given_password = p_password_given_spam*p_spam/p_password
print("Probability of spam given password:")
print(p_spam_given_password)

Probability of Password:
0.15018315018315018


Probability of spam given password:
0.6829268292682927


## Naive Bayes Classifier (Math)

The Bayes Theorem : $P(spam | w_1, w_2, ..., w_n) = {P(w_1, w_2, ..., w_n | spam)P(spam)}/{P(w_1, w_2, ..., w_n)}$

Naive Bayes assumption is that each word is independent of all other words, In reality, this is not true! But let's try it out for our spam/ham examples:

Applying Bayes' Rule, the above relationship becomes simple for both spam and ham with the Naive Bayes assumption: 

$P(spam | w_1, w_2, ..., w_n) = {P(w_1| spam)P(w_2| spam) ... P(w_n| spam)P(spam)}/{P(w_1, w_2, ..., w_n)}$

$P(ham | w_1, w_2, ..., w_n) = {P(w_1| ham)P(w_2| ham) ... P(w_n| ham)P(ham)}/{P(w_1, w_2, ..., w_n)}$

The denominator $P(w_1, w_2, ..., w_n)$ is independent of spam and ham, so we can remove it to simplify our equations, as we only care about labeling, and proportional relationships:

$P(spam | w_1, w_2, ..., w_n) \propto P(spam | w_1, w_2, ..., w_n) = {P(w_1| spam)P(w_2| spam) ... P(w_n| spam)P(spam)}$

$P(ham | w_1, w_2, ..., w_n) \propto P(ham | w_1, w_2, ..., w_n) = {P(w_1| ham)P(w_2| ham) ... P(w_n| ham)P(ham)}$

This is easier to express if we can write it as a summation. To do so, we can take the log of both sides of the equation, because the log of a product is the sum of the logs.

$logP(spam | w_1, w_2, ..., w_n) \propto {\sum_{i=1}^{n}log P(w_i| spam)+ log P(spam)}$

$logP(ham | w_1, w_2, ..., w_n) \propto {\sum_{i=1}^{n}log P(w_i| ham)+ log P(ham)}$

**Given the above, we can therefore, say that if:**

${\sum_{i=1}^{n}log P(w_i| spam)+ log P(spam)} > {\sum_{i=1}^{n}log P(w_i| ham)+ log P(ham)}$

**then that sentence is spam. Otherwise, the sentence is ham!**

## Pseudo-code for Naive Bayes for spam/ham dataset:

- Assume the following small dataset is given

- The first column is the labels of received emails

- The second column is the body of the email (sentences)

<img src="Images/spam_minidataset.png" width="700">

1- Based on the given dataset above, create the following two dictionaries:

     Ham -> D_ham = {'Jos': 1,'ask':1, 'you':1,... }
    
     Spam- > D_spam= {'Did': 1, 'you':3, ... }
    
Each dictionary representes all words for the spam and ham emails and their frequency (as the value of dictionaries)

2- For any new given sentences, having $w_1$, $w_2$, ... $w_n$ words, assuming the sentence is ham, calculate the following:

     $P(w_1| ham)$, $P(w_2| ham)$, ..., $P(w_n| ham)$
     $log(P(w_1| ham))$, $log(P(w_2| ham))$, ..., $log(P(w_n| ham))$
 
then add them all together to create one value


3- Calculate what percentage of labels are ham -> $P(ham)$ -> then take the log -> $log(P(ham))$

4- Add the value from step (2) and (3)

5- Do Steps (2) - (4) again, but assume the given new sentence is spam

6- Compare the two values. The greater value indicates which label (class) the sentence should be given
  

## Activity: Apply the naive Bayes to spam/ham email dataset:

**In groups of 3, complete the following activity**

1. Please read this article, starting at the **Naive Bayes Assumption** section: https://pythonmachinelearning.pro/text-classification-tutorial-with-naive-bayes/
1. We will use the [Spam Dataset](Datasets/spam.csv)
1. In the article, for the codeblock of the `fit` method, which line(s) of the method calculates the probabilty of ham and spam?
1. For the same `fit` method, which line(s) of the method calculates the spam and ham dictionaries?
1. In the article, for the codeblock of the `predict` method, which line(s) compares the scores of ham or spam based on log probabilities?

We will discuss as a class after workinging in groups.

## Activity: Find the Naive Bayes core parts in the SpamDetector Class

**In groups of 3, complete the following activity**

Assume we have written the `SpamDetector` class from the article. Train this model from the given [Spam Dataset](Datasets/spam.csv), and use it to make a prediction!

Use the starter code below, and then fill in the TODOs in the `main`.

**Hints:**

- you will need to use [train_test_split from sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to obtain your training and test (prediction) data
- You will need to instantiate your `SpamDetector`, fit the training data to it, predict using the test values, and then measure the accuracy
- To calculate accuracy: add up all the correct predictions divided by the total number of predictions
- Use the following code to get your data ready for transforming/manipulating:
```
data = pd.read_csv('Datasets/spam.csv',encoding='latin-1')
data = data.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
data = data.rename(columns={"v1":'label', "v2":'text'})
print(data.head())
tags = data["label"]
texts = data["text"]
X, y = texts, tags
```

In [11]:
import os
import re
import string
import math
import pandas as pd

class SpamDetector(object):
    """Implementation of Naive Bayes for binary classification"""

    # clean up our string by removing punctuation
    def clean(self, s):
        translator = str.maketrans("", "", string.punctuation)
        return s.translate(translator)

    #  tokenize our string into words
    def tokenize(self, text):
        text = self.clean(text).lower()
        return re.split("\W+", text)

    # count up how many of each word appears in a list of words.
    def get_word_counts(self, words):
        word_counts = {}
        for word in words:
            word_counts[word] = word_counts.get(word, 0.0) + 1.0
        return word_counts

    def fit(self, X, Y):
        """Fit our classifier
        Arguments:
            X {list} -- list of document contents
            y {list} -- correct labels
        """
        self.num_messages = {}
        self.log_class_priors = {}
        self.word_counts = {}
        self.vocab = set()

        # Compute log class priors (the probability that any given message is spam/ham),
        # by counting how many messages are spam/ham, 
        # dividing by the total number of messages, and taking the log.
        n = len(X)
        self.num_messages['spam'] = sum(1 for label in Y if label == 'spam')
        self.num_messages['ham'] = sum(1 for label in Y if label == 'ham')
        self.log_class_priors['spam'] = math.log(self.num_messages['spam'] / n )
        self.log_class_priors['ham'] = math.log(self.num_messages['ham'] / n )
        self.word_counts['spam'] = {}
        self.word_counts['ham'] = {}

        # for each (document, label) pair, tokenize the document into words.
        for x, y in zip(X, Y):
            c = 'spam' if y == 'spam' else 'ham'
            counts = self.get_word_counts(self.tokenize(x))
            # For each word, either add it to the vocabulary for spam/ham, 
            # if it isn’t already there, and update the number of counts. 
            for word, count in counts.items():
                # Add that word to the global vocabulary.
                if word not in self.vocab:
                    self.vocab.add(word)
                if word not in self.word_counts[c]:
                    self.word_counts[c][word] = 0.0

                self.word_counts[c][word] += count

    # function to actually output the class label for new data.
    def predict(self, X):
        result = []
        # Given a document...
        for x in X:
            counts = self.get_word_counts(self.tokenize(x))
            spam_score = 0
            ham_score = 0
            # We iterate through each of the words...
            for word, _ in counts.items():
                if word not in self.vocab: continue
                # ... and compute log p(w_i|Spam), and sum them all up. The same will happen for Ham
                # add Laplace smoothing
                # https://medium.com/syncedreview/applying-multinomial-naive-bayes-to-nlp-problems-a-practical-explanation-4f5271768ebf
                log_w_given_spam = math.log( (self.word_counts['spam'].get(word, 0.0) + 1) / (self.num_messages['spam'] + len(self.vocab)) )
                log_w_given_ham = math.log( (self.word_counts['ham'].get(word, 0.0) + 1) / (self.num_messages['ham'] + len(self.vocab)) )

                spam_score += log_w_given_spam
                ham_score += log_w_given_ham
            
            # Then we add the log class priors...
            spam_score += self.log_class_priors['spam']
            ham_score += self.log_class_priors['ham']

            # ... and check to see which score is bigger for that document.
            # Whichever is larger, that is the predicted label!
            if spam_score > ham_score:
                result.append('spam')
            else:
                result.append('ham')
        return result
        

# TODO: Fill in the below function to make a prediction, 
# your answer should match the final number in the below output (0.9641)
if __name__ == '__main__':
    pass

  label                                               text
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...
5572
{'spam': 567, 'ham': 3612}
0.9641


### Solution

In [None]:
if __name__ == '__main__':
    from sklearn.model_selection import train_test_split

    # import/clean/label your data
    data = pd.read_csv('Datasets/spam.csv',encoding='latin-1')
    data = data.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
    data = data.rename(columns={"v1":'label', "v2":'text'})
    print(data.head())
    tags = data["label"]
    texts = data["text"]

    # create texts and tags
    X, y = texts, tags
    print(len(X))
    
    # transform text into numerical vectors
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
    
    # instantiate your SpamDetector
    MNB = SpamDetector()
    # fit to model, with the trained part of the dataset
    MNB.fit(X_train.values, y_train.values)
    print(MNB.num_messages)
#     print(MNB.word_counts)
    # make predictions
    pred = MNB.predict(X_test.values)
    true = y_test.values
    # test for accuracy
    accuracy = sum(1 for i in range(len(pred)) if pred[i] == true[i]) / float(len(pred))
    print("{0:.4f}".format(accuracy))

## Activity: use sklearn CountVectorizer and MultinomialNB to spam email dataset

As we've seen with previous topics, sklearn has a lot of built in functionality that can save us from writing the code from scratch. We are going to solve the same problem in the previous activity, but using sklearn!

For example, the `SpamDectector` class in the previous activity is an example of a **Multinomial Naive Bayes (MNB) model**. An MNB lets us know that each conditional probability we're looking at (i.e. $P(spam | w_1, w_2, ..., w_n)$) is a multinomial (several terms, polynomial) distribution, rather than another type distribution.

**In groups of 3, complete the activity by using the provided starter code and following the steps below:**

1 - Split the dataset

`from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)`

2 - Vectorize the dataset : `vect = CountVectorizer()`

3 - Transform training data into a document-term matrix (BoW): `X_train_dtm = vect.fit_transform(X_train)`

4 - Build and evaluate the model

**Hints:**

- Remember how you prepared/cleaned/labeled the dataset, created texts and tags, and split the data innto train vs test from the previous activity. You'll need to do so again here
- Review the [CountVectorizer documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to see how you can transform text into numerical vectors
- Need more help? Check out this [MNB Vectorization](https://www.ritchieng.com/machine-learning-multinomial-naive-bayes-vectorization/) article and see what you can use from it.

In [None]:
## starter code:

from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

# TODO: Prepare the dataset

# TODO: create texts and tags

# TODO: split the data into train vs test

# TODO: transform text into numerical vectors

# instantiate Multinomial Naive Bayes model
nb = MultinomialNB()
# fit to model, with the trained part of the dataset
nb.fit(X_train_dtm, y_train)
X_test_dtm = vectorizer.transform(X_test)
# make prediction
y_pred_class = nb.predict(X_test_dtm)
# test accurarcy of prediction
metrics.accuracy_score(y_test, y_pred_class)

# Solution:

In [14]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
# from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split
from sklearn import metrics

# Prepare the dataset
data = pd.read_csv('Datasets/spam.csv',encoding='latin-1')
data = data.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
data = data.rename(columns={"v1":'label', "v2":'text'})
print(data.head())
tags = data["label"]
texts = data["text"]

# create texts and tags
X, y = texts, tags

# split the data into train vs test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# transform text into numerical vectors
vectorizer = CountVectorizer()
X_train_dtm = vectorizer.fit_transform(X_train)
print(X_train_dtm)

# instantiate Multinomial Naive Bayes model
nb = MultinomialNB()
# fit to model, with the trained part of the dataset
nb.fit(X_train_dtm, y_train)
X_test_dtm = vectorizer.transform(X_test)
# make prediction
y_pred_class = nb.predict(X_test_dtm)
# test accurarcy of prediction
metrics.accuracy_score(y_test, y_pred_class)

  label                                               text
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...
  (0, 4527)	1
  (0, 3636)	1
  (0, 3647)	1
  (0, 3493)	1
  (0, 5683)	1
  (0, 3263)	1
  (0, 4469)	1
  (0, 2236)	1
  (0, 4798)	1
  (0, 1488)	1
  (0, 4854)	1
  (0, 3451)	1
  (0, 2218)	1
  (0, 6326)	2
  (0, 6604)	1
  (0, 1535)	2
  (0, 4176)	2
  (0, 7373)	1
  (0, 5065)	2
  (0, 2112)	1
  (0, 6727)	1
  (0, 5712)	1
  (0, 819)	1
  (0, 802)	1
  (0, 919)	1
  :	:
  (4176, 4894)	1
  (4176, 4833)	1
  (4176, 3439)	1
  (4176, 1590)	1
  (4176, 4219)	1
  (4176, 7163)	1
  (4176, 4450)	1
  (4176, 6638)	1
  (4176, 2304)	1
  (4176, 3416)	1
  (4176, 3252)	1
  (4176, 4747)	1
  (4177, 6953)	1
  (4177, 5232)	1
  (4177, 1848)	1
  (4177, 4766)	1
  (4177, 3162)	1
  (4

0.9856424982053122

## Resources

- [MNB Vectorization](https://www.ritchieng.com/machine-learning-multinomial-naive-bayes-vectorization/)
