# Naive Bayes in Python
The purpose of this notebook is to demonstrate a relatively simplistic implementation of Naive Bayes without the need for any ML libraries. We can simply use counts and lookup tables (ie. Python dictionaries) to fit our model and make inferences.

In [2]:
import numpy as np
import pandas as pd

### The Corpus

In [3]:
documents = [
    ("I ate dinner early", "HAM"),
    ("free money today", "SPAM"),
    ("I had a blast", "HAM"),
    ("sign up free early today", "HAM"),
    ("only free today", "SPAM")
]

In [5]:
corpus = set()

# Build corpus
for document in documents:
    text = document[0]
    class_value = document[1]
    for word in text.split():
        corpus.add(word)
corpus

{'I',
 'a',
 'ate',
 'blast',
 'dinner',
 'early',
 'free',
 'had',
 'money',
 'only',
 'sign',
 'today',
 'up'}

# Generate Conditional Probabilities
We need to generate first $P(x|y)$. For instance, what is the likelihood of finding the word `free` if we know the document is `HAM` is represented as `P(x="free"|y="HAM")`.

In [6]:
conditional_probabilities = pd.DataFrame(index=list(corpus), 
                                         columns=["likelihood_given_ham", "likelihood_given_spam"])

### Compute Our Priors
We first want to obtain the values of $P(y =ham)$ and $P(y =spam)$.

In [7]:
spam_documents = 0
ham_documents = 0
for doc, label in documents:
    if label == "SPAM":
        spam_documents += 1
    else:
        ham_documents += 1

    print(f"{doc}, {label}")
    print(f"Spam documents: {spam_documents}")
    print(f"Ham documents: {ham_documents} \n\n")
    
p_ham = ham_documents / (spam_documents + ham_documents)
p_spam = spam_documents / (spam_documents + ham_documents)

I ate dinner early, HAM
Spam documents: 0
Ham documents: 1 


free money today, SPAM
Spam documents: 1
Ham documents: 1 


I had a blast, HAM
Spam documents: 1
Ham documents: 2 


sign up free early today, HAM
Spam documents: 1
Ham documents: 3 


only free today, SPAM
Spam documents: 2
Ham documents: 3 




### Compute the Conditional Likelihoods
We next want to compute the value of $P(x|y=ham)$ and $P(x|y=spam)$.

In [6]:
for word in corpus:
    
    ham_documents_with_word = 0
    spam_documents_with_word = 0
    
    for document in documents:
        document_class = document[1]
        if word in document[0].split():
            if document[1] == "HAM":
                ham_documents_with_word += 1
            else:
                spam_documents_with_word += 1
    
    print(f"For word {word}, {ham_documents_with_word} ham out of {ham_documents} ham documents.")
    print(f"For word {word}, {spam_documents_with_word} spam out of {spam_documents} spam documents.\n")
    conditional_probabilities.loc[word, "likelihood_given_ham"] = ham_documents_with_word * 1.0 / ham_documents
    conditional_probabilities.loc[word, "likelihood_given_spam"] = spam_documents_with_word * 1.0 / spam_documents

For word sign, 1 ham out of 3 ham documents.
For word sign, 0 spam out of 2 spam documents.

For word early, 2 ham out of 3 ham documents.
For word early, 0 spam out of 2 spam documents.

For word free, 1 ham out of 3 ham documents.
For word free, 2 spam out of 2 spam documents.

For word a, 1 ham out of 3 ham documents.
For word a, 0 spam out of 2 spam documents.

For word I, 2 ham out of 3 ham documents.
For word I, 0 spam out of 2 spam documents.

For word money, 0 ham out of 3 ham documents.
For word money, 1 spam out of 2 spam documents.

For word blast, 1 ham out of 3 ham documents.
For word blast, 0 spam out of 2 spam documents.

For word ate, 1 ham out of 3 ham documents.
For word ate, 0 spam out of 2 spam documents.

For word only, 0 ham out of 3 ham documents.
For word only, 1 spam out of 2 spam documents.

For word today, 1 ham out of 3 ham documents.
For word today, 2 spam out of 2 spam documents.

For word dinner, 1 ham out of 3 ham documents.
For word dinner, 0 spam out o

In [7]:
conditional_probabilities

Unnamed: 0,likelihood_given_ham,likelihood_given_spam
sign,0.333333,0.0
early,0.666667,0.0
free,0.333333,1.0
a,0.333333,0.0
I,0.666667,0.0
money,0.0,0.5
blast,0.333333,0.0
ate,0.333333,0.0
only,0.0,0.5
today,0.333333,1.0


### Get the Posterior Probablity of a Test Document
Now that we have our priors and our likelihoods, we actually have everything we need to test out our Naive Bayes algorithm. We'll use a test document, `free today`, and compute its posterior probability $P(y=spam|x)$.

Remember that using Bayes rule, we can rewrite this probability as 

$$
P(y=spam|x) = \frac{P(x|y=spam)P(y=spam)}{P(x)}
$$
The denominator, or the evidence, can be written as
$$
P(y=spam|x) = \frac{P(x|y=spam)P(y=spam)}{P(x = "free"|y=spam)\times P(x = "today" | y=spam)}
$$

In [8]:
test_document = "free today"

#### Define a Function to Calculate the Likelihood

In [9]:
from typing import Dict, Tuple
def get_likelihood(test_document: str, conditional_probabilities: Dict)-> Tuple[float, float]:
    likelihood_ham = 1
    likelihood_spam = 1
    for word in test_document.split():
        likelihood_ham = likelihood_ham * conditional_probabilities.loc[word, "likelihood_given_ham"]
        likelihood_spam = likelihood_spam * conditional_probabilities.loc[word, "likelihood_given_spam"]
    
    return likelihood_ham, likelihood_spam

In [10]:
likelihood_ham, likelihood_spam = get_likelihood(test_document, conditional_probabilities)

#### Use the Likelihoods and Priors to Calculate the Posterior

In [11]:
def get_posterior(likelihood_ham: float, likelihood_spam: float, p_ham: float, p_spam: float)-> float:
    posterior_ham = likelihood_ham * p_ham / (likelihood_ham * p_ham + likelihood_spam * p_spam)
    posterior_spam = likelihood_spam * p_spam / (likelihood_ham * p_ham + likelihood_spam * p_spam)
    return posterior_ham, posterior_spam

In [12]:
get_posterior(likelihood_ham, likelihood_spam, p_ham, p_spam)

(0.14285714285714285, 0.8571428571428572)

### Define the End to End Algorithm for Training a Naive Bayes Classifier

In [13]:
def fit_naive_bayes(documents):
    corpus = set()
    # Build corpus
    for document in documents:
        text = document[0]
        class_value = document[1]
        for word in text.split():
            corpus.add(word)
    
    conditional_probabilities = pd.DataFrame(index=list(corpus), 
                                             columns=["likelihood_given_ham", "likelihood_given_spam"])
    
    spam_documents = 0
    ham_documents = 0
    for document in documents:
        if document[1] == "SPAM":
            spam_documents += 1
        else:
            ham_documents += 1
    p_ham = ham_documents / (spam_documents + ham_documents)
    p_spam = spam_documents / (spam_documents + ham_documents)
    
    for word in corpus:
        ham_documents_with_word = 0
        spam_documents_with_word = 0
    
        for document in documents:
            document_class = document[1]
            if word in document[0].split():
                if document[1] == "HAM":
                    ham_documents_with_word += 1
                else:
                    spam_documents_with_word += 1

        #print(f"For word {word}, {ham_documents_with_word} ham out of {ham_documents}.")
        #print(f"For word {word}, {spam_documents_with_word} spam out of {spam_documents}.")
        conditional_probabilities.loc[word, "likelihood_given_ham"] = ham_documents_with_word * 1.0 / ham_documents
        conditional_probabilities.loc[word, "likelihood_given_spam"] = spam_documents_with_word * 1.0 / spam_documents

    
    return conditional_probabilities, p_ham, p_spam

In [14]:
fit_naive_bayes(documents)

(       likelihood_given_ham likelihood_given_spam
 sign               0.333333                   0.0
 early              0.666667                   0.0
 free               0.333333                   1.0
 a                  0.333333                   0.0
 I                  0.666667                   0.0
 money                   0.0                   0.5
 blast              0.333333                   0.0
 ate                0.333333                   0.0
 only                    0.0                   0.5
 today              0.333333                   1.0
 dinner             0.333333                   0.0
 had                0.333333                   0.0
 up                 0.333333                   0.0,
 0.6,
 0.4)

## Dealing with Non-Existent Words

From [Sebastian Raschka, Python Machine Learning](https://arxiv.org/pdf/1410.5329.pdf)
![Correlations](https://raw.githubusercontent.com/ychennay/dso-560-nlp-text-analytics/main/images/smoothing.png "Visualization of various r values for Pearson correlation coefficient")
