# Naive Bayes

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable. Bayes’ theorem states the following relationship, given class variable  and dependent feature vector 
 through 
, :

 
Using the naive conditional independence assumption that

for all , this relationship is simplified to

 
Since 
 is constant given the input, we can use the following classification rule:

 
 
 
  
and we can use Maximum A Posteriori (MAP) estimation to estimate  and 
; the former is then the relative frequency of class  in the training set.

The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of 
.

In spite of their apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many real-world situations, famously document classification and spam filtering. They require a small amount of training data to estimate the necessary parameters. (For theoretical reasons why naive Bayes works well, and on which types of data it does, see the references below.)

Naive Bayes learners and classifiers can be extremely fast compared to more sophisticated methods. The decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one dimensional distribution. This in turn helps to alleviate problems stemming from the curse of dimensionality.

On the flip side, although naive Bayes is known as a decent classifier, it is known to be a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously.

### Gaussian Naive Bayes 

GaussianNB implements the Gaussian Naive Bayes algorithm for classification. The likelihood of the features is assumed to be Gaussian:

 
 
The parameters 
 and 
 are estimated using maximum likelihood.

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
X, y = load_iris(return_X_y =True)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.5, random_state=0)
gnb = GaussianNB()
y_pred= gnb.fit(X_train, y_train).predict(X_test)
print("Number of mislabeled points out of a total %d points: %d"
     % (X_test.shape[0], (y_test != y_pred).sum()))


Number of mislabeled points out of a total 75 points: 4


# -------------------------------------

# Naive Bayes in Python

### Tokenizing and counting
Create a bag of words, counting all the words, ignoring case. Using a regex

In [7]:
import re
import string 
from prettytable import PrettyTable

def remove_punctuation(s):
    table = string.maketrans("","")
    return s.translate(table, string.punctuation)

def tokenize(text):
    text = remove_punctuation(text)
    text = text.lower()
    return re.split("\W+", text)

def count_words(words):
    wc={}
    for word in words:
        wc[word]=wc.get(word,0.0)+1.0
    return wc

s = "Hello my name is Ivan. My favorite food is sushi"
count_words(tokenize(s))


    

ModuleNotFoundError: No module named 'prettytable'

### Counting our probabilities
So now that we can count words, lets get cooking. The code below is going to do the following:
 * open each document
 * label it as aeither "crypto" or "dino" and keep track of how many of each label there are (priors)
 * count the words for the document
 * add those counts to the vocab, or a corpus level word count
 * add those counts to the word_count, for a category level word count

In [6]:
from sh import find

#setup some structures to store our data
vocab={}
word_count = {
    "crypto":{},
    "dino":{}
}
priors ={
    'crypto':0.,
    'dino':0.
}
docs=[]
for f in find("nb_files/sample-data"):
    f=f.strip()
    if f.endswith(".txt")==False:
        #skip non .txt files
        continue
    elif "cryptid" in f:
        category ="cryto"
    else:
        category = "dino"
    docs.append((category,f))
    #ok time to start counting stuff
    priors[category] += 1
    text = open(f).read()
    words = tokenize(text)
    counts = count_words(words)
    for word, count in counts.items():
        #if we havent seen a word yet, lets add it to our dictionaries with a count of 0
        if word not in vocab:
            vocab[word]=0.0 #use 0.0 here so python does "correct" math
        if word not in word_counts[category]:
            word_counts[category][word]=0.0
        vocab[word] += count
        word_counts[category][word] += count

NameError: name 'tokenize' is not defined

###  Classifying a new page

In [31]:
new_doc = open("nb_files/examples/Yeti.txt").read()
words= tokenize(new_doc)
counts = count_words(words)


NameError: name 'tokenize' is not defined

Alright, we've got our counts. Now we'll calculate P(word|category) for each word and multiply each of these conditional probabilities together to calculate the P(category|set of words). To prevent computational errors, we're going to perform the operations in logspace. All this means is we're going to use the log(probability) so we require fewer decimal places. More on the mystical properties of logs here and here.

In [9]:
import math

prior_dino= (priors['dino'] / sum(priors.values()))
prior_crypto= (priors['crypto'] / sum(priors.values()))

log_prob_crypto =0.0
log_prob_dino =0.0

for w, cnt in count_items():
    #Skip words that we havent seen before, or words less than 3 letters long
    if not w in vocab or len(w) <=3:
        continue
    #Calculate the probability that the word occurs at all
    p_word = vocab[w] / sum(vocab.values())
    #for both categories, calculate P(word[category]), or the probability a
    #  word will appear, given that we know that the document is <category>
    p_w_given_dino = word_counts["dino"].get(w,0.0) / sum(word_counts["dino"].values())
    p_w_given_crypto = word_counts["crypto"].get(w, 0.0) / sum(word_counts["crypto"].values())
    #add new probability to our running total" log_prob_<category> if the probability
    #is 0 (i.e the word never appears for the category ), then skip it
    if p_w_given_dino> 0 :
        log_prob_dino += math.log(cnt * p_w_given_dino / p_word)
    if p_w_given_crypto > 0 :
        log_prob_crypto += math.log(cnt * p_w_given_crypto / p_word)
        
    #Print out the results; we need to go from logspace back to "regular" space
    # so we take the EXP of the log_prob 
    print("Score(dino):", math.exp(log_prob_dino + math.log(prior_dino)))
    print("Score (crypto):",math.exp(log_prob_crypto + math.log(prior_crypto)))
    #dino; 2601.766
    #crypto: 25239.089

NameError: name 'count_items' is not defined

Since we're slightly bending the rules of Bayes' Theorem, the results are not actual probabilities, but rather are "scores". All you really need to know is which one is bigger. So our suspicions are confirmed, the "Yeti.txt" file is being classified overwhelmingly in favor of crypto (as we would hope).

# --------------------------------

## Another tutorial on the web
https://www.youtube.com/watch?v=99MN-rl8jGY

In [15]:
import numpy as np
import pandas as pd

import urllib.request
import sklearn
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.metrics import accuracy_score

#python 3
#with urllib.request.urlopen("http://www.python.org") as url:
#    s = url.read()

In [17]:
url ="https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
raw_data= urllib.request.urlopen(url) #python 2.x urllib.urlopen(url) 
dataset=np.loadtxt(raw_data, delimiter=",")
print(dataset[0])

[  0.      0.64    0.64    0.      0.32    0.      0.      0.      0.
   0.      0.      0.64    0.      0.      0.      0.32    0.      1.29
   1.93    0.      0.96    0.      0.      0.      0.      0.      0.
   0.      0.      0.      0.      0.      0.      0.      0.      0.
   0.      0.      0.      0.      0.      0.      0.      0.      0.
   0.      0.      0.      0.      0.      0.      0.778   0.      0.
   3.756  61.    278.      1.   ]


In [18]:
X= dataset[:, 0:48]
y = dataset[:,-1]


In [19]:
 X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.33, random_state=17)

### Bernoulli

In [22]:
BernNB = BernoulliNB(binarize = True)
BernNB.fit(X_train, y_train)
print(BernNB)
#label
y_expect=y_test
y_pred=BernNB.predict(X_test)
print(accuracy_score(y_expect,y_pred))

BernoulliNB(alpha=1.0, binarize=True, class_prior=None, fit_prior=True)
0.8558262014483212


### Multinomial

In [24]:
MultiNB = MultinomialNB()
MultiNB.fit(X_train, y_train)
print(MultiNB)

y_pred=MultiNB.predict(X_test)
print(accuracy_score(y_expect,y_pred))

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
0.8736010533245556


### Gaussian

In [25]:
GausNB = GaussianNB()
GausNB.fit(X_train, y_train)
print(GausNB)

y_pred=GausNB.predict(X_test)
print(accuracy_score(y_expect,y_pred))

GaussianNB(priors=None, var_smoothing=1e-09)
0.8130348913759052


#### improved versions

In [30]:
BernNB= BernoulliNB(binarize=0.1)
BernNB.fit(X_test,y_test)
print(BernNB)
y_expect= y_test
y_pred=BernNB.predict(X_test)

print(accuracy_score(y_expect,y_pred))


BernoulliNB(alpha=1.0, binarize=0.1, class_prior=None, fit_prior=True)
0.8940092165898618
