# Naïve Bayes from Scratch !

## A Quick Short Intro to The Naïve Bayes Algorithm

Naive Bayes is one of the most common ML algorithms that is often used for the purpose of text classification. If you have just stepped into ML, it is one of the easiest classification algorithms to start with. Naive Bayes is a probabilistic classification algorithm as it uses probability to make predictions for the purpose of classification.

## Training Phase of The Naïve Bayes Model

Let’s say, there is a restaurant review, “Very good food and service!!!”, and you want to predict that whether this given review implies a positive or a negative sentiment. To do this, we will first need to train a model ( that essentially means to determine counts of words of each category) on a relevant labelled training data set and then this model itself will be able to automatically classify such reviews into one of the given sentiments against which it was trained for. Assume that you are given a training dataset which looks like something below (a review and it’s corresponding sentiment):

</br>


<table>
  <tr><td><b>Training Examples</b></td><td><b>Labels</b></td></tr>
  <tr><td>Simply loved it!</td><td>Positive</td></tr>
  <tr><td>Most disgusting food I have ever had</td><td>Negative</td></tr>
  <tr><td>Stay away, very disgusting food</td><td>Negative</td></tr>
  <tr><td>Menu is absolutely perfect, loved it!</td><td>Positive</td></tr>
  <tr><td>A really good value for money</td><td>Positive</td></tr>
  <tr><td>This is a very good restaurant</td><td>Positive</td></tr>
  <tr><td>Terrible experience!</td><td>Negative</td></tr>
  <tr><td>This place has best food</td><td>Positive</td></tr>
  <tr><td>This place has most pathetic serving food!</td><td>Negative</td></tr>
</table>


</br>


Naive Bayes Classifier is a Supervised Machine Learning Algorithm

### Step # 1 : Data Preprocessing

As part of the preprocessing phase, all words in the training corpus/ training dataset are converted to lowercase and everything apart from letters like punctuation is excluded from the training examples.

</br>

<b><i>A Quick Side Note :</i></b>  A common pitfall is not preprocessing the test data in the same way as the training dataset was preprocessed and rather feeding the test example directly into the trained model. As a result, the trained model performs badly on the given test example on which it was supposed to perform quite good!

</br>

Preprocessed Training Dataset:

</br>

<table>
  <tr><td><b>Training Examples</b></td><td><b>Labels</b></td></tr>
  <tr><td>simply loved it</td><td>Positive</td></tr>
  <tr><td>most disgusting food i have ever had</td><td>Negative</td></tr>
  <tr><td>stay away very disgusting food</td><td>Negative</td></tr>
  <tr><td>menu is absolutely perfect loved it</td><td>Positive</td></tr>
  <tr><td>a really good value for money</td><td>Positive</td></tr>
  <tr><td>this is a very good restaurant</td><td>Positive</td></tr>
  <tr><td>terrible experience</td><td>Negative</td></tr>
  <tr><td>this place has best food</td><td>Positive</td></tr>
  <tr><td>this place has most pathetic serving food</td><td>Negative</td></tr>
</table>


</br>


### Step #2 : Training Your Naïve Bayes Model

Just simply make two bag of words (BoW), one for each category, and each of them will simply contain words and their corresponding counts. All words belonging to “Positive” sentiment/label will go to one BoW and all words belonging to “Negative” sentiment will have their own BoW. Every sentence in training set is split into words (on the basis of space as a tokenizer/separator) and this is how simply word-count pairs are constructed as demonstrated below :


</br>

<table align=left>
  <th>Positive BoW</th>
  <tr><td><b>Words</b></td><td><b>Counts</b></td></tr>
  <tr><td>this</td><td>2</td></tr>
  <tr><td>loved</td><td>2</td></tr>
  <tr><td>it</td><td>2</td></tr>
  <tr><td>is</td><td>2</td></tr>
  <tr><td>good</td><td>2</td></tr>
  <tr><td>a</td><td>2</td></tr>
  <tr><td>very</td><td>1</td></tr>
  <tr><td>value</td><td>1</td></tr>
  <tr><td>simply</td><td>1</td></tr>
  <tr><td>restaurant</td><td>1</td></tr>
  <tr><td>really</td><td>1</td></tr>
  <tr><td>for</td><td>1</td></tr>
  <tr><td>perfect</td><td>1</td></tr>
  <tr><td>money</td><td>1</td></tr>
  <tr><td>menu</td><td>1</td></tr>
  <tr><td>absolutely</td><td>1</td></tr>
  <tr><td>place</td><td>1</td></tr>
  <tr><td>food</td><td>1</td></tr>
  <tr><td>best</td><td>1</td></tr>
  <tr><td>has</td><td>1</td></tr>
</table>



<table>
  <th>Negative BoW</th>
  <tr><td><b>Words</b></td><td><b>Counts</b></td></tr>
  <tr><td>food</td><td>3</td></tr>
  <tr><td>most</td><td>2</td></tr>
  <tr><td>disgusting</td><td>2</td></tr>
  <tr><td>very</td><td>1</td></tr>
  <tr><td>this</td><td>1</td></tr>
  <tr><td>terrible</td><td>1</td></tr>
  <tr><td>stay</td><td>1</td></tr>
  <tr><td>serving</td><td>1</td></tr>
  <tr><td>place</td><td>1</td></tr>
  <tr><td>pathetic</td><td>1</td></tr>
  <tr><td>i</td><td>1</td></tr>
  <tr><td>have</td><td>1</td></tr>
  <tr><td>has</td><td>1</td></tr>
  <tr><td>experience</td><td>1</td></tr>
  <tr><td>ever</td><td>1</td></tr>
  <tr><td>away</td><td>1</td></tr>
  <tr><td>had</td><td>1</td></tr>
  <tr><td></td><td></td></tr>
  <tr><td></td><td></td></tr>
  <tr><td></td><td></td></tr>
</table>


</br>

## The Testing Phase — Where Prediction Comes into the Play!

Consider that now your model is given a restaurant review, “Very good food and service!!!”, and it needs to classify to what particular category it belongs to. A positive review or a negative one? We need to find the probability of this given review of belonging to each category and then we would assign it either a positive or a negative label depending upon for which particular category this test example was able to score more probability.

## Finding Probability of a Given Test Example

### Step # 1 : Preprocessing of Test Example

Preprocess the test example in the same way as the training examples were preprocessed i.e changing examples to lower case and excluding everything apart from letters/alphabets.

</br>

<table>
  <tr><td><b>Raw Test Example</b></td><td><b>Preprocessed Test Example</b></td></tr>
  <tr><td>Very good food and service!!!</td><td>very good food and service</td></tr>
</table>


</br>

### Step # 2 : Tokenization of Preprocessed Test Example

Tokenize the test example i.e split it into single words.

</br>

<table>
  <tr><td>very</td><td>good</td><td>food</td><td>and</td><td>service</td></tr>
</table>


</br>


<b><i>A Quick Side Note</i></b> : You must be already familiar with the term “feature” in machine learning. Here, in Naive Bayes, each word in the vocabulary of each class of the training data set constitutes a categorical feature. This implies that counts of all the unique words (i.e vocabulary/vocab) of each class are basically a set of features for that particular class. And why do we need “counts” ? because we need a numeric representation of the categorical word features as the Naive Bayes Model/Algorithm requires numeric features to find out the probabilistic scores!

</br>

### Step # 3 : Using Probability to Predict Label for Tokenized Test Example

</br>

<!-- The not so intimidating mathematical form of finding probability -->

The Probability of a Given <b>Test Example i</b> of belonging to class c:

<b><i>p</i></b> (<b><i>i</i></b> belonging to class <b><i>c</i></b>) = product of two terms: product(<b>p</b> of a test word <b>j</b> in class <b>c</b>) and <b>p</b> of class <b>c</b>

</br>

*   let i = test example = “Very good food and service!!!”

*   Total number of words in i = 5, so values of j (representing feature number) vary from 1 to 5.

</br>

Let’s map the above scenario to the given test example to make it more clear!

</br>

## Let’s start calculating values for these product terms.

</br>

### Step # 1 : Finding Value of the Term : p of class c

Simply the Fraction of Each Category/Class in the Training Set:

In [20]:
# !mv img/1_krSm7ixbHpcMDMqN4UgojA.png full_probability.png 
# ![](full_probability.png)
# ![](img/probabil_pos_neg.png)


<img src="./full_probability.png" width="60%">

p of class c for Positive & Negative categories:

<img src="./img/probabil_pos_neg.png" width="50%">

### Step # 2 : Finding value of term : product (p of a test word j in class c)

Before we start deducing probability of a test word j in a specific class c let’s quickly get familiar with some easy peasy notation that is being used in the not so distant lines of this blog post:

<img src="./img/1.png" width="80%">

As we have only one example in our test set at the moment (for the sake of understanding), so i = 1.

<img src="./img/2.png" width="60%">

<b><i>A Quick Side Note:</i></b> During test time/prediction time, we map every word of test example against it’s count that was found during training phase. So, in this case, we are looking for in total 5 word counts for this given test example.

### Finding Probability of a Test Word “ j ” in class c

Before we start calculating product ( p of a test word “ j ” in class c ), we obviously first need to determine p of a test word “ j ” in class c . There are two ways of doing this as specified below — which one should be actually followed and rather is practically used will be discovered in just a few minutes…

<img src="./img/3.png" width="60%">

### let’s first try finding probabilities using method number 1 :

<img src="./img/4.png" width="60%">

Now we can multiply the probabilities of individual words ( as found above ) in order to find the numerical value of the term : 
<b>product</b> ( <b>p</b> of a test word <b>“ j ”</b> in class <b>c</b> )


<b>The Common Pitfall of Zero Probabilities!</b>

<img src="./img/5.png" width="60%">

By now, we have numerical values for both the terms i.e ( <b>p</b> of class <b>c</b> and <b>product</b> ( <b>p</b> of a test word <b>“ j ”</b> in class <b>c</b> ) ) for both the classes . So we can multiply both of these terms in order to determine <b>p</b> ( <b>i</b> belonging to class <b>c</b> ) for both the categories. This is demonstrated below :

<img src="./img/6.png" width="60%">

The <b>p</b> ( <b>i</b> belonging to class <b>c</b> ) turns out to be <b>zero for both the categories!!!</b> but clearly the test example “Very good food and service!!!” belongs to positive class! Clearly, this happened because the <b>product</b> ( <b>p</b> of a test word 
<b>“ j ”</b> in class <b>c</b> ) <b>was zero for both the categories</b> and this in turn was zero because <b>a few words in the given test example (highlighted in orange) NEVER EVER appeared in our training dataset and hence their probability was zero! and clearly they have caused all the destruction!</b>

So does this imply that whenever a word that appears in the test example but never ever occurred in the training dataset will always cause such destruction ? and in such case our trained model will never be able to predict the correct sentiment? It will just randomly pick positive or negative category since both have same zero probability and predict wrongly? The answer is NO! This is where the second method (numbered 2) comes into play and infact this is the mathematical formula that is actually used to deduce <b>p</b> ( <b>i</b> belonging to class <b>c</b> ) . But before we move on the method number 2, we should first get familiar with it’s mathematical brainy stuff!

<img src="./img/7.png" width="60%">

So now <b>after adding pseudocounts of 1’s , the probability p of a test word that NEVER EVER APPEARED IN THE TRAINING DATASET WILL NEVER BE ZERO</b> and therefore, the numerical value of the term <b>product</b> ( <b>p</b> of a test word <b>“ j ”</b> in class <b>c</b> ) will never end up as zero which in turn implies that 
<b>p</b> ( <b>i</b> belonging to class <b>c</b> ) will never be zero as well! So all is well and no more destruction by zero probabilities!

<b>So the numerator term of method number 2 will have an added 1 as we have added a one for every word in the vocabulary and so it becomes:</b>

<img src="./img/8.png" width="60%">

Similarly the denominator becomes :

<img src="./img/9.png" width="60%">

And so the complete formula :

<img src="./img/10.png" width="60%">

<img src="./img/11.png" width="60%">

Probabilities of Positive & Negative Class:

<img src="./img/12.png" width="60%">


## Now finding probabilities using method number 2 :

Handling of Zero Probabilities : These act like failsafe probabilities !

<img src="./img/13.png" width="60%">

<img src="./img/14.png" width="60%">

<img src="./img/15.png" width="60%">

Now as probability of the test example, ”Very good food and service!!!” is more for the positive class i.e 9.33E-09 as compared to the negative class (i.e 7.74E-09), so we can predict it as a Positive Sentiment ! And that is how we simply predict a label for a test/unseen example

<b><i>A Quick Side Note :</i></b> As like every other machine learning algorithm, Naive Bayes too needs a validation set to assess the trained model’s effectiveness. But, since this post was aimed to focus on the algorithmic insights, so I deliberately avoided it and directly jumped to the testing part

## Digging Deeper into the Mathematics of Probability

Now that you have built a basic understanding of the probabilistic calculations needed to train the Naive Bayes Model and then using it to predict probability for the given test sentence, I will now dig deeper into the probabilistic details. 
While doing the calculations of probability of the given test sentence in the above section, we did nothing but implemented the given probabilistic formula for our prediction at test time:

<img src="./img/16.png" width="60%">

### Decoding the above mathematical equation :

“|” = refers to a state which has already been given / or some filtering criteria

“c” = class/category

“x” = test example/test sentence

<b>p (c|x) = given test example x</b>, what is it’s <b>probability of belonging to class c</b>. This is also known as posterior probability. This is <b>conditional probability that is to be found for the given test example x for each of the given training classes</b>.

<b>p(x|c)=given class c</b>, what is the <b>probability of example x belonging to class c</b>. This is also known as likelihood as it implies how much likely does example <b>x</b> belongs to class <b>c</b>. This is <b>conditional probability</b> too as we are finding probability of <b>x</b> out of total instances of class <b>c</b> only i.e we have <b>restricted/conditioned our search space to class c while finding the probability of x. We calculate this probability using the counts of words that are determined during the training phase.</b>

Here “ j ” represents a class and k represents a feature

<img src="./img/17.png" width="60%">

We implicitly used this formula twice above in the calculations sections as we had two classes. Remember finding the numerical value of <b>product</b> ( <b>p</b> of a test word <b>“ j ”</b> in class <b>c</b> ) ?

<img src="./img/18.png" width="60%">

<b>p</b> = This implies the <b>probability of class c</b>. This is also known as prior probability/unconditional probability. This is unconditional probability. We calculated this too earlier above in the probability calculations sections ( in Step # 1 which was finding value of term : <b>p</b> of class <b>c</b> )

<b>p(x)</b> = This is also known as <b>normalizing constant so that the probability p(c|x) does actually falls in the range [0,1]</b>. So if you remove this, the probability <b>p(c|x)</b> may not necessarily fall in the range of [0,1]. Intuitively this means probability of example <b>x</b> under any circumstances or irrespective of it’s class labels i.e whether positive or negative.
This is also reflected in total probability theorem which is used to <b>calculate p(x)</b> and dictates that to find <b>p(x)</b>, we will find it’s probability in all given classes (because it is unconditional probability)and simply add them :

Total Probability Theorem
<img src="./img/19.png" width="60%">

This implies that if we have two classes then we would have two terms, so in our particular case of positive and negative sentiments:

Total Probability Theorem for Two Classes
<img src="./img/20.png" width="60%">

<b>Did we use it in the above calculations? No we did not.</b> Why??? because we are comparing probabilities of positive and negative class and since the denominator remains the same, so in this particular case, omitting out the same denominator doesn’t affect the prediction by our trained model. It simply cancels out for both classes. So although we can include it but there is no such logical reason to do so. <b>But again as we have eliminated the normalization constant, the probability p(c|x) may not necessarily fall in the range of [0,1]</b>

</br>

## Avoiding the Common Pitfall of The Underflow Error!

* If you noticed, the numerical values of probabilities of words ( i.e p of a test word “ j ” in class c ) were quite small. And therefore, multiplying all these tiny probabilities to find product ( p of a test word “ j ” in class c ) will yield even a more smaller numerical value that often results in underflow which obviously means that for that given test sentence, the trained model will fail to predict it’s category/sentiment. So to avoid this underflow error, we take help of mathematical log as follows :

<img src="./img/21.png" width="60%">

* So now instead of multiplication of the tiny individual word probabilities, we will simply add them. And why only log? why not any other function? Because log increases or decreases monotonically which means that it will not affect the order of probabilities. Probabilities that were smaller will still stay smaller after the log has been applied on them and vice versa. so let’s say that a test word “is” has a smaller probability than the test word “happy”, so after passing these through log would although increase their magnitude but “is” would still have a smaller probability than “happy”. Therefore, without affecting the predictions of our trained model, we can effectively avoid the common pitfall of underflow error.

</br>

# Concluding Notes….

* Although we live in a age of API’s and practically rarely code from scratch. But understanding the algorithmic theory in depth is extremely vital to develop a sound understanding of how the machine learning algorithms actually work. It is only the key understanding which actually differentiates a true data scientist from a naive one and what actually matters when training a really good model. So before moving to API’s , I personally believe that a true data scientist should code from scratch to actually see behind the numbers and the reason why a particular algorithm is better than the other.


* One of the best characteristics of the Naive Bayes Model is that you can improve it’s accuracy by simply updating it with new vocabulary words instead of always retraining it. You will just need to add words to the vocabulary and update the words counts accordingly. That’s it!

In [17]:
# !pip install numpy pandas scikit-learn

In [8]:
import pandas as pd 
import numpy as np 
from collections import defaultdict
import re

In [9]:
def preprocess_string(str_arg):
    
    """"
        Parameters:
        ----------
        str_arg: example string to be preprocessed
        
        What the function does?
        -----------------------
        Preprocess the string argument - str_arg - such that :
        1. everything apart from letters is excluded
        2. multiple spaces are replaced by single space
        3. str_arg is converted to lower case 
        
        Example:
        --------
        Input :  Menu is absolutely perfect,loved it!
        Output:  ['menu', 'is', 'absolutely', 'perfect', 'loved', 'it']
        
        Returns:
        ---------
        Preprocessed string 
        
    """
    
    cleaned_str=re.sub('[^a-z\s]+',' ',str_arg,flags=re.IGNORECASE) #every char except alphabets is replaced
    cleaned_str=re.sub('(\s+)',' ',cleaned_str) #multiple spaces are replaced by single space
    cleaned_str=cleaned_str.lower() #converting the cleaned string to lower case
    
    return cleaned_str # returning the preprocessed string 

In [10]:
class NaiveBayes:
    
    def __init__(self,unique_classes):
        
        self.classes=unique_classes # Constructor is sinply passed with unique number of classes of the training set
        

    def addToBow(self,example,dict_index):
        
        '''
            Parameters:
            1. example 
            2. dict_index - implies to which BoW category this example belongs to
            What the function does?
            -----------------------
            It simply splits the example on the basis of space as a tokenizer and adds every tokenized word to
            its corresponding dictionary/BoW
            Returns:
            ---------
            Nothing
        
       '''
        
        if isinstance(example,np.ndarray): example=example[0]
     
        for token_word in example.split(): #for every word in preprocessed example
          
            self.bow_dicts[dict_index][token_word]+=1 #increment in its count
            
    def train(self,dataset,labels):
        
        '''
            Parameters:
            1. dataset - shape = (m X d)
            2. labels - shape = (m,)
            What the function does?
            -----------------------
            This is the training function which will train the Naive Bayes Model i.e compute a BoW for each
            category/class. 
            Returns:
            ---------
            Nothing
        
        '''
    
        self.examples=dataset
        self.labels=labels
        self.bow_dicts=np.array([defaultdict(lambda:0) for index in range(self.classes.shape[0])])
        
        #only convert to numpy arrays if initially not passed as numpy arrays - else its a useless recomputation
        
        if not isinstance(self.examples,np.ndarray): self.examples=np.array(self.examples)
        if not isinstance(self.labels,np.ndarray): self.labels=np.array(self.labels)
            
        #constructing BoW for each category
        for cat_index,cat in enumerate(self.classes):
          
            all_cat_examples=self.examples[self.labels==cat] #filter all examples of category == cat
            
            #get examples preprocessed
            
            cleaned_examples=[preprocess_string(cat_example) for cat_example in all_cat_examples]
            
            cleaned_examples=pd.DataFrame(data=cleaned_examples)
            
            #now costruct BoW of this particular category
            np.apply_along_axis(self.addToBow,1,cleaned_examples,cat_index)
            
                
        ###################################################################################################
        
        '''
            Although we are done with the training of Naive Bayes Model BUT!!!!!!
            ------------------------------------------------------------------------------------
            Remember The Test Time Forumla ? : {for each word w [ count(w|c)+1 ] / [ count(c) + |V| + 1 ] } * p(c)
            ------------------------------------------------------------------------------------
            
            We are done with constructing of BoW for each category. But we need to precompute a few 
            other calculations at training time too:
            1. prior probability of each class - p(c)
            2. vocabulary |V| 
            3. denominator value of each class - [ count(c) + |V| + 1 ] 
            
            Reason for doing this precomputing calculations stuff ???
            ---------------------
            We can do all these 3 calculations at test time too BUT doing so means to re-compute these 
            again and again every time the test function will be called - this would significantly
            increase the computation time especially when we have a lot of test examples to classify!!!).  
            And moreover, it doensot make sense to repeatedly compute the same thing - 
            why do extra computations ???
            So we will precompute all of them & use them during test time to speed up predictions.
            
        '''
        
        ###################################################################################################
      
        prob_classes=np.empty(self.classes.shape[0])
        all_words=[]
        cat_word_counts=np.empty(self.classes.shape[0])
        for cat_index,cat in enumerate(self.classes):
           
            #Calculating prior probability p(c) for each class
            prob_classes[cat_index]=np.sum(self.labels==cat)/float(self.labels.shape[0]) 
            
            #Calculating total counts of all the words of each class 
            count=list(self.bow_dicts[cat_index].values())
            cat_word_counts[cat_index]=np.sum(np.array(list(self.bow_dicts[cat_index].values())))+1 # |v| is remaining to be added
            
            #get all words of this category                                
            all_words+=self.bow_dicts[cat_index].keys()
                                                     
        
        #combine all words of every category & make them unique to get vocabulary -V- of entire training set
        
        self.vocab=np.unique(np.array(all_words))
        self.vocab_length=self.vocab.shape[0]
                                  
        #computing denominator value                                      
        denoms=np.array([cat_word_counts[cat_index]+self.vocab_length+1 for cat_index,cat in enumerate(self.classes)])                                                                          
      
        '''
            Now that we have everything precomputed as well, its better to organize everything in a tuple 
            rather than to have a separate list for every thing.
            
            Every element of self.cats_info has a tuple of values
            Each tuple has a dict at index 0, prior probability at index 1, denominator value at index 2
        '''
        
        self.cats_info=[(self.bow_dicts[cat_index],prob_classes[cat_index],denoms[cat_index]) for cat_index,cat in enumerate(self.classes)]                               
        self.cats_info=np.array(self.cats_info)                                 
                                              
                                              
    def getExampleProb(self,test_example):                                
        
        '''
            Parameters:
            -----------
            1. a single test example 
            What the function does?
            -----------------------
            Function that estimates posterior probability of the given test example
            Returns:
            ---------
            probability of test example in ALL CLASSES
        '''                                      
                                              
        likelihood_prob=np.zeros(self.classes.shape[0]) #to store probability w.r.t each class
        
        #finding probability w.r.t each class of the given test example
        for cat_index,cat in enumerate(self.classes): 
                             
            for test_token in test_example.split(): #split the test example and get p of each test word
                
                ####################################################################################
                                              
                #This loop computes : for each word w [ count(w|c)+1 ] / [ count(c) + |V| + 1 ]                               
                                              
                ####################################################################################                              
                
                #get total count of this test token from it's respective training dict to get numerator value                           
                test_token_counts=self.cats_info[cat_index][0].get(test_token,0)+1
                
                #now get likelihood of this test_token word                              
                test_token_prob=test_token_counts/float(self.cats_info[cat_index][2])                              
                
                #remember why taking log? To prevent underflow!
                likelihood_prob[cat_index]+=np.log(test_token_prob)
                                              
        # we have likelihood estimate of the given example against every class but we need posterior probility
        post_prob=np.empty(self.classes.shape[0])
        for cat_index,cat in enumerate(self.classes):
            post_prob[cat_index]=likelihood_prob[cat_index]+np.log(self.cats_info[cat_index][1])                                  
      
        return post_prob
    
   
    def test(self,test_set):
      
        '''
            Parameters:
            -----------
            1. A complete test set of shape (m,)
            
            What the function does?
            -----------------------
            Determines probability of each test example against all classes and predicts the label
            against which the class probability is maximum
            Returns:
            ---------
            Predictions of test examples - A single prediction against every test example
        '''       
       
        predictions=[] #to store prediction of each test example
        for example in test_set: 
                                              
            #preprocess the test example the same way we did for training set exampels                                  
            cleaned_example=preprocess_string(example) 
             
            #simply get the posterior probability of every example                                  
            post_prob=self.getExampleProb(cleaned_example) #get prob of this example for both classes
            
            #simply pick the max value and map against self.classes!
            predictions.append(self.classes[np.argmax(post_prob)])
                
        return np.array(predictions) 

In [11]:
from sklearn.datasets import fetch_20newsgroups
""" 
fetch_20newsgroups is a dataset that has 20 categories but we will restrict
the categories to 4 for the time being 
"""
categories=['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med'] 
newsgroups_train=fetch_20newsgroups(subset='train',categories=categories)

"""
    Some training data is being loaded where training examples are saved
    in train_data and train labels are saved in train_labels
"""

train_data=newsgroups_train.data #getting all trainign examples
train_labels=newsgroups_train.target #getting training labels
#print ("Total Number of Training Examples: ",len(train_data)) # Outputs -> Total Number of Training Examples:  2257
#print ("Total Number of Training Labels: ",len(train_labels)) # Outputs -> #Total Number of Training Labels:  2257


In [12]:
nb=NaiveBayes(np.unique(train_labels)) #instantiate a NB class object
print ("---------------- Training In Progress --------------------")
 
nb.train(train_data,train_labels) #start tarining by calling the train function
print ('----------------- Training Completed ---------------------')

---------------- Training In Progress --------------------
----------------- Training Completed ---------------------


In [13]:
"""
    Some test data is being loaded where test examples are saved in test_data
    and test labels are saved in test_labels
"""
newsgroups_test=fetch_20newsgroups(subset='test',categories=categories) #loading test data
test_data=newsgroups_test.data #get test set examples
test_labels=newsgroups_test.target #get test set labels

print ("Number of Test Examples: ",len(test_data)) # Output : Number of Test Examples:  1502
print ("Number of Test Labels: ",len(test_labels)) # Output : Number of Test Labels:  1502

Number of Test Examples:  1502
Number of Test Labels:  1502


In [14]:
pclasses=nb.test(test_data) #get predcitions for test set

#check how many predcitions actually match original test labels
test_acc=np.sum(pclasses==test_labels)/float(test_labels.shape[0]) 

print ("Test Set Examples: ",test_labels.shape[0]) # Outputs : Test Set Examples:  1502
print ("Test Set Accuracy: ",test_acc*100,"%") # Outputs : Test Set Accuracy:  93.8748335553 %


Test Set Examples:  1502
Test Set Accuracy:  93.87483355525966 %
