# Text Classification using Naive Bayes

**Outline**

* [Introduction](#intro)
* [Simple Example](#example)
* [Bayes Theorem](#bayes)
* [Naive Bayes classifier](#nb)
* [Vector Space Model](#vc)
    * [Bernoulli Naive Bayes](#bernoulli)
    * [Multinomial Naive Bayes](#multinomial)
    * [Laplace Smoothing](#laplace)
    * [Bernoulli vs Multinomial Naive Bayes](#bernoullevsmultinomial)  
* [Implementation using Sklearn](#implement)
* [Pros and Cons of Naive Bayes](#procon)
* [Area for Improvement](#improve)
* [Reference](#refer)


---

In [1]:
%load_ext watermark

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

%watermark -a 'Johnny' -d -t -v -p pandas,sklearn

Johnny 2018-04-08 13:32:42 

CPython 3.6.3
IPython 6.1.0

pandas 0.20.3
sklearn 0.19.1


## <a id='intro'>Introduction</a>

As a newbie for text analytics, I have heard several times that Naive Bayes is the most simple way to implement a simple text classification model. The motivation of this notebook is to learn and also summarize how to implement a simple naive bayes model to a binary text classification problem.

## <a id='example'>Simple Example</a>

Using the same example, a news topic text classification.,from the blog, [A practical explanation of a Naive Bayes classifier](https://monkeylearn.com/blog/practical-explanation-naive-bayes-classifier/#advanced-techniques), let's see how can Naive Bayes help us solve the problem:

In [2]:
df = pd.DataFrame({'text': ['A great game',
                            'The election was over',
                            'Very clean match',
                            'A clean but forgettable game',
                            'It was a close election'],
                   'category': ['Sports', 'Not sports', 'Sports', 'Sports', 'Not sports']})                 
df

Unnamed: 0,category,text
0,Sports,A great game
1,Not sports,The election was over
2,Sports,Very clean match
3,Sports,A clean but forgettable game
4,Not sports,It was a close election


We want to use know whether a news should be sports related or not based on the text, which is the news content. Suppose we have a new article with the content `a very close game`, how do we know if this article is sports related or not?

Actually, what we really want to know is v. If this probability is over some threshold, say 0.5, then we say that the new article is sports related. However, how do we get this probability?

## <a id='bayes'>Bayes Theorem</a>

From Bayes Theorem, we can know that

$P( \text{Sports | a very close game} )$ = $\frac{P( \text{a very close game | Sports}) \times P(Sports)}{P( \text{a very close game} )} \propto P( \text{a very close game | Sports}) \times P(Sports)$

Another question comes up, it rarely we will have another article with the samee content, i.e., `a very close game`, how do we calcualte $P(\text{a very close game | Sports})$? This is where the **Naive** part of bayes theorem comes in.

## <a id='nb'>Naive Bayes classifier</a>

According to conditional probability, we know that 

$P(\text{a very close game | Sports}) = P(\text{a | Sports}) * P(\text{very | Sports, a}) * P(\text{close | Sports, a, very}) * P(\text{game | Sports, a, very, close})$ 

Naive Bayes classifier assumes that we know this article is Sports related or not, then knowing whether the word "a" appear in the article does not effect the probability of "very" appearing in the article. More specifically, the above equation will become:

$P(\text{a very close game | Sports}) = P(\text{a | Sports}) * P(\text{very | Sports}) * P(\text{close | Sports}) * P(\text{game | Sports})$ 

In other words, this means that we’re no longer looking at entire sentences, but rather at individual words. So for our purposes, “a very close game” is the same as “very game a close” and “game a very close”.

Originally, $P(\text{very | Sports, a})$ indicates that given that we know the article is sports related and with the word "a" in it, what the probabilty is that the word "very" shows up in the article as well. With the conditional independence assumption, this probabilty is then the same as the that given we know the article is sports related, what the probabillity is that the word "very" shows up in the article. In other words, we now only look at the probabilty of each of the single words rather than checking the whole sentence.

With the above assumption, we can then start to see whether the sentence "a very close game" is sports related or not.

## <a id='vc'>Vector Space Model to represent document</a>

Vector space model or term vector model is an algebraic model for representing text documents (and any objects, in general) as vectors.

Documents are represented as vectors.

$d_{j}=(w_{1,j},w_{2,j},\dots ,w_{t,j})$

where
* j: the index of the document, in our case, we have 5 documents in our training data
* t: the dimentiionality of the document.

Each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is non-zero. Several different ways of computing these values, also known as (term) weights, have been developed. One of the best known schemes is tf-idf weighting.

The definition of term depends on the application. Typically terms are single words, keywords, or longer phrases. If words are chosen to be the terms, the dimensionality of the vector is the number of words in the vocabulary (the number of distinct words occurring in the corpus).

In our example, the dimentionality of our vector is 14, since there are 14 distinct words in our corpus, which is shown as below.

In [3]:
corpus = ['a','great','game','the','election','was','over', 'very', 'clean', 'match', 'but', 'forgettable', 'it', 'close']
corpus

['a',
 'great',
 'game',
 'the',
 'election',
 'was',
 'over',
 'very',
 'clean',
 'match',
 'but',
 'forgettable',
 'it',
 'close']

For Naive Bayes, the two most commonly used method to calculate term weights are Bernoulli and Multinomial document model, both of which represent documents as a bag of words, using the Naive Bayes assumption. Both models represent documents using feature vectors whose components correspond to word types. 

* **Bernoulli document model**: a document is represented by a feature vector with binary elements taking value 1 if the corresponding word is present in the document and 0 if the word is not present.
* **Multinomial document model**: a document is represented by a feature vector with integer elements whose value is the frequency of that word in the document.

### <a id='bernoulli'>Bernoulli Naive Bayes</a>

Remember the goal is to know $P(\text{a very close game | Sports})$, more generally, we want to know

$p(x_1, x_2, \dots, x_{14} | y)= p(\text{1,0,1,0,0,0,1,0,0,0,0,0,1 | y}) = \prod_{i=1}^{14}p(x_i | y)$

since the query `A very close game` equals to the following vector:

In [4]:
q1=[1,0,1,0,0,0,1,0,0,0,0,0,1]
q1

[1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1]

For Bernoulli Naive Bayes, the term probability is based on 

$P(x_i \text{ | y}) = P(\text{i | y}) x_i + (1-P(\text{i | y}))(1-x_i)$

where
* i: indicate a particular term in the corpus
* $x_i$: whether this term is in the query of not. If the terms appears in the corpus, then it is 1; otherwise is 0.


For i=1:
$P(x_1 \text{ | y}) = P(\text{a | y}) 1 + (1-P(\text{i | y}))(1-1) = P(\text{a | y}) = P(\text{a | Sports})$

As for the word likelihood $P(\text{i | y})$, we can learn (estimate) these parameters from a training set of documents labelled with class D=y.

$p(i∣D=y)=\frac{n_y(i)}{N_y} = \frac{\text{#docs with the target single word}}{\text{#docs for the class}}$

Where:
* $n_y(i)$ is the number of class D=y's document in which i is observed.
* $N_y$ is the number of documents that belongs to class y.

Hence, we know that $P(\text{a | Sports})$ equals to 2/3, since the term `a` appears in 2 out of three documents labeled with Sports.

Therefore, to calculate the final probability of this new query to be Sports related, we use the following equation

$p(x_1, x_2, \dots, x_{14})= \prod_{i=1}^{14}p(x_i | y) = p(\text{a | Sports}) \times p(\text{great | Sports}) \times p(\text{game | Sports}) \dots \times p(\text{close | Sports}) = \frac{2}{3} \times (1-\frac{1}{3}) \times \frac{2}{3} \dots \times \frac{0}{3} $

### <a id='multinomial'>Multinomial Naive Bayes</a>

The multinomial distribution can be used to compute the probabilities in situations in which there are more than two possible outcomes. We throw coins for 10 times, assuming that it can have 3 outcomes, what's the prob that we'll have 3 ties, 3 head, 4 tails.

For Multinomial Naive Bayes, instead of using whether if a term appears in a document or not as weight, it uses the number of time a term appears in the document as the weight in the vector. More specifically, the term probability is based on

$P(x_i \text{ | y}) = \frac{N_{yi}}{N_y} = \frac{\text{#words}}{\text{#total number of words from the class}}$

where
* $N_{yi}$ is the number of times feature i appears in a sample of class y in the training set T
* $N_y$ is the total count of all features for class y

Therefore, to calculate the final probability of this new query to be Sports related, we use the following equation

$p(x_1, x_2, \dots, x_{14})= \prod_{i=1}^{14}p(x_i | y) = p(\text{a | Sports}) \times p(\text{great | Sports}) \times p(\text{game | Sports}) \dots \times p(\text{close | Sports}) = \frac{2}{11} \times \frac{1}{11} \times \frac{2}{11} \dots \times \frac{0}{11} $

### <a id='laplace'>Laplace Smoothing</a>

As we can see from above, some of the terms will be 0 using either Bernoulli or Multinomial method. For example, since the term "close" doesn't appear in any Sports article. That means that P(close | Sports) = 0. This is rather inconvenient since we are going to be multiplying it with the other probabilities, so we’ll end up with $P(\text{a | Sports}) \times P(\text{very | Sports}) \times 0 \times P(\text{game | Sports})$. This equals 0, since in a multiplication, if one of the terms is zero, the whole calculation is nullified. Doing things this way simply doesn’t give us any information at all, so we have to find a way around.

A commonly used method is called [Laplace smoothing](https://en.wikipedia.org/wiki/Additive_smoothing). By using Add-one smooth, which is a special case of laplace smoothing, we basically add 1 to every count so it’s never zero. Therefore, the term probability becomes

**Bernoulli**

$p(i∣y)=\frac{n_y(i)+1}{N_y+d}$
where 
* **d**: the number of document that is with class D=y 

**Multinomial**

$P(x_i \text{ | y}) = \frac{N_{yi}+1}{N_y+d}$

* **d**: the dimensionality of our corpus, in our case, it equals 14

Therefore, using the laplace smoothing, the final probability of the query set `A very close game` equals

**Bernoulli** $p(x_1, x_2, \dots, x_{14})= \prod_{i=1}^{14}p(x_i | y) = p(\text{a | Sports}) \times p(\text{great | Sports}) \times p(\text{game | Sports}) \dots \times p(\text{close | Sports}) = \frac{3}{4} \times (1-\frac{2}{4}) \times \frac{3}{4} \dots \times \frac{1}{4} $

**Multinomial** $p(x_1, x_2, \dots, x_{14})= \prod_{i=1}^{14}p(x_i | y) = p(\text{a | Sports}) \times p(\text{great | Sports}) \times p(\text{game | Sports}) \dots \times p(\text{close | Sports}) = \frac{3}{25} \times \frac{2}{25} \times \frac{3}{25} \dots \times \frac{1}{25} $

Noted that to determine whether if the new query set should be with the class Sports or Not-Sports, we don't need to calculate the term probability of the terms that doesn't show up in the query, since the term probability will be the same in both $P(x_i \text{ | Sports})$ and $P(x_i \text{ | Not Sports})$. It can be neglect if we only want to know the classification result instead the actual probability.

### <a id='bernoullevsmultinomial'>Bernoulli vs Multinomial Naive Bayes</a>

Given that we have different method to choose when using Naive Bayes Classifier, how can we choose which one to use?

We have known that Bernoulli models the presence/absence of a feature. Multinomial models the number of counts of a feature. Therefore, the variant of Naive Bayes we use depends on the data. If our data consists of counts, the multinomial distribution may be an appropriate distribution for the likelihood, and thus multinomial Naive Bayes is appropriate.

One thing to [note](https://datascience.stackexchange.com/questions/27624/difference-between-bernoulli-and-multinomial-naive-bayes) is that whereas the binomial distribution generalises the Bernoulli distribution across the number of trials, the multinoulli distribution generalises it across the number of outcomes, that is, rolling a dice instead of tossing a coin. In other words, the denominator to calculate the term probability from each method is different.

If our term weight is a continuous value, another common method to use is [Gaussian Nayes Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB), where we assume that the features follow a normal distribution.

## <a id='implement'>Implementation using Sklearn</a>

In [5]:
def multiNB_fit(df, x_colname, y_colname):
    """
    fit multinomial naive bayes model.
    
    Args:
        df (pd.DataFrame): a dataframe having the document and label
        x_colname (str): the colname for the document column
        y_colname (str): the colname for the label column       
    
    Returns:
       nb: A Sklearn MultinomialNB object
       vect: the corpus obtained from training data
    """
    
    ### get document and label
    X_train = df['text']
    y_train = df['category']
    
    ### vectorize the document for both train and test
    vect = CountVectorizer()
    X_train_dtm = vect.fit_transform(X_train)

    ### See the result of the vectorization
    print('feature name: ', vect.get_feature_names())

    # convert to dense array for better visualize representation
    print('training:')
    print(X_train_dtm.toarray())

    ### Fit Multinomial NB model and predict the final probability
    nb = MultinomialNB()
    nb.fit(X_train_dtm, y_train)
            
    return nb, vect

In [6]:
def multiNB_predict(nb_model, vect, x_test, predict_class=True):
    """
    predict the classification result using the input multinomial naive bayes model.
    
    Args:
        nb_model (sklearn.naive_bayes.MultinomialNB): Sklearn MultinomialNB object
        vect (CountVectorizer): the corpus obtained from training data
        x_test (pd.Series): a pd.Series contains the document for testing
        predict_class (bol): indicate whether we want to get the predicted class or probability
    
    Returns:
        array contains predicted class or probabilities
    """
    
    ### vectorize the document for test

    X_test_dtm = vect.transform(x_test)
    
    # convert to dense array for better visualize representation
    print('\ntesting:')
    print(X_test_dtm.toarray()) 
    
    ### predict result
    if (predict_class==True):        
        pred = nb.predict(X_test_dtm)
    else:
        pred = nb.predict_proba(X_test_dtm)
                    
    print('library implementation', pred)
    
    return pred    

In [7]:
# take a look at our training data
df.head()

Unnamed: 0,category,text
0,Sports,A great game
1,Not sports,The election was over
2,Sports,Very clean match
3,Sports,A clean but forgettable game
4,Not sports,It was a close election


In [8]:
# manually set our testing data
x_test = pd.DataFrame({'text': ['A very close game',
                               'A clean election was over']})
x_test

Unnamed: 0,text
0,A very close game
1,A clean election was over


In [9]:
# fit Multinomial Naive Bayes Model
nb, vect = multiNB_fit(df, x_colname='text', y_colname='category')

feature name:  ['but', 'clean', 'close', 'election', 'forgettable', 'game', 'great', 'it', 'match', 'over', 'the', 'very', 'was']
training:
[[0 0 0 0 0 1 1 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 1 1 0 1]
 [0 1 0 0 0 0 0 0 1 0 0 1 0]
 [1 1 0 0 1 1 0 0 0 0 0 0 0]
 [0 0 1 1 0 0 0 1 0 0 0 0 1]]


In [10]:
# Get the final classification label for testing data
multiNB_predict(nb, vect, x_test['text'], predict_class=True)


testing:
[[0 0 1 0 0 1 0 0 0 0 0 1 0]
 [0 1 0 1 0 0 0 0 0 1 0 0 1]]
library implementation ['Sports' 'Not sports']


array(['Sports', 'Not sports'],
      dtype='<U10')

In [11]:
# Get the final classification probabilities for testing data
multiNB_predict(nb, vect, x_test['text'], predict_class=False)


testing:
[[0 0 1 0 0 1 0 0 0 0 0 1 0]
 [0 1 0 1 0 0 0 0 0 1 0 0 1]]
library implementation [[ 0.2035071   0.7964929 ]
 [ 0.82812184  0.17187816]]


array([[ 0.2035071 ,  0.7964929 ],
       [ 0.82812184,  0.17187816]])

## <a id='procon'>Pros and Cons of Nayes Bayes</a>

**Pros**
* Famously good at text classification. e.g. spam filtering. Or domains where you have many equally important features, which tends to be a problem for other kind of classifiers, in particular tree based algorithms.
* No parameter tuning is required
* Very simple, easy to implement and fast.
* If the NB conditional independence assumption holds, then it will converge quicker than discriminative models like logistic regression(?). Even if the NB assumption doesn’t hold, it works great in practice.
* Need less training data.
* Highly scalable. It scales linearly with the number of predictors and data points.
* Can be used for both binary and mult-iclass classification problems.
* Can make probabilistic predictions.
* Handles continuous and discrete data.
* Not sensitive to irrelevant features.

**Cons**
* Conditional independence is not always a valid assumption, thus can be outperformed by other methods.
* Predicted probabilities are not well-calibrated.

## <a id='improve'>Areas for Improvement</a>

According to the blog: [A practical explanation of a Naive Bayes classifier](https://monkeylearn.com/blog/practical-explanation-naive-bayes-classifier/#advanced-techniques), there are many things that can be done to improve this basic model. These techniques allow Naive Bayes to perform at the same level as more advanced methods. Some of these techniques are:

* **Removing stopwords**. These are common words that don’t really add anything to the categorization, such as a, able, either, else, ever and so on. So for our purposes, The election was over would be election over and a very close game would be very close game.
* **Lemmatizing words**. This is grouping together different inflections of the same word. So election, elections, elected, and so on would be grouped together and counted as more appearances of the same word.
* **Using n-grams**. Instead of counting single words like we did here, we could count sequences of words, like “clean match” and “close election”.
* **Using TF-IDF**. Instead of just counting frequency we could do something more advanced like also penalizing words that appear frequently in most of the samples.

## Reference

* [A practical explanation of a Naive Bayes classifier (Multinomial Naive Bayes Example)](https://monkeylearn.com/blog/practical-explanation-naive-bayes-classifier/#advanced-techniques)
* [scikit learn: Naive Bayes](http://scikit-learn.org/stable/modules/naive_bayes.html)

* [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)
* [Pros of Naive Bayes](https://www.quora.com/What-are-the-advantages-of-using-a-naive-Bayes-for-classification)

* [Difference between Bernoulli and Multinomial Naive Bayes](https://datascience.stackexchange.com/questions/27624/difference-between-bernoulli-and-multinomial-naive-bayes)
* [Bernoulli and Multinomial Naive Bayes from scratch](http://nbviewer.jupyter.org/github/ethen8181/machine-learning/blob/master/text_classification/naive_bayes/naive_bayes.ipynb)
* [Vector Space Model](https://en.wikipedia.org/wiki/Vector_space_model)
* [Laplace Smoothing](https://en.wikipedia.org/wiki/Additive_smoothing)
* [Naive Bayes and Text Classification](http://sebastianraschka.com/Articles/2014_naive_bayes_1.html#3_3_multivariate)