## Text Classification using Naive Bayes Classifier

**Outline**

* [Introduction and what is Naive Bayes](#intro)
* [Simple example](#exp)
* [Bayes theorem](#bayes)
* [Being naive](#naive)
* [Apply smoothing for unknown words](#smooth)
* [Implementing Multinomial Naive Bayes using sklearn](#mnb)
* [Naive Bayes pros and cons](#pc)
* [Additional techniques to improve model](#add)
* [References](#ref)

In [43]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

## <a id="intro">Introduction</a>

My learning of text classification began with training a naive bayes classifier to predict the category of a text. This notebook will summarize the basic algorithm and how to implement a simple naive bayes model. 

** What is Naive Bayes?**

A linear classifier/probabilistic model based on Bayes’ theorem, *naive* comes from the assumption that the features in a dataset are mutually independent which is often violated in practice. Naive Bayes classifiers still tend to perform well under this unrealistic assumption, especially for small sample sizes. 

## <a id="exp">Simple example</a>

Here I followed the example in this blog [A practical explanation of a Naive Bayes classifier](https://monkeylearn.com/blog/practical-explanation-naive-bayes-classifier/#advanced-techniques) to show how Naive Bayes can be applied to learn the tag of texts. See training data below:

In [19]:
df = pd.DataFrame({'text': ['A great game',
                            'The election was over',
                            'Very clean match',
                            'A clean but forgettable game',
                            'It was a close election'],
                   'category':['Sports', 'Not Sports', 'Sports', 'Sports', 'Not Sports']})

df['vocab size'] = df['text'].str.split().apply(len)
df

Unnamed: 0,category,text,vocab size
0,Sports,A great game,3
1,Not Sports,The election was over,4
2,Sports,Very clean match,3
3,Sports,A clean but forgettable game,5
4,Not Sports,It was a close election,5


Give this training data, now let's train a classifier to predict the category for a given sentence 'a very close game'. Since Naive Bayes is a probabilistic model, we want to know P(Sports|a very close game) and P(Not Sports|a very close game) and check which one is bigger/more likely.

## <a id="bayes">Bayes theorem</a>

$$P(A | B) = \frac{P(B | A) P(A)}{P(B)}$$

In our case, we want to know: $$P(Sports|a very close game) = \frac{P(a very close game | sports) P(sports)}{P(a very close game)}$$

and: $$P(Not Sports|a very close game) = \frac{P(a very close game | Not sports) P(Not sports)}{P(a very close game)}$$

Since we are interested in which one returns a larger probability, we can focus on calculating the nominators and compare the values.

We can first calculate the a priori probability of each tag: For a given sentence in the training corpus, $P(sports) = 3/5$ and $P(not sports) = 2/5$.

## <a id="naive">Being Naive</a>
We assume each word in a sentence is **independent** of the other ones, and thus focusing on each word of the sentence. Under this assumption, "the party was fun" is the same with "fun party was the".

We can rewrite $P(a very close game)$ As $P(a)*P(very)*P(close)*P(game)$
thus 
$$P(a very close game|sports)=P(a|sports)*P(very|sports)*P(close|sports)*P(game|sports)$$

Now we can go ahead and calculate these probabilities, which is just counting the frequency of words in our training corpus.

The problem arise for P(close|sports) because the word 'close' does not appear in our training corpus, if we regard it as 0 then the calculation is nullified and won't give us any information on $P(a very close game|sports)$. 

## <a id="smooth">Apply smoothing for unknown words</a>

Smoothing is used when a word doesn't appear in training but appears in testing. 

**Add-one/La Place smoothing** is an additive smoothing method that adds one to each word that appeared in the test document, which is essentially pretending that we saw each word one more time than we did; and add the vocabulary size of the entire training document (V) to the denominator, so that the division will never be greater than 1.

$$ P_{MLE}(w_i | w_{i-1}) = \frac{c(w_{i-1}, w_i)}{c(w_{i-1})}$$
$$ P_{Add-1}(w_i | w_{i-1}) = \frac{c(w_{i-1}, w_i) + 1}{c(w_{i-1}) + V}$$

**Issues with Add-one smooth**: When the number of zeros is huge, the total probability of novel events are large. Improvements include Good-Turing smoothing, and lambda smoothing, etc. 

In [23]:
#Vocab size of each category
size = df.groupby('category')["vocab size"].sum()
size

category
Not Sports     9
Sports        11
Name: vocab size, dtype: int64

Total vocab size of the training corpus v = 14.
Therefore, $$P(close|Sports) = (0+1)/(11 + 14) = 1/25$$
$$P(close|Not Sports) = (0+1)/(9 + 14) = 2/23$$

Final results of $P(Sports|a very close game) = 0.0000276$ which is larger than $P(Not Sports|a very close game) = 0.00000572$, and our classifier gives "a very close game" the **sports** tag!

## <a id="mlb">Implementing Multinomial Naive Bayes using Sklearn</a>

Default smoothing parameter alpha is 1.

In [35]:
def multiNB_fit(df, x_colname, y_colname):
    """
    fit multinomial naive bayes model.
    
    Args:
        df (pd.DataFrame): a dataframe having the document and label
        x_colname (str): the colname for the document column
        y_colname (str): the colname for the label column       
    
    Returns:
       nb_clf: A Sklearn MultinomialNB object
       vect: the feature vectors obtained from training data
    """
    
    # get document and label
    X_train = df['text']
    y_train = df['category']
    
    # vectorize the document for both train and test
    vect = CountVectorizer()
    X_train_feats = vect.fit_transform(X_train)

    # See the result of the vectorization
    print('feature name: ', vect.get_feature_names())

    # convert to dense array for better visualize representation
    print('training:')
    print(X_train_feats.toarray())

    # Fit Multinomial NB model and predict the final probability
    nb_clf = MultinomialNB()
    nb_clf.fit(X_train_feats, y_train)
            
    return nb_clf, vect

In [36]:
# fit Multinomial Naive Bayes Model
nb_clf, vect = multiNB_fit(df, x_colname='text', y_colname='category')

feature name:  ['but', 'clean', 'close', 'election', 'forgettable', 'game', 'great', 'it', 'match', 'over', 'the', 'very', 'was']
training:
[[0 0 0 0 0 1 1 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 1 1 0 1]
 [0 1 0 0 0 0 0 0 1 0 0 1 0]
 [1 1 0 0 1 1 0 0 0 0 0 0 0]
 [0 0 1 1 0 0 0 1 0 0 0 0 1]]


In [37]:
def multiNB_predict(nb_clf, vect, x_test, predict_class=True):
    """
    predict the classification result using the trained nb_clf.
    
    Args:
        nb_clf (sklearn.naive_bayes.MultinomialNB): Sklearn MultinomialNB object
        vect (CountVectorizer): the feature vectors obtained from training data
        x_test (pd.Series): a pd.Series contains the document for testing
        predict_class (bol): whether to return the predicted class or probability
    Returns:
        array contains predicted class or probabilities
    """
    
    # vectorize the test document 
    X_test_feats = vect.transform(x_test)
    
    # convert to dense array for better visualize representation
    print('\ntesting:')
    print(X_test_feats.toarray()) 
    
    ### predict result
    if (predict_class==True):        
        pred = nb.predict(X_test_feats)
    else:
        pred = nb.predict_proba(X_test_feats)
                    
    print('Predicted results:', pred)
    
    return pred  

In [38]:
# manually input test data
x_test = pd.DataFrame({'text': ['A close game',
                               'It was a forgettable election']})
x_test

Unnamed: 0,text
0,A close game
1,It was a forgettable election


In [39]:
# Get the final classification label for testing data
multiNB_predict(nb, vect, x_test['text'], predict_class=True)


testing:
[[0 0 1 0 0 1 0 0 0 0 0 0 0]
 [0 0 0 1 1 0 0 1 0 0 0 0 1]]
Predicted results: ['Sports' 'Not Sports']


array(['Sports', 'Not Sports'], dtype='<U10')

In [40]:
# Get the final classification label for testing data
multiNB_predict(nb, vect, x_test['text'], predict_class=False)


testing:
[[0 0 1 0 0 1 0 0 0 0 0 0 0]
 [0 0 0 1 1 0 0 1 0 0 0 0 1]]
Predicted results: [[0.32785775 0.67214225]
 [0.87845067 0.12154933]]


array([[0.32785775, 0.67214225],
       [0.87845067, 0.12154933]])

## <a id="pc">Naive Bayes pros and cons</a>

**pros:**
* No parameter tuning is required
* Simple and easy to implement and isn't computationally expensive.
* Highly scalable. It scales linearly with the number of predictors and data points.
* Can be used for both binary and multiclass classification problems.
* Not sensitive to irrelevant features.

**cons:**
* strong assumption on feature independence
* data scarcity which would require smoothing

## <a id="add">Additional techniques to improve model</a>

**Remove stop words:** remove common words that don't add anything meaningful to the classification such as: the, a, was, it.

**Lemmatize words:** group together different inflections of the same word. EX: election, elections, elected, would be grouped together to elect and counted together.

**Use n-grams:** tokenize to phrases of more than one word and count these phrases instead of single words

**Use TF-IDF:** Instead of just counting frequency, we could do something more advanced like penalizing words that appear frequently in most of the documents.

## <a id="ref">References</a>
* [Text Classification using Naive Bayes from Johnny Chiu](https://nbviewer.jupyter.org/github/johnnychiuchiu/Machine-Learning/blob/master/TextAnalytics/naiveBayesTextClassification.ipynb#laplace)
* [A practical explanation of a Naive Bayes classifier](https://monkeylearn.com/blog/practical-explanation-naive-bayes-classifier/#advanced-techniques)
* [scikit learn: Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html)