# Applying Multinomial Naive Bayes Algorithm to Classify Yelp Sentimental Review Comments

This project aims to build the sentimental classifier with accuracy higher than 80% using the multinomial Naive Bayes algorithm. We will compare the probability for a word and if the proability of a positive category is greater than the one of a negative category, then this word will be classified as a positive word. By combining all proabilities of words within a sentence/comment, we will get the whole comment proability and be able to identify the sentimental categories (positive or negative). 

The data used in this study is from https://www.kaggle.com/rahulin05/sentiment-labelled-sentences-data-set . 

Note that this project is inspired by one of the Guided Projects of DataQuest.

## Dataset

In [60]:
import numpy as np
import pandas as pd

In [61]:
# read the file 
yp = pd.read_csv('yelp_labelled.txt', delimiter = "\t", header=None)
yp.columns = ['comment','category']
yp.head()

Unnamed: 0,comment,category
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


**Category/Label**
- A negative comment: 0
- A positive comment: 1

In [62]:
# Percentages of positive and negative
yp['category'].value_counts(normalize=True)*100

1    50.0
0    50.0
Name: category, dtype: float64

In [63]:
yp.shape[0] # how many tweet in this dataset

1000

## Splitting Train and Test Sets

In [64]:
randomized_yp = yp.sample(frac=1, random_state=1)
idx = round((randomized_yp.shape[0])*0.8) # 80% train, 20% test
train = randomized_yp[:idx].reset_index(drop=True)
test = randomized_yp[idx:].reset_index(drop=True)

print(train.shape)
print(test.shape)

(800, 2)
(200, 2)


In [65]:
train['category'].value_counts(normalize=True)*100

1    50.0
0    50.0
Name: category, dtype: float64

In [66]:
test['category'].value_counts(normalize=True)*100

1    50.0
0    50.0
Name: category, dtype: float64

Previously, we made sure that we have similar percentages of sentiment in both train and test sets.

## Data Cleaning

### Lower Letter Case and Remove Punctuation

To make it easier for further analysis, we will remove the punctuation and make all words to become lower case.

In [67]:
train['comment'] = train['comment'].str.replace('\W',' ')
train['comment'] = train['comment'].str.lower()
train.head()

Unnamed: 0,comment,category
0,my gyro was basically lettuce only,0
1,it kept getting worse and worse so now i m off...,0
2,i am far from a sushi connoisseur but i can de...,0
3,the staff are great the ambiance is great,1
4,by this time our side of the restaurant was al...,0


### Create the Volcabulary List

In [68]:
train['comment'] = train['comment'].str.split() # split the text to get the list of words
train.head()

Unnamed: 0,comment,category
0,"[my, gyro, was, basically, lettuce, only]",0
1,"[it, kept, getting, worse, and, worse, so, now...",0
2,"[i, am, far, from, a, sushi, connoisseur, but,...",0
3,"[the, staff, are, great, the, ambiance, is, gr...",1
4,"[by, this, time, our, side, of, the, restauran...",0


In [69]:
# store all the words in the vocab list
vocab = []
for comment in train['comment']:
    for word in comment:
        vocab.append(word)

# remove duplicates
vocab = set(vocab)
vocabulary = list(vocab)

In [70]:
vocabulary[:5]

['six', 'crab', 'attached', 'see', 'did']

### Storing Word Counts in Dictionary

In [71]:
word_counts_per_comment = {unique_word: [0]*len(train['comment']) for unique_word in vocabulary}

for idx, comment in enumerate(train['comment']):
    for word in comment:
        word_counts_per_comment[word][idx] += 1

In [72]:
word_counts_per_comment = pd.DataFrame(word_counts_per_comment)
word_counts_per_comment.head(3)

Unnamed: 0,six,crab,attached,see,did,dark,return,lover,crowd,happened,...,old,famous,meats,diverse,likes,huevos,restaurant,traditional,filling,cranberry
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [73]:
# combine the word_counts_per_text dataframe to train set
train = pd.concat([train, word_counts_per_comment], axis=1)
train.head(3)

Unnamed: 0,comment,category,six,crab,attached,see,did,dark,return,lover,...,old,famous,meats,diverse,likes,huevos,restaurant,traditional,filling,cranberry
0,"[my, gyro, was, basically, lettuce, only]",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"[it, kept, getting, worse, and, worse, so, now...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"[i, am, far, from, a, sushi, connoisseur, but,...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Probability Calulations

The probality equation that we can use to classify new comments (eqn.1):

$$P(Sentimental \space Category \space A|New \space Comment) = \frac{P(Sentimental \space Category \space A)\cdot P(New \space Comment|Sentimental \space Category \space A)}{P(New \space Comment)}$$      

Since we want to compare only the results of the all 2 group using the eqn.1, we can remove the division to make the calulation become easier, and it will not affect the final calssification decision. The equations of all three catgories will be reduced to:

$$ P(Positive|New \space Comment) \space \alpha \space  P(Positive)\cdot P(New \space Comment|Positive)$$
$$ P(Negative|New \space Comment) \space \alpha \space P(Negative)\cdot P(New \space Comment|Negative)$$

Note that by reducing the equation, the calculated probabilities will not be accurate anymore. However, we still can compare individual calculated values to classify the sentimental category of new comments, which is our main goal.

We would like to find a probability of each sentimental catgory given a new comment, which is consisting of many words. This can be expressed as the following (eqn.2):

$$ P(Category \space A| w_1,w_2,...,w_n) = \frac{P(Category \space A \space \cap \space (w_1,w_2,...,w_n))}{P(w_1,w_2,...,w_n)}$$

where $w_1, w_2, w_3,...,w_n$ are words within that comment.


To make calulation easier, we reduce eqn.2 to (eqn.3):

$$ P(Category \space A| w_1,w_2,...,w_n) \space \alpha \space P(Category \space A \cap (w_1,w_2,...,w_n))$$


Here we make an assumption that all the words are conditionally independent. This is why we call this algorithm Naive Bayes since it's a simple way to make an assumption of conditional independece and do the calulations. However, this is not always the case in reality, and it's likely that words are dependent.

In this study, we assume conditional indepence between words. Thus, the eqn.3 can be written as follow (eqn.4):

$$ P(Category \space A| w_1,w_2,...,w_n) \space \alpha \space P(w_1|Category \space A)\cdot P(w_2|Category \space A) \cdot P(w_3|Category \space A) \cdot \space \dots  \space \cdot P(w_n|Category \space A)$$


The eqn.4 can be written in this form as well (eqn.5):

$$ P(Category \space A| w_1,w_2,...,w_n) \space \alpha \space P(Category \space A)\cdot \prod_{i=1}^{n} P(w_i|Category \space A) $$

To calculate $P(w_i|Category \space A)$, we will use the following equation (eqn.6):

$$ P(w_i|Category \space A) = \frac{N_{w_i|Category \space A}+\alpha}{N_{Category \space A} + \alpha \cdot N_{Vocabulary}}$$

where 
- $N_{w_i|Category \space A}$ = number of times that the word ($w_i$) occurs in Category A comments
- $N_{Category \space A}$ = total number of words in Category A comments
- $N_{Vocabulary}$ = total number of words in the vocabulary (list of unique words found in the dataset)
- $\alpha$ = a smoothing parameter or Laplace/Additive smoothing. This parameter is assigned to 1, and it helps to solve an issue when a word is not found in comments of a certain sentimental category. For example, we want to calculate P(Positive|'I am not a bad person') and "bad" is not in the Positive comments, so P('bad'|Positive) is zero. And because of that, using eqn. 4 P(Positive|'I am not a bad person') becomes zero as well. This shows that if only one word is not in comments of a certain sentimental category, it can affect the calculated proability and leads to wrong classification.

In [74]:
# extract individual sentimental comments
pos_comment = train[train['category']==1]
neg_comment = train[train['category']==0]

# calculate probabilities of the individuals
p_pos = len(pos_comment)/len(train)
p_neg = len(neg_comment)/len(train)

In [75]:
# calculate number of all words (not just unique) in all 3 groups, plus in vocabulary 
n_pos = pos_comment['comment'].apply(len).sum()
n_neg = neg_comment['comment'].apply(len).sum()
n_vocabulary = len(vocabulary)

## use Laplace smoothing and assign alpha = 1
alpha = 1

In [76]:
pos_probs = {unique_word:0 for unique_word in vocabulary}
neg_probs = {unique_word:0 for unique_word in vocabulary}

# using eqn.6
for word in vocabulary:
    n_word_given_pos = pos_comment[word].sum()
    p_word_given_pos = (n_word_given_pos + alpha)/(n_pos + alpha*n_vocabulary)
    pos_probs[word] = p_word_given_pos

    n_word_given_neg = neg_comment[word].sum()
    p_word_given_neg = (n_word_given_neg + alpha)/(n_neg + alpha*n_vocabulary)
    neg_probs[word] = p_word_given_neg

## Classification

In [77]:
import re
def classify(comment):
    comment = re.sub('\W', ' ', comment)
    comment = comment.lower()
    comment = comment.split()
    
    p_pos_given_comment = p_pos
    p_neg_given_comment = p_neg
    for word in comment:
        if word in pos_probs:
            p_pos_given_comment *= pos_probs[word]
        if word in neg_probs:
            p_neg_given_comment *= neg_probs[word]
    
    if p_pos_given_comment > p_neg_given_comment:
        return 1
    elif p_neg_given_comment > p_pos_given_comment:
        return 0
    else: 
        return -1 # cannot classify

## Sentimental Classifier Accuracy

In [78]:
def classify_accuracy(df):
    correct = 0
    total = df.shape[0]
    
    for row in df.iterrows():
        row = row[1]
        if row['category'] == row['predicted']:
            correct += 1

    print("Accuracy: ", round(correct*100/total,2), "%")

In [79]:
test['predicted'] = test['comment'].apply(classify)
test.head()

Unnamed: 0,comment,category,predicted
0,Tonight I had the Elk Filet special...and it s...,0,0
1,What a mistake.,0,0
2,I asked multiple times for the wine list and a...,0,0
3,My friend loved the salmon tartar.,1,1
4,If someone orders two tacos don't' you think i...,0,0


In [80]:
classify_accuracy(test)

Accuracy:  84.0 %


## Function for Other Dataset Classification

In [81]:
def sentiment_classify(df, frac_train_test_split):
    randomized_df = df.sample(frac=1, random_state=1)
    idx = round((randomized_df.shape[0])*frac_train_test_split) 
    train = randomized_df[:idx].reset_index(drop=True)
    test = randomized_df[idx:].reset_index(drop=True)
    
    train['comment'] = train['comment'].str.replace('\W',' ')
    train['comment'] = train['comment'].str.lower()
    train['comment'] = train['comment'].str.split()
    
    # create vocabulary
    vocab = []
    for comment in train['comment']:
        for word in comment:
            vocab.append(word)
    vocab = set(vocab)
    vocabulary = list(vocab)
    
    # store word counts in dictionary 
    word_counts_per_comment = {unique_word: [0]*len(train['comment']) for unique_word in vocabulary}
    for idx, comment in enumerate(train['comment']):
        for word in comment:
            word_counts_per_comment[word][idx] += 1

    # concatnate all the unique words to the dataframe
    word_counts_per_comment = pd.DataFrame(word_counts_per_comment)
    train = pd.concat([train, word_counts_per_comment], axis=1)
    
    ## Probability calculations
    # extract individual sentimental comments
    pos_comment = train[train['category']==1]
    neg_comment = train[train['category']==0]

    # calculate probabilities of the individuals
    p_pos = len(pos_comment)/len(train)
    p_neg = len(neg_comment)/len(train)
    
    # calculate number of all words (not just unique) in all 3 groups, plus in vocabulary 
    n_pos = pos_comment['comment'].apply(len).sum()
    n_neg = neg_comment['comment'].apply(len).sum()
    n_vocabulary = len(vocabulary)

    ## use Laplace smoothing and assign alpha = 1
    alpha = 1
    
    pos_probs = {unique_word:0 for unique_word in vocabulary}
    neg_probs = {unique_word:0 for unique_word in vocabulary}

    # using eqn.6
    for word in vocabulary:
        n_word_given_pos = pos_comment[word].sum()
        p_word_given_pos = (n_word_given_pos + alpha)/(n_pos + alpha*n_vocabulary)
        pos_probs[word] = p_word_given_pos

        n_word_given_neg = neg_comment[word].sum()
        p_word_given_neg = (n_word_given_neg + alpha)/(n_neg + alpha*n_vocabulary)
        neg_probs[word] = p_word_given_neg
    
    # predict the category
    test['predicted'] = test['comment'].apply(classify)
    
    # calculate and return the accuracy
    return classify_accuracy(test)

#### Now, let's try our function with Amazon dataset.

In [82]:
# import and change the column names
amz = pd.read_csv('amazon_cells_labelled.txt', delimiter = "\t", header=None)
amz.columns = ['comment','category']

# run the classification function
frac_train_test_split = [0.7, 0.75, 0.8, 0.85, 0.9]

for r in frac_train_test_split:
    accuracy = sentiment_classify(amz, r)

Accuracy:  74.0 %
Accuracy:  74.8 %
Accuracy:  76.5 %
Accuracy:  77.33 %
Accuracy:  78.0 %


## Summary and Future Work

- Although we reached our goal getting higher accuracy than expectation (80%) for Yelp review dataset, when we apply our function with different dataset, i.e. Amazon reviews, the accuracy decreases. It's possible that we do not have enough words in vocabulary list since when we change train set to be larger, the accuracy increases. If we have more data, this may help improving our model performance.
- For future work, we can take a look at individual comments that their actual categories are not the same as our predictions, and may be able to improve our model based on that observation.