<a href="https://colab.research.google.com/github/phanmanhtung/Natural-Language-Processing-Course/blob/master/Naive_Bayes_for_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Naive Bayes for Sentiment Classification

In this notebook, we will implement Naive Bayes algorithm for text classification. We will use sentiment classification data in the notebook.

Student's name: Phan Manh Tung
<br />
ID: USHTBi8-160


## Data

We will use the sentiment analysis corpus in [polarity dataset v1.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.README.1.0.txt) from [Moview Review Data](http://www.cs.cornell.edu/people/pabo/movie-review-data/) created by Bo Pang and Lillian Lee. The task is to classify reviews into positive or negative polarity.

Dataset contains 10662 reviews of movies in which 50% of reviews have positive sentiment and 50% of reviews have negative sentiment. Data is stored in the file `sentiment.txt` in which each line is a review with labels (+1 or -1) at the beginning. All reviews are tokenized. For instance.

```
+1 if you sometimes like to go to the movies to have fun , wasabi is a good place to start . 
-1 enigma is well-made , but it's just too dry and too placid .
```

We need to download data first.

In [0]:
!rm -f sentiment.txt
!wget https://raw.githubusercontent.com/minhpqn/nlp_100_drill_exercises/master/data/sentiment.txt

--2020-03-24 12:03:59--  https://raw.githubusercontent.com/minhpqn/nlp_100_drill_exercises/master/data/sentiment.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1270444 (1.2M) [text/plain]
Saving to: ‘sentiment.txt’


2020-03-24 12:04:00 (27.4 MB/s) - ‘sentiment.txt’ saved [1270444/1270444]



### Loading data

We will load data into a list of sentences with their labels.

In [0]:
import re


def load_data(file_path):
    data = []
    # Regular expression to get the label and the text
    regx = re.compile(r'^(\+1|-1)\s+(.+)$')
    with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
        for line in f:
            line = line.strip()
            if line == '':
                continue
            match = regx.match(line)
            if match:
                lb = match.group(1)
                text = match.group(2)
                data.append((text, lb))
    return data

In [0]:
data = load_data('./sentiment.txt')

In [0]:
print(data[0])
print(data[-1])

('the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', '+1')
("enigma is well-made , but it's just too dry and too placid .", '-1')


## Train/test split

We will split the data into train/test so that the label distributions on two data files are similar. We will split data with the ratio 80/20.

We use [scikit-learn](https://scikit-learn.org) library to do train/test split.

In [0]:
from sklearn.model_selection import train_test_split

texts, labels = zip(*data)
train_texts, test_texts, train_labels, test_labels = train_test_split(texts, labels, test_size=0.2, random_state=42)

In [0]:
print(train_texts)



Let's check labels on the training data and test data.


In [0]:
from collections import Counter

print(Counter(train_labels))
print(Counter(test_labels))

Counter({'+1': 4269, '-1': 4260})
Counter({'-1': 1071, '+1': 1062})


## Multinomial Naive Bayes Model

In this section, we will implement the Multinomial Naive Bayes (MNB) model. The implementation follows the pseudo code if Figure 4.2, chapter 4 "Naive Bayes and Sentiment Analysis" (SLP Book).

### Training Multinomial Naive Bayes Model

We first extract a vocabulary from a training dataset which is a list of sentences. For the sake of simplicity, we extract all words except punctuations.

In [0]:
import string

def build_vocab(texts):
    """Build vocabulary from dataset

    Args:
        texts (list): list of tokenized sentences
    
    Returns:
        vocab (dict): map from word to index
    """
    vocab = {}
    for s in texts:
        for word in s.split():
            # Check if word is a punctuation
            if word in string.punctuation:
                continue
            if word not in vocab:
                idx = len(vocab)
                vocab[word] = idx
    return vocab

Let's check how the function `build_vocab` works.

In [0]:
vocab = build_vocab(train_texts)

In [0]:
print(vocab)



In [0]:
from collections import defaultdict
import math

# For each document, we just need to count a word one time
def count_one_time(d):
  new_d = []
  for x in d:
    if x not in new_d: 
      new_d.append(x)
  return new_d

def train_naive_bayes(texts, labels, target_classes, alpha=1):
    """Train a binarized Naive Bayes model
    """
    ndoc = 0
    nc = defaultdict(int)   # map from a class label to number of documents in the class
    logprior = dict()
    loglikelihood = dict()
    count = defaultdict(int)  # count the occurrences of w in documents of class c

    vocab = build_vocab(texts)
    # Training
    for s, c in zip(texts, labels):
        ndoc += 1
        nc[c] += 1
        s = ' '.join(count_one_time(s.split()))
        for w in s.split():
            if w in vocab:
                count[(w,c)] += 1
    
    vocab_size = len(vocab)
    for c in target_classes:
        logprior[c] = math.log(nc[c]/ndoc)
        sum_ = 0
        for w in vocab.keys():
            if (w,c) not in count: count[(w,c)] = 0
            sum_ += count[(w,c)]
        
        for w in vocab.keys():
            loglikelihood[(w,c)] = math.log( (count[(w,c)] + alpha) / (sum_ + alpha * vocab_size) )
    
    return logprior, loglikelihood, vocab

Let's test the train function on a toy example

In [0]:
data = [
    ("Chinese Beijing Chinese", "c"),
    ("Chinese Chinese Shanghai", "c"),
    ("Chinese Macao", "c"),
    ("Tokyo Japan Chinese", "j")
]
texts, labels = zip(*data)
target_classes = ["c", "j"]

logprior, loglikelihood, vocab = train_naive_bayes(texts, labels, target_classes)

Let's confirm our implementation works correctly.

In [0]:
assert logprior['c'] == math.log(0.75)
assert logprior['j'] == math.log(0.25)
assert loglikelihood[('Chinese', 'c')] == math.log(1/3)
assert loglikelihood[('Tokyo', 'c')] == math.log(1/12)
assert loglikelihood[('Japan', 'c')] == math.log(1/12)
assert loglikelihood[('Tokyo', 'j')] == math.log(2/9)

There is no assert exception, so our implementation of the training step is correct!

#### Prediction Function

In [0]:
def test_naive_bayes(testdoc, logprior, loglikelihood, target_classes, vocab):
    sum_ = {}
    for c in  target_classes:
        sum_[c] = logprior[c]
        for w in testdoc.split():
            if w in vocab:
                sum_[c] += loglikelihood[(w,c)]
    # sort keys in sum_ by value
    sorted_keys = sorted(sum_.keys(), key=lambda x: sum_[x], reverse=True)
    return sorted_keys[0]

Let's try to predict the label for a test document.

In [0]:
print('Predicted class: %s' % test_naive_bayes('Chinese Chinese Tokyo Japan', logprior, loglikelihood, target_classes, vocab))

Predicted class: j


Now, it is time to train our Naive Bayes model on the sentiment data.

In [0]:
target_classes = ['+1', '-1']    # we can construct a fixed set of classes from train_labels
logprior, loglikelihood, vocab = train_naive_bayes(train_texts, train_labels, target_classes)

In [0]:
test_naive_bayes("enigma is well-made , but it's just too dry and too placid .", logprior, loglikelihood, target_classes, vocab)

'-1'

### Evaluation

We will calculate evaluation measures on the test data. You can implement evaluation measures by yourself, but in this notebook, we are going to use scikit-learn to do that.

Let's get predicted classes of test documents.

In [0]:
predicted_labels = [test_naive_bayes(s, logprior, loglikelihood, target_classes, vocab)
                    for s in test_texts]

In [0]:
from sklearn import metrics

print('Accuracy score: %f' % metrics.accuracy_score(test_labels, predicted_labels))

Accuracy score: 0.763244


We can calculate precision, recall, f1_score per class.

In [0]:
for c in target_classes:
    print('Evaluation measures for class %s' % c)
    print('  Precision: %f' % metrics.precision_score(test_labels, predicted_labels, pos_label=c))
    print('  Recall: %f' % metrics.recall_score(test_labels, predicted_labels, pos_label=c))
    print('  F1: %f' % metrics.f1_score(test_labels, predicted_labels, pos_label=c))

Evaluation measures for class +1
  Precision: 0.769082
  Recall: 0.749529
  F1: 0.759180
Evaluation measures for class -1
  Precision: 0.757741
  Recall: 0.776844
  F1: 0.767174


We can also compute macro-averaged and micro-averaged f1 score.

In [0]:
print('Macro-averaged f1: %f' % metrics.f1_score(test_labels, predicted_labels, average='macro'))
print('Micro-averaged f1: %f' % metrics.f1_score(test_labels, predicted_labels, average='micro'))

Macro-averaged f1: 0.763177
Micro-averaged f1: 0.763244


We can report classification results all by once.

In [0]:
print(metrics.classification_report(test_labels, predicted_labels))

              precision    recall  f1-score   support

          +1       0.77      0.75      0.76      1062
          -1       0.76      0.78      0.77      1071

    accuracy                           0.76      2133
   macro avg       0.76      0.76      0.76      2133
weighted avg       0.76      0.76      0.76      2133



## Programming Assignment 3

**Due date: March 24, 2020**

- Modify the implementation of train/test function for Boolean Multinomial Naive Bayes model and evaluate it on the test dataset of sentiment analysis data.
- (Optional, Bonus Points) Try to deal with negation (by adding NOT_ for the words after negation words).
- (Optional, Bonus Points) Try sentiment lexicon in feature extraction

What to submit:
- Link to Google Colab, please share the reading permission for me (minhpham0902@gmail.com)
- Write your name, student id in your Google Colab.