# Naive Bayes and Sentiment Analysis

TEM501 - Text Mining

## Introduction

This document will instruct you how to perform sentiment classifiation by using Naive Bayes algorithm. We use [scikit-learn](http://scikit-learn.org/) to perform the task. After reading this document, students can understand how to perform following tasks.

- Read sentiment data from a text file
- Split data into a train/test file
- Feature extraction: convert a sentence into a feature vector
- Train a Naive Bayes model on the training data
- Evaluate on the test file
- Perform k-fold cross validation



## Data

We use the [sentence polarity dataset v1.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.README.1.0.txt) from [Moview Review Data](http://www.cs.cornell.edu/people/pabo/movie-review-data/) of Bo Pang và Lillian Lee. We can see the data file in [./data/sentiment.txt](./data/sentiment.txt).

Each line in the file is a review which was already tokenized into words. Each review has a label (+1 for positive review and -1 for negative review).


## Loading data

We will load the data into a list of tuples $(d, c)$ in which $d$ denote a document and $c$ denotes the label of the document. We define the function `load_data` as follows.

In [1]:
import re


def load_data(file_path):
    data = []
    with open(file_path, "r", encoding="utf-8", errors="ignore") as f:
        for line in f:
            line = line.strip()
            if line == "":
                continue
            match = re.search(r"(\+1|-1)[\s\t]+(.+)$", line)  # match the line +1 ...
            if match:
                lb = match.group(1)
                sentence = match.group(2)
                if sentence == "":
                    continue
                data.append((sentence,lb))
    return data
            

We will use the above function to load sentiment data.

In [2]:
DATA_PATH = "./data/sentiment.txt"
data = load_data(DATA_PATH)

print("# Loaded {} examples".format(len(data)))

# Loaded 10662 examples


We would like to know how many positive and negative reviews in the data.

In [3]:
from collections import Counter

docs, labels = zip(*data)
counter = Counter(labels)
for c in sorted(counter.keys()):
    print("Label {}: {}".format(c, counter[c]))

Label +1: 5331
Label -1: 5331


Our dataset in balanced. There are 5331 positive reviews and 5331 negative reviews.

## Split data into training/test data

Now we would like to randomly split the data into training and test data. We will use function `train_test_split` in the module `sklearn.model_selection`. Please refer to [http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation](http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation) for more details about the function. We would like to use 80% of data for training and 20% of data for testing.

In [4]:
from sklearn.model_selection import train_test_split

data = load_data(DATA_PATH)
docs, labels = zip(*data)

train_docs, test_docs, train_labels, test_labels = train_test_split(docs, labels,
                                                                   test_size=0.2,
                                                                   random_state=1337)
print("Training reviews: {}".format(len(train_docs)))
print("Test reviews: {}".format(len(test_docs)))

Training reviews: 8529
Test reviews: 2133


## Feature extraction

Now we convert a sentence into a feature vector. We just use BoW (bag-of-words) in each review. In this section, we will implement by ourselves. In next section, we will learn how to use scikit-learn for feature extraction.

We define a feature function as follows. Input of the function is a sentence and output is a feature vector. We will remove stopwords and punctuations in the review.

In [5]:
import re
from nltk.corpus import stopwords
eng_stop_words = set(stopwords.words('english'))


def is_punct(word):
    if re.search(r"^[!\"!\",\.:;%&]*$", word):
        return True
    else:
        return False
    

def feature_vec(doc):
    vec = dict()
    for word in doc.split():
        if word in eng_stop_words or is_punct(word):
            continue
        vec[word.lower()] = 1.0
    return vec   
    

Now we try to apply the feature function for a sentence.

In [6]:
vec = feature_vec("a thoughtful , provocative , insistently humanizing film .")
print(vec)

{'thoughtful': 1.0, 'provocative': 1.0, 'insistently': 1.0, 'humanizing': 1.0, 'film': 1.0}


## Converting data into matrix using DictVectorizer

`DictVectorizer` transforms lists of feature-value mappings to vectors. See [http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html) for more details.

We need to transform data (feature mapping) into numeric vectors so that machine learning algorithms can use them as input.

In [7]:
from sklearn.feature_extraction import DictVectorizer

vectorizer = DictVectorizer()
X_train = [ feature_vec(d) for d in train_docs ]
X_train = vectorizer.fit_transform(X_train)

We look at the the matrix `X_train`. Scikit-learn store data in sparse matrix data structure.

In [8]:
X_train

<8529x18887 sparse matrix of type '<class 'numpy.float64'>'
	with 91679 stored elements in Compressed Sparse Row format>

We see the vector representation of a sentence.

In [9]:
print(vectorizer.transform(feature_vec("a thoughtful , provocative , insistently humanizing film .")))

  (0, 6532)	1.0
  (0, 8314)	1.0
  (0, 13102)	1.0
  (0, 16883)	1.0


## Training model

We now will train a Naive Bayes model on the training data. Since we use binarize features, so we can use `BernoulliNB` module.

In [10]:
from sklearn.naive_bayes import BernoulliNB

clf = BernoulliNB()
print(clf)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)


Now we train the model on the training data

In [11]:
clf.fit(X_train, train_labels)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

We can use the model to predict the label for an example.

In [12]:
example = "a thoughtful , provocative , insistently humanizing film ."
test_x = vectorizer.transform(feature_vec(example))
print("Predicted class: {}".format(clf.predict(test_x)))

Predicted class: ['+1']


## Evaluation on test set

We now use the test data to evaluate the trained model. In the first step, we need to transform the test data into a matrix.

In [13]:
X_test = [ feature_vec(d) for d in test_docs ]
X_test = vectorizer.transform(X_test)

We predict labels for reviews in the test data by calling `predict` function.

In [14]:
test_preds = clf.predict(X_test)

After that, we calculate the accuracy, precision, recall, f1 score by using `sklearn.metrics` module.

In [15]:
from sklearn import metrics

accuracy = metrics.accuracy_score(test_labels, test_preds)
print("# Test accuracy: {}".format(accuracy))

# Test accuracy: 0.7543366150961087


Now, we would like to know precision, recall, f1 score for each category.

In [16]:
import numpy as np

for label in ["+1", "-1"]:
    p = metrics.precision_score(test_labels, test_preds, average="binary", pos_label=label)
    r = metrics.recall_score(test_labels, test_preds, average="binary", pos_label=label)
    f1 = metrics.f1_score(test_labels, test_preds, average="binary", pos_label=label)
    print("Label: {}".format(label))
    print(" Precision: {}".format(p))
    print(" Recall: {}".format(r))
    print(" F1: {}".format(f1))

Label: +1
 Precision: 0.7402234636871509
 Recall: 0.7644230769230769
 F1: 0.7521286660359507
Label: -1
 Precision: 0.7686496694995278
 Recall: 0.7447392497712717
 F1: 0.7565055762081785


We may want to look at the confusion matrix.

In [17]:
metrics.confusion_matrix(test_labels, test_preds, labels=["+1", "-1"])

array([[795, 245],
       [279, 814]])

## k-fold cross validation

Now we instruct you to perform k-fold cross validation. We just use the training data for k-fold cross validation, because in general, the test-data is unseen to us.

In [18]:
from sklearn.model_selection import StratifiedKFold


train_docs, train_labels = np.asarray(train_docs), np.asarray(train_labels)

n_splits = 5
scores = []
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=1337)
for i, (train_index, test_index) in enumerate(skf.split(train_docs, train_labels), start=1):
    
    clf = BernoulliNB()
    vectorizer = DictVectorizer()
    
    x_train, x_test = train_docs[train_index], train_docs[test_index]
    y_train, y_test = train_labels[train_index], train_labels[test_index]
    
    x_train = [ feature_vec(d) for d in x_train ]
    x_train = vectorizer.fit_transform(x_train)
    
    x_test = [ feature_vec(d) for d in x_test ]
    x_test = vectorizer.transform(x_test)
    
    clf.fit(x_train, y_train)
    y_preds = clf.predict(x_test)
    accuracy = metrics.accuracy_score(y_test, y_preds)
    scores.append(accuracy)

print("{}-fold accuracy scores: {}".format(n_splits, scores))
print("Average score: {}".format(np.mean(scores)))   
    

5-fold accuracy scores: [0.7644991212653779, 0.7819460726846424, 0.7608440797186401, 0.7589442815249267, 0.7659824046920821]
Average score: 0.7664431919771338


## Exercises

1. Try to use different alpha values in `BernoulliNB` ([http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB)).
2. Add bi-gram features and run experiments again.
3. Try [MultinomialNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB) instead of BernoulliNB and see the difference of the system performance.
4. Using feature extraction module of scikit-learn for feature extraction phase. You may want to look at the tutorial [http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) to see how to use feature extraction modules in scikit-learn.