# Sentiment Polarity Prediction with Naive Bayes

This notebook contains a basic implementation of document-level sentiment analysis
for movie reviews with multinomial Naive Bayes and bag-of-words features
and of cross-validation.
* No special treatment of rare or unknown words. Unknown words in the test data are skipped.

We use the movie review polarity data set of Pang and Lee 2004 [A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts](https://www.aclweb.org/anthology/P04-1035/) in Version 2.0 available from http://www.cs.cornell.edu/People/pabo/movie-review-data (section "Sentiment polarity datasets"). This dataset contains 1000 positive and 1000 negative reviews, each tokenised, sentence-split (one sentence per line) and lowercased. Each review has been assigned to 1 of 10 cross-validation folds by the authors and this setup should be followed to compare with published results.


In [1]:
import os
import tarfile
import time
import urllib.request
import numpy
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

# choose between 'local-tgz', 'local-folder' and 'web',
# see description under each heading below

data_source = 'local-folder'

# adjust paths as needed

data_folder = os.path.join('data', 'txt_sentoken')
# data_tgz    = os.path.join('data', 'pang-and-lee-2004', 'review_polarity.tar.gz')

# data_url = 'https://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz'

In [2]:
class PL04DataLoader_Part_1:
    
    def __init__(self):
        pass
    
    def get_labelled_dataset(self, fold = 0):
        ''' Compile a fold of the data set
        '''
        dataset = []
        for label in ('pos', 'neg'):
            for document in self.get_documents(
                fold = fold,
                label = label,
            ):
                dataset.append((document, label))
        return dataset
    
    def get_documents(self, fold = 0, label = 'pos'):
        ''' Enumerate the raw contents of all data set files.
            Args:
                data_dir: relative or absolute path to the data set folder
                fold: which fold to load (0 to n_folds-1)
                label: 'pos' or 'neg' to
                    select data with positive or negative sentiment
                    polarity
            Return:
                List of tokenised documents, each a list of sentences
                that in turn are lists of tokens
        '''
        raise NotImplementedError

In [3]:
class PL04DataLoader(PL04DataLoader_Part_1):
    
    def get_xval_splits(self):
        ''' Split data with labels for cross-validation
            returns a list of k pairs (training_data, test_data)
            for k cross-validation
        '''
        # load the folds
        folds = []
        for i in range(10):
            folds.append(self.get_labelled_dataset(
                fold = i
            ))
        # create training-test splits
        retval = []
        for i in range(10):
            test_data = folds[i]
            training_data = []
            for j in range(9):
                ij1 = (i+j+1) % 10
                assert ij1 != i
                training_data = training_data + folds[ij1]
            retval.append((training_data, test_data))
        return retval

In [4]:
class PL04DataLoaderFromStream(PL04DataLoader):
        
    def __init__(self, tgz_stream, **kwargs):
        super().__init__(**kwargs)
        self.data = {}
        counter = 0
        with tarfile.open(
            mode = 'r|gz',
            fileobj = tgz_stream
        ) as tar_archive:
            for tar_member in tar_archive:
                if counter == 2000:
                    break
                path_components = tar_member.name.split('/')
                filename = path_components[-1]
                if filename.startswith('cv') \
                and filename.endswith('.txt') \
                and '_' in filename:
                    label = path_components[-2]
                    fold = int(filename[2])
                    key = (fold, label)
                    if key not in self.data:
                        self.data[key] = []
                    f = tar_archive.extractfile(tar_member)
                    document = [
                        line.decode('utf-8').split()
                        for line in f.readlines()
                    ]
                    self.data[key].append(document)
                    counter += 1
            
    def get_documents(self, fold = 0, label = 'pos'):
        return self.data[(fold, label)]

## Read Data from the Web
This should run efficiently both on google colab and locally but has the disadvantage that the same data is downloaded each time the notebook is run.

In [5]:
class PL04DataLoaderFromURL(PL04DataLoaderFromStream):
    
    def __init__(self, data_url, **kwargs):
        with urllib.request.urlopen(data_url) as tgz_stream:
            super().__init__(tgz_stream, **kwargs)

## Read Data from a Local .tgz File

You manually download the .tgz once to a filesystem that can be accessed from the notebook, e.g. google drive on colab, and this notebook reads this file in one chunk. 

Note that if you are accessing files from google drive on colab, you will need to mount your drive and enter an authentication token:

```
from google.colab import drive
drive.mount('/content/drive')
```

You will also have to change your *data_tgz* or *data_folder* paths above so that they start with *'/content/drive/My Drive/'*

In [6]:
class PL04DataLoaderFromTGZ(PL04DataLoaderFromStream):
    
    def __init__(self, data_path, **kwargs):
        with open(data_path, 'rb') as tgz_stream:
            super().__init__(tgz_stream, **kwargs)

## Read Data from a Local Folder

Extract the .tgz to a local folder and only load the required files. This is usually the fastest option when storage is on a local SSD. On remote filesystems, however, this can be very slow.

In [7]:
class PL04DataLoaderFromFolder(PL04DataLoader):
    
    def __init__(self, data_dir, **kwargs):
        self.data_dir = data_dir
        super().__init__(**kwargs)
        
    def get_documents(self, fold = 0, label = 'pos'):
        # read folder contents
        path = os.path.join(self.data_dir, label)
        dir_entries = os.listdir(path)
        # must process entries in numeric order to
        # replicate order of original experiments
        dir_entries.sort()
        # check each entry and add to data if matching
        # selection criteria
        for filename in dir_entries:
            if filename.startswith('cv') \
            and filename.endswith('.txt'):
                if fold == int(filename[2]):
                    # correct fold
                    f = open(os.path.join(path, filename), 'rt')
                    # "yield" tells Python to return an iterator
                    # object that produces the yields of this
                    # function as elements without creating a
                    # full list of all elements
                    yield [line.split() for line in f.readlines()]
                    f.close()

In [8]:
if data_source == 'local-folder':
    data_loader = PL04DataLoaderFromFolder(data_folder)
elif data_source == 'local-tgz':
    data_loader = PL04DataLoaderFromTGZ(data_tgz)
elif data_source == 'web':
    data_loader = PL04DataLoaderFromURL(data_url)
else:
    raise ValueError('Unsupported data source %r' %data_source)

In [9]:
def get_document_preview(document, max_length = 72):
    s = []
    count = 0
    reached_limit = False
    for sentence in document:
        for token in sentence:
            if count + len(token) + len(s) > max_length:
                reached_limit = True
                break
            s.append(token)
            count += len(token)
        if reached_limit:
            break
    return '|'.join(s)
    
for label in 'pos neg'.split():
    print(f'== {label} ==')
    print('doc sentences start of first sentence')
    for index, document in enumerate(data_loader.get_documents(
        label = label
    )):
        print('%3d %7d   %s' %(
            index, len(document), get_document_preview(document)
        ))
        if index == 4:
            break

== pos ==
doc sentences start of first sentence
  0      25   films|adapted|from|comic|books|have|had|plenty|of|success|,|whether
  1      39   every|now|and|then|a|movie|comes|along|from|a|suspect|studio|,|with
  2      19   you've|got|mail|works|alot|better|than|it|deserves|to|.|in|order|to|make
  3      42   "|jaws|"|is|a|rare|film|that|grabs|your|attention|before|it|shows|you|a
  4      25   moviemaking|is|a|lot|like|being|the|general|manager|of|an|nfl|team|in
== neg ==
doc sentences start of first sentence
  0      35   plot|:|two|teen|couples|go|to|a|church|party|,|drink|and|then|drive|.
  1      13   the|happy|bastard's|quick|movie|review|damn|that|y2k|bug|.|it's|got|a
  2      23   it|is|movies|like|these|that|make|a|jaded|movie|viewer|thankful|for|the
  3      19   "|quest|for|camelot|"|is|warner|bros|.|'|first|feature-length|,
  4      37   synopsis|:|a|mentally|unstable|man|undergoing|psychotherapy|saves|a|boy


## Create Training-Test Splits for Cross-Validation

In [10]:
splits = data_loader.get_xval_splits()

print('tr-size te-size (number of documents)')
for xval_tr_data, xval_te_data in splits:
    print('%7d %7d' %(len(xval_tr_data), len(xval_te_data)))

tr-size te-size (number of documents)
   1800     200
   1800     200
   1800     200
   1800     200
   1800     200
   1800     200
   1800     200
   1800     200
   1800     200
   1800     200


## Interface for Sentiment Polarity Predictor
Let's define a base class to clarify how we plan to use polarity predictors. Its functions will have to be implemented in sub-classes.

In [11]:
class PolarityPredictorInterface:

    def train(self, data_with_labels):
        raise NotImplementedError
        
    def predict(self, data):
        raise NotImplementedError

In [12]:
class PolarityPredictorInit(PolarityPredictorInterface):
    
    def train(self, data_with_labels, features, learning_model):
        """
        Function which trains model. Extracts features from extract_features function
        (currently BoW's implementation). Gets targets also and passes both to training function.
        """
        
        # Initialise vocab set object
        self.reset_feature_sets()
        
        # Populate with the data
        self.add_to_feature_sets_from_data(data_with_labels, features)
        
        self.finalise_vocab(features)
        tr_features = self.extract_features(
            data_with_labels, features
        )
        tr_targets = self.get_targets(data_with_labels)
        self.train_model_on_features(tr_features, tr_targets, learning_model)

    def reset_feature_sets(self):
        """
        Initialises a set to hold the vocab of the data.
        """
        self.vocab = set()
        self.bigrams = set()
        self.trigrams = set()
        
    def add_to_feature_sets_from_data(self, data, features):
        """
        Parses tokens in data and adds them to a set.
        """
        if 'bow' in features:
            for document, label in data:
                for sentence in document:
                    for token in sentence:
                        self.vocab.add(token)
                        
        if 'bob' in features:
            for document, label in data:
                for sentence in document:
                    for index, token in enumerate(sentence):
                        if index != 0:
                            bigram = sentence[index-1] + ' ' + token
                            self.bigrams.add(bigram)
                        
        if 'bot' in features:
            for document, label in data:
                for sentence in document:
                    for index, token in enumerate(sentence):
                        if index not in [0, 1]:
                            trigram = sentence[index-2] + ' ' + sentence[index-1] + ' ' + token
                            self.trigrams.add(trigram)

    def finalise_vocab(self, features):
        """
        Creates a dict for the vocab for faster operations.
        """
        self.vocab = list(self.vocab)
        # create reverse map for fast token lookup
        self.vocab2index = {}
        for index, token in enumerate(self.vocab):
            self.vocab2index[token] = index
            
        self.bigrams = list(self.bigrams)
        # create reverse map for fast token lookup
        self.bigram2index = {}
        for index, token in enumerate(self.bigrams):
            self.bigram2index[token] = index
            
        self.trigrams = list(self.trigrams)
        # create reverse map for fast token lookup
        self.trigram2index = {}
        for index, token in enumerate(self.trigrams):
            self.trigram2index[token] = index
        
        
    def extract_features(self, data, features):
        raise NotImplementedError
    
    def get_targets(self, data, label2index = None):
        raise NotImplementedError
        
    def train_model_on_features(self, tr_features, tr_targets):
        raise NotImplementedError

In [13]:
class PolarityPredictorExtractFeatures(PolarityPredictorInit):
    
    def __init__(self, clip_counts = True):
        self.clip_counts = clip_counts
        
    def extract_features(self, data, feature_list):
        feature_matrices = {}
        
#         master_matrix = numpy.zeros((len(data),1), dtype=numpy.int32)
        master_matrix = None
        
        """
        Creates BoW's features from the data.
        """
        for feature in feature_list:
            if feature == 'bow':
            # create numpy array of required size
                columns = len(self.vocab)
            elif feature == 'bob':
                columns = len(self.bigrams)
            elif feature == 'bot':
                columns = len(self.trigrams)
            rows = len(data)
            print("BoW Columns: ", columns)
            print("BoW Rows: ", rows)

            # Initialise a feature matrix with zeros
            features = numpy.zeros((rows, columns), dtype=numpy.int32)

            # populate feature matrix
            for row, item in enumerate(data):
                document, _ = item
                for sentence in document:
                    for index, token in enumerate(sentence):
                        try:
                            if feature == 'bow':
                                index = self.vocab2index[token]
                            if feature == 'bob':
                                if index != 0:
                                    index = self.bigram2index[sentence[index-1] + ' ' + token]
                            if feature == 'bot':
                                if index not in [0, 1]:
                                    index = self.trigram2index[sentence[index-2] + ' ' + sentence[index-1] + ' ' + token]
                        except KeyError:
                            # token not in vocab
                            # --> skip this token
                            # --> continue with next token
                            continue
                        if self.clip_counts:
                            features[row, index] = 1
                        else:
                            features[row, index] += 1
            new_matrix = features
            if master_matrix is None:
                master_matrix = new_matrix
            else:
                master_matrix = numpy.append(master_matrix,new_matrix, 1)
                
        nom = (master_matrix-master_matrix.min(axis=0))*(1)
        denom = master_matrix.max(axis=0) - master_matrix.min(axis=0)
        denom[denom==0] = 1
        master_matrix = 0 + nom / denom
        
        return master_matrix        

In [14]:
class PolarityPredictorAssignTargets(PolarityPredictorExtractFeatures):
 
    def get_targets(self, data):
        ''' create column vector with target labels
        '''
        # prepare target vector
        targets = numpy.zeros(len(data), dtype=numpy.int8)
        index = 0
        for _, label in data:
            if label == 'pos':
                targets[index] = 1
            index += 1
        return targets

    def train_model_on_features(self, tr_features, tr_targets):
        raise NotImplementedError

In [15]:
class PolarityPredictor(PolarityPredictorAssignTargets):

    def train_model_on_features(self, tr_features, tr_targets, model):
        # pass numpy array to sklearn to train NB
        if model == 'multinb':
            self.model = MultinomialNB()
        self.model.fit(tr_features, tr_targets)
        
    def predict(
        self, data, feature_list, get_accuracy = False,
        get_confusion_matrix = False
    ):
        # Extract features from unseen data
        features = self.extract_features(data, feature_list)
        # use numpy to get predictions
        y_pred = self.model.predict(features)
        # restore labels
        labels = []
        for is_positive in y_pred:
            if is_positive:
                labels.append('pos')
            else:
                labels.append('neg')
        if get_accuracy or get_confusion_matrix:
            retval = []
            retval.append(labels)
            y_true = self.get_targets(data)
            if get_accuracy:
                retval.append(
                    metrics.accuracy_score(y_true, y_pred)
                )
            if get_confusion_matrix:
                retval.append(
                    metrics.confusion_matrix(y_true, y_pred)
                )
            return retval
        else:
            return labels

In [16]:
model = PolarityPredictor()
feature_list = ['bow', 'bob', 'bot']
learning_model = 'multinb'
model.train(splits[0][0], features=feature_list, learning_model='multinb') 

BoW Columns:  48359
BoW Rows:  1800
BoW Columns:  427601
BoW Rows:  1800
BoW Columns:  884583
BoW Rows:  1800


## Measuring Performance
We will report accuracy, and the full confusion matrix.


In [17]:
def print_first_predictions(model, te_data, feature_list, n = 12):
    predictions = model.predict(te_data, feature_list)
    for i in range(n):
        document, label = te_data[i]
        prediction = predictions[i]
        print('%4d %s %s %s' %(
            i, label, prediction,
            get_document_preview(document),
        ))
    
print_first_predictions(model, splits[0][1], feature_list)

BoW Columns:  48359
BoW Rows:  200
BoW Columns:  427601
BoW Rows:  200
BoW Columns:  884583
BoW Rows:  200
   0 pos pos films|adapted|from|comic|books|have|had|plenty|of|success|,|whether
   1 pos pos every|now|and|then|a|movie|comes|along|from|a|suspect|studio|,|with
   2 pos pos you've|got|mail|works|alot|better|than|it|deserves|to|.|in|order|to|make
   3 pos pos "|jaws|"|is|a|rare|film|that|grabs|your|attention|before|it|shows|you|a
   4 pos neg moviemaking|is|a|lot|like|being|the|general|manager|of|an|nfl|team|in
   5 pos pos on|june|30|,|1960|,|a|self-taught|,|idealistic|,|yet|pragmatic|,|young
   6 pos pos apparently|,|director|tony|kaye|had|a|major|battle|with|new|line
   7 pos pos one|of|my|colleagues|was|surprised|when|i|told|her|i|was|willing|to|see
   8 pos pos after|bloody|clashes|and|independence|won|,|lumumba|refused|to|pander|to
   9 pos pos the|american|action|film|has|been|slowly|drowning|to|death|in|a|sea|of
  10 pos pos after|watching|"|rat|race|"|last|week|,|i|notic

In [18]:
labels, accuracy, confusion_matrix = model.predict(
    splits[0][1], feature_list, get_accuracy = True, get_confusion_matrix = True
)

print(accuracy)
print(confusion_matrix)

BoW Columns:  48359
BoW Rows:  200
BoW Columns:  427601
BoW Rows:  200
BoW Columns:  884583
BoW Rows:  200
0.82
[[76 24]
 [12 88]]


## Cross-Validation Results

In [21]:
def evaluate_model(model, splits, verbose = False):
    accuracies = []
    fold = 0
    for tr_data, te_data in splits:
        if verbose:
            print('Evaluating fold %d of %d' %(fold+1, len(splits)))
            fold += 1
        model.train(tr_data, features=feature_list, learning_model=learning_model)
        _, accuracy = model.predict(te_data, feature_list, get_accuracy = True)
        accuracies.append(accuracy)
        if verbose:
            print('-->', accuracy)
    n = float(len(accuracies))
    avg = sum(accuracies) / n
    mse = sum([(x-avg)**2 for x in accuracies]) / n
    return (avg, mse**0.5, min(accuracies),
            max(accuracies))

# this takes about 3 minutes
print(evaluate_model(model, splits, verbose = True))

Evaluating fold 1 of 10
BoW Columns:  48359
BoW Rows:  1800
BoW Columns:  427601
BoW Rows:  1800
BoW Columns:  884583
BoW Rows:  1800
BoW Columns:  48359
BoW Rows:  200
BoW Columns:  427601
BoW Rows:  200
BoW Columns:  884583
BoW Rows:  200
--> 0.82
Evaluating fold 2 of 10
BoW Columns:  48546
BoW Rows:  1800
BoW Columns:  428799
BoW Rows:  1800
BoW Columns:  886391
BoW Rows:  1800
BoW Columns:  48546
BoW Rows:  200
BoW Columns:  428799
BoW Rows:  200
BoW Columns:  886391
BoW Rows:  200
--> 0.875
Evaluating fold 3 of 10
BoW Columns:  48429
BoW Rows:  1800
BoW Columns:  428216
BoW Rows:  1800
BoW Columns:  885922
BoW Rows:  1800
BoW Columns:  48429
BoW Rows:  200
BoW Columns:  428216
BoW Rows:  200
BoW Columns:  885922
BoW Rows:  200
--> 0.87
Evaluating fold 4 of 10
BoW Columns:  48733
BoW Rows:  1800
BoW Columns:  429743
BoW Rows:  1800
BoW Columns:  887916
BoW Rows:  1800
BoW Columns:  48733
BoW Rows:  200
BoW Columns:  429743
BoW Rows:  200
BoW Columns:  887916
BoW Rows:  200
--> 0.87

## Comparing Models

As an example of how above function can be used to compare different models, we compare the model with and without count cut-off (i.e. NB versus binary NB).

In [23]:
print('Clip  Accuracy Stddev   Min   Max   Duration' )
for clip in (True, False):
    start = time.time()
    model = PolarityPredictor(clip_counts = clip)
    eval_results = evaluate_model(model, splits)
    duration = time.time() - start
    print('%5r %8.3f %6.3f  %.3f %.3f  %.1f seconds  ' %(
        (clip,) + eval_results +
        (duration,)
    ))

Clip  Accuracy Stddev   Min   Max   Duration
BoW Columns:  48359
BoW Rows:  1800
BoW Columns:  427601
BoW Rows:  1800
BoW Columns:  884583
BoW Rows:  1800
BoW Columns:  48359
BoW Rows:  200
BoW Columns:  427601
BoW Rows:  200
BoW Columns:  884583
BoW Rows:  200
BoW Columns:  48546
BoW Rows:  1800
BoW Columns:  428799
BoW Rows:  1800
BoW Columns:  886391
BoW Rows:  1800
BoW Columns:  48546
BoW Rows:  200
BoW Columns:  428799
BoW Rows:  200
BoW Columns:  886391
BoW Rows:  200
BoW Columns:  48429
BoW Rows:  1800
BoW Columns:  428216
BoW Rows:  1800
BoW Columns:  885922
BoW Rows:  1800
BoW Columns:  48429
BoW Rows:  200
BoW Columns:  428216
BoW Rows:  200
BoW Columns:  885922
BoW Rows:  200
BoW Columns:  48733
BoW Rows:  1800
BoW Columns:  429743
BoW Rows:  1800
BoW Columns:  887916
BoW Rows:  1800
BoW Columns:  48733
BoW Rows:  200
BoW Columns:  429743
BoW Rows:  200
BoW Columns:  887916
BoW Rows:  200
BoW Columns:  48574
BoW Rows:  1800
BoW Columns:  426872
BoW Rows:  1800
BoW Columns:  