# Sentiment Polarity Prediction with Naive Bayes

This notebook contains a basic implementation of document-level sentiment analysis
for movie reviews with multinomial Naive Bayes and bag-of-words features
and of cross-validation.
* No special treatment of rare or unknown words. Unknown words in the test data are skipped.

We use the movie review polarity data set of Pang and Lee 2004 [A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts](https://www.aclweb.org/anthology/P04-1035/) in Version 2.0 available from http://www.cs.cornell.edu/People/pabo/movie-review-data (section "Sentiment polarity datasets"). This dataset contains 1000 positive and 1000 negative reviews, each tokenised, sentence-split (one sentence per line) and lowercased. Each review has been assigned to 1 of 10 cross-validation folds by the authors and this setup should be followed to compare with published results.


In [1]:
import os
import tarfile
import time
import urllib.request
import numpy
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import nltk
nltk.download('stopwords')
data_source = 'local-folder'
data_folder = os.path.join('data', 'txt_sentoken')

[nltk_data] Downloading package stopwords to /Users/ivan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
class PL04DataLoader_Part_1:
    
    def __init__(self):
        pass
    
    def get_labelled_dataset(self, fold = 0):
        ''' Compile a fold of the data set
        '''
        dataset = []
        for label in ('pos', 'neg'):
            for document in self.get_documents(
                fold = fold,
                label = label,
            ):
                dataset.append((document, label))
        return dataset
    
    def get_documents(self, fold = 0, label = 'pos'):
        ''' Enumerate the raw contents of all data set files.
            Args:
                data_dir: relative or absolute path to the data set folder
                fold: which fold to load (0 to n_folds-1)
                label: 'pos' or 'neg' to
                    select data with positive or negative sentiment
                    polarity
            Return:
                List of tokenised documents, each a list of sentences
                that in turn are lists of tokens
        '''
        raise NotImplementedError

In [3]:
class PL04DataLoader(PL04DataLoader_Part_1):
    
    def get_xval_splits(self):
        ''' Split data with labels for cross-validation
            returns a list of k pairs (training_data, test_data)
            for k cross-validation
        '''
        # load the folds
        folds = []
        for i in range(10):
            folds.append(self.get_labelled_dataset(
                fold = i
            ))
        # create training-test splits
        retval = []
        for i in range(10):
            test_data = folds[i]
            training_data = []
            for j in range(9):
                ij1 = (i+j+1) % 10
                assert ij1 != i
                training_data = training_data + folds[ij1]
            retval.append((training_data, test_data))
        return retval

In [4]:
class PL04DataLoaderFromStream(PL04DataLoader):
        
    def __init__(self, tgz_stream, **kwargs):
        super().__init__(**kwargs)
        self.data = {}
        counter = 0
        with tarfile.open(
            mode = 'r|gz',
            fileobj = tgz_stream
        ) as tar_archive:
            for tar_member in tar_archive:
                if counter == 2000:
                    break
                path_components = tar_member.name.split('/')
                filename = path_components[-1]
                if filename.startswith('cv') \
                and filename.endswith('.txt') \
                and '_' in filename:
                    label = path_components[-2]
                    fold = int(filename[2])
                    key = (fold, label)
                    if key not in self.data:
                        self.data[key] = []
                    f = tar_archive.extractfile(tar_member)
                    document = [
                        line.decode('utf-8').split()
                        for line in f.readlines()
                    ]
                    self.data[key].append(document)
                    counter += 1
            
    def get_documents(self, fold = 0, label = 'pos'):
        return self.data[(fold, label)]

## Read Data from the Web
This should run efficiently both on google colab and locally but has the disadvantage that the same data is downloaded each time the notebook is run.

In [5]:
class PL04DataLoaderFromURL(PL04DataLoaderFromStream):
    
    def __init__(self, data_url, **kwargs):
        with urllib.request.urlopen(data_url) as tgz_stream:
            super().__init__(tgz_stream, **kwargs)

## Read Data from a Local .tgz File

You manually download the .tgz once to a filesystem that can be accessed from the notebook, e.g. google drive on colab, and this notebook reads this file in one chunk. 

Note that if you are accessing files from google drive on colab, you will need to mount your drive and enter an authentication token:

```
from google.colab import drive
drive.mount('/content/drive')
```

You will also have to change your *data_tgz* or *data_folder* paths above so that they start with *'/content/drive/My Drive/'*

In [6]:
class PL04DataLoaderFromTGZ(PL04DataLoaderFromStream):
    
    def __init__(self, data_path, **kwargs):
        with open(data_path, 'rb') as tgz_stream:
            super().__init__(tgz_stream, **kwargs)

## Read Data from a Local Folder

Extract the .tgz to a local folder and only load the required files. This is usually the fastest option when storage is on a local SSD. On remote filesystems, however, this can be very slow.

In [7]:
class PL04DataLoaderFromFolder(PL04DataLoader):
    
    def __init__(self, data_dir, **kwargs):
        self.data_dir = data_dir
        super().__init__(**kwargs)
        
    def get_documents(self, fold = 0, label = 'pos'):
        # read folder contents
        path = os.path.join(self.data_dir, label)
        dir_entries = os.listdir(path)
        # must process entries in numeric order to
        # replicate order of original experiments
        dir_entries.sort()
        # check each entry and add to data if matching
        # selection criteria
        for filename in dir_entries:
            if filename.startswith('cv') \
            and filename.endswith('.txt'):
                if fold == int(filename[2]):
                    # correct fold
                    f = open(os.path.join(path, filename), 'rt')
                    # "yield" tells Python to return an iterator
                    # object that produces the yields of this
                    # function as elements without creating a
                    # full list of all elements
                    yield [line.split() for line in f.readlines()]
                    f.close()

In [8]:
if data_source == 'local-folder':
    data_loader = PL04DataLoaderFromFolder(data_folder)
elif data_source == 'local-tgz':
    data_loader = PL04DataLoaderFromTGZ(data_tgz)
elif data_source == 'web':
    data_loader = PL04DataLoaderFromURL(data_url)
else:
    raise ValueError('Unsupported data source %r' %data_source)

In [9]:
def get_document_preview(document, max_length = 72):
    s = []
    count = 0
    reached_limit = False
    for sentence in document:
        for token in sentence:
            if count + len(token) + len(s) > max_length:
                reached_limit = True
                break
            s.append(token)
            count += len(token)
        if reached_limit:
            break
    return '|'.join(s)
    
for label in 'pos neg'.split():
    print(f'== {label} ==')
    print('doc sentences start of first sentence')
    for index, document in enumerate(data_loader.get_documents(
        label = label
    )):
        print('%3d %7d   %s' %(
            index, len(document), get_document_preview(document)
        ))
        if index == 4:
            break

== pos ==
doc sentences start of first sentence
  0      25   films|adapted|from|comic|books|have|had|plenty|of|success|,|whether
  1      39   every|now|and|then|a|movie|comes|along|from|a|suspect|studio|,|with
  2      19   you've|got|mail|works|alot|better|than|it|deserves|to|.|in|order|to|make
  3      42   "|jaws|"|is|a|rare|film|that|grabs|your|attention|before|it|shows|you|a
  4      25   moviemaking|is|a|lot|like|being|the|general|manager|of|an|nfl|team|in
== neg ==
doc sentences start of first sentence
  0      35   plot|:|two|teen|couples|go|to|a|church|party|,|drink|and|then|drive|.
  1      13   the|happy|bastard's|quick|movie|review|damn|that|y2k|bug|.|it's|got|a
  2      23   it|is|movies|like|these|that|make|a|jaded|movie|viewer|thankful|for|the
  3      19   "|quest|for|camelot|"|is|warner|bros|.|'|first|feature-length|,
  4      37   synopsis|:|a|mentally|unstable|man|undergoing|psychotherapy|saves|a|boy


## Create Training-Test Splits for Cross-Validation

In [10]:
splits = data_loader.get_xval_splits()

print('tr-size te-size (number of documents)')
for xval_tr_data, xval_te_data in splits:
    print('%7d %7d' %(len(xval_tr_data), len(xval_te_data)))

tr-size te-size (number of documents)
   1800     200
   1800     200
   1800     200
   1800     200
   1800     200
   1800     200
   1800     200
   1800     200
   1800     200
   1800     200


## Interface for Sentiment Polarity Predictor
Let's define a base class to clarify how we plan to use polarity predictors. Its functions will have to be implemented in sub-classes.

In [11]:
class PolarityPredictorInterface:

    def train(self, data_with_labels):
        raise NotImplementedError
        
    def predict(self, data):
        raise NotImplementedError

In [12]:
class PolarityPredictorInit(PolarityPredictorInterface):
    
    def train(self, data_with_labels, feature):
        """
        Function which trains model. Extracts features from extract_features function
        (changes for different features). Gets targets also and passes both to training function.
        """
        
        # Initialise vocab set object
        self.reset_feature_sets()
        # negate first then remove stop words? probably makes most sense
        if self.negation:
            self.add_negation_to_data(data_with_labels)
            
        if self.remove_stopwords:
            self.get_stopwords()
            self.remove_stopwords_from_data(data_with_labels)
            
        # Populate with the data
        self.add_to_feature_sets_from_data(data_with_labels)
        
        self.finalise_vocab()
        tr_features = self.extract_features(data_with_labels, feature)
        tr_targets = self.get_targets(data_with_labels)
        self.train_model_on_features(tr_features, tr_targets)

    def reset_feature_sets(self):
        """
        Initialises a set to hold each of the feature sets.
        """
        self.vocab = set()
        self.bigrams = set()
        self.trigrams = set()
    
    def add_negation_to_data(self, data):
        for document, label in data:
            for sentence in document:
                negate = False
                for index, token in enumerate(sentence):
                    if token in ('not', 'no') or (token[-3:] == "n't"):
                        negate = True
                        continue
                    if token == '.':
                        negate = False
                    if negate:
                        sentence[index] = 'NOT_' + token
    
    def remove_stopwords_from_data(self, data):
        for document, label in data:
            for sentence in document:
                stopword_indices = []
                for index, token in enumerate(sentence):
                    if token in self.stopwords:
                        stopword_indices.append(index)
                stopword_indices.reverse()
                for index in stopword_indices:
                    del sentence[index]
    
    def get_stopwords(self):
        self.stopwords = set(nltk.corpus.stopwords.words('english'))
        
    def add_to_feature_sets_from_data(self, data):
        """
        Parses tokens in data and adds them to each feature set.
        """
        for document, label in data:
            for sentence in document:
#                 sentence.insert(0, '<s>')
#                 sentence.append('</s>')
                prev_token = None
                for index, token in enumerate(sentence):
                    self.vocab.add(token)
                    if index > 0:
                        bigram = (prev_token, token)
                        self.bigrams.add(bigram)
                    if index > 1:
                        trigram = (prev_prev_token, prev_token, token)
                        self.trigrams.add(trigram)
                    prev_prev_token = prev_token
                    prev_token = token
                        
    def finalise_vocab(self):
        """
        Creates a dict for the feature sets for faster operations.
        """
        self.vocab = list(self.vocab)
        # create reverse map for fast token lookup
        self.vocab2index = {}
        for index, token in enumerate(self.vocab):
            self.vocab2index[token] = index
            
        self.bigrams = list(self.bigrams)
        # create reverse map for fast token lookup
        self.bigram2index = {}
        for index, token in enumerate(self.bigrams):
            self.bigram2index[token] = index
            
        self.trigrams = list(self.trigrams)
        # create reverse map for fast token lookup
        self.trigram2index = {}
        for index, token in enumerate(self.trigrams):
            self.trigram2index[token] = index
        
        
    def extract_features(self, data, feature):
        raise NotImplementedError
    
    def get_targets(self, data, label2index = None):
        raise NotImplementedError
        
    def train_model_on_features(self, tr_features, tr_targets):
        raise NotImplementedError

In [13]:
class PolarityPredictorExtractFeatures(PolarityPredictorInit):
    
    def __init__(self, clip_counts = True, negation=False, remove_stopwords=False, learning_model=MultinomialNB()):
        self.clip_counts = clip_counts
        self.negation = negation
        self.remove_stopwords = remove_stopwords
        self.model = learning_model
        
    def extract_features(self, data, ngram):
        """
        Creates features from the data. This implementation creates a dict which contains the relevant feature
        matrices for different feature implementations.
        """
        feature_matrices = {}            
        for feature in ['bow', 'bob', 'bot']:
            rows = len(data)
            # Initialise a feature matrix with zeros
            feature_matrices['bow'] = numpy.zeros((rows, len(self.vocab)), dtype=numpy.int32)
            feature_matrices['bob'] = numpy.zeros((rows, len(self.bigrams)), dtype=numpy.int32)
            feature_matrices['bot'] = numpy.zeros((rows, len(self.trigrams)), dtype=numpy.int32)
            # populate feature matrix
            for row, item in enumerate(data):
                document, _ = item
                for sentence in document:
#                     if sentence[0] != '<s>':
#                         sentence.insert(0, '<s>')
#                     if sentence[-1] != '</s>':
#                         sentence.append('</s>')
                    prev_token = None
                    for idx, token in enumerate(sentence):
                        # word
                        try:
                            bow_index = self.vocab2index[token]
                        except KeyError:
                            continue
                        if self.clip_counts:
                            feature_matrices['bow'][row, bow_index] = 1
                        else:
                            feature_matrices['bow'][row, bow_index] += 1
                        # bigram
                        if idx > 0:
                            bigram = (prev_token, token)
                            try:
                                bob_index = self.vocab2index[bigram]
                            except KeyError:
                                continue
                            if self.clip_counts:
                                feature_matrices['bob'][row, bob_index] = 1
                            else:
                                feature_matrices['bob'][row, bob_index] += 1
                        # trigram
                        if idx > 1:
                            trigram = (prev_prev_token, prev_token, token)
                            try:
                                bot_index = self.trigram2index[trigram]
                            except KeyError:
                                continue
                            if self.clip_counts:
                                feature_matrices['bot'][row, bot_index] = 1
                            else:
                                feature_matrices['bot'][row, bot_index] += 1
                                                    
                        prev_prev_token = prev_token
                        prev_token = token
        if ngram == 'bow':
            return feature_matrices['bow']
        if ngram == 'bob':
            return feature_matrices['bob']
        if ngram == 'bot':
            return feature_matrices['bot']

In [14]:
class PolarityPredictorAssignTargets(PolarityPredictorExtractFeatures):
 
    def get_targets(self, data):
        ''' create column vector with target labels
        '''
        # prepare target vector
        targets = numpy.zeros(len(data), dtype=numpy.int8)
        index = 0
        for _, label in data:
            if label == 'pos':
                targets[index] = 1
            index += 1
        return targets

    def train_model_on_features(self, tr_features, tr_targets):
        raise NotImplementedError

In [15]:
class PolarityPredictor(PolarityPredictorAssignTargets):

    def train_model_on_features(self, tr_features, tr_targets):
        # pass numpy array to sklearn to train NB
        self.model.fit(tr_features, tr_targets)
        
    def predict(self, data, feature, get_accuracy = False, get_confusion_matrix = False):
        if self.negation:
            self.add_negation_to_data(data)
        if self.remove_stopwords:
            self.remove_stopwords_from_data(data)
        # Extract features from unseen data
        features = self.extract_features(data, feature)
        # use numpy to get predictions
        y_pred = self.model.predict(features)
        # restore labels
        labels = []
        for is_positive in y_pred:
            if is_positive:
                labels.append('pos')
            else:
                labels.append('neg')
        if get_accuracy or get_confusion_matrix:
            retval = []
            retval.append(labels)
            y_true = self.get_targets(data)
            if get_accuracy:
                retval.append(
                    metrics.accuracy_score(y_true, y_pred)
                )
            if get_confusion_matrix:
                retval.append(
                    metrics.confusion_matrix(y_true, y_pred)
                )
            return retval
        else:
            return labels

# Experiments

## Evaluation Table

The next class we define contains three different feature representations:
* Bag-of-Words (Unigrams)
* Bag-of-Bigrams
* Bag-of-Trigrams

The relevant functions take in a *feature* parameter which specifies which feature of the above to use. There is also a *learning_model* parameter which specifies which particular learning model to use.

We plan to run many different experiments using different feature representations and different learning models. Therefore, having an evaluation table which contains the details of each experiment and the corresponding evalation results will be useful. Below, we define table to store these.

In [16]:
evaluation_df = pd.DataFrame(columns=['name', 'learning_model', 'features', 'clip_counts', 'avg_cv_acc', 'rmse', 'min_acc', 'max_acc'])

## 1. Baseline

Below, we run the baseline approach as a functionality test.

In [17]:
model = PolarityPredictor(clip_counts=True, negation=False, remove_stopwords=False, learning_model=MultinomialNB())
feature = 'bow'
model.train(splits[0][0], feature)

### Measuring Performance
We will report accuracy, and the full confusion matrix.


In [18]:
def print_first_predictions(model, te_data, feature, n = 12):
    predictions = model.predict(te_data, feature)
    for i in range(n):
        document, label = te_data[i]
        prediction = predictions[i]
        print('%4d %s %s %s' %(i, label, prediction, get_document_preview(document),))
    
print_first_predictions(model, splits[0][1], feature)

   0 pos neg films|adapted|from|comic|books|have|had|plenty|of|success|,|whether
   1 pos pos every|now|and|then|a|movie|comes|along|from|a|suspect|studio|,|with
   2 pos pos you've|got|mail|works|alot|better|than|it|deserves|to|.|in|order|to|make
   3 pos pos "|jaws|"|is|a|rare|film|that|grabs|your|attention|before|it|shows|you|a
   4 pos neg moviemaking|is|a|lot|like|being|the|general|manager|of|an|nfl|team|in
   5 pos pos on|june|30|,|1960|,|a|self-taught|,|idealistic|,|yet|pragmatic|,|young
   6 pos pos apparently|,|director|tony|kaye|had|a|major|battle|with|new|line
   7 pos pos one|of|my|colleagues|was|surprised|when|i|told|her|i|was|willing|to|see
   8 pos pos after|bloody|clashes|and|independence|won|,|lumumba|refused|to|pander|to
   9 pos pos the|american|action|film|has|been|slowly|drowning|to|death|in|a|sea|of
  10 pos pos after|watching|"|rat|race|"|last|week|,|i|noticed|my|cheeks|were|sore
  11 pos pos i've|noticed|something|lately|that|i've|never|thought|of|before|.


In [19]:
labels, accuracy, confusion_matrix = model.predict(splits[0][1], feature, get_accuracy = True, get_confusion_matrix = True)

print(accuracy)
print(confusion_matrix)

0.795
[[82 18]
 [23 77]]


### Cross-Validation Results

In [49]:
def evaluate_model(model, splits, feature, verbose = False):
    accuracies = []
    fold = 0
    for tr_data, te_data in splits:
        if verbose:
            print('Evaluating fold %d of %d' %(fold+1, len(splits)))
            fold += 1
        model.train(tr_data, feature)
        _, accuracy = model.predict(te_data, feature, get_accuracy = True)
        accuracies.append(accuracy)
        if verbose:
            print('-->', accuracy)
    n = float(len(accuracies))
    avg = sum(accuracies) / n
    mse = sum([(x-avg)**2 for x in accuracies]) / n
    return (avg, mse**0.5, min(accuracies),
            max(accuracies))

# this takes about 3 minutes
avg, rmse, min_acc, max_acc = evaluate_model(model, splits, feature, verbose = True)
print("Average Accuracy: ", avg)
print("RMSE: ", rmse)
print("Min Accuracy: ", min_acc)
print("Max Accuracy: ", max_acc)

Evaluating fold 1 of 10
--> 0.795
Evaluating fold 2 of 10
--> 0.84
Evaluating fold 3 of 10
--> 0.84
Evaluating fold 4 of 10
--> 0.825
Evaluating fold 5 of 10
--> 0.835
Evaluating fold 6 of 10
--> 0.83
Evaluating fold 7 of 10
--> 0.84
Evaluating fold 8 of 10
--> 0.845
Evaluating fold 9 of 10
--> 0.785
Evaluating fold 10 of 10
--> 0.855
Average Accuracy:  0.829
RMSE:  0.021071307505705458
Min Accuracy:  0.785
Max Accuracy:  0.855


So, the baseline approach achieves an average accuracy score of 83.2%. Let's add this to our evaluation table.

In [50]:
evaluation_df.loc[0] = ['baseline-NB-BoW-clip', 'multinb', 'bow', True, avg, rmse, min_acc, max_acc]

In [51]:
evaluation_df

Unnamed: 0,name,learning_model,features,clip_counts,avg_cv_acc,rmse,min_acc,max_acc
0,baseline-NB-BoW-clip,multinb,bow,True,0.829,0.021071,0.785,0.855


## Comparing Models

As an example of how above function can be used to compare different models, we compare the model with and without count cut-off (i.e. NB versus binary NB).

In [55]:
print('RemoveSW  Negation  Clip  Accuracy Stddev   Min   Max   Duration' )
for negation in (True, False):
    for remove_stopwords in (True, False):
        for clip in (True, False):
            start = time.time()
            model = PolarityPredictor(clip_counts = clip, negation = negation, remove_stopwords = remove_stopwords, learning_model=MultinomialNB())
            accuracy, stddev, min_acc, max_acc = evaluate_model(model, splits, feature)
            duration = time.time() - start
            print(f'{remove_stopwords}     {negation}  {clip}  {accuracy}  {stddev}  {min_acc} {max_acc}  {duration}')
#             print('%5r %8.3f %6.3f  %.3f %.3f  %.1f seconds  ' %((clip,) + eval_results + (duration,)))

RemoveStopwords  Negation  Clip  Accuracy Stddev   Min   Max   Duration
True  True  True  0.8265  0.023350588857671214  0.775 0.86  68.45569515228271
True  True  False  0.8084999999999999  0.03619737559547651  0.74 0.845  121.99980902671814
False  True  True  0.8265  0.023350588857671214  0.775 0.86  68.11051321029663
False  True  False  0.8084999999999999  0.03619737559547651  0.74 0.845  121.99795413017273
True  False  True  0.8265  0.023350588857671214  0.775 0.86  66.6793200969696
True  False  False  0.8084999999999999  0.03619737559547651  0.74 0.845  121.04392004013062
False  False  True  0.8265  0.023350588857671214  0.775 0.86  66.276606798172
False  False  False  0.8084999999999999  0.03619737559547651  0.74 0.845  121.60758399963379


## 2. Baseline (clip_counts = False)

In [60]:
model = PolarityPredictor(clip_counts=False)
feature_list = ['bow']
learning_model = 'multinb'

In [61]:
model2_avg, model2_rmse, model2_min_acc, model2_max_acc = evaluate_model(model, splits, feature_list, learning_model, verbose = True)

print("Average Accuracy: ", model2_avg)
print("RMSE: ", model2_rmse)
print("Min Accuracy: ", model2_min_acc)
print("Max Accuracy: ", model2_max_acc)

Evaluating fold 1 of 10
--> 0.795
Evaluating fold 2 of 10
--> 0.81
Evaluating fold 3 of 10
--> 0.855
Evaluating fold 4 of 10
--> 0.815
Evaluating fold 5 of 10
--> 0.83
Evaluating fold 6 of 10
--> 0.845
Evaluating fold 7 of 10
--> 0.86
Evaluating fold 8 of 10
--> 0.83
Evaluating fold 9 of 10
--> 0.795
Evaluating fold 10 of 10
--> 0.835
Average Accuracy:  0.827
RMSE:  0.021817424229271406
Min Accuracy:  0.795
Max Accuracy:  0.86


Without clipping the counts, the accuracy of the model actually worsens slightly. Let's add this to our evaluation table.

In [62]:
evaluation_df.loc[evaluation_df.index.max() + 1] = ['baseline-NB-BoW', 'multinb', 'bow', False, model2_avg, model2_rmse, model2_min_acc, model2_max_acc]

In [63]:
evaluation_df

Unnamed: 0,name,learning_model,features,clip_counts,avg_cv_acc,rmse,min_acc,max_acc
0,baseline-NB-BoW-clip,multinb,bow,True,0.832,0.023791,0.785,0.87
1,baseline-NB-BoW,multinb,bow,False,0.827,0.021817,0.795,0.86


## 3. Bag-of-Bigrams NB

In [64]:
model = PolarityPredictor()
feature_list = ['bob']
learning_model = 'multinb'

In [65]:
model3_avg, model3_rmse, model3_min_acc, model3_max_acc = evaluate_model(model, splits, feature_list, learning_model, verbose = True)

print("Average Accuracy: ", model3_avg)
print("RMSE: ", model3_rmse)
print("Min Accuracy: ", model3_min_acc)
print("Max Accuracy: ", model3_max_acc)

Evaluating fold 1 of 10
--> 0.5
Evaluating fold 2 of 10
--> 0.5
Evaluating fold 3 of 10
--> 0.5
Evaluating fold 4 of 10
--> 0.5
Evaluating fold 5 of 10
--> 0.5
Evaluating fold 6 of 10
--> 0.5
Evaluating fold 7 of 10
--> 0.5
Evaluating fold 8 of 10
--> 0.5
Evaluating fold 9 of 10
--> 0.5
Evaluating fold 10 of 10
--> 0.5
Average Accuracy:  0.5
RMSE:  0.0
Min Accuracy:  0.5
Max Accuracy:  0.5


Without clipping the counts, the accuracy of the model actually worsens slightly. Let's add this to our evaluation table.

In [62]:
evaluation_df.loc[evaluation_df.index.max() + 1] = ['baseline-NB-BoW', 'multinb', 'bow', False, model2_avg, model2_rmse, model2_min_acc, model2_max_acc]

In [63]:
evaluation_df

Unnamed: 0,name,learning_model,features,clip_counts,avg_cv_acc,rmse,min_acc,max_acc
0,baseline-NB-BoW-clip,multinb,bow,True,0.832,0.023791,0.785,0.87
1,baseline-NB-BoW,multinb,bow,False,0.827,0.021817,0.795,0.86
