<h1 style="text-align: center;">Naive Bayes Sentiment Classification</h1>
<p>This blog post will introduce the concept of the Naive Bayes Classifier and evaluate its predictive capabilities on the labelled imdb data for sentiment analysis. The dataset can be found on Kaggle at the following link, https://www.kaggle.com/marklvl/sentiment-labell. The methods for performing experimentation on the Naive Bayes Classifier are referenced from the Data Mining-5334 assignment found at https://docs.google.com/document/d/1bmCm9TXwqp5tX7lpg14NkaB3dBSg15cCC7ICxeB-vB4/edit. All code and experiment results are my contribution to the methodology given. Numpy is the primary package used for handling the data along with base python structures. </p>

In [1]:
import numpy as np

<p>For the classifier that will be built in this blog, we only require the imdb_labelled.txt text file from Kaggle. As the classifier will predict for a given line of text the sentiment, we can simply just load the lines of the text file into an array as follows. </p>

In [2]:
data = []
with open('imdb_labelled.txt') as f:
    data = f.readlines()

<p>Next we will split our dataset into train/dev/test datasets. We will reserve 10% of the data for testing, this data will never be seen by any model and is solely used for final evaluation. The remaining 90% of the dataset are split into test and dev sets with 72% being used for training and 18% used for development/validation. The value 18% is used for development as it can split 90% evenly 5 ways which will be useful for 5-fold cross validation later. </p>

In [3]:
train_data = data[:720]

dev_data = data[720:900]

test_data = data[900:]

<p>The classifier that will be used to predict the sentiment for this dataset will be the Naive Bayes Classifier. This classifier works by using Baye's theorem to calculate the probability of the sentiment being positive given a word. This is abastracted to allow for a full sentence or multiple words by making a naive assumption about the conditional independence of the words, P(Sentiment|word1, word2) = P(word1|Sentiment) * P(word2|sentiment) * P(Sentiment) / P(word1, word2). This assumption allows the probabilities of positive and negative sentiments given the sentence to be calculated and compared. The classifier then predicts the class that has the higher probability value. In the unlikely case of a tie, we arbitrarily choose negative sentiment as the prediction. In the implementation following we leave out the calculation of the denominator as this will be a constant between predicting the likelihood of positive sentiment vs negative sentiment and thus will not impact the classification results.</p> <p>Our classifier trains on the provided training data via the train function. This function takes the sentence as input, does some minor filtering (replace commas and periods with space and make all words lower case), and places the words from the sentence into a temporary set. As the words are in a set, we are only looking for a single occurrence of a word per line. The labels are included at the end of the input line of the training data and is parsed by the model and removed from the line as to not include the value in the dictionary. Finally the set is iterated over and each word is stored in the dictionary as a 3-dimensional array; this array stores the total count of the word in all documents, the count of the word in positive documents, and the count of the word in negative documents. If the word is already present in the dictionary the appropriate values of the array are incremented based on the sentiment label. The implementation as described above is provided below.</p>

In [4]:
class NBC():
  def __init__(self):
    self.model_dict = {}
    self.pos_docs = 0
    self.neg_docs = 0
    self.tot_docs = 0

  def train(self, train_data):
    for data in train_data:
      label = data[len(data)-2]
      positive = True
      data = data[:len(data)-3]
      words = data.split()
      for i in range(len(words)):
        words[i] = words[i].replace(".","").replace(",","").lower()
      words = set(words)


      if label == '1':
        self.pos_docs += 1
      elif label == '0':
        self.neg_docs += 1
        positive = False
      self.tot_docs += 1

      for word in words:
        if word not in self.model_dict:
          if positive:
            self.model_dict[word] = [1, 1, 0]
          else:
            self.model_dict[word] = [1, 0, 1]
        else:
          self.model_dict[word][0] += 1
          if positive:
            self.model_dict[word][1] += 1
          else:
            self.model_dict[word][2] += 1

  def word_probability(self, word):
    return model.model_dict[word][0]/model.tot_docs

  def word_given_positive(self, word, smoothing=False):
    if smoothing:
      return (model.model_dict[word][1] + 1)/(model.pos_docs + 2)
    else:
      return model.model_dict[word][1]/model.pos_docs

  def word_given_negative(self, word, smoothing=False):
    if smoothing:
      return (model.model_dict[word][2] + 1)/(model.neg_docs + 2)
    else:
      return model.model_dict[word][2]/model.neg_docs

  def positive(self, word, smoothing=False):
    return ((self.word_given_positive(word, smoothing=smoothing))*(self.word_probability(word)))/(model.pos_docs/model.tot_docs)

  def negative(self, word, smoothing=False):
    return ((self.word_given_negative(word, smoothing=smoothing))*(self.word_probability(word)))/(model.neg_docs/model.tot_docs)

  def forward(self, test_data, smoothing=False):
    words = test_data.split()
    for i in range(len(words)):
        words[i] = words[i].replace(".","").replace(",","").lower()
    words = set(words)
    pos = model.pos_docs/model.tot_docs
    neg = model.neg_docs/model.tot_docs
    for word in words:
      if word in self.model_dict:
        pos = pos*self.word_given_positive(word, smoothing)
        neg = neg*self.word_given_negative(word, smoothing)
      else:
        if smoothing:
          pos = pos*(1/(model.pos_docs + 2))
          neg = neg*(1/(model.neg_docs + 2))
        else:
          pos = 0
          neg = 0

    return pos, neg

#Return True for positive sentiment, False for negative sentiment
  def predict(self, test_data, smoothing=False):
    pos, neg = self.forward(test_data, smoothing=smoothing)
    if pos > neg:
      return True
    else:
      return False

  def test(self, test_data, labels, smoothing=False):
    correct = 0
    total = 0
    for i in range(len(test_data)):
      if labels[i] == '1':
        label = True
      else:
        label = False
      prediction = self.predict(test_data[i], smoothing=smoothing)
      if label == prediction:
        correct += 1
      total += 1

    return correct/total

<p>Now we can initialize our model and train it on the training data from our previous split. </p>

In [5]:
model = NBC()
model.train(train_data)

<p>Now that the model is trained, we can calculate some probabilities based on the given data and see what our model has learned. First we will look at the probability of the occurrence of the word 'the'. </p>

In [6]:
#P('the')
prob = model.word_probability('the')
print(prob)

0.49166666666666664


<p>We get a result of .4917 or a 49.17% chance of the word 'the' occuring in a sentence. Next we will look at the probability of the word 'the' appearing given that the document (sentence) is of positive sentiment as well as the probability of 'the' occuring given a negative document.  </p>

In [7]:
#P('the'|Positive)
prob = model.word_given_positive('the')
print(prob)

#P('the'|Negative)
prob = model.word_given_negative('the')
print(prob)

0.47112462006079026
0.5089514066496164


<p>We get a result of .4711 or a 47.11% chance of the word 'the' appearing given a document's sentiment is positive and .5173 or a 51.73% chance of 'the' appearing given a document's sentiment is negative. </p>

<p>As the model only learns on the 72% of data that is reserved for testing, it may not have seen some words that will be presented to it in the development of test datasets. This will cause the probability of the sentiment given this word to be zero and the classifier will not be able to account for any other words in the sentence. One approach to fixing this issue is to perform cross validation to find which training dataset results in the best predictions. We will perform 5-fold cross validation on our model by splitting the 90% of the dataset reserved for train/dev evenly into 5 folds. The model will take 1 fold (18% of data) as validation and the remaining 4 folds as the training data. The model will then be validated by calculating the accuracy of predictions on the validation set for all 5 models to be trained. The implementation can be seen below.</p>

In [8]:
def validate(train_data, dev_data, smoothing=False):
  model = NBC()
  model.train(train_data)
  labels = []
  for x in dev_data:
    labels.append(x[len(x)-2])
    
  return model.test(dev_data, labels, smoothing=smoothing)

In [9]:
def FiveFoldCrossValidation(data, smoothing=False):
  accuracy = []
  train_data = data[180:900]
  dev_data = data[:180]

  accuracy.append(validate(train_data, dev_data, smoothing=smoothing))

  train_data = data[:180] + data[360:900]
  dev_data = data[180:360]

  accuracy.append(validate(train_data, dev_data, smoothing=smoothing))

  train_data = data[:360] + data[540:900]
  dev_data = data[360:540]

  accuracy.append(validate(train_data, dev_data, smoothing=smoothing))

  train_data = data[:540] + data[720:900]
  dev_data = data[540:720]

  accuracy.append(validate(train_data, dev_data, smoothing=smoothing))

  train_data = data[:720]
  dev_data = data[720:900]

  accuracy.append(validate(train_data, dev_data, smoothing=smoothing))

  return accuracy

In [10]:
accuracy = FiveFoldCrossValidation(data)
print(accuracy)

[0.6666666666666666, 0.65, 0.6277777777777778, 0.5777777777777777, 0.45555555555555555]


<p>After performing 5-fold cross validation we get the resulting accuracies of our 5 models as [0.6666666666666666, 0.65, 0.6277777777777778, 0.5777777777777777, 0.45555555555555555]. We can see that the first split where the first 18% of the dataset is reserved for validation performs the best. Our accuracy values are still fairly low and when inspecting the model to determine why this is we see that it is due to the problem of unseen words resulting in zero probabilities. As this issue is still not solved after optimizing our training dataset, we will need to look into another method of solving this. </p>

<p>In order to fix the zero probability problem for unseen words we implement the technique known as smoothing. Smoothing allows us to artificially add 2 new documents for positive sentiment where one document contains all words in the dictionary and the other contains no words in the dictionary (as this counts for all possibilities for every attribute since a word can either be included or discluded). The same is done for the negative sentiment as well. As the documents are not real, they are not actually added to the dataset. Instead the documents are accounted for in the calculation of the probability for a word given the sentiment where the numerator (number of times a word appeared in documents of the given sentiment) is increased by one and the denominator (number of documents of the given sentiment) is increased by 2. After implementing smoothing, we perform 5-fold cross validation again to get the optimal hyperparameters for our model. </p>

In [11]:
smoothing_accuracy = FiveFoldCrossValidation(data, smoothing=True)
print(smoothing_accuracy)

[0.8666666666666667, 0.8611111111111112, 0.8666666666666667, 0.9, 0.7277777777777777]


<p>After performing cross validation we get the resulting accuracies of the 5 models to be [0.8666666666666667, 0.8611111111111112, 0.8666666666666667, 0.9, 0.7277777777777777]. We can see that all of our models perform much better now that we are using the smoothing technique. Our best model gave a 90% accuracy on the validation set as the 4th fold. We use this model as our best and find the top ten words that predict positive and negative sentiments as well as their probabilities (P(Sentiment|word)). </p>

In [12]:
train_data = data[:540] + data[720:900]
dev_data = data[540:720]

model = NBC()
model.train(train_data)

pos = []
neg = []
for word in model.model_dict:
  pos.append((model.positive(word, smoothing=True), word))
  neg.append((model.negative(word, smoothing=True), word))

In [13]:
def sort_help(x):
  return x[0]

pos.sort(reverse=True, key=sort_help)
neg.sort(reverse=True, key=sort_help)

In [14]:
print(pos[:10])
print(neg[:10])

[(0.5334168650545044, 'the'), (0.2501942112517228, 'a'), (0.24207492795389043, 'and'), (0.1912709351376185, 'of'), (0.18322683038884016, 'is'), (0.15394896211836445, 'this'), (0.115774965543165, 'i'), (0.10162469197677818, 'it'), (0.09197677818151442, 'to'), (0.08896128304723717, 'in')]
[(0.506206896551724, 'the'), (0.17875862068965517, 'a'), (0.1724491600353669, 'and'), (0.16499381078691425, 'of'), (0.14500442086648982, 'is'), (0.12767462422634834, 'this'), (0.09687002652519892, 'i'), (0.09276038903625108, 'it'), (0.06930503978779841, 'in'), (0.06878160919540229, 'to')]


<p>We can see the top ten words for positive sentiment and the probability P(Positive|word) are [(0.5334168650545044, 'the'), (0.2501942112517228, 'a'), (0.24207492795389043, 'and'), (0.1912709351376185, 'of'), (0.18322683038884016, 'is'), (0.15394896211836445, 'this'), (0.115774965543165, 'i'), (0.10162469197677818, 'it'), (0.09197677818151442, 'to'), (0.08896128304723717, 'in')]
. The top ten words for negative sentiment and the probability P(Negative|word) are [(0.506206896551724, 'the'), (0.17875862068965517, 'a'), (0.1724491600353669, 'and'), (0.16499381078691425, 'of'), (0.14500442086648982, 'is'), (0.12767462422634834, 'this'), (0.09687002652519892, 'i'), (0.09276038903625108, 'it'), (0.06930503978779841, 'in'), (0.06878160919540229, 'to')].
 </p>

<p>It can be seen from the results that the same set of words are the top predictors for both positive and negative sentiment. This is because these are very common words in the English language. These words are included in a set of words called "Stop Words" and these words are typically filtered out of the dataset. Although they are typically filtered from models, our model still performs well with them included as they influence the positive and negative sentiment predictions near equally. </p>

<p>Finally we evaluate our best model with the hyperparameters found from the 5-fold cross validation on the last 10% of the dataset that was withheld for testing. </p>

In [15]:
test_labels = []
for x in test_data:
  test_labels.append(x[len(x)-2])
model.test(test_data, test_labels, smoothing=True)

0.71

<p>The final model accuracy on the test data is .71 or 71%. The accuracy is pretty decent taking into consideration the naive assumption being made and that the data being predicted on has never been seen by any of the models trained in the development process. </p>

<h2>References</h2>
<p>Assignment/Methods. https://docs.google.com/document/d/1bmCm9TXwqp5tX7lpg14NkaB3dBSg15cCC7ICxeB-vB4/edit </p>
<p>Kaggle IMDB labelled dataset. https://www.kaggle.com/marklvl/sentiment-labelled-sentences-data-set </p>
<p>Github. https://github.com/jjames71396/NaiveBayesClassifier </p>
<p>Working Notebook. https://colab.research.google.com/drive/1XuhBSHC48XSjhW6L2XwD3lI_5Z9lT5Pg?usp=sharing  </p>