<a href="https://colab.research.google.com/github/julurisaichandu/3DSubjectChatbot/blob/main/notebook_02_NaiveBayes_distrib.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Notebook 2: Naive Bayes
===============

CS 6120 Natural Language Processing, Amir



Saichandu Juluri

Saving notebooks as pdfs
----------

Feel free to add cells to this notebook as you wish. Make sure to leave **code that you've written** and any **answers to questions** that you've written in your notebook. Turn in your notebook as a pdf at the end of lecture's day.


To convert your notebook to a pdf for turn in, you'll do the following:
1. Kernel -> Restart & Run All (clear your kernel's memory and run all cells)
2. File -> Download As -> .html -> open in a browser -> print to pdf

(The download as pdf option doesn't preserve formatting and output as nicely as taking the step "through" html, but will do if the above doesn't work for you.)

Task 1: Implement Binary Naive Bayes
-------

Recall that for a document of $n$ words, Naive Bayes makes predictions as

$\hat{y} = \arg\max_{y \in \{0, 1\}} P(y) \prod_{i=1}^{n} P(x_i|y)$

To make this calculation more stable we can operate in log space

$\hat{y} = \arg\max_{y \in \{0, 1\}} \log P(y) + \sum_{i=1}^{n} \log P(x_i|y)$

Training entails estimating:

1. class priors

$P(y) = \frac{N_y}{N}$
where $N_y$ is the number of documents with class $y$ and $N$ is the total number of documents


2. class conditional word probabilities

$P(x_i|y) = \frac{\text{count}(x_i,y)}{\sum_{x \in V} \text{count}(x,y)}$


Also recall that we should use smoothing when calculating the above probabilities

$P(x_i|y) = \frac{\text{count}(x_i,y) + \alpha}{\sum_{x \in V} \text{count}(x,y)+ \alpha|V|}$


In [None]:
from collections import Counter
import numpy as np
import math
from sklearn.metrics import precision_score, recall_score, f1_score
import nltk
from nltk.corpus import stopwords

nltk.download("punkt")
nltk.download('stopwords')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
def read_data(fpath):
  """
  Reads the data and forms tuples: (text,label)
  Parameters:
    fpath - str path to file to read in
  Return:
    a list of tuples of strings formatted [(example_text, label), (example_text, label)....]
  """
  f = open(fpath, "r", encoding="utf8")
  dataset = []
  for review in f:
    if len(review.strip()) == 0:
      continue
    data = review.split("\t")
    t = tuple([data[1].strip(), int(data[2].strip())])
    dataset.append(t)
  f.close()
  return dataset

def report_metrics(classifier, test_data):
  """
    Applies the trained classifier to test data and computes performance
  """
  golds = [data[1] for data in test_data]
  classified = [classifier.predict(data[0]) for data in test_data]
  print("Precision:", precision_score(golds, classified))
  print("Recall:", recall_score(golds, classified))
  print("F1:", f1_score(golds, classified))

In [None]:
class NaiveBayes:

  def __init__(self, alpha=0):

    #Prior class probabilities P(y)
    self.prior_pos = 0
    self.prior_neg = 0

    #Conditional class probabilities P(x|y)
    self.p_x_pos = {}
    self.p_x_neg = {}

    #smoothing constant
    self.alpha = alpha
    #vocabulary
    self.vocab = set()

  def fit(self, examples):
    """
      Fit the model parameters via Maximum Likelihood Estimation
    """
    n_positive_docs = 0
    n_negative_docs = 0
    #word counts per class
    word_counts_pos = Counter()
    word_counts_neg = Counter()

    #iterate through the training data
    for example in examples:

      x, y = example
      words = self.featurize(x)

      #keep track of class counts
      # incrementing the docs count according to the class
      # incrementing the word counts in specific class according to the class
      if y == 1:
        n_positive_docs += 1
        word_counts_pos.update(words)
      else:
        n_negative_docs += 1
        word_counts_neg.update(words)

      #keep track of words
      self.vocab.update(words)

    # calculate class priors
    self.prior_pos = n_positive_docs / len(examples)
    self.prior_neg = n_negative_docs / len(examples)


    #calculate conditional probs for each word
    for word in self.vocab:
      # probability of word such that it belongs to positive class
      self.p_x_pos[word] = (word_counts_pos[word] + self.alpha) / \
                  (sum(word_counts_pos.values()) + self.alpha * len(self.vocab))

      # probability of word such that it belongs to negative class
      self.p_x_neg[word] = (word_counts_neg[word] + self.alpha) / \
                  (sum(word_counts_neg.values()) + self.alpha * len(self.vocab))


  def score(self, data):
    """
      Compute scores for the positive and negative class given data
    """
    #get features
    words = self.featurize(data)

    p_neg_feat = 0
    p_pos_feat = 0

    for word in words:
      # skip words that we've never seen
      if word not in self.vocab:
        continue

      p_neg_feat += np.log(self.p_x_neg[word])
      p_pos_feat += np.log(self.p_x_pos[word])

    neg_score = math.e ** (np.log(self.prior_neg) + p_neg_feat)
    pos_score = math.e ** (np.log(self.prior_pos) + p_pos_feat)

    return [neg_score, pos_score]

  def predict(self, data):
    """
      Predict class given input data
    """
    scores = self.score(data)

    # calculating argmax of two scores as we have two classes only
    # if scores[0] >= scores[1]:
    #   return 0
    # else:
    #   return 1

    # argmax for finding class to which it belongs
    return np.argmax(scores)


  def featurize(self, data):
    """
      Basic feature extractor. Only applies white space tokenization
    """
    return data.split()




In [None]:
training = "data/hotel_reviews_train.txt"
testing = "data/hotel_reviews_test.txt"

model = NaiveBayes(alpha=1)

examples = read_data(training)
model.fit(examples)

test_data = read_data(testing)
report_metrics(model, test_data)



Precision: 0.8181818181818182
Recall: 0.6923076923076923
F1: 0.7500000000000001


Q1: Implement the fit() and predict() methods. Using the basic feature set and smoothing set to $\alpha=1$ you should get $F_1 = 0.75$


In [None]:
# yes, its implemented in the above codes

Q2: What happens if you dont use smoothing? (try setting $\alpha=0$).

In [None]:
# trying with no smooting
model_no_smoothing = NaiveBayes(alpha=0)

examples = read_data(training)
model_no_smoothing.fit(examples)

test_data = read_data(testing)
report_metrics(model_no_smoothing, test_data)

Precision: 1.0
Recall: 0.038461538461538464
F1: 0.07407407407407407


  p_neg_feat += np.log(self.p_x_neg[word])
  p_pos_feat += np.log(self.p_x_pos[word])


Reason:

So when I do not use smoothing, if a word is present in the one class only during traning and not in other class, then the conditional probability of finding the  word in the other class during testing is becoming zero. To explain further, the count of the word in the particular class is becoming zero and so the whole conditional prob variable p_x_pos or p_x_neg is getting zero into it(the numerator and denominator are added with 0 in this case as aplha=0). Then during testing, if we use score function for prediction, when we sum the log probabilities of these conditional variables, we get log(0) due to which we are encountering an error inside log without smoothing.



Q3: Try to improve the performance by experimenting with preprocessing, tokenization and feature engineering. What do you observe? We managed to obtain $F_1=0.85$ just with better tokenization and simple preprocessing.

In [None]:
from functools import reduce
from nltk.stem import PorterStemmer


# inheriting the original NaiveBayes class and overriding the featurize class
class NaiveBayesWithTextProcessing(NaiveBayes):

  def __init__(self, alpha=0):
    super().__init__(alpha)


  # overriding the original method
  def featurize(self, data):
    """
      Feature extractor which tokenizes and removes stop words from the data
    """
    stop_words = set(stopwords.words('english'))
    ps = PorterStemmer()
    # tokenization
    word_tokens = nltk.word_tokenize(data)

    # stemming
    words = [ps.stem(w) for w in word_tokens]

    filtered_sentence = []
    # removing stop words
    for w in words:
        if w not in stop_words:
            filtered_sentence.append(w)

    return filtered_sentence





In [None]:
# Model with text pre processing
model_text_processed = NaiveBayesWithTextProcessing(alpha=1)

examples = read_data(training)
model_text_processed.fit(examples)

test_data = read_data(testing)
report_metrics(model_text_processed, test_data)



Precision: 0.8275862068965517
Recall: 0.9230769230769231
F1: 0.8727272727272727


So I have observed that if we do data pre processing like tokenizing the words, removing stop words, and stemming the words will improve the accuracy of finding the class in which the document belongs to.

Reason:
Stop word removal eliminates common words that don't contribute more to the classification task. This helps to focus on more important words by reducing noise in the data and making the classifier to concentrate on important features

The frequency of words also affects the output in the Naive Bayes and so by removing the repeated words, we reduce the frequency of words thereby reducing the noise and improving the accuracy.