# Lab 2: Sentiment Analysis
*   **Course**: CM52065 Natural Language Processing
*   **Author**: Dr. Andrew Barnes

## Overview
This lab will provide a chance for you to implement some of the representations we saw in this week's lectures.

The learning objectives for this lab are as follows:

1. Implement BoNGrams document representation using Python.
2. Implement TF-IDF transformation.
3. Apply document representations to a sentiment analysis use-case.

To achieve these we will be using a sentiment analysis use-case.

Sentiment analysis is the method of estimating the sentiment of a given document. For example, 'The latest innovations from Company X were terrible.' has a strongly negative sentiment; whereas, 'The employees were thrilled at the increase in wages.' has a positive sentiment.

## Dataset

The dataset we will be working with this week is a set of IMDB movie reviews and the review was positive or negative.

### 1.1 Loading the dataset
You will need to make sure the `Lab 2 Dataset.tar.gz` file is in the same folder as this script and has been extracted into a folder called `Lab 2 Dataset`. I have provided a method below to load the data into both train and test for you.

If you are using the Google Colab interface you will need to upload the `.tar.gz` file and then extract it by running the commented out code below.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
file_name = '/content/drive/MyDrive/MSc Advanced Machine Learning/Semester 2/Natural Language Processing/Week 2/Lab 2 Dataset.gz'

In [None]:
# ONLY RUN THIS CODE BLOCK IF YOU NEED TO EXTRACT A .tar.gz FILE ON GOOGLE COLAB
import tarfile
tar = tarfile.open(file_name, "r")
#tar = tarfile.open('Lab 2 Dataset.gz', "r")
tar.extractall()
tar.close()

  tar.extractall()


In [None]:
import pandas as pd
import os

def load_dataset(split='train'):
  data = []
  sentiment = {"pos": 1, "neg": 0}
  split_path = os.path.join('./aclImdb/', split)
  for label in ["pos", "neg"]:
      label_path = os.path.join(split_path, label)
      for fname in os.listdir(label_path):
          if fname.endswith(".txt"):
              with open(os.path.join(label_path, fname), encoding="utf-8") as f:
                  text = f.read().strip()
                  data.append([sentiment[label], text])
  return pd.DataFrame(data, columns=['label', 'text']).sample(500)

print("Loading training data...")
dataset = load_dataset(split="train")
print(f"Loaded {len(dataset)} examples.")


Loading training data...
Loaded 500 examples.


Let's take a look at a couple examples:

In [None]:
print("Example of a positive review:")
print(dataset[dataset['label'] == 1].iloc[50]['text'])
print("Example of a negative review:")
print(dataset[dataset['label'] == 0].iloc[50]['text'])

Example of a positive review:
Someone(or, something thing..)is leaving puncture marks on the jugular and draining victims of their blood till dead. Police detective Karl Brettschneider(Melvyn Douglas, before slipping out of the B-movie horror genre for greater heights)is stumped at who..or what..is behind these notorious crimes. The village is overcome by hysteria and Karl depends on his trusted medical genius, Dr. Otto von Niemann(Lionel Atwill, in yet another effective mad scientist role)to provide some feedback as to what might be causing the deaths of innocents. He also fears for the safety of his beloved Ruth(the lovely Fay Wray who stars for the third time with Atwill after "Doctor X" & "The Mystery of the Wax Museum")who is Niemann's assistant.<br /><br />Dwight Frye steals the film as a rather loony village idiot who collects bats and carries a demented demeanor wherever he goes..it's easy to see why he becomes a suspect as local paranoia is at a fever pitch. Maude Eburne provi

## Part 1: Representing the Corpus

The first step to achieving our goal of building a sentiment classifier is to build our document representations using the techniques we learned in the lectures.

### Bag of Words Representations

In this section you will implement the three types of representation we learned in the lectures this week. Don't worry though, we may start out from first principles but we will move to libraries soon!

#### Bag of Unigrams
Your first task is to implement a method which will return a Bag of Unigrams representation for a given set of documents. Remember, you will first need to generate a vocabulary, and then build the feature set.

In [None]:
def bag_of_unigrams(documents):
  vocabulary = {}
  dataset = []
  #### BEGIN CODE ####
  for document in documents:
    for word in document:
      if word not in vocabulary:
        vocabulary[word] = len(vocabulary)


  for document in documents:
    ds = [0]*len(vocabulary)
    for word in document:
      if word in vocabulary:
        x = vocabulary[word]
        ds[x] = 1
    dataset.append(ds)


  #### END CODE ####
  return dataset, vocabulary

Now try your method out with the test function below!

In [None]:
def test_unigram(method):
  test, testv = method([['i', 'love', 'dogs'], ['i', 'cats']])
  if sum(test[0]) == 3:
      print("Test doc #1 passed.")
  else:
      print(f"Test doc #1 failed. Expected something like [1, 1, 1, 0, 0], got {test[0]}")
  if sum(test[1]) == 2:
      print("Test doc #2 passed.")
  else:
      print(f"Test doc #2 failed. Expected something like [1, 0, 0, 1, 1], got {test[1]}")
test_unigram(bag_of_unigrams)

Test doc #1 passed.
Test doc #2 passed.


#### Bag of N-Grams

Now it's time to expand this function to implement Bag of N-grams where N can be any integer > 1 (but within a reasonable range, there is no point having an N which is the same size as a document (unless you're looking for an exact match)).

The process is very similar to your previous method but this time you need to be more expansive with your dictionary.

HINT: Using nltk's `defaultdict` will save you some code.

In [None]:
from nltk.util import defaultdict
def bag_of_ngrams(documents, n=1):
  vocabulary = {}
  dataset = []
  ### START CODE HERE ###
  for document in documents:
    for i in range(len(document)):
        if ' '.join(document[i:i+n]) not in vocabulary and len(document[i:i+n]) == n:
            vocabulary[' '.join(document[i:i+n])]= len(vocabulary)


  for document in documents:
      ds = [0]*len(vocabulary)
      for i in range(len(document)):
          if len(document[i:i+n]) == n and ' '.join(document[i:i+n]) in vocabulary:
              x = vocabulary[' '.join(document[i:i+n])]
              ds[x] = 1
      dataset.append(ds)

  ### END CODE HERE ###
  return dataset, vocabulary

Now it's time to test your method:

In [None]:
def test_bigram(method):
  test, testv = method([['i', 'love', 'dogs'], ['like', 'cats']], 2)
  if sum(test[0]) == 2:
      print("Test doc #1 passed.")
  else:
      print(f"Test doc #1 failed. Expected something like [1, 1, 0], got {test[0]}")
  if sum(test[1]) == 1:
      print("Test doc #2 passed.")
  else:
      print(f"Test doc #2 failed. Expected something like [0, 0, 1], got {test[1]}")
def test_trigram(method):
  test, testv = method([['i', 'love', 'dogs', 'sometimes'], ['sometimes', 'like', 'cats']], 3)
  if sum(test[0]) == 2:
      print("Test doc #1 passed.")
  else:
      print(f"Test doc #1 failed. Expected something like [1, 1, 0], got {test[0]}")
  if sum(test[1]) == 1:
      print("Test doc #2 passed.")
  else:
      print(f"Test doc #2 failed. Expected something like [0, 0, 1], got {test[1]}")
print("Testing Unigram")
test_unigram(bag_of_ngrams)
print("Testing Bigram")
test_bigram(bag_of_ngrams)
print("Testing Trigram")
test_trigram(bag_of_ngrams)


Testing Unigram
Test doc #1 passed.
Test doc #2 passed.
Testing Bigram
Test doc #1 passed.
Test doc #2 passed.
Testing Trigram
Test doc #1 passed.
Test doc #2 passed.


#### TF-IDF

It's now time to implement the TF-IDF method. However, because the TF-IDF representation is a modification to the Bag of N-Grams method we can use this to our advantage.

Complete the method below to implement TF-IDF on a given BoNGrams representation.

In [None]:
import numpy as np
def tfidf_transformation(feature_vectors):
  transformed_feature_vectors = []
  # Let's transform the feature vectors into a matrix
  feature_vectors = np.array(feature_vectors)

  ### START CODE HERE ###
  tf = feature_vectors / feature_vectors.sum(axis=1, keepdims=True)

  N = feature_vectors.shape[0]
  idf = np.zeros(feature_vectors.shape[1])

  for j in range(feature_vectors.shape[1]):  # loop over columns
    df = np.sum(feature_vectors[:, j] > 0)
    idf[j] = np.log(N / df)


  transformed_feature_vectors = (tf * idf).tolist()

  ### END CODE HERE ###

  return transformed_feature_vectors


Test your code below, you should see the following outputs (depending on how you structured your vocabulary):

1. `[[0, 0.23., 0.46., 0], [0, 0, 0, 2.38.]]`
2. `[[0.69., 0], [0, 0.69.]]`
3. `[[0.69., 0], [0, 0]]`


In [None]:
print(tfidf_transformation([[1, 1, 2, 0], [1, 0, 0, 4]]))
print(tfidf_transformation([[1, 0], [0, 1]]))
print(tfidf_transformation([[1, 1], [0, 500]]))

[[0.0, 0.17328679513998632, 0.34657359027997264, 0.0], [0.0, 0.0, 0.0, 0.5545177444479562]]
[[0.6931471805599453, 0.0], [0.0, 0.6931471805599453]]
[[0.34657359027997264, 0.0], [0.0, 0.0]]


## Part 2: Data Wrangling

We now have our data encoding methods in place so now it's time to wrangle our data and transform it into something useable.

To do this we will be making use of a new library called spaCy which is another popular NLP library. spaCy offers a range of preprocessing and tokenisation options for our corpus, take some time to explore spaCy's capabilities [here](https://www.geeksforgeeks.org/nlp/tokenization-using-spacy-library/). Whilst it may seem insignificant now this library will become very useful later on.

To save time you have been provided a simple tokenisation example below which is all you need for this lab, please do modify the example sentences to get a feel for what is happening.

NOTE: You may notice spacy considers case of letters (e.g. the vs The) and punctuation seperately. For our purposes today this is fine, but consider how you could preprocess the tweets further.

In [None]:
import spacy
def process_document(corpus):
  nlp = spacy.blank("en")
  processed_corpus = []
  for d in corpus:
    doc = nlp(d) # Tokenise
    doc_tokens = []
    for token in doc:
      if not token.is_stop: # Check for stop words
        doc_tokens.append(token.text)
    processed_corpus.append(doc_tokens)
  return processed_corpus

In [None]:
print(process_document(["this is a new document given to spacy."]))
print(process_document(["The dog in the hole barked loudly into the wind."]))

[['new', 'document', 'given', 'spacy', '.']]
[['dog', 'hole', 'barked', 'loudly', 'wind', '.']]


Now it's your turn! Expand the template method provided below to process and encode the datasets provided.

You will need to use:
1. Preprocessing `process_document(...)`
2. Bag of N-Grams `bag_of_ngrams(...)`
3. TFIDF `tfidf_transformation(...)`

In [None]:
def encode_corpus(corpus, n=2, tfidf=False):
  """
  `n`:: The 'n' in the n-Gram calculation.
  `tfidf`:: Whether to apply TF-IDF transformation.
  """
  encoding = corpus
  ### START CODE ###

  encoding = process_document(corpus)
  encoding, _ = bag_of_ngrams(encoding, n)
  encoding = tfidf_transformation(encoding)
  ### END CODE ###
  return np.array(encoding)

Next we will use the code above to generate training and test datasets for different scenarios:

1. Unigrams without TF-IDF
2. Unigrams with TF-IDF
3. Trigrams without TF-IDF
4. Trigrams with TF-IDF

In [None]:
print("Encoding Unigram case...")
ug_raw = encode_corpus(dataset['text'])
print("Encoding Trigram case...")
tg_raw = encode_corpus(dataset['text'], n=3)
print("Encoding Unigram TF-IDF case...")
ug_tfidf = encode_corpus(dataset['text'], tfidf=True)
print("Encoding Trigram TF-IDF case...")
tg_tfidf = encode_corpus(dataset['text'], n=3, tfidf=True)

Encoding Unigram case...
Encoding Trigram case...
Encoding Unigram TF-IDF case...
Encoding Trigram TF-IDF case...


If this seemed eerily fast that's because we only used a random sample of 250 documents from the training and testing datasets (check out the Dataset loading code to see this). You are more than welcome to increase this, but be warned, the size of your feature vectors is going to get large.

For context, I've provided some informative print outs below, check these and how much RAM you have before you increase the size!

In [None]:
import math

def convert_size(size_bytes):
   if size_bytes == 0:
       return "0B"
   size_name = ("B", "KB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB")
   i = int(math.floor(math.log(size_bytes, 1024)))
   p = math.pow(1024, i)
   s = round(size_bytes / p, 2)
   return "%s %s" % (s, size_name[i])

print(f"Size of the Unigram feature set: {ug_raw.shape[0]}x{ug_raw.shape[1]} which takes up {convert_size(ug_raw.nbytes)}.")
print(f"Size of the Unigram TF-IDF feature set: {ug_tfidf.shape[0]}x{ug_tfidf.shape[1]} which takes up {convert_size(ug_tfidf.nbytes)}.")
print(f"Size of the Trigram feature set: {tg_raw.shape[0]}x{tg_raw.shape[1]} which takes up {convert_size(tg_raw.nbytes)}.")
print(f"Size of the Trigram TF-IDF feature set: {tg_tfidf.shape[0]}x{tg_tfidf.shape[1]} which takes up {convert_size(tg_tfidf.nbytes)}.")


Size of the Unigram feature set: 500x52134 which takes up 198.88 MB.
Size of the Unigram TF-IDF feature set: 500x52134 which takes up 198.88 MB.
Size of the Trigram feature set: 500x64807 which takes up 247.22 MB.
Size of the Trigram TF-IDF feature set: 500x64807 which takes up 247.22 MB.


> **Question:** Why does the trigram approach result in a larger vocabulary?

> **Question:** How would you expect the vocabulary size to change as we increase `n` to larger values? (e.g. `n=5`, `n=10`)

## Part 3: Building a Classifier

Now let's build our Naive-Bayes Classifier.

Populate the method below to accept a training dataset and output a Naive-Bayes model.

Hint: If you're stuck on how to create the classifier, try going [here](https://scikit-learn.org/stable/modules/naive_bayes.html).

In [None]:
from sklearn.naive_bayes import GaussianNB
def train_model(encoded_corpus, labels):
  model = GaussianNB()
  ### START CODE HERE ###

  model.fit(encoded_corpus, labels)

  ### END CODE HERE
  return model

Here's some code to split the data and create the models using your method above.

In [None]:
from sklearn.model_selection import train_test_split

# Split the data for a unigram approach
ug_raw_train, ug_raw_test, ug_raw_labels_train, ug_raw_labels_test = train_test_split(
    ug_raw, dataset['label'], test_size=0.33, random_state=42)
# Split the data for a unigram approach with TF-IDF
ug_tfidf_train, ug_tfidf_test, ug_tfidf_labels_train, ug_tfidf_labels_test = train_test_split(
    ug_tfidf, dataset['label'], test_size=0.33, random_state=42)
# Split the data for a trigram approach
tg_raw_train, tg_raw_test, tg_raw_labels_train, tg_raw_labels_test = train_test_split(
    tg_raw, dataset['label'], test_size=0.33, random_state=42)
# Split the data for a trigram approach with TF-IDF
tg_tfidf_train, tg_tfidf_test, tg_tfidf_labels_train, tg_tfidf_labels_test = train_test_split(
    tg_tfidf, dataset['label'], test_size=0.33, random_state=42)

In [None]:
print("Training unigram model...")
ug_raw_model = train_model(ug_raw_train, ug_raw_labels_train)
print("Training unigram with TF-IDF model...")
ug_tfidf_model = train_model(ug_tfidf_train, ug_tfidf_labels_train)
print("Training trigram model...")
tg_raw_model = train_model(tg_raw_train, tg_tfidf_labels_train)
print("Training trigram with TF-IDF model...")
tg_tfidf_model = train_model(tg_tfidf_train, tg_tfidf_labels_train)

Training unigram model...
Training unigram with TF-IDF model...
Training trigram model...
Training trigram with TF-IDF model...


We now need to test our classifier, using the imported metrics perform an evaluation of the four encoding schemes.

As you did this in the last lab, I've included code to evaluate the models.

In [None]:
from sklearn.metrics import precision_score, recall_score, accuracy_score

def evaluate_model(model_name, model, test_features, test_labels):
  predictions = model.predict(test_features)
  precision = precision_score(test_labels, predictions)
  recall = recall_score(test_labels, predictions)
  accuracy = accuracy_score(test_labels, predictions)
  print(f"{model_name}: Acc. - {accuracy:.2f}%, Prec. - {precision:.2f}%, Rec. - {recall:.2f}%, ")

In [None]:
evaluate_model("Unigram Raw", ug_raw_model, ug_raw_test, ug_raw_labels_test)
evaluate_model("Unigram TF-IDF", ug_tfidf_model, ug_tfidf_test, ug_tfidf_labels_test)
evaluate_model("Trigram Raw", tg_raw_model, tg_raw_test, tg_raw_labels_test)
evaluate_model("Trigram TF-IDF", tg_tfidf_model, tg_tfidf_test, tg_tfidf_labels_test)

Unigram Raw: Acc. - 0.67%, Prec. - 0.64%, Rec. - 0.58%, 
Unigram TF-IDF: Acc. - 0.67%, Prec. - 0.64%, Rec. - 0.58%, 
Trigram Raw: Acc. - 0.58%, Prec. - 0.52%, Rec. - 0.68%, 
Trigram TF-IDF: Acc. - 0.58%, Prec. - 0.52%, Rec. - 0.68%, 


> **Question:** Can you explain why the models produced different results?

> **Question:** What effect does the train, test split have on the results?

> **Question:** What changes could you make to further improve the results?

## Wrap-up
And that's the end of this week's lab! I hope you enjoyed implementing and using the representations we learned this week.

If you fancy an additional challenge, can you implement sparsity reduction techniques and evaluate their ability to improve results?