<a href="https://colab.research.google.com/github/haaa20/Labeled-Frequency-Distribution/blob/main/NLassignment2022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLE Assignment: Sentiment Classification

In this assignment, you will be investigating NLP methods for distinguishing positive and negative reviews written about movies.

For assessment, you are expected to complete and submit this notebook file.  When answers require code, you may import and use library functions (unless explicitly told otherwise).  All of your own code should be included in the notebook rather than imported from elsewhere.  Written answers should also be included in the notebook.  You should insert as many extra cells as you want and change the type between code and markdown as appropriate.

In order to avoid misconduct, you should not talk about the assignment questions with your peers.  If you are not sure what a question is asking you to do or have any other questions, please ask me or one of the Teaching Assistants.

Marking guidelines are provided as a separate document.

The first few cells contain code to set-up the assignment and bring in some data.   In order to provide unique datasets for analysis by different students, you must enter your candidate number in the following cell.  Otherwise do not change the code in these cells.

In [1]:
candidateno=246611 #this MUST be updated to your candidate number so that you get a unique data sample


In [2]:
#do not change the code in this cell
#preliminary imports

#set up nltk
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('movie_reviews')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.corpus import movie_reviews

#for setting up training and testing data
import random

#useful other tools
import re
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from itertools import zip_longest
from nltk.probability import FreqDist
from nltk.classify.api import ClassifierI


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


In [3]:
#do not change the code in this cell
def split_data(data, ratio=0.7): # when the second argument is not given, it defaults to 0.7
    """
    Given corpus generator and ratio:
     - partitions the corpus into training data and test data, where the proportion in train is ratio,

    :param data: A corpus generator.
    :param ratio: The proportion of training documents (default 0.7)
    :return: a pair (tuple) of lists where the first element of the 
            pair is a list of the training data and the second is a list of the test data.
    """
    
    data = list(data)  
    n = len(data)  
    train_indices = random.sample(range(n), int(n * ratio))          
    test_indices = list(set(range(n)) - set(train_indices))    
    train = [data[i] for i in train_indices]           
    test = [data[i] for i in test_indices]             
    return (train, test)                       
 

def get_train_test_data():
    
    #get ids of positive and negative movie reviews
    pos_review_ids=movie_reviews.fileids('pos')
    neg_review_ids=movie_reviews.fileids('neg')
   
    #split positive and negative data into training and testing sets
    pos_train_ids, pos_test_ids = split_data(pos_review_ids)
    neg_train_ids, neg_test_ids = split_data(neg_review_ids)
    #add labels to the data and concatenate
    training = [(movie_reviews.words(f),'pos') for f in pos_train_ids]+[(movie_reviews.words(f),'neg') for f in neg_train_ids]
    testing = [(movie_reviews.words(f),'pos') for f in pos_test_ids]+[(movie_reviews.words(f),'neg') for f in neg_test_ids]
   
    return training, testing

When you have run the cell below, your unique training and testing samples will be stored in `training_data` and `testing_data`

In [4]:
#do not change the code in this cell
random.seed(candidateno)
training_data,testing_data=get_train_test_data()
print("The amount of training data is {}".format(len(training_data)))
print("The amount of testing data is {}".format(len(testing_data)))
print("The representation of a single data item is below")
print(training_data[0])

The amount of training data is 1400
The amount of testing data is 600
The representation of a single data item is below
(['"', 'last', 'night', '"', 'could', 'have', 'an', '"', ...], 'pos')


1)  
a) **Generate** a list of 10 content words which are representative of the positive reviews in your training data.

b) **Generate** a list of 10 content words which are representative of the negative reviews in your training data.

c) **Explain** what you have done and why

[20\%]

In [20]:
# I kinda wanna do this with a class as since spending a year doing Java I have been seduced by the ways of OOP
# I'm gonna make my own LabelledFreqDict class. 
# Each label (pos, neg, weather, football etc.) will hash to a two item array A where:
# A[0] is a dictionary mapping tokens to their frequency
# A[1] is a the total number of keys that have been observed under that label 
class LabelledFreqDict:
  def __init__(self):
    self.labels = dict()
    self.total = 0
  
  def get_freq(self, token, label):
    label_dict = self.label_tokens(label)
    return label_dict.setdefault(token, 0)

  def p_label(self, label):
    return self.label_total(label) / self.total

  # p(label|token)
  def p_label_given(self, label, token):
    # Bayes' law babyyyyyyyy
    p_token = self.p_token(token)
    if p_token == 0:
      return 1 / len(self.labels)
    return (self.p_token_given(token, label) * self.p_label(label)) / self.p_token(token)

  def p_token(self, token):
    return self.total_of_token(token) / self.total

  # p(token|label)
  def p_token_given(self, token, label):
    return (self.get_freq(token, label)) / self.label_total(label)

  # Returns the Dictionary at the given key, or a new empty one if the key does not exist yet
  def label_tokens(self, label):
    if label not in self.labels.keys():
      self.labels[label] = [dict(), 0]
    return self.labels[label][0]

  # Returns the total number of tokens in the dictionary at the given key, or 0 if there is
  # nothing at the given key. Note it does NOT create a new dictionary
  def label_total(self, label):
    if label not in self.labels.keys():
      return 0
    return self.labels[label][1]
  
  # Returns an iterable set of all the unique keys that have been seen under all labels
  # Note how we don't need a special function for the unique keys of a certain lable
  # as we could just use .keys()
  def unique_tokens(self):
    u_t = set()

    for v in self.labels.values():
      for t in v[0].keys():
        u_t.add(t)
    
    return u_t
  
  # Applies add n smoothing to all tokens across all labels
  # 
  def add_smoothing(self, n):
    pass
  
  # Returns the total number of times the given token apears across all labels
  def total_of_token(self, token):
    labels_with_token = [l for l in self.labels.keys() if token in self.label_tokens(l)]
    total = sum([self.get_freq(token, l) for l in labels_with_token])
    
    return total
  
  # Adds to the frequency of tokens at a given label
  def add_tokens(self, tokens, label):
    stop_words = set(stopwords.words('english'))
    freq_dict = self.label_tokens(label)
    
    for t in tokens:
      if t in stop_words:
        continue
      freq_dict[t] = freq_dict.setdefault(t, 0) + 1
      self.labels[label][1] += 1
      self.total += 1
    


In [22]:
term_freq = LabelledFreqDict()
for t in training_data:
  term_freq.add_tokens(t[0], t[1])

In [25]:
test_word = "pizza"

case_1 = term_freq.p_label_given('pos', test_word)
case_2 = term_freq.p_label_given('neg', test_word)

print(case_1)
print(case_2)
print(abs((case_1 + case_2) - 1) < 0.01)

# I have run the test above with multiple test words and concluded that I can use
# p(a | t) = 1 - p(b | t), for a given token t and a BINARY label classification
# system of a and b, to save computational time


0.6666666666666667
0.33333333333333337
True


In [None]:
# Lemme just write stuff out as my brain is super fried rn
# My plan is to compare words based on some tfdf value, this should eliminate 
# the risk of words that occur super frequently in one specific review


255

2) 
a) **Use** the lists generated in Q1 to build a **word list classifier** which will classify reviews as being positive or negative.

b) **Explain** what you have done.

[12.5\%]


3)
a) **Calculate** the accuracy, precision, recall and F1 score of your classifier.

b) Is it reasonable to evaluate the classifier in terms of its accuracy?  **Explain** your answer and give a counter-example (a scenario where it would / would not be reasonable to evaluate the classifier in terms of its accuracy).

[20\%]

4) 
a)  **Construct** a Naive Bayes classifier (e.g., from NLTK).

b)  **Compare** the performance of your word list classifier with the Naive Bayes classifier.  **Discuss** your results. 

[12.5\%]

5) 
a) Design and **carry out an experiment** into the impact of the **length of the wordlists** on the wordlist classifier.  Make sure you **describe** design decisions in your experiment, include a **graph** of your results and **discuss** your conclusions. 

b) Would you **recommend** a wordlist classifier or a Naive Bayes classifier for future work in this area?  **Justify** your answer.

[25\%]


In [None]:
##This code will word count all of the markdown cells in the notebook saved at filepath
##Running it before providing any answers shows that the questions have a word count of 437

import io
from nbformat import current

#filepath="/content/drive/My Drive/NLE Notebooks/assessment/assignment1.ipynb"
filepath="/content/drive/My Drive/Colab Notebooks/NLE Notebooks/Assignment1/NLassignment2022.ipynb"
question_count=437

with io.open(filepath, 'r', encoding='utf-8') as f:
    nb = current.read(f, 'json')

word_count = 0
for cell in nb.worksheets[0].cells:
    if cell.cell_type == "markdown":
        word_count += len(cell['source'].replace('#', '').lstrip().split(' '))
print("Submission length is {}".format(word_count-question_count))

FileNotFoundError: ignored