<a href="https://colab.research.google.com/github/kamto101/Natural-language-Processing/blob/main/Sentiment_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ANLP Assignment: Sentiment Classification

In this assignment, you will be investigating NLP methods for distinguishing positive and negative reviews written about movies.

For assessment, you are expected to complete and submit this notebook file.  When answers require code, you may import and use library functions (unless explicitly told otherwise).  All of your own code should be included in the notebook rather than imported from elsewhere.  Written answers should also be included in the notebook.  You should insert as many extra cells as you want and change the type between code and markdown as appropriate.

In order to avoid misconduct, you should not talk about the assignment questions with your peers.  If you are not sure what a question is asking you to do or have any other questions, please ask me or one of the Teaching Assistants.

Marking guidelines are provided as a separate document.

The first few cells contain code to set-up the assignment and bring in some data.   In order to provide unique datasets for analysis by different students, you must enter your candidate number in the following cell.  Otherwise do not change the code in these cells.

In [None]:
candidateno=22211789 #this MUST be updated to your candidate number so that you get a unique data sample


In [None]:
#do not change the code in this cell
#preliminary imports

#set up nltk
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('movie_reviews')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.corpus import movie_reviews

#for setting up training and testing data
import random

#useful other tools
import re
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from itertools import zip_longest
from nltk.probability import FreqDist
from nltk.classify.api import ClassifierI


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


In [None]:
#do not change the code in this cell
def split_data(data, ratio=0.7): # when the second argument is not given, it defaults to 0.7
    """
    Given corpus generator and ratio:
     - partitions the corpus into training data and test data, where the proportion in train is ratio,

    :param data: A corpus generator.
    :param ratio: The proportion of training documents (default 0.7)
    :return: a pair (tuple) of lists where the first element of the 
            pair is a list of the training data and the second is a list of the test data.
    """
    
    data = list(data)  
    n = len(data)  
    train_indices = random.sample(range(n), int(n * ratio))          
    test_indices = list(set(range(n)) - set(train_indices))    
    train = [data[i] for i in train_indices]           
    test = [data[i] for i in test_indices]             
    return (train, test)                       
 

def get_train_test_data():
    
    #get ids of positive and negative movie reviews
    pos_review_ids=movie_reviews.fileids('pos')
    neg_review_ids=movie_reviews.fileids('neg')
   
    #split positive and negative data into training and testing sets
    pos_train_ids, pos_test_ids = split_data(pos_review_ids)
    neg_train_ids, neg_test_ids = split_data(neg_review_ids)
    #add labels to the data and concatenate
    training = [(movie_reviews.words(f),'pos') for f in pos_train_ids]+[(movie_reviews.words(f),'neg') for f in neg_train_ids]
    testing = [(movie_reviews.words(f),'pos') for f in pos_test_ids]+[(movie_reviews.words(f),'neg') for f in neg_test_ids]
   
    return training, testing

When you have run the cell below, your unique training and testing samples will be stored in `training_data` and `testing_data`

In [None]:
#do not change the code in this cell
random.seed(candidateno)
training_data,testing_data=get_train_test_data()
print("The amount of training data is {}".format(len(training_data)))
print("The amount of testing data is {}".format(len(testing_data)))
print("The representation of a single data item is below")
print(training_data[0])

The amount of training data is 1400
The amount of testing data is 600
The representation of a single data item is below
(['matthew', 'broderick', 'and', 'high', 'school', ...], 'pos')


1)  
a) **Generate** a list of 10 content words which are representative of the positive reviews in your training data.

b) **Generate** a list of 10 content words which are representative of the negative reviews in your training data.

c) **Explain** what you have done and why

[20\%]

In [None]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

def remove_stopwords(filtered):
  filter_words = [word for word in filtered if word.isalpha() and word not in stop]
  return filter_words

remove_stopwords(training_data[0][0])


['matthew',
 'broderick',
 'high',
 'school',
 'comedy',
 'two',
 'terms',
 'practically',
 'inseparable',
 'since',
 'ferris',
 'buehler',
 'took',
 'day',
 'years',
 'later',
 'broderick',
 'another',
 'high',
 'school',
 'comedy',
 'election',
 'show',
 'world',
 'ferris',
 'buehler',
 'day',
 'showed',
 'educational',
 'setting',
 'similar',
 'pile',
 'marshmallows',
 'light',
 'fluffy',
 'tasty',
 'sparkling',
 'clean',
 'election',
 'far',
 'different',
 'production',
 'dark',
 'frighteningly',
 'realistic',
 'one',
 'much',
 'entertain',
 'minutes',
 'occupies',
 'shocks',
 'well',
 'jim',
 'mcallister',
 'matthew',
 'broderick',
 'type',
 'teacher',
 'makes',
 'american',
 'high',
 'schools',
 'proud',
 'kind',
 'caring',
 'younger',
 'man',
 'built',
 'life',
 'around',
 'carver',
 'high',
 'school',
 'turn',
 'carver',
 'high',
 'school',
 'provided',
 'home',
 'jim',
 'teacher',
 'year',
 'less',
 'three',
 'times',
 'year',
 'span',
 'well',
 'respected',
 'social',
 'studi

In [None]:
from nltk.corpus.reader import wordlist
training_data2 = [(FreqDist(remove_stopwords(wordlist)),label) for (wordlist,label) in training_data]
testing_data2 = [(FreqDist(remove_stopwords(wordlist)), label) for (wordlist,label) in testing_data]

testing_data2[0]

(FreqDist({'slade': 15, 'film': 13, 'rock': 13, 'really': 11, 'like': 11, 'era': 11, 'glam': 9, 'one': 8, 'movement': 8, 'wild': 8, ...}),
 'pos')

In [None]:

positive_frequency = FreqDist()
negative_frequency = FreqDist()

for review,label in training_data2:
  if label == 'pos':
    positive_frequency+=review
  else:
    negative_frequency+=review



In [None]:
positive_frequencyy = FreqDist()
negative_frequencyy = FreqDist()

for review,label in testing_data2:
  if label == 'pos':
    positive_frequencyy+=review
  else:
    negative_frequencyy+=review

In [None]:
def content_words(positive,negative,b):
    difference = positive - negative
    most_common_diff = difference.most_common()
    words = [word for (word,freq) in most_common_diff[:b]]
    return words


In [None]:
neg_words = content_words(negative_frequency,positive_frequency,10)
print(neg_words)

['movie', 'bad', 'plot', 'script', 'worst', 'boring', 'even', 'get', 'nothing', 'could']


In [None]:
pos_words = content_words(positive_frequency,negative_frequency,10)
print(pos_words)

['film', 'life', 'also', 'one', 'great', 'world', 'well', 'many', 'story', 'best']


**Explanation**

To generate content words, The training data needs to be preprocessed. preprocessing data is essential because we need to transform the dataset from dirty data to clean data which will make the models accurate. in this assignment the stopwords function from nltk was used to unnecessary words and punctuations in order to give more focus to important words. 

i also used the FreqDist function to turn the training data into a bag of words representation. this will enable us see the words and the frequency. the next step was to use a for loop to divide the words from the training data into postive frequuency and negative frequency datasets. 

The next step was to generate a function that takes three parameters, gets the difference of first two parameters and  return most common words in the difference. 

the list of positive words was created by passing the function and putting the positive frequency and negative frequency(in that order), and the number of words needed.

the list of negative words was created by passing the function and putting the negative frequency and positive frequency(in that order), and the number of words needed.

2) 
a) **Use** the lists generated in Q1 to build a **word list classifier** which will classify reviews as being positive or negative.

b) **Explain** what you have done.

[12.5\%]


In [None]:

from nltk.classify.api import ClassifierI
import random

class Wordlistclassifier(ClassifierI):


  def __init__(self,pos,neg):
    self.__pos = pos
    self.__neg = neg

  def classify(self,doc):
    score = 0 

    for word,value in doc.items():
        if word in self.__pos:
          score += value
        if word in self.__neg:
          score -= value

    return "neg" if score < 0 else "pos"


  def labels(self):
      return("pos","neg")
    


In [None]:
classifier = Wordlistclassifier(pos_words,neg_words)


In [None]:
classifier.classify(FreqDist("The movie is awful".split()))


'neg'

In [None]:
#To compare the performance of the naive bayes classifier and the word list classifier, we would

Explanation

To generate a word list classifier, i created a class object and passed in ClassfierI. 

Three methods were defined, the first method had attributes of positive and negative. In the second method, the score was set to zero. Then a for loop was created for words,value in Frequency Distribution. 

If the word is positive, the score should be incremented by the value. and if the word is negative , the word should be reduced by the value. this means the word list classifer should return negative if the score is less than zero. and it should return positive if the score is greated than zero.

The Word list classifer has been generate to detect positive reviews as postive and negative reviews as negative.

3)
a) **Calculate** the accuracy, precision, recall and F1 score of your classifier.

b) Is it reasonable to evaluate the classifier in terms of its accuracy?  **Explain** your answer and give a counter-example (a scenario where it would / would not be reasonable to evaluate the classifier in terms of its accuracy).

[20\%]

In [None]:
def accuracy_tester(cls,test_data):
  acc = 0 
  docs,goldstandard = zip(*testing_data2)
  predictions = cls.classify_many(docs)
  for prediction,goldlabel in zip(predictions,goldstandard):
    if prediction == goldlabel:
        acc+=1
  print(len(predictions))
  print(len(goldstandard))
  return acc/(len(test_data))
    


In [None]:
score = accuracy_tester(classifier, testing_data2)  
print(score)

600
600
0.64


In [None]:
class ConfusionMatrix:
  def __init__ (self,predictions,goldstandard,classes = ('pos','neg')):

      (self.c1,self.c2) = classes
      self.TP = 0
      self.FP = 0
      self.TN = 0
      self.FN = 0
      for p,g in zip(predictions,goldstandard):
        if g == self.c1:
            if p == self.c1:
                self.TP+=1
            else:
                self.FN+=1

        elif p == self.c1:
            self.FP+=1
        else:
            self.TN+=1


  def precision(self):
          p = 0
          p = self.TP/ (self.TP + self.FP)
          print(p)
          return p

  def recall(self):
          r = 0
          r = self.TP / (self.TP + self.FN)
          return r 

  def F1(self):
        f1 = 0
        p = self.precision()
        r = self.recall()
        f1 = (2 * p * r )/(p + r)
        return f1 




In [None]:
docs,labels = zip(*testing_data2)


In [None]:
prediction = classifier.classify_many(docs)

In [None]:
calculation = ConfusionMatrix(classifier.classify_many(docs),labels)

In [None]:
print(calculation.TP)
print(calculation.FP)
print(calculation.TN)
print(calculation.FN)

275
191
109
25


In [None]:
calculation.precision()


0.5901287553648069


0.5901287553648069

In [None]:
calculation.recall()

0.9166666666666666

In [None]:
calculation.F1()

0.5901287553648069


0.7180156657963446

Explanation

No it is not reasonable to evaluate classfiers by accuracy. When it comes to evaluation of the classifier, accuracy is not the best indicator, This is because of class imbalance

Class Imbalance occurs when one class (the minority class) contains fewer samples than the other class,(the majority class).

When there is class imbalance, the classifier can get high accuracy by predicting the majority class and it fails to predict the minority class.

This is because accuracy gives a single measure of how well the classifier is working within one class and it does not tell how well the classifier works within seperate classes.

An example can be spam detection. in the emails, 95 percent are not spam, 5 percent are spam. if a classifier is 95 percent accurate in detecting spam and it predicts all the emails as non-spam. it means the classifier might only be doing well enough to predict the majority class that is 95 percent. 

The classifier might not be doing well enough to predict the minority class of 5 percent spam. T

4) 
a)  **Construct** a Naive Bayes classifier (e.g., from NLTK).

b)  **Compare** the performance of your word list classifier with the Naive Bayes classifier.  **Discuss** your results. 

[12.5\%]

In [None]:
naive_bayes_classifier = nltk.NaiveBayesClassifier.train(training_data2)


In [None]:
print(nltk.classify.accuracy(naive_bayes_classifier,testing_data2))
naive_bayes_classifier.show_most_informative_features(20)



0.68
Most Informative Features
               insulting = 1                 neg : pos    =     15.0 : 1.0
             beautifully = 1                 pos : neg    =     13.0 : 1.0
             wonderfully = 1                 pos : neg    =     12.7 : 1.0
                   lousy = 1                 neg : pos    =     11.0 : 1.0
                   sucks = 1                 neg : pos    =     11.0 : 1.0
                terrific = 2                 pos : neg    =     11.0 : 1.0
               ludicrous = 1                 neg : pos    =     10.2 : 1.0
            breathtaking = 1                 pos : neg    =     10.2 : 1.0
                     bad = 5                 neg : pos    =      9.7 : 1.0
                  avoids = 1                 pos : neg    =      9.7 : 1.0
                    slip = 1                 pos : neg    =      9.7 : 1.0
               uplifting = 1                 pos : neg    =      9.7 : 1.0
             outstanding = 1                 pos : neg    =      9.2 

To compare the performance of the naives bayes classifier and the word list classifier. i will check some of the words that the naive bayes has called negative and positive with the word list classifier to see if the word list classifier will term it negative or positive

Above we can see the naive bayes most informative features

naive bayes negative words : lousy,sucks,insulting


naive bayes positive words:  beautifully, wonderfully, terrific


i will now pass these words in sentences into the word list classifier to see the review from the word list classifier


In [None]:
classifier.classify(FreqDist("The film is lousy".split()))


'pos'

In [None]:
classifier.classify(FreqDist("This movie sucks".split()))


'neg'

In [None]:
classifier.classify(FreqDist("The film is insulting".split()))


'pos'

In [None]:
classifier.classify(FreqDist("The movie is beautifully made".split()))


'neg'

In [None]:
classifier.classify(FreqDist("The movie is wonderfully made".split()))


'neg'

In [None]:
classifier.classify(FreqDist("The film is terrific".split()))


'pos'

Results.

The words the naive bayes classifier classified as positive and negative words are correct according to the usuage in real life. 

However the word list classifier classified words which are positive as negative and words which are nagative as positive. 

The underlying problem could be due to the fact that the word list classifier classified a word which actually has no positive or negative emotion attached to it in real life as a negative word or a positive word, so any other words put in the same sentence with that word(which was classfied as a negative or positive word) becomes a negative or positive review.

5) 
a) Design and **carry out an experiment** into the impact of the **length of the wordlists** on the wordlist classifier.  Make sure you **describe** design decisions in your experiment, include a **graph** of your results and **discuss** your conclusions. 

b) Would you **recommend** a wordlist classifier or a Naive Bayes classifier for future work in this area?  **Justify** your answer.

[25\%]


Recommendation.

I would recommend a naive bayes classifier for future work in classification.

Naives Bayes classifier is easier to implent: This is means it requires less code compared to the word list classifier.writing the code for the wordlist classifier took me longer time to understand and write. but with the naive bayes i understood it in a short time and implemented it.

Naives Bayes Classifier is faster: it is faster to run because it requires less amount of code to run so it can be completed in shorter amount of time which makes the work more efficient.

Naives Bayes Classifier can handle big datasets: The naives bayes classifier showed better accuarcy in handling the words compared to the word list classifier. it accuartely classified negative words as negative and positive words as positive meanwhile the word list classifier did not.



In [None]:
##This code will word count all of the markdown cells in the notebook saved at filepath
##Running it before providing any answers shows that the questions have a word count of 437

import io
from nbformat import current
from google.colab import drive
drive.mount('/content/drive')
#filepath="/content/drive/My Drive/NLE Notebooks/assessment/assignment1.ipynb"
filepath="/content/drive/My Drive/AppNLP_Notebooks/22211789.ipynb"
question_count=437

with io.open(filepath, 'r', encoding='utf-8') as f:
    nb = current.read(f, 'json')

word_count = 0
for cell in nb.worksheets[0].cells:
    if cell.cell_type == "markdown":
        word_count += len(cell['source'].replace('#', '').lstrip().split(' '))
print("Submission length is {}".format(word_count-question_count))

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Submission length is 812
