# NLE Assignment: Sentiment Classification

In this assignment, you will be investigating NLP methods for distinguishing positive and negative reviews written about movies.

For assessment, you are expected to complete and submit this notebook file.  When answers require code, you may import and use library functions (unless explicitly told otherwise).  All of your own code should be included in the notebook rather than imported from elsewhere.  Written answers should also be included in the notebook.  You should insert as many extra cells as you want and change the type between code and markdown as appropriate.

In order to avoid misconduct, you should not talk about the assignment questions with your peers.  If you are not sure what a question is asking you to do or have any other questions, please ask me or one of the Teaching Assistants.

Marking guidelines are provided as a separate document.

The first few cells contain code to set-up the assignment and bring in some data.   In order to provide unique datasets for analysis by different students, you must enter your candidate number in the following cell.  Otherwise do not change the code in these cells.

In [1]:
candidateno=219060 #this MUST be updated to your candidate number so that you get a unique data sample


In [2]:
#do not change the code in this cell
#preliminary imports

#set up nltk
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('movie_reviews')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.corpus import movie_reviews

#for setting up training and testing data
import random

#useful other tools
import re
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from itertools import zip_longest
from nltk.probability import FreqDist
from nltk.classify.api import ClassifierI


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Michael\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Michael\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\Michael\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


In [3]:
#do not change the code in this cell
def split_data(data, ratio=0.7): # when the second argument is not given, it defaults to 0.7
    """
    Given corpus generator and ratio:
     - partitions the corpus into training data and test data, where the proportion in train is ratio,

    :param data: A corpus generator.
    :param ratio: The proportion of training documents (default 0.7)
    :return: a pair (tuple) of lists where the first element of the 
            pair is a list of the training data and the second is a list of the test data.
    """
    
    data = list(data)  
    n = len(data)  
    train_indices = random.sample(range(n), int(n * ratio))          
    test_indices = list(set(range(n)) - set(train_indices))    
    train = [data[i] for i in train_indices]           
    test = [data[i] for i in test_indices]             
    return (train, test)                       
 

def get_train_test_data():
    
    #get ids of positive and negative movie reviews
    pos_review_ids=movie_reviews.fileids('pos')
    neg_review_ids=movie_reviews.fileids('neg')
   
    #split positive and negative data into training and testing sets
    pos_train_ids, pos_test_ids = split_data(pos_review_ids)
    neg_train_ids, neg_test_ids = split_data(neg_review_ids)
    #add labels to the data and concatenate
    training = [(movie_reviews.words(f),'pos') for f in pos_train_ids]+[(movie_reviews.words(f),'neg') for f in neg_train_ids]
    testing = [(movie_reviews.words(f),'pos') for f in pos_test_ids]+[(movie_reviews.words(f),'neg') for f in neg_test_ids]
   
    return training, testing

When you have run the cell below, your unique training and testing samples will be stored in `training_data` and `testing_data`

In [4]:
#do not change the code in this cell
random.seed(candidateno)
training_data,testing_data=get_train_test_data()
print("The amount of training data is {}".format(len(training_data)))
print("The amount of testing data is {}".format(len(testing_data)))
print("The representation of a single data item is below")
print(training_data[0])

The amount of training data is 1400
The amount of testing data is 600
The representation of a single data item is below
(['the', 'opening', 'crawl', 'tells', 'us', 'that', ...], 'pos')


1)  
a) **Generate** a list of 10 content words which are representative of the positive reviews in your training data.

b) **Generate** a list of 10 content words which are representative of the negative reviews in your training data.

c) **Explain** what you have done and why

[20\%]

In [5]:
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import RegexpTokenizer, sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer
import string
from nltk.probability import FreqDist
stops = set(stopwords.words('english'))

#Takes in a list of words to normalise. All numbers in that list get normalised
def numberNormalise (tokenList):
    newList = list()
    for token in tokenList:
        if  token.isdigit(): newList.append("NUM")
        else: 
            for x in token: 
                if x.isdigit(): 
                    newList.append("Nth")
                    break
                else: 
                    newList.append(token)
                    break
    return(newList)

#Takes in a list of words to normalise. All stopwords are removed
def stopwordNormalise (tokenList):
    newList = list()
    stops = set(stopwords.words('english'))
    for x in tokenList:
        if x not in stops:
            newList.append(x)
    return(newList)

#Takes in a list of words to normalise. All words are returned in lower case form
def caseNormalise (tokenList):
    newList = list()
    for x in tokenList:
        newList.append(x.lower())
    return(newList)
        
#Takes in a list of words to normalise. A list is returned with all the words with all the punctuation removed
def punctuationNormalise (tokenList):
    newList = list()
    for x in tokenList:
        if x not in string.punctuation:
            newList.append(x)
    return(newList)
    
#Returns the input list after it has been normalised
def normalise (tokenList):
    return punctuationNormalise(numberNormalise(caseNormalise(stopwordNormalise(tokenList))))

#Takes the frequency distribution of words in all documents and returns the difference of word occurances between positive and negative reviews
def set_frequency_difference(AllDocFreqDist):
    #creates a separate frequency distribution for pos and neg reviews
    pos_freq_dist=FreqDist()
    neg_freq_dist=FreqDist()
    my_pos_word_list = []
    my_neg_word_list = []
    FreqDiff = list()
    pos_freq_diff = list()
    neg_freq_diff = list()
    
    for reviewDist,label in AllDocFreqDist:
        if label=='pos':
            pos_freq_dist+=reviewDist
        else:
            neg_freq_dist+=reviewDist    
    
    for word in pos_freq_dist:
        if word in neg_freq_dist:
            diff = pos_freq_dist[word]-neg_freq_dist[word]
            if diff>0:
                tempTuple = word, diff
                pos_freq_diff.append(tempTuple)
    for word in neg_freq_dist:
        if word in pos_freq_dist:
            diff = neg_freq_dist[word]-pos_freq_dist[word]
            if diff>0:
                tempTuple = word, diff
                neg_freq_diff.append(tempTuple)
    
    pos_freq_diff.sort(key=lambda a:a[1], reverse = True)
    neg_freq_diff.sort(key=lambda a:a[1], reverse = True)
    return pos_freq_diff, neg_freq_diff

#returns the most frequent words from a word distribution
def most_frequent_words(FreqDist, k):
    returnList = list()
    count = 0
    for word, freq in FreqDist:
        if count >= k:
            break;
        returnList.append(word)
        count = count+1
    return returnList
    
AllDocFreqDist = list()
count=0
for element in training_data:
    newList = training_data[count][0]
    DocFreqDist = FreqDist(normalise(newList))
    DocTuple = (DocFreqDist, training_data[count][1])
    AllDocFreqDist.append(DocTuple)
    count+=1
    
pos_words, neg_words = set_frequency_difference(AllDocFreqDist)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Michael\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
positive_words = most_frequent_words(pos_words, 10)
print(positive_words)

['film', 'life', 'also', 'well', 'best', 'great', 'many', 'story', 'world', 'films']


In [7]:
negative_words = most_frequent_words(neg_words, 10)
print(negative_words)

['movie', 'bad', 'plot', 'get', 'worst', 'nothing', 'supposed', 'boring', 'stupid', 'script']


I have created several funtions in order to normalise the words in the documents (reviews). I then created a frequency distribution for each document. Each frequency distribution is then sorted into two lists (one for positive reviews and the other for negative reviews).

After they are sorted, a frequency distribution is created for all words that appear in positive reviews. Another is created for all words that appear in negative reviews. These two are then compared. Two lists of tuples are created, one for positive words and one for negative words. Each tuple contains a word that appears in a positive and/or negative review and the number of times that it appears in one type of review over the other. Each tuple in the positive list of tuples contains a word that appears more in positive reviews than negative reviews and how many more thimes this is the case. This is the opposite for negative reviews. e.g. ("film", 680). This means that the word film appears in positive reviews 680 more times than it does in the negative reviews. 
Each list if then sorted from most occurances to fewest occurances.

To get a list of the most k used words in either positive or negative reviews, the function "most_frequent_words" is used. It returns a list of the first 'k' words in the tuple list (where 'k' is the number of words to be returned). 


The numberNormalise function normalises all numbers that are passed into it. Numbers get turned into "NUM" and positions e.g. "1st" are changed into "Nth"
The stopwordNormalise function removes stop words from the list of words input
The caseNormalise function returns all words inputted as lower case
The punctuationNormalise function removes punctuation
The normalise function calls the above functions. This makes the code easier to read as one function can be called in-line (during the code) instead of several
The set_frequency_difference function takes the frequency distribution of words in all documents and returns the difference of word occurances between positive and negative reviewsTakes the frequency distribution of words in all documents and returns the difference of word occurances between positive and negative reviews
The most_frequent_words returns the most frequent words from a word distributionreturns the most frequent words from a word distribution

2) 
a) **Use** the lists generated in Q1 to build a **word list classifier** which will classify reviews as being positive or negative.

b) **Explain** what you have done.

[12.5\%]


In [8]:
#Takes in the documents, the positive word list and the negative word list and, based on whether there are more positive or negative occurances of words, classifies them as being positive or negative
def classify_documents(documents, pos_words, neg_words):
    pos_doc_list = list()
    neg_doc_list = list()
    for document in documents:
        pos_count = 0
        neg_count = 0
        freqDist = FreqDist(normalise(document[0]))
        for word in freqDist:
            if word in pos_words:
                pos_count = pos_count + freqDist[word]
            if word in neg_words:
                neg_count = neg_count + freqDist[word]
        if pos_count > neg_count:
            doc_class = ["pos"]
        elif neg_count > pos_count:
            doc_class = ["neg"]
        else:
            doc_class = ["pos", "neg"]
        if random.choice(doc_class) == "pos":
            documentTuple = (document, "pos")
            pos_doc_list.append(document)
        else:
            documentTuple = (document, "neg")
            neg_doc_list.append(document)                
    return pos_doc_list, neg_doc_list

pos_doc_list, neg_doc_list = classify_documents(testing_data, most_frequent_words(pos_words, 10), most_frequent_words(neg_words, 10))
print(len(pos_doc_list))
print(len(neg_doc_list))

429
171


I created a function that has an input of the testing data/the list of documents that need to be classified. For every document in the list of documents provided, this function normalises each document the list and creates a frequency distribution for it. Every word in the frequency distribution, is compared to the positive and negative word lists created in part 1. If the word being checked is in either word list, the frequency of that word is added to the respective count. After all words in review have been check, the review type of review ("pos" or "neg") gets added to the "doc_class" list (depending on which count is highest). If the counts are the same, the both get added to the list. From this list a random label is chosen and assigned to the document and the document is added to a list of only that label type (if only one label is in the list, only 1 can be chosen). 

3)
a) **Calculate** the accuracy, precision, recall and F1 score of your classifier.

b) Is it reasonable to evaluate the classifier in terms of its accuracy?  **Explain** your answer and give a counter-example (a scenario where it would / would not be reasonable to evaluate the classifier in terms of its accuracy).

[20\%]

In [9]:
#Takes in the positive and negative list of documents and counts how hamny are correctly and incorrectly sorted
def sort_correctness(pos_list, neg_list, data = testing_data):
    TP = 0
    FP = 0
    TN = 0
    FN = 0
    for review in data:
        if review[1] == "pos":
            if review in pos_list:
                TP = TP+1
            else:
                FN = FN+1
        else:
            if review in neg_list:
                TN = TN+1
            else:
                FP = FP+1
    return TP, FP, TN, FN

#Takes in the True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN) and returns the accuracy of the classifier
def calculate_accuracy(TP, FP, TN, FN):
    return ((TP+TN)/(TP+FP+TN+FN))

#Takes in the True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN) and returns the precision of the classifier
def calculate_precision(TP, FP, TN, FN):
    return (TP/(TP+FP))

#Takes in the True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN) and returns the recall of the classifier
def calculate_recall(TP, FP, TN, FN):
    return (TP/(TP+FN))

#Takes in the True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN) and returns the F1 score of the classifier
def calculate_F1_score(TP, FP, TN, FN):
    return ((2*calculate_precision(TP, FP, TN, FN)*calculate_recall(TP, FP, TN, FN))/((calculate_precision(TP, FP, TN, FN))+calculate_recall(TP, FP, TN, FN)))

TP, FP, TN, FN = sort_correctness(pos_doc_list, neg_doc_list)
print(TP, FP, TN, FN)

261 168 132 39


In [10]:
print("Accuracy: ", calculate_accuracy(TP, FP, TN, FN))
print("Precision: ", calculate_precision(TP, FP, TN, FN))
print("Recall: ", calculate_recall(TP, FP, TN, FN))
print("F1 score: ", calculate_F1_score(TP, FP, TN, FN))

Accuracy:  0.655
Precision:  0.6083916083916084
Recall:  0.87
F1 score:  0.7160493827160495


The sort_correctness function counts and returns the True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN) of the classifier
The 'calculate' functions then use these values to return their respective values

It is not always a good idea to judge the effectiveness of a classifier based on it's accuracy.
While accuracy is a good measure of how well a classifier works with a balanced data set, if the data set is skewed the effectiveness decreases. If the data set is skewed and the classifier has a bias to it and the bias and the skew are in the same direction then the classifier would seem like it is working very accurately when in reality it may be guessing what class the data belongs in. 
As the guesses would follow the bias, it would appear to be working well. e.g. if 90% of the data inputted is negative and the classifier was guessing if pieces of data were positive or negative with a negative bias (e.g. only guessing negative), the classifier may receive a high level of accuracy (as only 10% would be classed wrong). If you were to use this same classifier with a balanced data set, the accuracy would likely be very poor.
Due to this, accuracy is only effective with balanced data sets. This is a problem though as you cannot rely on having a balanced data set when using the classifier properly (not with training/testing data).

Because of this, precision, recall and F1-score are a much better way of telling a classifiers effectiveness

4) 
a)  **Construct** a Naive Bayes classifier (e.g., from NLTK).

b)  **Compare** the performance of your word list classifier with the Naive Bayes classifier.  **Discuss** your results. 

[12.5\%]

In [None]:
from nltk.probability import FreqDist

#Returns the probability that the training data has positive or negative labels
def prob_of_positive(pos_docs, neg_docs):
    total = len(pos_docs)+len(neg_docs)
    return (len(pos_docs)/total)

#Returns the probability of each word in a given frequency distribution 
def prob_of_words(freq_dist):
    word_prob = list()
    total = len(freq_dist)
    for word, freq in freq_dist:
        prob = freq / total
        prob_tuple = (word, prob)
        word_prob.append(prob_tuple)
    return word_prob
    
#Takes a list of mixed "pos" and "neg" documents and returns two lists, one of "pos" documents and the other of "neg" documents
def split_pos_neg(documents):
    pos_docs = list()
    neg_docs = list()
    for document in documents:
        if document[1] == "pos":
            pos_docs.append(document)
        else:
            neg_docs.append(document)
    return pos_docs, neg_docs

#Returns the frequency distribution of all words across all documents 
def frequency_distribution_in_documents(documents):
    docs_freq_dist = list()
    for document, label in documents:
        normalDoc = normalise(document)
        freqDist = FreqDist(normalDoc)
        docs_freq_dist.append(freqDist)
    return docs_freq_dist

#Performs the bayes calculation on each documents that is inputted and then assigns a class based on the answer
def calculate_bayes(document, word_prob_list, class_prob_tuple):
    pos_words = []
    neg_words = []
    for prob, label in word_prob_list:
        if label == "pos":
            pos_words.append(prob)
        else:
            neg_words.append(prob)
    pos_class = class_prob_tuple[0]
    neg_class = class_prob_tuple[1]
    pos_prob = pos_class
    neg_prob = neg_class
    
    for word in document:
        changed = False
        for word_prob, label in word_prob_list:
            #print("\n\nword: ", word, "\nlabel: ", label, "\nprobability: ", word_prob)
            if word == word_prob[0]:
                #print("words match")
                if label == "pos":
                    #print("label is pos")
                    pos_prob = pos_prob*(word_prob[1]*pos_class)
                    changed = True
                   # print("changed: ", changed)
                    break
            if word == word_prob[0]:
                #print("words match")
                if label == "neg":
                    #print("label is neg")
                    neg_prob = neg_prob*(word_prob[1]*neg_class)
                    changed = True
                    #print("changed: ", changed)
                    break
            
    if pos_prob > neg_prob:
        class_prob = {"pos": pos_prob}
    if neg_prob > pos_prob:
        class_prob = {"pos": pos_prob}
    else:
        class_prob = {"pos": pos_prob, "neg":neg_prob}
    classes=list(class_prob.keys())
    return random.choice(classes)

#Performs the naive bayes calculation on all documents 
def naive_bayes_calculation(docs, freq_dist_tuple, pos_prob):
    documents = list()
    pos_docs = list()
    neg_docs = list()
    word_prob = list()

    for each in prob_of_words(freq_dist_tuple[0]):
        prob_tuple = (each, "pos")
        word_prob.append(prob_tuple)
    for each in prob_of_words(freq_dist_tuple[1]):
        prob_tuple = (each, "neg")
        word_prob.append(prob_tuple)
    class_prob_tuple = (pos_prob, (1-pos_prob))
    for document in docs:
        documents.append(normalise(document[0]))
    doc_tuples = list()
    for document in documents:
        doc_class = calculate_bayes(document, word_prob, class_prob_tuple)
        doc_tuple = (document, doc_class)
        doc_tuples.append(doc_tuple)
    return doc_tuples

pos_docs, neg_docs = split_pos_neg(testing_data)
freq_dist_tuple = (pos_words, neg_words)
docs_tuple = (pos_docs, neg_docs)

classified_docs=naive_bayes_calculation(testing_data, freq_dist_tuple, prob_of_positive(pos_docs, neg_docs))

#This part of the code is used to count the number of True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN). This part of the code does not work
TP = 0
TN = 0
FP = 0
FN = 0

for document, label in classified_docs:
    for review in testing_data:
        normal_review = normalise(review[0])
        if normal_review == document:
            if review[1] == "pos":
                if label == "pos":
                    TP = TP+1
                else:
                    FN = FN+1
            if review[1] == "neg":
                if label == "neg":
                    TN = TN+1
                else:
                    FP = FP+1
print(TP, FP, TN, FN)


In [None]:
print("Accuracy: ", calculate_accuracy(TP, FP, TN, FN))
print("Precision: ", calculate_precision(TP, FP, TN, FN))
print("Recall: ", calculate_recall(TP, FP, TN, FN))
print("F1 score: ", calculate_F1_score(TP, FP, TN, FN))

In the code above, documents get sorted into positive and negative with a naive bayes approach. The probability of a positive and negative review being chosen at random is calculated at. The probability that a given word is in a given type of review is also calculated (the frequency of that word divided by the total number of words in all reviews of that type). These two values are then multiplied together to give a probability of that word being in in a document of a given class. This probability is then built upon, until all words in the document have been reached, by being multiplied by the next words' probability that it is in the same given class.

After this has been completed for both positive and negative classes, a class is assigned to the document. This is done by having the class with the highest probability being added to "class_prob". If both classes have the same probability, they are both added. From this, a random class is chosen and assigned.

The number of True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN) is then calculated along with the accuracy, precision, recall and F1 score


The prob_of_positive function returns the probability that the training data has positive or negative labels. It takes the positive and negative document lists and outputs the probability that a document chosen at random is positive.
The prob_of_words function returns the probability of each word in a given frequency distribution. It add the total number of frequencies in the frequency distribution and divides each frequency by this total.
The split_pos_neg function takes a list of mixed "pos" and "neg" documents and returns two lists, one of "pos" documents and the other of "neg" documents. This is to then be fed into the prob_of_positive function.
The frequency_distribution_in_documents function returns the frequency distribution of all words across all documents. It goes through each document and adds the words' frequencies to the running total ("docs_freq_dist").
The calculate_bayes function performs the bayes calculation on each documents that is inputted and then assigns a class based on the answer. Whichever value is larger (probability that a document is positive, probability that a document in negative) gets added to "class_prob". A key from "class_prob" is chosen at random to be the label of the document.
The naive_bayes_calculation function performs the calculate_bayes function on all documents.

5) 
a) Design and **carry out an experiment** into the impact of the **length of the wordlists** on the wordlist classifier.  Make sure you **describe** design decisions in your experiment, include a **graph** of your results and **discuss** your conclusions. 

b) Would you **recommend** a wordlist classifier or a Naive Bayes classifier for future work in this area?  **Justify** your answer.

[25\%]


Initial hypothesis: The larger the world list are, the longer it will take to classify each document but documents will be classified more accurately. To measure this, I will use the value for the True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN). To measure the time taken, the time library will be used. I will then plot an accuracy, precision, recall and F1 score graph as well as a time taken to classify at with each word list length graph and an F1 score per second graph

In [None]:
import time

AllDocFreqDist = list()
count=0
for element in training_data:
    newList = training_data[count][0]
    normalList = normalise(newList)
    DocFreqDist = FreqDist(normalList)
    DocTuple = (DocFreqDist, training_data[count][1])
    AllDocFreqDist.append(DocTuple)
    count+=1
pos_words, neg_words = set_frequency_difference(AllDocFreqDist)

#Word list length 10, creation, classification and correctness
pos10 = most_frequent_words(pos_words, 10)
neg10 = most_frequent_words(neg_words, 10)
start_time = time.time()
pos_doc_list, neg_doc_list = classify_documents(testing_data, pos10, neg10)
classify_time10 = (time.time() - start_time)
print("\n10 word classifier took: ", classify_time10)
TP, FP, TN, FN = sort_correctness(pos_doc_list, neg_doc_list)
correctness_10 = (TP, FP, TN, FN)

#Word list length 20, creation, classification and correctness
pos20 = most_frequent_words(pos_words, 20)
neg20 = most_frequent_words(neg_words, 20)
start_time = time.time()
pos_doc_list, neg_doc_list = classify_documents(testing_data, pos20, neg20)
classify_time20 = (time.time() - start_time)
print("\n20 word classifier took: ", classify_time20)
TP, FP, TN, FN = sort_correctness(pos_doc_list, neg_doc_list)
correctness_20 = (TP, FP, TN, FN)

#Word list length 30, creation, classification and correctness
pos30 = most_frequent_words(pos_words, 30)
neg30 = most_frequent_words(neg_words, 30)
start_time = time.time()
pos_doc_list, neg_doc_list = classify_documents(testing_data, pos30, neg30)
classify_time30 = (time.time() - start_time)
print("\n30 word classifier took: ", classify_time30)
TP, FP, TN, FN = sort_correctness(pos_doc_list, neg_doc_list)
correctness_30 = (TP, FP, TN, FN)

#Word list length 40, creation, classification and correctness
pos40 = most_frequent_words(pos_words, 40)
neg40 = most_frequent_words(neg_words, 40)
start_time = time.time()
pos_doc_list, neg_doc_list = classify_documents(testing_data, pos40, neg40)
classify_time40 = (time.time() - start_time)
print("\n40 word classifier took: ", classify_time40)
TP, FP, TN, FN = sort_correctness(pos_doc_list, neg_doc_list)
correctness_40 = (TP, FP, TN, FN)

#Word list length 50, creation, classification and correctness
pos50 = most_frequent_words(pos_words, 50)
neg50 = most_frequent_words(neg_words, 50)
start_time = time.time()
pos_doc_list, neg_doc_list = classify_documents(testing_data, pos50, neg50)
classify_time50 = (time.time() - start_time)
print("\n50 word classifier took: ", classify_time50)
TP, FP, TN, FN = sort_correctness(pos_doc_list, neg_doc_list)
correctness_50 = (TP, FP, TN, FN)

#Word list length 60, creation, classification and correctness
pos60 = most_frequent_words(pos_words, 60)
neg60 = most_frequent_words(neg_words, 60)
start_time = time.time()
pos_doc_list, neg_doc_list = classify_documents(testing_data, pos60, neg60)
classify_time60 = (time.time() - start_time)
print("\n60 word classifier took: ", classify_time60)
TP, FP, TN, FN = sort_correctness(pos_doc_list, neg_doc_list)
correctness_60 = (TP, FP, TN, FN)

#Word list length 70, creation, classification and correctness
pos70 = most_frequent_words(pos_words, 70)
neg70 = most_frequent_words(neg_words, 70)
start_time = time.time()
pos_doc_list, neg_doc_list = classify_documents(testing_data, pos70, neg70)
classify_time70 = (time.time() - start_time)
print("\n70 word classifier took: ", classify_time70)
TP, FP, TN, FN = sort_correctness(pos_doc_list, neg_doc_list)
correctness_70 = (TP, FP, TN, FN)

#Word list length 80, creation, classification and correctness
pos80 = most_frequent_words(pos_words, 80)
neg80 = most_frequent_words(neg_words, 80)
start_time = time.time()
pos_doc_list, neg_doc_list = classify_documents(testing_data, pos80, neg80)
classify_time80 = (time.time() - start_time)
print("\n80 word classifier took: ", classify_time80)
TP, FP, TN, FN = sort_correctness(pos_doc_list, neg_doc_list)
correctness_80 = (TP, FP, TN, FN)

#Word list length 90, creation, classification and correctness
pos90 = most_frequent_words(pos_words, 90)
neg90 = most_frequent_words(neg_words, 90)
start_time = time.time()
pos_doc_list, neg_doc_list = classify_documents(testing_data, pos90, neg90)
classify_time90 = (time.time() - start_time)
print("\n90 word classifier took: ", classify_time90)
TP, FP, TN, FN = sort_correctness(pos_doc_list, neg_doc_list)
correctness_90 = (TP, FP, TN, FN)

#Word list length 100, creation, classification and correctness
pos100 = most_frequent_words(pos_words, 100)
neg100 = most_frequent_words(neg_words, 100)
start_time = time.time()
pos_doc_list, neg_doc_list = classify_documents(testing_data, pos100, neg100)
classify_time100 = (time.time() - start_time)
print("\n100 word classifier took: ", classify_time100)
TP, FP, TN, FN = sort_correctness(pos_doc_list, neg_doc_list)
correctness_100 = (TP, FP, TN, FN)

In [None]:
results = {"Accuracy":{"10":calculate_accuracy(correctness_10[0], correctness_10[1], correctness_10[2], correctness_10[3]), "20":calculate_accuracy(correctness_20[0], correctness_20[1], correctness_20[2], correctness_20[3]), "30":calculate_accuracy(correctness_30[0], correctness_30[1], correctness_30[2], correctness_30[3]), "40":calculate_accuracy(correctness_40[0], correctness_40[1], correctness_40[2], correctness_40[3]), "50":calculate_accuracy(correctness_50[0], correctness_50[1], correctness_50[2], correctness_50[3]), "60":calculate_accuracy(correctness_60[0], correctness_60[1], correctness_60[2], correctness_60[3]), "70":calculate_accuracy(correctness_70[0], correctness_70[1], correctness_70[2], correctness_70[3]), "80":calculate_accuracy(correctness_80[0], correctness_80[1], correctness_80[2], correctness_80[3]), "90":calculate_accuracy(correctness_90[0], correctness_90[1], correctness_90[2], correctness_90[3]), "100":calculate_accuracy(correctness_100[0], correctness_100[1], correctness_100[2], correctness_100[3])}, "Precision":{"10":calculate_precision(correctness_10[0], correctness_10[1], correctness_10[2], correctness_10[3]), "20":calculate_precision(correctness_20[0], correctness_20[1], correctness_20[2], correctness_20[3]), "30":calculate_precision(correctness_30[0], correctness_30[1], correctness_30[2], correctness_30[3]), "40":calculate_precision(correctness_40[0], correctness_40[1], correctness_40[2], correctness_40[3]), "50":calculate_precision(correctness_50[0], correctness_50[1], correctness_50[2], correctness_50[3]), "60":calculate_precision(correctness_60[0], correctness_60[1], correctness_60[2], correctness_60[3]), "70":calculate_precision(correctness_70[0], correctness_70[1], correctness_70[2], correctness_70[3]), "80":calculate_precision(correctness_80[0], correctness_80[1], correctness_80[2], correctness_80[3]), "90":calculate_precision(correctness_90[0], correctness_90[1], correctness_90[2], correctness_90[3]), "100":calculate_precision(correctness_100[0], correctness_100[1], correctness_100[2], correctness_100[3])}, "Recall":{"10":calculate_recall(correctness_10[0], correctness_10[1], correctness_10[2], correctness_10[3]), "20":calculate_recall(correctness_20[0], correctness_20[1], correctness_20[2], correctness_20[3]), "30":calculate_recall(correctness_30[0], correctness_30[1], correctness_30[2], correctness_30[3]), "40":calculate_recall(correctness_40[0], correctness_40[1], correctness_40[2], correctness_40[3]), "50":calculate_recall(correctness_50[0], correctness_50[1], correctness_50[2], correctness_50[3]), "60":calculate_recall(correctness_60[0], correctness_60[1], correctness_60[2], correctness_60[3]), "70":calculate_recall(correctness_70[0], correctness_70[1], correctness_70[2], correctness_70[3]), "80":calculate_recall(correctness_80[0], correctness_80[1], correctness_80[2], correctness_80[3]), "90":calculate_recall(correctness_90[0], correctness_90[1], correctness_90[2], correctness_90[3]), "100":calculate_recall(correctness_100[0], correctness_100[1], correctness_100[2], correctness_100[3])}, "F1_score":{"10":calculate_F1_score(correctness_10[0], correctness_10[1], correctness_10[2], correctness_10[3]), "20":calculate_F1_score(correctness_20[0], correctness_20[1], correctness_20[2], correctness_20[3]), "30":calculate_F1_score(correctness_30[0], correctness_30[1], correctness_30[2], correctness_30[3]), "40":calculate_F1_score(correctness_40[0], correctness_40[1], correctness_40[2], correctness_40[3]), "50":calculate_F1_score(correctness_50[0], correctness_50[1], correctness_50[2], correctness_50[3]), "60":calculate_F1_score(correctness_60[0], correctness_60[1], correctness_60[2], correctness_60[3]), "70":calculate_F1_score(correctness_70[0], correctness_70[1], correctness_70[2], correctness_70[3]), "80":calculate_F1_score(correctness_80[0], correctness_80[1], correctness_80[2], correctness_80[3]), "90":calculate_F1_score(correctness_90[0], correctness_90[1], correctness_90[2], correctness_90[3]), "100":calculate_F1_score(correctness_100[0], correctness_100[1], correctness_100[2], correctness_100[3])}}
correctness = pd.DataFrame(results)
correctness_graph = correctness.plot(kind="line",title="Experiment Results")
correctness_graph.set_ylabel("Score")
correctness_graph.set_xlabel("Word List Size")

In [None]:
times = {"Classify time":{"10":classify_time10, "20":classify_time20, "30":classify_time30, "40":classify_time40, "50":classify_time50, "60":classify_time60, "70":classify_time70, "80":classify_time80, "90":classify_time90, "100":classify_time100}}
exp_times = pd.DataFrame(times)
times_graph = exp_times.plot(kind="line",title="Time to classify Results")
times_graph.set_ylabel("Time Taken (s)")
times_graph.set_xlabel("Word List Size")

In [None]:
time_f1_ratio = {"Time/F1 Score":{"10":calculate_F1_score(correctness_10[0], correctness_10[1], correctness_10[2], correctness_10[3])/classify_time10, "20":calculate_F1_score(correctness_20[0], correctness_20[1], correctness_20[2], correctness_20[3])/classify_time20, "30":calculate_F1_score(correctness_30[0], correctness_30[1], correctness_30[2], correctness_30[3])/classify_time30, "40":calculate_F1_score(correctness_40[0], correctness_40[1], correctness_40[2], correctness_40[3])/classify_time40, "50":calculate_F1_score(correctness_50[0], correctness_50[1], correctness_50[2], correctness_50[3])/classify_time50, "60":calculate_F1_score(correctness_60[0], correctness_60[1], correctness_60[2], correctness_60[3])/classify_time60, "70":calculate_F1_score(correctness_70[0], correctness_70[1], correctness_70[2], correctness_70[3])/classify_time70, "80":calculate_F1_score(correctness_80[0], correctness_80[1], correctness_80[2], correctness_80[3])/classify_time80, "90":calculate_F1_score(correctness_90[0], correctness_90[1], correctness_90[2], correctness_90[3])/classify_time90, "100":calculate_F1_score(correctness_100[0], correctness_100[1], correctness_100[2], correctness_100[3])/classify_time100}}
time_f1 = pd.DataFrame(time_f1_ratio)
time_f1_graph = time_f1.plot(kind="line",title="F1 Score/Time Ratio")
time_f1_graph.set_ylabel("F1 Score Per Second")
time_f1_graph.set_xlabel("Word List Size")

In this experiment we see what the effects of the length of the word list of a word list classifier has on the running time and correctness of a classifier.

Ten word lists are created of differing lengths, from 10 to 100, in 10 word length increments. These values were chosen to provide a wide range. The fact that the largest result is 10 times larger than the small one allows us to see to what extent the values of the accuracy, precision, recall and F1 score change.

For each word list, the time taken to classify the documents is taken and plotted. (see above)
As well as this is a graph that states the accuracy, precision, recall and F1 score for each word list.

Method:
After all training documents are normalised and a frequency distributions of positive and negative words are created, the first N words are taken from each list. This N started at 10 and increased by 10 until word lists of 100 words were created. For each of the pairs of different length lists, the testing documents are classified. The time taken for each of these classifications is recorded as well as the number of true positives, true negatives, false positives and false negatives produced by each classification. From these values, the accuracy, precision, recall and F1 score is calculated and plotted. Another graph is produced with the time taken to classify documents given different length word lists.

b)
As the word list grows the F1 score of the classifier and the time taken to classify increases. These two values, however, do not increase at the same rate. The time taken to classify increases at a faster rate. This means that the ratio between the amount of time taken to classify the documents and the F1 score gets worse the larger the word list length. Due to this, people should try to avoid word list classifiers for a large number of documents; the more documents you have, the larger the word list needs to be in order to minimise the amount of random choices made by the classifier.
For very large data sets, a naive bayes classifier would produce similar (or better results) faster. Due to this, I would recommed a naive bayes classifer for larger data sets.

In [None]:
##This code will word count all of the markdown cells in the notebook saved at filepath
##Running it before providing any answers shows that the questions have a word count of 437

import io
from nbformat import current

#filepath="/content/drive/My Drive/NLE Notebooks/assessment/assignment1.ipynb"
filepath="NLassignment2021.ipynb"
question_count=437

with io.open(filepath, 'r', encoding='utf-8') as f:
    nb = current.read(f, 'json')

word_count = 0
for cell in nb.worksheets[0].cells:
    if cell.cell_type == "markdown":
        word_count += len(cell['source'].replace('#', '').lstrip().split(' '))
print("Submission length is {}".format(word_count-question_count))