# Data Preprocessing & Representation
I have chosen to model the data using dictionaries inside a dictionary. This is done for me by the get_word_statistics(targets) function which also counts all the necessary calculations I need to eventually calculate posterior probabilities eg. class frequencies, unique words in a class etc. The resulting dictionary has every word that appears in the full dataset mapping to a dictionary with the frequency of that word in each class. For example, the word "the" : {'B':18145,'A':1783,'E':24794,'V':1785} optimising runtime. Rather than using 0-1 attributes, I used the frequencies of the words as it improves the accuracy of the model since it means the words are weighted, meaning the probabilities are tuned to be more accurate. This is also very useful in regards to extending the implementation for better accuracy, as we can weight the words giving more frequent words appearing more significance. A binary classifier would cause a lot of information loss. Using frequencies rather than a binary approach can increase the runtime but since my representation of the data is collected efficiently, it did not make a big difference for me. I chose to represent the data in a dictionary rather than something such as a dataframe which I had also considered, because the complexity became very large using a dataframe such as CountVectorizer. The dictionary representation is very efficient and computationally minimal and does not require me to import many different packages and libraries, keeping the code simple but effective and consuming much less memory than a dataframe would. 
# Implementation & Improvement
I implemented the SNB by calling the get_word_statistics() and passing it the trg.csv file content. This generated the dictionary I needed for all my calculations, as well as the variables I need for calculations. I implemented a method called conditional calculates the conditional probability for a word appearing in a specific class using laplace smoothing which prevents 0 probabilities as well as overfitting of the data. I also implemented a posterior method which calculates the posterior probability for a class label and an abstract. Lastly I used a method called testing which takes one argument which is the test data. This runs the process of the SNB making class predictions. I tested my SND using tst.csv data as you can see below. I extended this model using the concept of data cleaning. Data cleaning can be done in different ways, eg.lowercasing or concatenating words etc. I have chosen to implement data cleaning by removing stop-words. Stop-words are words that are insignificant to the context of the data, such as "and". I made a dataclean() method to implement the extension of my SNB. I chose to do this since in my original word frequency I noticed the word "and" is one of the most common words to appear in this given dataset so I felt that the accuracy would improve by getting rid of these stopwords, such as "and". Data cleaning is an important part of preprocessing the data and by removing stopwords we decrease bias, and increase the accuracy as well as reducing overfitting of the dataset (counting evidence twice when it's just one piece of evidence). I could have further implemented the classifier by doing other forms of preprocessing but I was interested in seeing how much the accuracy would change from just removing stopwords since they were so frequent in the data. This is implemented in the dataclean() function I made.
# Evaluation
To evaluate the procedure I used cross-validation. I used a 5-fold cross validation as a 10-fold cross validation does not improve the accuracy much but increases the runtime a lot which means the classifier becomes very computationally consumptive. Training/validation split works by splitting the data up into a training set and validation set just once whereas cross-validation splits the data up multiple times and evaluates the accuracy by averaging the accuracies. This reduces optimisation bias and gives a better approximation of the accuracy which is why I chose to use cross-validation rather than training/validation split. The accuracy for the SNB classifier was 0.981, 98.1% whereas the accuracy for the extended implementation of the naive bayes classifier was 0.982, 98.2% This improvement in accuracy is very small despite what I thought would happen. I assume this is because the accuracy of the SNB implementation was already considerably high which means it is hard to improve such a high accuracy with just one form of data cleaning. To improve the accuracy more, I would need to implement more extensive data cleaning, and even consider implementing other improvements such as n-grams. However, we can still see the improvement in removing stop-words from the dataset, such small improvements add up in large datasets such as the one provided. 

In [650]:
#TASK 1
import math
import random
import sklearn
import nltk
from nltk.corpus import stopwords
import pandas as pd
import numpy as np

targets_read = pd.read_csv("datasets/trg.csv")
test_read = pd.read_csv("datasets/tst.csv")



labels = (targets_read["class"].unique()) #labels A,E,B,V
counting = targets_read["class"].value_counts().to_dict() # frequencies of the classes


prior_probs = {key: math.log(value) - math.log(sum(counting.values())) for key, value in counting.items()} #prior log probabilities
class_counts = targets_read["class"].value_counts(normalize=True).to_dict() #class frequencies 


def get_word_statistics(targets): #generates word frequency dictionary and other variables
    global word_frequencies, total_words_in_class, unique_words_in_class #allows access outside the function of the variables 
    
    for index, row in targets.iterrows():
        unique_words = set()
        
        for word in row["abstract"].split():
            unique_words.add(word)
            
            try:
                total_words_in_class[row["class"]] += 1         #counting total words in the particular class
            except:
                total_words_in_class[row["class"]] = 1
            
            try:
                try:
                    word_frequencies[word][row["class"]] += 1   #counting frequency of a word in a particular class
                except:
                    word_frequencies[word][row["class"]] = 1
            except:
                word_frequencies[word] = {row["class"]: 1}

        try:
            unique_words_in_class[row["class"]].update(unique_words)
        except:
            unique_words_in_class[row["class"]] = unique_words



In [655]:
#TASK 1
word_frequencies = {}
total_words_in_class = {}
unique_words_in_class = {}
get_word_statistics(targets_read)
total_unique = len(word_frequencies.keys())
#print(word_frequencies) #run this to see the data representation output!

In [651]:
#TASK 2

def conditional(label, word):
    if word in word_frequencies:
        try:
            word_count = word_frequencies[word][label] + 1 #laplace smoothing
        except KeyError:
            word_count = 1    # in case word of the test code is not in the training data
    else:
        word_count = 1 
    total_class_count = total_words_in_class[label] + total_unique #laplace smoothing for when probability is 0 
                                                                   #eg. the word is not in the training data when testing
    
    numerator = math.log(word_count) 
    denominator = math.log(total_class_count)
    result = numerator - denominator  #log(a/b) = log(a) - log(b)
    return result 

def posterior(words, label):
    probabilities = []
    for w in words:
        probabilities += [conditional(label, w)]    #log(a*b) = log(a) + log(b)
    return sum(probabilities) + prior_probs[label]  #log(a*b) = log(a) + log(b)
    
def testing(test_data):
    test_list = test_data.to_numpy()
    predictions = {}
    for item in test_list:

        i_d = item[0]
        content = item[1].split()
        probs = {}
        for l in labels:
            probs[l] = posterior(content, l) 

        maximum = {}
        for l in labels:
            maximum[l] = probs[l] + probs[l] - sum(value for value in probs.values()) #calculating posterior probability in log space

        predictions[i_d]= max(maximum, key = probs.get)
    return predictions
    



In [656]:
#TASK 2 code running for naive standard bayes on tst.csv
word_frequencies = {}
total_words_in_class = {}
unique_words_in_class = {}
get_word_statistics(targets_read)
total_unique = len(word_frequencies.keys())
#print(testing(test_read)) run this to see the output of the predictions!!


In [652]:
def dataclean(content):
    words = content.split()
    new_abstr = []
    for w in words:
        if (w not in set(stopwords.words("english"))): 
            new_abstr+=[w] #only adding words that are NOT stopwords
    
    return ' '.join(new_abstr) #returning abstract without stopwords

In [657]:
cleaned = targets_read.copy()
cleaned["abstract"] = cleaned['abstract'].apply(dataclean)
word_frequencies = {}
total_words_in_class = {}
unique_words_in_class = {}
get_word_statistics(cleaned)
total_unique = len(word_frequencies.keys())
#print(word_frequencies) run this to see the output of the cleaned data representation!


In [653]:
#TASK 3 
from sklearn.model_selection import train_test_split

def cross_validation(raw):

    average = 0
    
    for i in range(1, 6):
        word_frequencies = {}
        total_words_in_class = {}
        unique_words_in_class = {}  
    
        training_data = raw.sample(frac=0.8, random_state=22) #selecting random 80% of the data
        index = training_data.index #finding index of the 80% training data
        validation_data = raw.drop(index) #assigning validation_data to remaining 20% of the data
    
        labels = training_data["class"].unique()
        
        counting = training_data["class"].value_counts().to_dict()
        
        prior_probs = {key: math.log(value) - math.log(sum(counting.values())) for key, value in counting.items()}
        class_counts = training_data["class"].value_counts(normalize=True).to_dict() 

        #preliminary calculations used to calculate posterior probability
    
        get_word_statistics(training_data)     #training the data
        total_unique = len(word_frequencies.keys())
    
        validate = validation_data.drop(columns=['class'])  
        predicted_classes = testing(validate)      #testing validation data on the model trained by the training data
    
        correct = 0
        
        actual_classes = {}
        for index, row in validation_data.iterrows():
            actual_classes[ row["id"] ] = row["class"]
    
        for id_ , predicted_class in predicted_classes.items():
            if actual_classes[id_] == predicted_class:
                correct += 1
        average+=(correct/len(predicted_classes))        #calculating accuracy
    
    average = average/i    
    return average


In [648]:
targets_read = pd.read_csv("datasets/trg.csv")
standard = targets_read.copy()
print("The average accuracy when using a 5-fold cross-validation on the standard Naive Bayes implementation is {:.3f}".format(cross_validation(standard)))

The average accuracy when using a 5-fold cross-validation on the standard Naive Bayes implementation is 0.981


In [649]:
cleaned = targets_read.copy()
cleaned["abstract"] = cleaned['abstract'].apply(dataclean)
word_frequencies = {}
total_words_in_class = {}
unique_words_in_class = {}
get_word_statistics(cleaned)
total_unique = len(word_frequencies.keys())
print("The average accuracy when using a 5-fold cross-validation on the extended Naive Bayes implementation is {:.3f}".format(cross_validation(cleaned)))


The average accuracy when using a 5-fold cross-validation on the extended Naive Bayes implementation is 0.982
