# Data Representation (and Preprocessing) for Standard Naive Bayes (SNB)

## Pre-processing

Before we can start with any models, we have to first process the data in way so we can apply our algorithms. Since our data is a just a block of text (aka abstract, document), we need to first be able to seperate the words that make up the abstract. For this I used the nltk module/library to help me first seperate every single word. Next, we check want to remove stop words (diller words that dont add any meaning to the text) and some punctuation (full stops and commas). Lastly I used a function to count up the number of times each word appears from all the abstract for each class so we can find what words are more common in that class so we can use Naive Bayes to predict the classifier. There is another tool, Lemmatizing that allows for the words to be broken to its root words, but if you follow along the code section, I find that this decreased my accuracy in the testing set from 0.976 to 0.97, this could be because many problems but would be quite time consuming to try to find the problems.

## Proposed extensions

For my Navie Bayes I tried out 4 iterations each which with improvements to the previous iteration. The four I used were:
1. Normal frequncy weighting --> i.e. not just checking if word exists or not but count up the number of times it appeared use these as weights for each word. The more it shows up in one specific class, the more likely it is to be that class when classifying
2. Inverse document frequency --> i.e.
3. Complement -->
4. Lemmatizing Words -->


## Evaluation

How am I evaluating the model

Need to show results based on the models
try find a suitable graphing ttechnique or just make a table

Include validation results, 

# Code

## Get all required imports

In [1]:
import pandas as pd
from pandas import DataFrame
import nltk
from math import log

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk import FreqDist
from nltk.stem import WordNetLemmatizer

stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\liyor\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\liyor\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\liyor\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Load Data

In [2]:
# abstractTrainingData (absTrainData)
absTrainData = pd.read_csv("data/trg.csv")
absTestData  = pd.read_csv("data/tst.csv")

absTrainData = absTrainData.drop(columns="id")
absTestData  = absTestData.drop(columns="id")


classDic = {"A": "Archaea", "B": "Bacteria", "E": "Eucaryota", "V": "Virus"}

classKeys = list(classDic)


# Check variables to see if its as expected
print(absTrainData)
print(absTestData)
print(classDic)
print(classKeys)


     class                                           abstract
0        B  the 4 202 353 bp genome of the alkaliphilic ba...
1        A  the complete 1751377-bp sequence of the genome...
2        E  in 1992 we started assembling an ordered libra...
3        E  the aim of this study is to measure human mito...
4        B  the amino acid sequence of the spirulina maxim...
...    ...                                                ...
3995     E  we have isolated and characterized two diureti...
3996     E  myotonias are muscle diseases in which the fun...
3997     E  cysteine synthase o-acetylserine sulfhydrylase...
3998     E  a region of 25 nucleotides is highly conserved...
3999     B  thermoanaerobacter tengcongensis is a rod-shap...

[4000 rows x 2 columns]
                                              abstract
0    in a previous work all three components of com...
1    we compared morphology of two geographically c...
2    factor xiii mr 320000 is a blood coagulation f...
3    we rep

## Pre-processing

Break Down abstract into seperate words so we can analyse the frequency of words. First we will find all the words present.

I will use the NLTK library to do this. From the NLTK library I will be using the tokenizing funtion (break down each word) and removing stopping words using the list of words from the library and finally counting the number of words using the FreqDist Class.

In [3]:
allWords = ""

for index in range(len(absTrainData)):
  abstract = absTrainData["abstract"][index]
  # print(type(abstract))
  allWords += "\n" + abstract



# print(allWords)

allWordsTokens = word_tokenize(allWords)

# print("Tokens before filter: ")

# print(*allWordsTokens, sep= "\n")

filteredWords = [word for word in allWordsTokens if word.casefold() not in stop_words]

# print(len(filteredWords))

# print("Tokens after filter: ")
# print(*filteredWords, sep= "\n")

uniqueWords = set(filteredWords)
# print(len(uniqueWords))

fD = FreqDist(filteredWords)
# print(len(frequncyDistribution))
# for word in fD.items():
#    print(word[0])

## Creating the functions

In [4]:
class probVars():

    def __init__(self, cls, cls_p, numOfUniqueWords, filtered, freq_filtered) -> None:
        self.cls = cls
        self.cls_p = cls_p
        self.numOfUniqueWords = numOfUniqueWords
        self.filtered = filtered
        self.freq_filtered = freq_filtered

    def load(self):
        return self.cls, self.cls_p, self.numOfUniqueWords, self.filtered, self.freq_filtered


class BayesianModel:
    

    def __init__(self, training_data: DataFrame, class_variables) -> None:
        self.class_variables = class_variables
        self.training_data = training_data

        self.posterior_probability = {}
        self.word_frequncy = {}
        self.class_totals = {}
        for cls in class_variables:
            self.posterior_probability[cls] = 0
            self.word_frequncy[cls] = {}
            self.class_totals[cls] = 0
        self.needLaplaceSmoothing = False
        self.uniqueWords = set()
        self.calculate_posterior_probabilities()
        self.count_words()


    def count_words(self):
        data = self.training_data

        
        for cls in self.class_variables:
            wordsInClass = []
            for index in range(len(data)):
                if cls == data["class"][index]:
                    abstract = data["abstract"][index]
                    wordsInClass += word_tokenize(abstract)
            filteredInClass = [word for word in wordsInClass if word.casefold() not in stop_words]
            totalTokensInClass = len(filteredInClass)
            frequncyInClass = FreqDist(filteredInClass)
            for word in frequncyInClass.items():
                self.uniqueWords.add(word[0])
                self.word_frequncy[cls][word[0]] = word[1] + 1
                self.class_totals[cls] = totalTokensInClass
            # print(self.word_frequncy[cls])
        print("counted all")


        # data = self.training_data
        # allWords = ""
        # for index in range(len(data)):
        #     abstract = data["abstract"][index]
        #     allWords += "\n" + abstract
        
        # allTokens = word_tokenize(allWords)
        # filteredWords = [word for word in allTokens if word.casefold() not in stop_words]

        # uniqueWords = set(filteredWords)


        # self.needLaplaceSmoothing = self.need_laplace(uniqueWords)

        # wordFrequncy = FreqDist(filteredWords)
        
    def need_laplace(self, uniqueWords):
        data = self.training_data

        for cls in self.class_variables:
            wordsInClass = ""
            for index in range(len(data)):
                if cls == data["class"][index]:
                    abstract = data["abstract"][index]
                    wordsInClass += "\n" + abstract
            tokensInClass = word_tokenize(wordsInClass)
            filteredInClass = [word for word in tokensInClass if word.casefold() not in stop_words]
            uniqueInClass = set(filteredInClass)
            if len(uniqueInClass) != len(uniqueWords):
                return True

        return False

    def test(self, test_data: DataFrame) -> list:
        # result list to return
        results = []

        # go through each abstract in the test_data seperately
        for index in range(len(test_data)):

            # copy the dictionary --> previously took the point instead
            cls_p = self.posterior_probability.copy()

            # get abstract and tokenise and remove stop words
            abstract = test_data["abstract"][index]
            tokens = word_tokenize(abstract)
            filtered = [word for word in tokens if word.casefold() not in stop_words]
            freq_filtered = FreqDist(filtered)

            # get the unique words from the filtered list
            set_of_filtered = set(filtered)

            # use union operation to combine sets to find total new words for laplace smoothing
            all_words = set_of_filtered.union(self.uniqueWords)
            numOfUniqueWords = len(all_words)

            # seperate this and use parallel computing to speed things up --> nope not sure whats wrong...

            # pVs = []
            for cls in self.class_variables:
                pV = probVars(cls, cls_p, numOfUniqueWords, filtered, freq_filtered)
                self.calc_prob(pV)
                # pVs.append(pV)
            

            # # p = multiprocessing.Pool(processes=len(self.class_variables), initializer=lock_init, initargs=self.lock)
            # print("starting processors")
            # # p.map(self.calc_prob, pVs)

            # p1 = multiprocessing.Process(target=self.calc_prob, args=(pVs[0], self.lock))
            # p2 = multiprocessing.Process(target=self.calc_prob, args=(pVs[1], self.lock))
            # p3 = multiprocessing.Process(target=self.calc_prob, args=(pVs[2], self.lock))
            # p4 = multiprocessing.Process(target=self.calc_prob, args=(pVs[3], self.lock))

            # p1.start()
            # p2.start()
            # p3.start()
            # p4.start()

            # p1.join()
            # p2.join()
            # p3.join()
            # p4.join()


            
            # always do a laplace smoothing as it only makes things better
            # for cls in self.class_variables:
            #     cls_p[cls] = log(cls_p[cls])
            #     cls_t = self.class_totals.copy()
            #     cls_t[cls] += (len(cls_p)) * numOfUniqueWords
            #     word_frequncy:dict = self.word_frequncy[cls].copy()
            #     for word in set(filtered).difference(set(list(word_frequncy.keys()))):
            #             word_frequncy[word] = 1

            #     # could make this faster with the frequncy dist and then add it that many times instead of looping the entire list --> done
            #     for w_n in freq_filtered.items():
            #         cls_p[cls] += w_n[1] * (log(word_frequncy[w_n[0]]/cls_t[cls]))

            cls_prediction = list(cls_p.keys())[0]
            for cls in list(cls_p.keys()):
                if cls_p[cls] > cls_p[cls_prediction]:
                    cls_prediction = cls
            results.append(cls_prediction)

        return results

    def calculate_posterior_probabilities(self):
        data = self.training_data
        total = len(data)
        for cls in self.class_variables:
            self.posterior_probability[cls] = len(data[data['class'] == cls])/total
        print(self.posterior_probability)

    def calc_prob(self, pV: probVars):
        cls, cls_p, numOfUniqueWords, filtered, freq_filtered = pV.load()
        cls_p[cls] = log(cls_p[cls])
        cls_t = self.class_totals.copy()
        cls_t[cls] += (len(cls_p)) * numOfUniqueWords
        word_frequncy:dict = self.word_frequncy[cls].copy()
        for word in set(filtered).difference(set(list(word_frequncy.keys()))):
            word_frequncy[word] = 1

        # could make this faster with the frequncy dist and then add it that many times instead of looping the entire list --> done
        for w_n in freq_filtered.items():
            wF = word_frequncy[w_n[0]]/cls_t[cls]
            p = w_n[1]
            cls_p[cls] += p * (log(wF))


### Test with some data so we can validate model

In [5]:
# Train with some data

def single_split_validation(
    data_set: DataFrame,
    training_validation_split=0.5,
):
    training_data = data_set.groupby('class', group_keys=False).apply(lambda x: x.sample(frac=training_validation_split, random_state= 200))
    validation_data = data_set.drop(training_data.index)
    training_data = training_data.reset_index(drop=True)
    validation_data = validation_data.reset_index(drop=True)
    BM = BayesianModel(training_data, classKeys)
    predictions = BM.test(validation_data)
    total = len(validation_data)
    correct = 0
    for i in range(len(validation_data)):
        if predictions[i] == validation_data['class'][i]:
            correct+=1
    print(f'{correct/total}')

# 90/10 split
single_split_validation(absTrainData, training_validation_split= 0.9)

# 70/30 split
single_split_validation(absTrainData, training_validation_split= 0.7)

# 50/50 split
single_split_validation(absTrainData, training_validation_split= 0.5)

  training_data = data_set.groupby('class', group_keys=False).apply(lambda x: x.sample(frac=training_validation_split, random_state= 200))


{'A': 0.03194444444444444, 'B': 0.40055555555555555, 'E': 0.5361111111111111, 'V': 0.03138888888888889}
counted all
0.9125
{'A': 0.03214285714285714, 'B': 0.40035714285714286, 'E': 0.5360714285714285, 'V': 0.03142857142857143}


  training_data = data_set.groupby('class', group_keys=False).apply(lambda x: x.sample(frac=training_validation_split, random_state= 200))


counted all
0.9108333333333334
{'A': 0.032, 'B': 0.4005, 'E': 0.536, 'V': 0.0315}


  training_data = data_set.groupby('class', group_keys=False).apply(lambda x: x.sample(frac=training_validation_split, random_state= 200))


counted all
0.915


### Test with all data once found best hyperparameters

In [6]:
# Train with all data
BM = BayesianModel(absTrainData, classKeys)

predictions = BM.test(absTestData)

results = pd.read_csv("results/976prediction.csv")
total = len(results)
correct =0
for i in range(len(results)):
    if predictions[i] == results['class'][i]:
        correct += 1
print(f"{correct/total}")

file = open("results/predictions.csv", "w")

s = "class,id\n"

id = 1
for p in predictions:
    s+=f"{p},{id}\n"
    id+=1

file.write(s)
file.close()

{'A': 0.032, 'B': 0.4005, 'E': 0.536, 'V': 0.0315}
counted all
0.928


### Added Inverse document frequency

In [7]:
class probVars():

    def __init__(self, cls, cls_p, numOfUniqueWords, filtered, freq_filtered) -> None:
        self.cls = cls
        self.cls_p = cls_p
        self.numOfUniqueWords = numOfUniqueWords
        self.filtered = filtered
        self.freq_filtered = freq_filtered

    def load(self):
        return self.cls, self.cls_p, self.numOfUniqueWords, self.filtered, self.freq_filtered


class BayesianModel:
    
    def __init__(self, training_data: DataFrame, class_variables) -> None:
        self.class_variables = class_variables
        self.training_data = training_data

        self.posterior_probability = {}
        self.word_frequncy = {}
        self.word_weights = {}
        self.class_totals = {}
        for cls in class_variables:
            self.posterior_probability[cls] = 0
            self.word_frequncy[cls] = {}
            self.class_totals[cls] = 0
        self.uniqueWords = set()
        self.calculate_posterior_probabilities()
        self.count_words()


    def count_words(self):
        data = self.training_data

        for cls in self.class_variables:
            wordsInClass = []
            for index in range(len(data)):
                if cls == data["class"][index]:
                    abstract = data["abstract"][index]
                    wordsInClass += word_tokenize(abstract)
            filteredInClass = [word for word in wordsInClass if word.casefold() not in stop_words]
            totalTokensInClass = len(filteredInClass)
            frequncyInClass = FreqDist(filteredInClass)
            for word in frequncyInClass.items():
                self.uniqueWords.add(word[0])
                self.word_frequncy[cls][word[0]] = word[1] + 1
                self.class_totals[cls] = totalTokensInClass
            # print(self.word_frequncy[cls])
        print("counted all")

    def test(self, test_data: DataFrame) -> list:
        # result list to return
        results = []

        tokensList = [set() for i in range(len(test_data))]
        allWords = []

        for index in range(len(test_data)):
            abstract = test_data["abstract"][index]
            tokens = word_tokenize(abstract)
            tokens = [word for word in tokens if word.casefold() not in stop_words]
            allWords += tokens
            tokensList[index] = set(tokens)
        
        for word in set(allWords):
            occurances = 0
            for tokens in tokensList:
                if word in tokens:
                    occurances += 1
            if occurances == 0:
                # print(word)
                occurances = 1
            self.word_weights[word] = log(len(test_data)/occurances)
        print(self.word_weights)

        # go through each abstract in the test_data seperately
        for index in range(len(test_data)):

            # copy the dictionary --> previously took the point instead
            cls_p = self.posterior_probability.copy()

            # get abstract and tokenise and remove stop words
            abstract = test_data["abstract"][index]
            tokens = word_tokenize(abstract)
            filtered = [word for word in tokens if word.casefold() not in stop_words]
            freq_filtered = FreqDist(filtered)

            # get the unique words from the filtered list
            set_of_filtered = set(filtered)

            # use union operation to combine sets to find total new words for laplace smoothing
            all_words = set_of_filtered.union(self.uniqueWords)
            numOfUniqueWords = len(all_words)

            for cls in self.class_variables:
                pV = probVars(cls, cls_p, numOfUniqueWords, filtered, freq_filtered)
                self.calc_prob(pV)


            cls_prediction = list(cls_p.keys())[0]
            for cls in list(cls_p.keys()):
                if cls_p[cls] > cls_p[cls_prediction]:
                    cls_prediction = cls
            results.append(cls_prediction)

        return results

    def calculate_posterior_probabilities(self):
        data = self.training_data
        total = len(data)
        for cls in self.class_variables:
            self.posterior_probability[cls] = len(data[data['class'] == cls])/total
        print(self.posterior_probability)

    def calc_prob(self, pV: probVars):
        cls, cls_p, numOfUniqueWords, filtered, freq_filtered = pV.load()
        cls_p[cls] = log(cls_p[cls])
        cls_t = self.class_totals.copy()
        cls_t[cls] += (len(cls_p)) * numOfUniqueWords
        word_frequncy:dict = self.word_frequncy[cls].copy()
        word_weights:dict = self.word_weights
        for word in set(filtered).difference(set(list(word_frequncy.keys()))):
            word_frequncy[word] = 1

        for w_n in freq_filtered.items():
            weight = word_weights[w_n[0]]
            wF = word_frequncy[w_n[0]]/cls_t[cls]
            p = weight * w_n[1]
            cls_p[cls] += p * (log(wF))


#### Test with some data so we can validate model

In [8]:
# Train with some data

def single_split_validation(
    data_set: DataFrame,
    training_validation_split=0.5,
):
    training_data = data_set.groupby('class', group_keys=False).apply(lambda x: x.sample(frac=training_validation_split, random_state= 200))
    validation_data = data_set.drop(training_data.index)
    training_data = training_data.reset_index(drop=True)
    validation_data = validation_data.reset_index(drop=True)
    BM = BayesianModel(training_data, classKeys)
    predictions = BM.test(validation_data)
    total = len(validation_data)
    correct = 0
    for i in range(len(validation_data)):
        if predictions[i] == validation_data['class'][i]:
            correct+=1
    print(f'{correct/total}')

# 90/10 split
single_split_validation(absTrainData, training_validation_split= 0.9)

# 70/30 split
single_split_validation(absTrainData, training_validation_split= 0.7)

# 50/50 split
single_split_validation(absTrainData, training_validation_split= 0.5)

{'A': 0.03194444444444444, 'B': 0.40055555555555555, 'E': 0.5361111111111111, 'V': 0.03138888888888889}


  training_data = data_set.groupby('class', group_keys=False).apply(lambda x: x.sample(frac=training_validation_split, random_state= 200))


counted all
{'biob': 5.991464547107982, 'trnas': 4.382026634673881, '18452': 5.991464547107982, '35': 3.7942399697717626, 'sulphite': 5.298317366548036, 'divergently': 4.382026634673881, 'trans-activator-binding': 5.991464547107982, 'non-human': 5.991464547107982, 'lactobacillus': 5.991464547107982, 'plastids': 5.991464547107982, 'hyperglycaemia': 5.991464547107982, 'prevent': 5.991464547107982, 'ssb-12': 5.991464547107982, 'studied': 3.7942399697717626, 'intermediacy': 5.991464547107982, 'typhus': 5.991464547107982, 'specify': 5.991464547107982, 'linkers': 5.991464547107982, '250': 4.382026634673881, '1054': 5.298317366548036, 'asn-asp-ala-gly': 5.991464547107982, 'octanucleotide': 5.991464547107982, 'commercial': 5.991464547107982, '8p213-p22': 5.991464547107982, 'current': 4.605170185988092, 'homocysteine': 5.991464547107982, 'kd': 4.605170185988092, 'normally': 4.605170185988092, 'loop': 4.892852258439873, 'pinguinus': 5.991464547107982, 'deposited': 5.991464547107982, 'polyptides'

  training_data = data_set.groupby('class', group_keys=False).apply(lambda x: x.sample(frac=training_validation_split, random_state= 200))


counted all
{'biob': 7.090076835776092, 'pre-t': 6.396929655216146, 'trnas': 4.199705077879927, '18452': 7.090076835776092, '35': 3.8319802977546096, 'sulphite': 6.396929655216146, 'divergently': 5.0106352940962555, 'trans-activator-binding': 7.090076835776092, 'non-human': 6.396929655216146, 'compose': 7.090076835776092, 'hyperexpress': 7.090076835776092, 'lactobacillus': 5.0106352940962555, 'plastids': 5.144166686720778, 'hyperglycaemia': 7.090076835776092, 'prevent': 7.090076835776092, 'relating': 6.396929655216146, 'stream': 7.090076835776092, 'ctla4icos': 7.090076835776092, 'ssb-12': 7.090076835776092, 'studied': 3.7578723256008875, 'cionin': 7.090076835776092, 'homodimerization': 7.090076835776092, 'intermediacy': 7.090076835776092, 'beta-amylase': 7.090076835776092, 'typhus': 5.991464547107982, 'specify': 5.703782474656201, 'formin-1': 7.090076835776092, 'linkers': 6.396929655216146, '120-megabase': 5.480638923341991, '250': 4.382026634673881, 'd-galactose-1-dehydrogenase': 7.09

  training_data = data_set.groupby('class', group_keys=False).apply(lambda x: x.sample(frac=training_validation_split, random_state= 200))


counted all
{'pre-t': 6.907755278982137, 'trnas': 4.199705077879927, 'sulphite': 6.907755278982137, 'icp8-dbp': 7.600902459542082, 'divergently': 5.115995809754082, 'trans-activator-binding': 7.600902459542082, 'hyperexpress': 7.600902459542082, 'rent1hent1': 7.600902459542082, 'phytohormone': 7.600902459542082, 'plastids': 5.298317366548036, 'ctla4icos': 7.600902459542082, 'homodimerization': 7.600902459542082, 'studied': 3.816712825623821, 'zeta-carotene': 7.600902459542082, 'beta-amylase': 6.907755278982137, 'gene-related': 7.600902459542082, 'specify': 5.521460917862246, 'avium': 7.600902459542082, '120-megabase': 5.809142990314028, 'd-galactose-1-dehydrogenase': 7.600902459542082, '1054': 5.991464547107982, 'peridinin': 7.600902459542082, 'glutathione': 5.991464547107982, '8p213-p22': 7.600902459542082, 'acetoindichlorophenolindophenol': 7.600902459542082, 'funneled': 7.600902459542082, 'r11a': 7.600902459542082, 'bp180': 7.600902459542082, 'sinensis': 6.907755278982137, 'arrayed'

#### Test with all data once found best hyperparameters

In [9]:
# Train with all data
BM = BayesianModel(absTrainData, classKeys)

predictions = BM.test(absTestData)

results = pd.read_csv("results/976prediction.csv")
total = len(results)
correct =0
for i in range(len(results)):
    if predictions[i] == results['class'][i]:
        correct += 1
print(f"{correct/total}")

file = open("results/predictions.csv", "w")

s = "class,id\n"

id = 1
for p in predictions:
    s+=f"{p},{id}\n"
    id+=1

file.write(s)
file.close()

{'A': 0.032, 'B': 0.4005, 'E': 0.536, 'V': 0.0315}
counted all
{'biob': 6.907755278982137, 'amsd': 6.907755278982137, 'mas': 6.907755278982137, 'trnas': 4.509860006183766, '18452': 6.214608098422191, '35': 3.9633162998156966, 'brevinin-1sb': 6.907755278982137, 'algb': 6.907755278982137, 'end-directed': 6.907755278982137, 'transductor': 6.907755278982137, 'divergently': 6.214608098422191, 'vomiting': 6.907755278982137, 'monte': 6.907755278982137, 'plastids': 6.907755278982137, 'lactobacillus': 6.214608098422191, 'stream': 6.907755278982137, 'prevent': 5.521460917862246, 'relating': 6.907755278982137, 'homodimerization': 6.907755278982137, 'herbaceous': 6.907755278982137, 'studied': 3.1465551632885744, 'typhus': 5.115995809754082, 'folates': 6.907755278982137, 'gene-related': 6.907755278982137, 'riboside': 6.907755278982137, 'c-myc-initiated': 6.907755278982137, 'specify': 5.298317366548036, 'clot': 6.907755278982137, 'formin-1': 6.907755278982137, 'photolyase': 6.907755278982137, 'avium

### Complement version

In [10]:
class probVars():

    def __init__(self, cls, cls_p, numOfUniqueWords, filtered, freq_filtered) -> None:
        self.cls = cls
        self.cls_p = cls_p
        self.numOfUniqueWords = numOfUniqueWords
        self.filtered = filtered
        self.freq_filtered = freq_filtered

    def load(self):
        return self.cls, self.cls_p, self.numOfUniqueWords, self.filtered, self.freq_filtered


class BayesianModel:
    

    def __init__(self, training_data: DataFrame, class_variables) -> None:
        self.class_variables = class_variables
        self.training_data = training_data

        self.posterior_probability = {}
        self.word_frequncy = {}
        self.word_weights = {}
        self.class_totals = {}
        for cls in class_variables:
            self.posterior_probability[cls] = 0
            self.word_frequncy[cls] = {}
            self.class_totals[cls] = 0
        self.uniqueWords = set()
        self.calculate_posterior_probabilities()
        self.count_words()


    def count_words(self):
        data = self.training_data

        for cls in self.class_variables:
            wordsInClass = []
            for index in range(len(data)):
                if cls == data["class"][index]:
                    abstract = data["abstract"][index]
                    wordsInClass += word_tokenize(abstract)
            filteredInClass = [word for word in wordsInClass if word.casefold() not in stop_words]
            totalTokensInClass = len(filteredInClass)
            frequncyInClass = FreqDist(filteredInClass)
            for word in frequncyInClass.items():
                self.uniqueWords.add(word[0])
                self.word_frequncy[cls][word[0]] = word[1] + 1
                self.class_totals[cls] = totalTokensInClass
            # print(self.word_frequncy[cls])
        print("counted all")

    def test(self, test_data: DataFrame) -> list:
        # result list to return
        results = []

        tokensList = [set() for i in range(len(test_data))]
        allWords = []

        for index in range(len(test_data)):
            abstract = test_data["abstract"][index]
            tokens = word_tokenize(abstract)
            tokens = [word for word in tokens if word.casefold() not in stop_words]
            allWords += tokens
            tokensList[index] = set(tokens)
        
        for word in set(allWords):
            occurances = 0
            for tokens in tokensList:
                if word in tokens:
                    occurances += 1
            if occurances == 0:
                # print(word)
                occurances = 1
            self.word_weights[word] = log(len(test_data)/occurances)
        print(self.word_weights)

        # go through each abstract in the test_data seperately
        for index in range(len(test_data)):

            # copy the dictionary --> previously took the point instead
            cls_p = self.posterior_probability.copy()

            # get abstract and tokenise and remove stop words
            abstract = test_data["abstract"][index]
            tokens = word_tokenize(abstract)
            filtered = [word for word in tokens if word.casefold() not in stop_words]
            freq_filtered = FreqDist(filtered)

            # get the unique words from the filtered list
            set_of_filtered = set(filtered)
        
            # use union operation to combine sets to find total new words for laplace smoothing
            all_words = set_of_filtered.union(self.uniqueWords)
            numOfUniqueWords = len(all_words)

            



            # seperate this and use parallel computing to speed things up --> nope not sure whats wrong...

            # pVs = []
            for cls in self.class_variables:
                pV = probVars(cls, cls_p, numOfUniqueWords, filtered, freq_filtered)
                self.calc_prob(pV)
                # pVs.append(pV)


            cls_prediction = list(cls_p.keys())[0]
            for cls in list(cls_p.keys()):
                if cls_p[cls] > cls_p[cls_prediction]:
                    cls_prediction = cls
            results.append(cls_prediction)

        return results

    def calculate_posterior_probabilities(self):
        data = self.training_data
        total = len(data)
        for cls in self.class_variables:
            self.posterior_probability[cls] = len(data[data['class'] == cls])/total
        print(self.posterior_probability)

    def calc_prob(self, pV: probVars):
        cls, cls_p, numOfUniqueWords, filtered, freq_filtered = pV.load()
        cls_p[cls] = log(cls_p[cls])
        cls_t = self.class_totals.copy()

        # cls_t[cls] += (len(cls_p)) * numOfUniqueWords

        word_frequncy:dict = self.word_frequncy.copy()
        word_weights:dict = self.word_weights
        for c in self.class_variables:
            for word in set(filtered).difference(set(list(word_frequncy[c].keys()))):
                word_frequncy[c][word] = 1

        # could make this faster with the frequncy dist and then add it that many times instead of looping the entire list --> done
        total_words_not_in_class = (len(cls_p)) * numOfUniqueWords
        for c in self.class_variables:
                if c != cls:
                    total_words_not_in_class += cls_t[c]
        for w_n in freq_filtered.items():
            weight = word_weights[w_n[0]]

            w_n_not_in_class = 0
            for c in self.class_variables:
                if c != cls:
                    w_n_not_in_class += word_frequncy[c][w_n[0]]


            wF = w_n_not_in_class/total_words_not_in_class
            p = weight * w_n[1]
            cls_p[cls] -= p * (log(wF))


#### Test with some data so we can validate model

In [11]:
# Train with some data

def single_split_validation(
    data_set: DataFrame,
    training_validation_split=0.5,
):
    training_data = data_set.groupby('class', group_keys=False).apply(lambda x: x.sample(frac=training_validation_split, random_state= 200))
    validation_data = data_set.drop(training_data.index)
    training_data = training_data.reset_index(drop=True)
    validation_data = validation_data.reset_index(drop=True)
    BM = BayesianModel(training_data, classKeys)
    predictions = BM.test(validation_data)
    total = len(validation_data)
    correct = 0
    for i in range(len(validation_data)):
        if predictions[i] == validation_data['class'][i]:
            correct+=1
    print(f'{correct/total}')

# 90/10 split
single_split_validation(absTrainData, training_validation_split= 0.9)

# 70/30 split
single_split_validation(absTrainData, training_validation_split= 0.7)

# 50/50 split
single_split_validation(absTrainData, training_validation_split= 0.5)

{'A': 0.03194444444444444, 'B': 0.40055555555555555, 'E': 0.5361111111111111, 'V': 0.03138888888888889}


  training_data = data_set.groupby('class', group_keys=False).apply(lambda x: x.sample(frac=training_validation_split, random_state= 200))


counted all
{'biob': 5.991464547107982, 'trnas': 4.382026634673881, '18452': 5.991464547107982, '35': 3.7942399697717626, 'sulphite': 5.298317366548036, 'divergently': 4.382026634673881, 'trans-activator-binding': 5.991464547107982, 'non-human': 5.991464547107982, 'lactobacillus': 5.991464547107982, 'plastids': 5.991464547107982, 'hyperglycaemia': 5.991464547107982, 'prevent': 5.991464547107982, 'ssb-12': 5.991464547107982, 'studied': 3.7942399697717626, 'intermediacy': 5.991464547107982, 'typhus': 5.991464547107982, 'specify': 5.991464547107982, 'linkers': 5.991464547107982, '250': 4.382026634673881, '1054': 5.298317366548036, 'asn-asp-ala-gly': 5.991464547107982, 'octanucleotide': 5.991464547107982, 'commercial': 5.991464547107982, '8p213-p22': 5.991464547107982, 'current': 4.605170185988092, 'homocysteine': 5.991464547107982, 'kd': 4.605170185988092, 'normally': 4.605170185988092, 'loop': 4.892852258439873, 'pinguinus': 5.991464547107982, 'deposited': 5.991464547107982, 'polyptides'

  training_data = data_set.groupby('class', group_keys=False).apply(lambda x: x.sample(frac=training_validation_split, random_state= 200))


counted all
{'biob': 7.090076835776092, 'pre-t': 6.396929655216146, 'trnas': 4.199705077879927, '18452': 7.090076835776092, '35': 3.8319802977546096, 'sulphite': 6.396929655216146, 'divergently': 5.0106352940962555, 'trans-activator-binding': 7.090076835776092, 'non-human': 6.396929655216146, 'compose': 7.090076835776092, 'hyperexpress': 7.090076835776092, 'lactobacillus': 5.0106352940962555, 'plastids': 5.144166686720778, 'hyperglycaemia': 7.090076835776092, 'prevent': 7.090076835776092, 'relating': 6.396929655216146, 'stream': 7.090076835776092, 'ctla4icos': 7.090076835776092, 'ssb-12': 7.090076835776092, 'studied': 3.7578723256008875, 'cionin': 7.090076835776092, 'homodimerization': 7.090076835776092, 'intermediacy': 7.090076835776092, 'beta-amylase': 7.090076835776092, 'typhus': 5.991464547107982, 'specify': 5.703782474656201, 'formin-1': 7.090076835776092, 'linkers': 6.396929655216146, '120-megabase': 5.480638923341991, '250': 4.382026634673881, 'd-galactose-1-dehydrogenase': 7.09

  training_data = data_set.groupby('class', group_keys=False).apply(lambda x: x.sample(frac=training_validation_split, random_state= 200))


counted all
{'pre-t': 6.907755278982137, 'trnas': 4.199705077879927, 'sulphite': 6.907755278982137, 'icp8-dbp': 7.600902459542082, 'divergently': 5.115995809754082, 'trans-activator-binding': 7.600902459542082, 'hyperexpress': 7.600902459542082, 'rent1hent1': 7.600902459542082, 'phytohormone': 7.600902459542082, 'plastids': 5.298317366548036, 'ctla4icos': 7.600902459542082, 'homodimerization': 7.600902459542082, 'studied': 3.816712825623821, 'zeta-carotene': 7.600902459542082, 'beta-amylase': 6.907755278982137, 'gene-related': 7.600902459542082, 'specify': 5.521460917862246, 'avium': 7.600902459542082, '120-megabase': 5.809142990314028, 'd-galactose-1-dehydrogenase': 7.600902459542082, '1054': 5.991464547107982, 'peridinin': 7.600902459542082, 'glutathione': 5.991464547107982, '8p213-p22': 7.600902459542082, 'acetoindichlorophenolindophenol': 7.600902459542082, 'funneled': 7.600902459542082, 'r11a': 7.600902459542082, 'bp180': 7.600902459542082, 'sinensis': 6.907755278982137, 'arrayed'

#### Test with all data once found best hyperparameters

In [12]:
# Train with all data
BM = BayesianModel(absTrainData, classKeys)

predictions = BM.test(absTestData)

results = pd.read_csv("results/976prediction.csv")
total = len(results)
correct =0
for i in range(len(results)):
    if predictions[i] == results['class'][i]:
        correct += 1
print(f"{correct/total}")

file = open("results/predictions.csv", "w")

s = "class,id\n"

id = 1
for p in predictions:
    s+=f"{p},{id}\n"
    id+=1

file.write(s)
file.close()

{'A': 0.032, 'B': 0.4005, 'E': 0.536, 'V': 0.0315}
counted all
{'biob': 6.907755278982137, 'amsd': 6.907755278982137, 'mas': 6.907755278982137, 'trnas': 4.509860006183766, '18452': 6.214608098422191, '35': 3.9633162998156966, 'brevinin-1sb': 6.907755278982137, 'algb': 6.907755278982137, 'end-directed': 6.907755278982137, 'transductor': 6.907755278982137, 'divergently': 6.214608098422191, 'vomiting': 6.907755278982137, 'monte': 6.907755278982137, 'plastids': 6.907755278982137, 'lactobacillus': 6.214608098422191, 'stream': 6.907755278982137, 'prevent': 5.521460917862246, 'relating': 6.907755278982137, 'homodimerization': 6.907755278982137, 'herbaceous': 6.907755278982137, 'studied': 3.1465551632885744, 'typhus': 5.115995809754082, 'folates': 6.907755278982137, 'gene-related': 6.907755278982137, 'riboside': 6.907755278982137, 'c-myc-initiated': 6.907755278982137, 'specify': 5.298317366548036, 'clot': 6.907755278982137, 'formin-1': 6.907755278982137, 'photolyase': 6.907755278982137, 'avium

### Lemmatize Words

In [13]:
class probVars():

    def __init__(self, cls, cls_p, numOfUniqueWords, filtered, freq_filtered) -> None:
        self.cls = cls
        self.cls_p = cls_p
        self.numOfUniqueWords = numOfUniqueWords
        self.filtered = filtered
        self.freq_filtered = freq_filtered

    def load(self):
        return self.cls, self.cls_p, self.numOfUniqueWords, self.filtered, self.freq_filtered


class BayesianModel:
    

    def __init__(self, training_data: DataFrame, class_variables) -> None:
        self.class_variables = class_variables
        self.training_data = training_data

        self.posterior_probability = {}
        self.word_frequncy = {}
        self.word_weights = {}
        self.class_totals = {}
        for cls in class_variables:
            self.posterior_probability[cls] = 0
            self.word_frequncy[cls] = {}
            self.class_totals[cls] = 0
        self.uniqueWords = set()
        self.calculate_posterior_probabilities()
        self.count_words()


    def count_words(self):
        data = self.training_data

        for cls in self.class_variables:
            wordsInClass = []
            for index in range(len(data)):
                if cls == data["class"][index]:
                    abstract = data["abstract"][index]
                    wordsInClass += word_tokenize(abstract)
            filteredInClass = [lemmatizer.lemmatize(word) for word in wordsInClass if word.casefold() not in stop_words]
            totalTokensInClass = len(filteredInClass)
            frequncyInClass = FreqDist(filteredInClass)
            for word in frequncyInClass.items():
                self.uniqueWords.add(word[0])
                self.word_frequncy[cls][word[0]] = word[1] + 1
                self.class_totals[cls] = totalTokensInClass
            # print(self.word_frequncy[cls])
        print("counted all")

    def test(self, test_data: DataFrame) -> list:
        # result list to return
        results = []

        tokensList = [set() for i in range(len(test_data))]
        allWords = []

        for index in range(len(test_data)):
            abstract = test_data["abstract"][index]
            tokens = word_tokenize(abstract)
            tokens = [lemmatizer.lemmatize(word) for word in tokens if word.casefold() not in stop_words]
            allWords += tokens
            tokensList[index] = set(tokens)
        
        for word in set(allWords):
            occurances = 0
            for tokens in tokensList:
                if word in tokens:
                    occurances += 1
            if occurances == 0:
                # print(word)
                occurances = 1
            self.word_weights[word] = log(len(test_data)/occurances)
        print(self.word_weights)

        # go through each abstract in the test_data seperately
        for index in range(len(test_data)):

            # copy the dictionary --> previously took the point instead
            cls_p = self.posterior_probability.copy()

            # get abstract and tokenise and remove stop words
            abstract = test_data["abstract"][index]
            tokens = word_tokenize(abstract)
            filtered = [lemmatizer.lemmatize(word) for word in tokens if word.casefold() not in stop_words]
            freq_filtered = FreqDist(filtered)

            # get the unique words from the filtered list
            set_of_filtered = set(filtered)

            # use union operation to combine sets to find total new words for laplace smoothing
            all_words = set_of_filtered.union(self.uniqueWords)
            numOfUniqueWords = len(all_words)

            



            # seperate this and use parallel computing to speed things up --> nope not sure whats wrong...

            # pVs = []
            for cls in self.class_variables:
                pV = probVars(cls, cls_p, numOfUniqueWords, filtered, freq_filtered)
                self.calc_prob(pV)
                # pVs.append(pV)


            cls_prediction = list(cls_p.keys())[0]
            for cls in list(cls_p.keys()):
                if cls_p[cls] > cls_p[cls_prediction]:
                    cls_prediction = cls
            results.append(cls_prediction)

        return results

    def calculate_posterior_probabilities(self):
        data = self.training_data
        total = len(data)
        for cls in self.class_variables:
            self.posterior_probability[cls] = len(data[data['class'] == cls])/total
        print(self.posterior_probability)

    def calc_prob(self, pV: probVars):
        cls, cls_p, numOfUniqueWords, filtered, freq_filtered = pV.load()
        cls_p[cls] = log(cls_p[cls])
        cls_t = self.class_totals.copy()

        # cls_t[cls] += (len(cls_p)) * numOfUniqueWords

        word_frequncy:dict = self.word_frequncy.copy()
        word_weights:dict = self.word_weights
        for c in self.class_variables:
            for word in set(filtered).difference(set(list(word_frequncy[c].keys()))):
                word_frequncy[c][word] = 1

        # could make this faster with the frequncy dist and then add it that many times instead of looping the entire list --> done
        total_words_not_in_class = (len(cls_p)) * numOfUniqueWords
        for c in self.class_variables:
                if c != cls:
                    total_words_not_in_class += cls_t[c]
        wFs = {}
        sumwFs = 0
        for w_n in freq_filtered.items():

            w_n_not_in_class = 0
            for c in self.class_variables:
                if c != cls:
                    w_n_not_in_class += word_frequncy[c][w_n[0]]

            wFs[w_n] = log(w_n_not_in_class/total_words_not_in_class)
            sumwFs += abs(wFs[w_n])

        for w_n in freq_filtered.items():
            weight = word_weights[w_n[0]]

            wF = wFs[w_n] # /sumwFs
            p = weight * w_n[1]
            cls_p[cls] -= p * wF


#### Test with some data so we can validate model

In [14]:
# Train with some data

def single_split_validation(
    data_set: DataFrame,
    training_validation_split=0.5,
):
    training_data = data_set.groupby('class', group_keys=False).apply(lambda x: x.sample(frac=training_validation_split, random_state= 200))
    validation_data = data_set.drop(training_data.index)
    training_data = training_data.reset_index(drop=True)
    validation_data = validation_data.reset_index(drop=True)
    BM = BayesianModel(training_data, classKeys)
    predictions = BM.test(validation_data)
    total = len(validation_data)
    correct = 0
    for i in range(len(validation_data)):
        if predictions[i] == validation_data['class'][i]:
            correct+=1
    print(f'{correct/total}')

# 90/10 split
single_split_validation(absTrainData, training_validation_split= 0.9)

# 70/30 split
single_split_validation(absTrainData, training_validation_split= 0.7)

# 50/50 split
single_split_validation(absTrainData, training_validation_split= 0.5)

{'A': 0.03194444444444444, 'B': 0.40055555555555555, 'E': 0.5361111111111111, 'V': 0.03138888888888889}


  training_data = data_set.groupby('class', group_keys=False).apply(lambda x: x.sample(frac=training_validation_split, random_state= 200))


counted all
{'biob': 5.991464547107982, '18452': 5.991464547107982, '35': 3.7942399697717626, 'sulphite': 5.298317366548036, 'divergently': 4.382026634673881, 'trans-activator-binding': 5.991464547107982, 'non-human': 5.991464547107982, 'lactobacillus': 5.991464547107982, 'hyperglycaemia': 5.991464547107982, 'prevent': 5.991464547107982, 'ssb-12': 5.991464547107982, 'studied': 3.7942399697717626, 'intermediacy': 5.991464547107982, 'typhus': 5.991464547107982, 'specify': 5.991464547107982, 'tarsier': 5.991464547107982, 'linkers': 5.991464547107982, '250': 4.382026634673881, '1054': 5.298317366548036, 'asn-asp-ala-gly': 5.991464547107982, 'octanucleotide': 5.991464547107982, 'commercial': 5.991464547107982, '8p213-p22': 5.991464547107982, 'current': 4.382026634673881, 'homocysteine': 5.991464547107982, 'kd': 4.605170185988092, 'normally': 4.605170185988092, 'loop': 4.382026634673881, 'pinguinus': 5.991464547107982, 'deposited': 5.991464547107982, 'polyptides': 5.991464547107982, 'crta': 

  training_data = data_set.groupby('class', group_keys=False).apply(lambda x: x.sample(frac=training_validation_split, random_state= 200))


counted all
{'biob': 7.090076835776092, 'pre-t': 6.396929655216146, '18452': 7.090076835776092, '35': 3.8319802977546096, 'sulphite': 6.396929655216146, 'divergently': 5.0106352940962555, 'trans-activator-binding': 7.090076835776092, 'non-human': 6.396929655216146, 'compose': 7.090076835776092, 'hyperexpress': 7.090076835776092, 'lactobacillus': 5.0106352940962555, 'hyperglycaemia': 7.090076835776092, 'prevent': 7.090076835776092, 'relating': 6.396929655216146, 'stream': 7.090076835776092, 'ctla4icos': 7.090076835776092, 'ssb-12': 7.090076835776092, 'studied': 3.7578723256008875, 'cionin': 7.090076835776092, 'homodimerization': 7.090076835776092, 'intermediacy': 7.090076835776092, 'beta-amylase': 7.090076835776092, 'typhus': 5.991464547107982, 'specify': 5.703782474656201, 'tarsier': 6.396929655216146, 'formin-1': 7.090076835776092, 'linkers': 6.396929655216146, '120-megabase': 5.480638923341991, '250': 4.382026634673881, 'd-galactose-1-dehydrogenase': 7.090076835776092, 'allocation': 

  training_data = data_set.groupby('class', group_keys=False).apply(lambda x: x.sample(frac=training_validation_split, random_state= 200))


counted all
{'pre-t': 6.907755278982137, 'sulphite': 6.907755278982137, 'icp8-dbp': 7.600902459542082, 'divergently': 5.115995809754082, 'trans-activator-binding': 7.600902459542082, 'hyperexpress': 7.600902459542082, 'rent1hent1': 7.600902459542082, 'phytohormone': 7.600902459542082, 'ctla4icos': 7.600902459542082, 'homodimerization': 7.600902459542082, 'studied': 3.816712825623821, 'zeta-carotene': 7.600902459542082, 'beta-amylase': 6.907755278982137, 'gene-related': 7.600902459542082, 'specify': 5.521460917862246, 'avium': 7.600902459542082, '120-megabase': 5.809142990314028, 'd-galactose-1-dehydrogenase': 7.600902459542082, '1054': 5.991464547107982, 'peridinin': 7.600902459542082, 'glutathione': 5.991464547107982, '8p213-p22': 7.600902459542082, 'acetoindichlorophenolindophenol': 7.600902459542082, 'funneled': 7.600902459542082, 'r11a': 7.600902459542082, 'bp180': 7.600902459542082, 'sinensis': 6.907755278982137, 'arrayed': 6.907755278982137, 'thirteen': 5.403677882205863, 'create

#### Test with all data once found best hyperparameters

In [15]:
# Train with all data
BM = BayesianModel(absTrainData, classKeys)

predictions = BM.test(absTestData)

results = pd.read_csv("results/976prediction.csv")
total = len(results)
correct =0
for i in range(len(results)):
    if predictions[i] == results['class'][i]:
        correct += 1
print(f"{correct/total}")

file = open("results/predictions.csv", "w")

s = "class,id\n"

id = 1
for p in predictions:
    s+=f"{p},{id}\n"
    id+=1

file.write(s)
file.close()

{'A': 0.032, 'B': 0.4005, 'E': 0.536, 'V': 0.0315}
counted all
{'biob': 6.907755278982137, 'amsd': 6.907755278982137, '18452': 6.214608098422191, '35': 3.9633162998156966, 'brevinin-1sb': 6.907755278982137, 'algb': 6.907755278982137, 'end-directed': 6.907755278982137, 'transductor': 6.907755278982137, 'divergently': 6.214608098422191, 'vomiting': 6.907755278982137, 'monte': 6.907755278982137, 'lactobacillus': 6.214608098422191, 'stream': 6.907755278982137, 'prevent': 5.521460917862246, 'relating': 6.907755278982137, 'homodimerization': 6.907755278982137, 'herbaceous': 6.907755278982137, 'studied': 3.1465551632885744, 'typhus': 5.115995809754082, 'gene-related': 6.907755278982137, 'riboside': 6.907755278982137, 'c-myc-initiated': 6.907755278982137, 'specify': 5.298317366548036, 'clot': 6.907755278982137, 'tarsier': 6.907755278982137, 'formin-1': 6.907755278982137, 'photolyase': 6.907755278982137, 'avium': 6.214608098422191, '120-megabase': 5.115995809754082, '250': 5.115995809754082, 'y