## Model Improvements/ Data processing

To improve the performance of the various models that will be implemented, there are certain processing steps that can be done with the data to help improve the accuracy of the classifier

1. Removing any punctuation and normalising all capitilisations in each of the documents text. This is because punctuation and the capitilisations shouldn't have an affect on what the abstract is talking about, so therefore should not be considered in our classifier. This was done by taking the column in the dataframe and removing all string.punctuation and converting all to str.lower.

2. We can remove common words that occur in English that aren't specific to the abstract and so therefore we believe won't play a major role in determining the class. These are words such as "and", "of". These can be removed by specifying the argument stop_words='english' when creating using CountVectorizer()

3. We can set a maximum frequency threshold for words to be considered in the classifier. similar to the removal of common english words, they may not be useful in determining certain classes as they appear too within all classes. This was done by specifying the argument max_df=0.9 when creating using CountVectorizer(). So by setting max_df = 0.9, any word that appears in more than 90% of the documents are discarded. On the flip side We can set the minimum frequency threshold for words we want to include in our model.

4. In our data, we can specify that we want phrases of words to be included as well. This is specifying the ngram range. For this data, the ngram_range 1-3 words concatonated together, so this will consider not only the indivdiual words, but pair of words and triplets of words. This was done by specifying the argument ngram_range=(1,3) when creating using CountVectorizer(). This is because certain words may not have much meaning by itself in a sentence, but when joined with the words next to it, it can have a significant change in meaning. Some examples include homo-sapiens where homo or sapiens may not have signfinciant meaning alone but together have a very specific meaning


# Import libraries/ Training and Test datasets

In [None]:
# importing necessary libraries
import pandas as pd
import string
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# importing training and test datasets
trg = pd.read_csv("/Users/pangnakh/Downloads/trg.csv")
tst = pd.read_csv("/Users/pangnakh/Desktop/tst.csv")

# Data Preprocessing


In [None]:
# Make sure all punctuation is removed from documents
trg['abstract'] = trg['abstract'].str.replace('[{}]'.format(string.punctuation), '')
trg.to_csv('punctuation.csv',index= False)

In [None]:
# Make sure all documents converted to lowercase
trg['abstract'] = trg['abstract'].str.lower()

In [None]:
# Calculate and produce a table of each of the class prior probabilities
class_counts = Y_train.value_counts()
class_priors = class_counts / len(Y_train)

In [None]:
#convert text document from X_train abstract column to matrix of counts by using vectorizer
#specify to not include in common english words or words that occur over 90% in the documents
#include counts of single worlds, and concatonated words up to 3 words next to each other in the matrix
vectorizer = CountVectorizer(binary=True,stop_words='english', ngram_range=(1,3),max_df=0.9)
X_train = vectorizer.fit_transform(trg['abstract'])

#assign class label column as Y_train and test data as X_test to be used when predicting
Y_train = trg['Category']
X_test = vectorizer.transform(tst['abstract'])

Write out test and training datasets to be used for model implementation

In [None]:
#create a dataframe pased on the predicted labels of naive bayes classifier
df = pd.DataFrame(X_test)
# Reset the index to start from 1
df.index += 1

# Export the DataFrame to a CSV file
df.to_csv('Articles test dataset.csv',index= True, index_label = 'Text')

df = pd.DataFrame(Y_train)
# Reset the index to start from 1
df.index += 1

# Export the DataFrame to a CSV file
df.to_csv('Articles train dataset.csv',index= True, index_label = 'Category')