# Bag-Of-Words Classifier

Import necessary dependencies from scikit. Use [pandas](https://pandas.pydata.org/) for file extraction. 

In [16]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

fileName = 'wikipedia_300/wikipedia_300.csv'
col = ['Text', 'Category']

# Read the dataset and split text and category
df = pd.read_csv(fileName, header=0, sep=',', names=col)
X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['Category'], random_state = 0, test_size=0.2)

# Init the countvectorizer and convert the text/documents to a matrix of token counts
# I.e create a bag of words
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)

### Using Multinominal Naive Bayes

In [17]:
from sklearn.naive_bayes import MultinomialNB

NB_model = MultinomialNB().fit(X_train_counts, y_train)

# Check if model is able to predict
print(NB_model.predict(count_vect.transform(["Script for games are good with java "])))

['Games']


### Using Support Vector Machines

In [19]:
from sklearn.svm import LinearSVC

SVM_model = LinearSVC().fit(X_train_counts, y_train)

# Check if model is able to predict
print(SVM_model.predict(count_vect.transform(["Script for programs are good with java"])))

['Programming']


### Comparing Accuracy

Here we are comparing the accuracy score between the Naive Bayes method and the SVM method. We can see that the SVM has a higher accuracy score than the Naive Bayes method on classifing the text

In [4]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import KFold, cross_val_score

X_test_counts = count_vect.transform(X_test)
k_fold = KFold(n_splits=10)

NB_predictions = NB_model.predict(X_test_counts)
SVM_predictions = SVM_model.predict(X_test_counts)

print('/====== Multinominal Naive Bayes Classifier ======/')
print(classification_report(y_test, NB_predictions))
print(f'Accuracy Score: {round(accuracy_score(y_test, NB_predictions) * 100, 2)}%')
print('Confusion Matrix:')
print(confusion_matrix(y_test, NB_predictions))
print(k_fold)

print()
print()

print('/====== Support Vector Machine Classifier ======/')
print(classification_report(y_test, SVM_predictions))
print(f'Accuracy Score: {round(accuracy_score(y_test, SVM_predictions) * 100, 2)}%')
print('Confusion Matrix:')
print(confusion_matrix(y_test, SVM_predictions))

              precision    recall  f1-score   support

       Games       0.93      0.90      0.91        29
 Programming       0.91      0.94      0.92        31

   micro avg       0.92      0.92      0.92        60
   macro avg       0.92      0.92      0.92        60
weighted avg       0.92      0.92      0.92        60

Accuracy Score: 91.67%
Confusion Matrix:
[[26  3]
 [ 2 29]]
KFold(n_splits=10, random_state=None, shuffle=False)


              precision    recall  f1-score   support

       Games       0.96      0.90      0.93        29
 Programming       0.91      0.97      0.94        31

   micro avg       0.93      0.93      0.93        60
   macro avg       0.94      0.93      0.93        60
weighted avg       0.94      0.93      0.93        60

Accuracy Score: 93.33%
Confusion Matrix:
[[26  3]
 [ 1 30]]


### Comparing Accuracy using Cross Validation

By measuring the accuracy using cross validation, we get a completely different score. Here we are splitting the text documents into 10 folds (i.e. 10 groups), and checking the accuracy score for each fold. By measuring the mean accuracy score, we can see that the Naive Bayes method has a higher accuracy than SVM, but also has a higher spectrum of values.

I'm assuming this is because the generative Naive Bayes model is better at handling smaller datasets, and the discriminative SVM doesn't get enough data to train on when folding the data 10 times. Maybe this has something to do with Naive Bayes handling overfitting better on this particular dataset using the default parameters.

Or perhaps this has something to do with SVM getting p(x|y) while naive bayes are getting p(x,y).

Or, as is probably my best assumption, I'm guessing in general this has to do with if the independence in naive bayes is more satisfied by the variables of this dataset by using cross validation, and the degree of class overlapping is smaller on the smaller sets, then the naive bayes method could outperform the svm.

In [5]:
X_data_counts = count_vect.transform(df['Text'])

print('/====== Multinominal Naive Bayes Classifier ======/')
scores = cross_val_score(NB_model, X_data_counts, df['Category'], cv=10)
print(scores)
print(f'Accuracy mean: {round(scores.mean()*100, 2)}% (+/- {round(scores.std()*2*100, 2)}%)')

print()
print('/====== Support Vector Machine Classifier ======/')
scores = cross_val_score(SVM_model, X_data_counts, df['Category'], cv=10)
print(scores)
print(f'Accuracy mean: {round(scores.mean()*100, 2)}% (+/- {round(scores.std()*2*100, 2)}%)')

[0.96666667 0.96666667 0.9        0.96666667 0.93333333 1.
 0.9        0.96666667 1.         0.93333333]
Accuracy mean: 95.33% (+/- 6.8%)

[0.93333333 0.93333333 0.9        0.93333333 0.86666667 0.9
 0.93333333 0.93333333 0.93333333 0.9       ]
Accuracy mean: 91.67% (+/- 4.47%)


### Using TF-IDF Converter

One problem with just using counter is that longer documents will have a higher count average than shorter documents even if they might be on the same topic, and this might create discrepancies.

By using TF (Term Frequencies) we can avoid this problem by checking words frequencies compared to the total number of words. Adding IDF (Inverse Document Frequency) also down-weights words that occur frequently in many documents and are therefore less informative (e.g. and, if, then).

We'll first transform the train-data to a tf-dif representation

In [6]:
from sklearn.feature_extraction.text import TfidfTransformer

tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)

... and then re-iniate the classification models with the new tf-dif representation

In [7]:
NB_model = MultinomialNB().fit(X_train_tf, y_train)
SVM_model = LinearSVC().fit(X_train_tf, y_train)

### Comparing Accuracy

In [8]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

X_test_counts = count_vect.transform(X_test)

NB_predictions = NB_model.predict(X_test_counts)
SVM_predictions = SVM_model.predict(X_test_counts)

print('/====== Multinominal Naive Bayes Classifier ======/')
print(classification_report(y_test, NB_predictions))
print(f'Accuracy Score: {round(accuracy_score(y_test, NB_predictions) * 100, 2)}%')
print('Confusion Matrix:')
print(confusion_matrix(y_test, NB_predictions))

print()
print()

print('/====== Support Vector Machine Classifier ======/')
print(classification_report(y_test, SVM_predictions))
print(f'Accuracy Score: {round(accuracy_score(y_test, SVM_predictions) * 100, 2)}%')
print('Confusion Matrix:')
print(confusion_matrix(y_test, SVM_predictions))

              precision    recall  f1-score   support

       Games       1.00      0.93      0.96        29
 Programming       0.94      1.00      0.97        31

   micro avg       0.97      0.97      0.97        60
   macro avg       0.97      0.97      0.97        60
weighted avg       0.97      0.97      0.97        60

Accuracy Score: 96.67%
Confusion Matrix:
[[27  2]
 [ 0 31]]


              precision    recall  f1-score   support

       Games       0.91      1.00      0.95        29
 Programming       1.00      0.90      0.95        31

   micro avg       0.95      0.95      0.95        60
   macro avg       0.95      0.95      0.95        60
weighted avg       0.95      0.95      0.95        60

Accuracy Score: 95.0%
Confusion Matrix:
[[29  0]
 [ 3 28]]


### Comparing Accuracy using Cross Validation

When adding the text frequencies weights and downgrading frequently occuring words, the SVM outperforms the Naive Bayes classifier. The Naive bayes classifier even got worse adding these methods on cross validation. Maybe the transformed dataset weighing words disrupt the independence validation of naive bayes on small datasets, while improving SVM classification.

In [11]:
X_data_counts = count_vect.transform(df['Text'])
tf = TfidfTransformer(use_idf=False).fit(X_data_counts)
X_data_tf = tf.transform(X_data_counts)

print('/====== Multinominal Naive Bayes Classifier ======/')
scores = cross_val_score(NB_model, X_data_tf, df['Category'], cv=10)
print(scores)
print(f'Accuracy mean: {round(scores.mean()*100, 2)}% (+/- {round(scores.std()*2*100, 2)}%)')

print()
print('/====== Support Vector Machine Classifier ======/')
scores = cross_val_score(SVM_model, X_data_tf, df['Category'], cv=10)
print(scores)
print(f'Accuracy mean: {round(scores.mean()*100, 2)}% (+/- {round(scores.std()*2*100, 2)}%)')

[0.93333333 0.96666667 0.86666667 0.96666667 0.86666667 1.
 0.93333333 0.9        0.9        0.93333333]
Accuracy mean: 92.67% (+/- 8.33%)

[0.96666667 1.         0.9        0.93333333 0.9        0.96666667
 0.96666667 0.96666667 0.96666667 0.93333333]
Accuracy mean: 95.0% (+/- 6.15%)
