## Naive Bayes Video Classifier
This notebook runs through YouTube Trending Video the classification of YouTube Trending video category based on title using Naive Bayes.

Additionally, we will compare the performance of two specific variants of Navive Bayes
    1. Multinomial Naive Bayes
    2. Bernouli Naive Bayes

In [1]:
import pandas as pd
import seaborn as sns
from yellowbrick.classifier import ClassificationReport, ConfusionMatrix

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from yellowbrick.text import FreqDistVisualizer

from sklearn import metrics
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve
from yellowbrick.classifier import ROCAUC

import matplotlib.pyplot as plt
from imblearn.over_sampling import SMOTE
from sklearn.pipeline import Pipeline


ModuleNotFoundError: No module named 'yellowbrick'

In [None]:
from sklearn.naive_bayes import MultinomialNB, BernoulliNB

### Load in filtered dataset and look at the formatting
The columns of interest at ['filter_title', 'filter_title_no_stops', 'category_id']
    1. 'filter_title' --> raw input that only has punctuation removed and letters to lower case
    2. 'filter_title_no_stops' --> 'filter_title' data with addition to removal of stopwords
    3. 'category_id' --> output/buckets to categorize video (16 total)

In [None]:
df_titles_info = pd.read_csv('./output/US_count_vectorizer_dataset.csv')
df_titles_info.head()

### Separate the dataframe into inputs and outputs

In [None]:
df_x = df_titles_info['filter_title']
df_y = df_titles_info['category_id']
df_x_stop = df_titles_info['filter_title_no_stops']

target_names = list(df_titles_info['category_id'].unique())

### Split dataset before vectorizing
This guards against leaking information from testing to training set (80% training, 20% testing)

https://machinelearningmastery.com/data-leakage-machine-learning/
https://stackoverflow.com/questions/54491953/can-i-use-countvectorizer-on-both-test-and-train-data-at-the-same-time-or-do-i-n

In [None]:
x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.2, random_state=117)
x_train_stop, x_test_stop = train_test_split(df_x_stop, test_size=0.2, random_state=117)

x_train_stop = x_train_stop.fillna(' ')
x_test_stop = x_test_stop.fillna(' ')
print("Training data size:", x_train.shape)
print("Testing data size:", x_test.shape)
print("Training data size:", x_train_stop.shape)
print("Testing data size:", x_test_stop.shape)

### Tokenizer words using CountVectorizer
Bag-of-Words model that allows both tokenize a collection of text documents and build a vocabulary of known words

The length of each individual vector will be that of the entire dataset which each indices representing the count of a specific word

In [None]:
count_vectorizer = CountVectorizer()
train_count_vector = count_vectorizer.fit_transform(x_train)
test_count_vector = count_vectorizer.transform(x_test)

train_count_vector_stop = count_vectorizer.fit_transform(x_train_stop)
test_count_vector_stop = count_vectorizer.transform(x_test_stop)

### Tokenizer words using TfidfTransformer
Will convert our count values from CountVectorizer into a frequency matrix

Term Frequency: How often a word appears in a particular title.
Inverse Document Frequency: Downscale this words that appear often across multiple titles.

Main purpose is to reduce the importance of stopwords that a common accross categories

In [None]:
tfidf_vectorizer = TfidfTransformer()
x_trained_tfidf_vector = tfidf_vectorizer.fit_transform(train_count_vector)
x_test_tfidf_vector = tfidf_vectorizer.transform(test_count_vector)

### Naive Bayes Model Training

Create Naive Bayes Model on 3 inputs from above

In [None]:
clf_count = MultinomialNB()
clf_count.fit(train_count_vector, y_train)

In [None]:
clf_count_stop = MultinomialNB()
clf_count_stop.fit(train_count_vector_stop, y_train)

In [None]:
clf_tfidf = MultinomialNB()
clf_tfidf.fit(x_trained_tfidf_vector, y_train)

### Accuracy and Classification Report
Test the accuracy of our model usingth testing dataset

In [None]:
pred_tfidf = clf_tfidf.predict(x_test_tfidf_vector)
print("Tfidf Model")
print("Accuracy: ", accuracy_score(y_test, pred_tfidf))
print("Precision: ", metrics.precision_score(y_test, pred_tfidf, average='weighted'))
print("F1: ", metrics.f1_score(y_test, pred_tfidf, average='weighted'))
print(metrics.classification_report(y_test, pred_tfidf))


visualizer = ClassificationReport(clf_tfidf, support=True, cmap='Greens')
visualizer.fit(x_trained_tfidf_vector, y_train)
visualizer.score(x_test_tfidf_vector, y_test)
visualizer.show()

In [None]:
pred_count = clf_count.predict(test_count_vector)
print("Count Model")
print("Accuracy: ", accuracy_score(y_test, pred_count))
print("Precision: ", metrics.precision_score(y_test, pred_count, average='weighted'))
print("F1: ", metrics.f1_score(y_test, pred_count, average='weighted'))
print(metrics.classification_report(y_test, pred_count))


plt.title("Testing")
visualizer = ClassificationReport(clf_count, support=True, cmap='Greens')
visualizer.fit(train_count_vector, y_train)
visualizer.score(test_count_vector, y_test)
visualizer.finalize(set_title="this")

In [None]:
pred_count_stop = clf_count_stop.predict(test_count_vector_stop)
print("Accuracy: ", accuracy_score(y_test, pred_count_stop))
print("Precision: ", metrics.precision_score(y_test, pred_count_stop, average='weighted'))
print("F1: ", metrics.f1_score(y_test, pred_count_stop, average='weighted'))
print(metrics.classification_report(y_test, pred_count_stop))


visualizer = ClassificationReport(clf_count_stop, support=True, cmap='Greens')
visualizer.fit(train_count_vector_stop, y_train)
visualizer.score(test_count_vector_stop, y_test)
visualizer.show()

### Confusion Matrix

Graphical representation of accuracy in terms of False Negative/False Positives

In [None]:
conf_matrix = ConfusionMatrix(clf_count, percent=True, cmap='Greens')
conf_matrix.fit(train_count_vector, y_train)
conf_matrix.score(test_count_vector, y_test)
conf_matrix.show()

conf_matrix = ConfusionMatrix(clf_tfidf, percent=True, cmap='Greens')
conf_matrix.fit(x_trained_tfidf_vector, y_train)
conf_matrix.score(x_test_tfidf_vector, y_test)
conf_matrix.show()

conf_matrix = ConfusionMatrix(clf_count_stop, percent=True, cmap='Greens')
conf_matrix.fit(train_count_vector_stop, y_train)
conf_matrix.score(test_count_vector_stop, y_test)
conf_matrix.show()

### Accuracy Comparision between model variance

In [None]:
normalized_accuracy = accuracy_score(y_test, pred_tfidf)
regular_accuracy = accuracy_score(y_test, pred_count)
regular_accuracy_stop = accuracy_score(y_test, pred_count_stop)


fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
plt.ylim(0, 1)
x = ["Regular Filtered", "Filtered/No Stopwords", "Filterd/Normalized"]
y = [regular_accuracy, regular_accuracy_stop, normalized_accuracy]

fig.suptitle('LogReg Input vs Accuracy', fontsize=15)
plt.xlabel('Input Type', fontsize=15)
plt.ylabel('Accuracy Percentage', fontsize=15)
ax.bar(x,y)
plt.show()

# Conclusion

Overall, our title classification using Naive Bayes seems to perform well with accuracy ranging between 86-90%. When looking at the titles of YouTube videos, we often see that each category are highly seperable. For example, videos about Sports or Education often will use very specific words that are exclusive to that category. However, our model does seem to struggle when classifing Entertainment videos as the title wordings seems to be more generalized. "the trump presidency last week tonight with jo..." is under the Entertainment category when wordings suggests it could be part of News & Politics instead.

We are confident in the performance of our Naive Bayes Classifer model due to the highly seperable categories on the input dataset and our methods of data separation to ensure no data leakage occurs.