# Table of Contents
- [Introduction](#introduction)
- [Dataset Description and Preprocessing](#dataset)

<br/>

# 1 Introduction <a name="introduction"></a>


<br/>

We will be using the same dataset and features as we did in the case of clustering.

In order to construct the dataset, we sample and mix 1000 items from each category.

In [87]:
%load_ext autoreload
%autoreload 2

from util import import_data
import random

random.seed(2)
size_per_class = 10


def generate_dataset(size):
    data = import_data('datasets/Clothing_Shoes_and_Jewelry_5.json', size=10000, file_type='json',
                       field='reviewText', label='clothing').sample(size).reset_index()
    data = data.append(import_data('datasets/Books_5.json', size=10000, file_type='json',
                                   field='reviewText', label='books').sample(
        size).reset_index())
    data = data.append(
        import_data('datasets/Movies_and_TV_5.json', size=10000, file_type='json', field='reviewText',
                    label='movies').sample(
            size).reset_index())
    data = data.append(import_data('datasets/Software_5.json', size=10000, file_type='json', field='reviewText',
                                   label='software').sample(
        size).reset_index())
    data = data.append(
        import_data('datasets/Musical_Instruments_5.json', size=10000, file_type='json', field='reviewText',
                    label='music').sample(
            size).reset_index())
    data = data.reset_index()
    print("Loaded data")
    return data


data = generate_dataset(size_per_class)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Loaded data


# 2 Classification

The purpose of this process is to train a classifier model, to assign user reviews to the output categories.
This would be useful for automated categorization of reviews, or identifying user preferences.

## 2.1. Classification algorithms

For text classification we consider using Naive Bayes or SVM as the classifiers.

### 2.1.1 Naive Bayes

### 2.2.2 Support Vector Machines

In [88]:
from util import TextPreprocessor
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

tpr = TextPreprocessor()
tfidf_vector, tfidf_matrix, dense_tfidf_matrix = tpr.generate_tfidf(data.input, debug=True)
X_train, X_test, y_train, y_test = train_test_split(tfidf_matrix, data.label, test_size=0.2)
clf = MultinomialNB()
clf.fit(X_train, y_train)

Created tfidf transformer
Created tfidf matrix
Created tfidf dense matrix


MultinomialNB()

### Initial attempt with 100 samples per class
We present the result of the Naive Bayes algorithm for a small number of samples. The metrics presented
are:

* Accuracy
* Precision
* Recall
* F1 score
* Confusion matrix

Accuracy is a generic measure that shows the total correct predictions versus the total predictions.
The rest of the metrics are calculated per class. Precision is an insight of how sensitive is the algorithm to misclassifying samples to the specific class, that
should belong to other classes. On the other hand, Recall shows the ability of the classifier to identify the samples that truly belong to the specific class.
F1 is a combination of the above


In [89]:
from sklearn.metrics import confusion_matrix, precision_recall_fscore_support
import plotly.express as px
import pandas as pd
from IPython.display import display

labels = ["books", "movies", "software", "clothing", "music"]
y_pred = clf.predict(X_test)
acc = clf.score(X_test, y_test)
pre, rec, f1, sup = precision_recall_fscore_support(y_test, y_pred, labels=labels)
metrics = pd.DataFrame({'Accuracy': [acc] * 5,
                        'Precision': pre,
                        'Recall': rec,
                        'F1': f1,
                        'Label': labels})
display(metrics)

cm = confusion_matrix(y_test, y_pred, labels=["books", "movies", "software", "clothing", "music"])
df_cm = pd.DataFrame(cm, columns=["books", "movies", "software", "clothing", "music"], index=labels)
fig = px.imshow(df_cm, title="Confusion matrix for 100 samples per class",
                labels={'x': 'predicted (y_pred)', 'y': 'actual (y_test)'})
fig.show()


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.



Unnamed: 0,Accuracy,Precision,Recall,F1,Label
0,0.2,1.0,0.5,0.666667,books
1,0.2,0.0,0.0,0.0,movies
2,0.2,1.0,0.5,0.666667,software
3,0.2,0.0,0.0,0.0,clothing
4,0.2,0.0,0.0,0.0,music


In the software category the algorithm seems to perform best, since the precision and recall scores are both high and balanced.


Investigating the cases where the prediction doesn't match the real label, we can see that the input text in those cases is very generic, like
*Love it!*, *Item as described*

In [90]:
df_pred = pd.DataFrame({'pred': y_pred, 'actual': y_test}).join(data)
df_pred.loc[df_pred.pred != df_pred.label, ['pred', 'actual', 'input']]

Unnamed: 0,pred,actual,input
1,movies,clothing,I ordered these for my eight-year old he. I or...
0,music,clothing,My first pair and I love them!
47,movies,music,Good
6,movies,clothing,My niece put them on as soon as she opened the...
7,movies,clothing,Great purchase
17,movies,books,"I gave this as a gift to my niece, for when I ..."
49,movies,music,LOVE this stand. It's very sturdy and well mad...
36,movies,software,Happy Happy Joy Joy! It was just as it was adv...


### Adjusting the sample size
We can increase the dataset used for training and examine how the different scores, as well as the training time, are affected.
For the precision, recall and F1 we will be taking the average of all classes.

In [None]:
import time

sample_sizes = [50, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000]
acc, pre, rec, f1, sup = [], [], [], [], []
training_time = []
testing_time = []
feature_size = []

for size in sample_sizes:
    data = generate_dataset(size)

    # Generate features and split
    tpr = TextPreprocessor()
    tfidf_vector, tfidf_matrix, dense_tfidf_matrix = tpr.generate_tfidf(data.input, debug=True)
    X_train, X_test, y_train, y_test = train_test_split(tfidf_matrix, data.label, test_size=0.2)
    feature_size.append(tfidf_matrix.shape[1])

    # Train classifier
    start = time.time()
    clf = MultinomialNB()
    clf.fit(X_train, y_train)
    training_time.append(time.time() - start)

    # Calculate scores
    start = time.time()
    y_pred = clf.predict(X_test)
    testing_time.append(time.time() - start)
    acc.append(clf.score(X_test, y_test))
    pre_, rec_, f1_, sup_ = precision_recall_fscore_support(y_test, y_pred, labels=labels, average='weighted')
    pre.append(pre_)
    rec.append(rec_)
    f1.append(f1_)
    sup.append(sup_)

Loaded data
Created tfidf transformer
Created tfidf matrix
Created tfidf dense matrix
Loaded data
Created tfidf transformer
Created tfidf matrix
Created tfidf dense matrix
Loaded data
Created tfidf transformer
Created tfidf matrix
Created tfidf dense matrix
Loaded data
Created tfidf transformer
Created tfidf matrix
Created tfidf dense matrix
Loaded data
Created tfidf transformer


In [86]:
metrics = pd.DataFrame({
    'samples per class': sample_sizes,
    'training_time': training_time,
    'testing_time': testing_time,
    'feature_size': feature_size,
    'accuracy': acc,
    'precision': pre,
    'recall': rec,
    'F1': f1})

display(metrics)

df_metrics = metrics.melt(id_vars=['samples per class', 'training_time', 'testing_time', 'feature_size'], var_name='metric', value_name='score')

import plotly.express as px

fig = px.line(df_metrics, x='samples per class', y='score', color='metric', title='Different metrics vs training dataset size')
fig.show()


df_time = metrics.loc[:, ['samples per class', 'feature_size', 'training_time', 'testing_time']]\
    .melt(id_vars=['samples per class'], var_name='metric', value_name='score')
fig2 = px.line(df_time, x='samples per class', y='score', color='metric', title='Different metrics vs training dataset size (log scale)', log_y=True)
fig2.show()

display(df_time)

Unnamed: 0,samples per class,training_time,testing_time,feature_size,accuracy,precision,recall,F1
0,50,0.001003,0.0,3317,0.68,0.698718,0.68,0.683945
1,100,0.002001,0.0,5091,0.84,0.84503,0.84,0.838093
2,200,0.004674,0.001012,6914,0.815,0.822603,0.815,0.814197
3,300,0.003999,0.0,8254,0.843333,0.853895,0.843333,0.843323
4,400,0.004,0.001,9422,0.8725,0.877364,0.8725,0.872754
5,500,0.007051,0.0,10920,0.838,0.842642,0.838,0.837391
6,600,0.009001,0.0,11896,0.878333,0.879473,0.878333,0.878267
7,700,0.008265,0.0,12722,0.865714,0.867238,0.865714,0.864983
8,800,0.009677,0.001,13573,0.87625,0.877393,0.87625,0.87602
9,900,0.052888,0.00298,14164,0.881111,0.884194,0.881111,0.880312


Unnamed: 0,samples per class,metric,score
0,50,feature_size,3317.0
1,100,feature_size,5091.0
2,200,feature_size,6914.0
3,300,feature_size,8254.0
4,400,feature_size,9422.0
5,500,feature_size,10920.0
6,600,feature_size,11896.0
7,700,feature_size,12722.0
8,800,feature_size,13573.0
9,900,feature_size,14164.0
