# Lab 03: Text Classification on the DBpedia14 dataset

### Objectives:
1. Build a Naive Bayes classification model from scratch
2. Evaluate the performance of your model on the DBpedia14 dataset
3. Train an off-the-shelf NB classifier and compare its performance to your implementation
4. Train off-the-shelf implementations of the linear-SVM, RBF-kernel-SVM, and perceptron and compare their performance with the NB models

### Suggested Reading

1. https://arxiv.org/pdf/1811.12808.pdf

### Download the dataset

In [2]:
!pip install datasets

Collecting datasets

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
jange 0.1.6 requires cytoolz<0.11.0,>=0.10.0, which is not installed.
jange 0.1.6 requires more_itertools<9.0.0,>=8.4.0, which is not installed.
jange 0.1.6 requires networkx<3.0,>=2.4, but you have networkx 2.3 which is incompatible.
jange 0.1.6 requires pandas==1.0.5, but you have pandas 1.3.0 which is incompatible.
jange 0.1.6 requires plotly<5.0.0,>=4.8.2, but you have plotly 5.1.0 which is incompatible.
jange 0.1.6 requires spacy<3.0.0,>=2.2.0, but you have spacy 3.1.2 which is incompatible.
You should consider upgrading via the 'C:\Users\sangi\anaconda3\python.exe -m pip install --upgrade pip' command.



  Downloading datasets-1.12.1-py3-none-any.whl (270 kB)
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2021.9.0-py3-none-any.whl (123 kB)
Collecting huggingface-hub<0.1.0,>=0.0.14
  Downloading huggingface_hub-0.0.17-py3-none-any.whl (52 kB)
Collecting pyarrow!=4.0.0,>=1.0.0
  Downloading pyarrow-5.0.0-cp38-cp38-win_amd64.whl (14.5 MB)
Collecting tqdm>=4.62.1
  Downloading tqdm-4.62.3-py2.py3-none-any.whl (76 kB)
Collecting dill
  Downloading dill-0.3.4-py2.py3-none-any.whl (86 kB)
Collecting multiprocess
  Downloading multiprocess-0.70.12.2-py38-none-any.whl (128 kB)
Collecting xxhash
  Downloading xxhash-2.0.2-cp38-cp38-win_amd64.whl (35 kB)
Installing collected packages: tqdm, fsspec, dill, xxhash, pyarrow, multiprocess, huggingface-hub, datasets
  Attempting uninstall: tqdm
    Found existing installation: tqdm 4.61.2
    Uninstalling tqdm-4.61.2:
      Successfully uninstalled tqdm-4.61.2
Successfully installed datasets-1.12.1 dill-0.3.4 fsspec-2021.9.0 huggingface-hub-0

In [1]:
import datasets
import pandas as pd

train_ds, test_ds = datasets.load_dataset('dbpedia_14', split=['train[:80%]', 'test[80%:]'])
df_train: pd.DataFrame = train_ds.to_pandas()
df_test: pd.DataFrame = test_ds.to_pandas()

Reusing dataset d_bpedia14 (C:\Users\sangi\.cache\huggingface\datasets\d_bpedia14\dbpedia_14\2.0.0\7f0577ea0f4397b6b89bfe5c5f2c6b1b420990a1fc5e8538c7ab4ec40e46fa3e)


  0%|          | 0/2 [00:00<?, ?it/s]

In [2]:
df_train.head()

Unnamed: 0,label,title,content
0,0,E. D. Abbott Ltd,Abbott of Farnham E D Abbott Limited was a Br...
1,0,Schwan-Stabilo,Schwan-STABILO is a German maker of pens for ...
2,0,Q-workshop,Q-workshop is a Polish company located in Poz...
3,0,Marvell Software Solutions Israel,Marvell Software Solutions Israel known as RA...
4,0,Bergan Mercy Medical Center,Bergan Mercy Medical Center is a hospital loc...


In [3]:
set(df_train.head(500)['label'].tolist())

{0}

In [4]:
df_test.head()

Unnamed: 0,label,title,content
0,11,Jedan od onih života...,Jedan od onih života... (trans. One of Those ...
1,11,Wanna Be a Star,Wanna Be a Star is the ninth album by the Can...
2,11,AOK (album),AOK is a studio album by the Polish singer an...
3,11,Coal (Leprous album),Coal is the third studio album released by th...
4,11,20th Century Masters – The Millennium Collecti...,20th Century Masters – The Millennium Collect...


# Part I: Build your own Naive Bayes classification model

### (5 pts) Task I: Build a model from scratch
Using your notes from lecture-02, implement a Naive Bayes model and train it on the DBpedia dataset. Also, feel free to use any text preprocessing you wish, such as the pipeline from Lab02. 

Below is a template class to help you think about the structure of this problem (feel free to design your own code if you like). It contains methods for each inference step in NB. It also has a classmethod that you could use to instantiate the class from a list of documents and a corresponding list of labels. Here we are suggesting you create a dictionary that maps each word to a unique $ith$ index in the $\phi_{i,k}$ probabilty matrix, which you need to estimate. Because the labels are a set of 0-indexed integers, they naturally map to a unique position $\mu_{k}$ (you should check this to make sure).

In [5]:
from typing import Union, List
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

class NaiveBayesModel:
    
    """Multinomial NB model class template"""
    
    phi: List[List[float]] # (N, K)
    
    mu: List[float]  # (K,)
    
    #vocab: dict     # vocabulary map from word to row index in phi
    
    #n_class: int    # number of classes
        
    words_counts: List[List[int]] # syntax?
    
    labels_list: List[int]
        
    n_words: int
        
    count_vectorizer: CountVectorizer
    
    
    def __init__(self, n_words: int, words_counts: List[List[int]], labels_list: List[int], count_vectorizer: CountVectorizer):
        """
        Parameters
        ----------
        vocabulary: {str: int} <- {word: index}
        num_classes: Number of classes
        """
        #vocab = vocabulary
        #n_class = num_classes
        self.words_counts = words_counts
        self.labels_list = labels_list
        self.n_words = n_words
        self.count_vectorizer = count_vectorizer
        self.mu = self.estimate_mu()
        self.phi = self.estimate_phi()
        return
    
    @classmethod
    def from_preprocessed_data(cls, docs_list: List[str], labels_list: List[int]):
        # Turn docs_list into count_vectorized df
        count_vectorizer = CountVectorizer()
        count_vectorizer.fit(docs_list)
        count_vector = count_vectorizer.transform(docs_list)
        return cls(count_vector.shape[1], count_vector.toarray(), labels_list, count_vectorizer)
    
    def estimate_mu(self, alpha: float = 1.):
        """
        Estimate P(Y), the prior over labels
        
        Parameters
        ----------
        alpha: smoothing parameter
        """
        # p_y[i] = num occurrences of i / total rows
        # p_y_given_x[i] = num occurrences where x (specific word) => y[i] / total rows (words?)
        # p_x_given_y[n] = num occurrences where y and x[n] (specific word) / total rows (words?)
        # self.mu
        p = []
        # assuming labels is length of the sample set, NOT unique labels
        for i in range(len(self.labels_list)):
            p.append(self.labels_list.count(self.labels_list[i]) / len(self.labels_list))
        self.mu = p
        return self.mu
    
    def estimate_phi(self, alpha: float = 1.):
        """
        Estimate phi, the N x K matrix 
        describing the probability of
        the nth word in the kth class.
        
        Parameters
        ----------
        alpha: smoothing parameter
        """
        p = []
        for n in range(self.n_words):
            row = []
            for i in range(len(set(self.labels_list))):
                # replace 1s with word_counts[doc_ind][n] if we weight by count of the word in each document
                sum_word_label = sum([1 if (self.labels_list[doc_ind] == list(set(self.labels_list))[i]) and (self.words_counts[doc_ind][n] != 0) \
                                      else 0 \
                                      for doc_ind in range(len(self.words_counts))])
                count_word = sum([1 if self.words_counts[doc_ind][n] != 0 \
                                  else 0 \
                                  for doc_ind in range(len(self.words_counts))])
                row.append(sum_word_label / count_word)
            p.append(row)
        self.phi = p
        return self.phi
    
    def predict_label(self, text: str) -> int:
        """
        Compute label given some input text
        
        Parameters
        ----------
        text: raw input text
        
        Returns
        -------
        int: corresponding to the predicted label
        """
        input_counts = self.count_vectorizer.transform(text).toarray()
        probabilities = []
        for i in range(len(self.mu)):
            p = self.mu[i]
            for n in input_counts[0]:
                if not n == 0: # is this valid? Otherwise most things might zero out...
                    p *= n
            probabilities.append(p)
        return probabilities.index(max(probabilities))

In [None]:
# Your code goes here

# turn df_train into docs_list and labels_list
data_train = df_train.sample(500)  # full dataset is waaaaaay too big! (1.9TiB?!?!?!)
docs_list = data_train['content'].tolist()
labels_list = data_train['label'].tolist()
print("Available labels in the sample: " + str(set(labels_list)))
# create NB object
naiveBayesModel = NaiveBayesModel.from_preprocessed_data(docs_list, labels_list)
# run from_preprocessed_data to initialize
# test with one doc from df_test with predict_label method
test_row = df_test.head(1)
test_doc_list = test_row['content'].tolist()
test_labels_list = test_row['label'].tolist()

print("Testing 1 doc...")
print(naiveBayesModel.predict_label(test_row['content']))

print("Testing all test docs!!!")
predictions = []
count = 0
for text in df_test['content'].tolist():
    if count >= 10:
        break
    count += 1
    predictions.append(naiveBayesModel.predict_label([text]))
    
test_labels = df_test['label'].tolist()
    
for i in range(len(predictions)):
    print("Predicted: " + str(predictions[i]) + " -- Actual: " + str(test_labels[i]))


Available labels in the sample: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}


# Part II: Model performance evaluation

Evaluating the performance of a classification model may seem as simple as computing an accuracy, and in some cases that is sufficient, but in general accuracy is not a reliable metric by itself. Typically we need to evaluate our model using several different metrics. 

One common issue is class imbalance, which is when the label distribution in the data varies far from uniform. In this case a high accuracy can be misleading because low frequency labels don't contribute equally to the score. More generally, this is one of the biggest drawbacks of using MLE in NLP: models tend to be much less sensitive to low probability labels than to higher probabilty labels. Later in this class we will explore models that learn by predicting words given their context, can you think of reasons why this can be problematic? Hint: remember Zipf's law?

Another reason to use multiple evaluation methods is that it can help you better understand your data. Evaluating performance on individual classes often reveals problems with the data that would otherwise go unnoticed. For example, if you observe an abundance of misclassified data specific to only a few classes, chances are you have inconsistent labels for those classes in the training set. This is very common in 3rd party mechanical turk data, where quality can vary wildly.

In this lab we will use three metrics and one visualization tool:

1. [Accuracy](https://en.wikipedia.org/wiki/Accuracy_and_precision)
2. [F1 score](https://en.wikipedia.org/wiki/F-score)
3. [AUC ROC score](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)
4. [The confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix)

The [metrics module](https://scikit-learn.org/stable/modules/model_evaluation.html) within sklearn provides support for nearly any evaluation metric that you will need.

# Part III: Compare your performance to an off-the-shelf NB classifier
Open source implementations of your custom NB classifier from Part I already exist of course. One such implementation is [`sklearn.naive_bayes.MultinomialNB`](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB) from the sklearn library. 

### (5 pts) Task II: NB model comparison
Train this model on the same data and compare its performance with your model using the metrics from part II.

In [None]:
from sklearn.naive_bayes import MultinomialNB

In [None]:
# Your code goes here

# Part IV: Compare NB to other classification models

Now that we've built and validated our NB classifier, we want to evaluate other models on this task.

### (5 pts) Task III: Evaluate the perceptron, SVM (linear), and SVM (RBF kernel)
Train and evaluate the following models on this dataset, and compare them with the NB models.

1. [Perceptron](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html#sklearn.linear_model.Perceptron)
2. [Linear-SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)
3. [RBF-Kernel-SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

In [None]:
from sklearn.linear_model import Perceptron
from sklearn.svm import SVC, LinearSVC

In [None]:
# Your code goes here

### (5 pts) Task IV: Select the best model

1. Which model performed the best overall? 
2. What metric(s) influence this decision?
3. Does the model that learns a non-linear decision boundary help?