# CPSC 43N Assignment 2 - Text Classification

Follow the instructions in this notebook to develop 3 text classifiers:

1. Bag-of-words (BOW) classifier.
2. Word embedding-based (CBOW) classifier.
3. Transformer-based classifier.

You will test these classifiers on subsets of the 20 newsgroup dataset, and analyze the errors and successes of each model compared to the other.



## Load Dataset

For this assignment, we train and test classification models on the 20 newsgroups dataset. This dataset comprises around 18,000 newsgroups posts on 20 topics. It is split into 2 subsets (train and test) by `sklearn`.

To ensure this assignment is manageable and won't take too long for training and inference, we will use the subset of 20 newsgroups only covering samples belonging to either one of the two classes ('__talk.politics.misc__' and '__talk.religion.misc__' used below). With this setting, we will perform a  binary classification instead of multiclass classification.

Please **read carefully** the two links below which provide details about the 20newsgroups corpus and how to load and process it with sklearn:

* http://qwone.com/~jason/20Newsgroups/
* https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html

In [1]:
from sklearn.datasets import fetch_20newsgroups

newsgroups_train = fetch_20newsgroups(
    subset='train', remove=('headers', 'footers', 'quotes'),
    categories=['talk.politics.misc','talk.religion.misc'])

newsgroups_test = fetch_20newsgroups(
    subset='test', remove=('headers', 'footers', 'quotes'),
    categories=['talk.politics.misc','talk.religion.misc'])

# Remove empty documents
for s, name in zip([newsgroups_train, newsgroups_test], ["train", "test"]):
  empty_indices = {i for i, doc in enumerate(s.data) if len(doc) == 0}
  orig_len = len(s.data)
  for k in ['data', 'filenames', 'target', 'DESCR']:
    s[k] = [s[k][i] for i in range(orig_len) if i not in empty_indices]
  print(f"Removed {len(empty_indices)} empty documents from the {name} set. Before: {orig_len}. After: {len(s.data)}.")

Removed 21 empty documents from the train set. Before: 842. After: 821.
Removed 8 empty documents from the test set. Before: 561. After: 553.


Let's look at the data! Specifically, using the training set, let's find the length of the shortest and longest message in terms of number of characters, and the number of examples from each class.

In [2]:
####################################
#   Your code here
####################################

longest = len(max(newsgroups_train['data'], key=len))
shortest = len(min(newsgroups_train['data'], key=len))


class_names = ['talk.politics.misc','talk.religion.misc']
label_balance = {}
for target in newsgroups_train['target']:
    label_balance[class_names[target]] = label_balance.get(class_names[target], 0) + 1
####################################

print(f"Shortest message: {shortest} chars. Longest message: {longest} chars.")

# label_balance should be a dictionary from class name to the number of
# examples from that class in the train set
print(label_balance)

Shortest message: 1 chars. Longest message: 49094 chars.
{'talk.politics.misc': 458, 'talk.religion.misc': 363}


## Part 1 - BOW Classifier

The first classifier we will train is a bag-of-words (BOW) classifier, such as we learned in class:

![](https://drive.google.com/uc?export=view&id=1HcgY3jHSuXoAMaBWmDnY4FlCoLghxh84)

Training this classifier requires the following steps:

1. Preprocessing the data, i.e. preparing the text features we would like to include. In our case, we will use the text content and will tokenize it.

2. Vectorizing the data. Each instance will be represented as a high-dimensional vector indicating the count of each word in the vocabulary.

3. Training the classifier.


### Tokenization

[Tokenization](https://www.analyticsvidhya.com/blog/2020/05/what-is-tokenization-nlp/) is a critical preprocessing step when we work with text data. A basic tokenizer separates the input text into tokens, which can be either words, characters, or subwords. In this assignment, we include additional processing to reduce the noise caused by typos and frequent but insignificant words contained in text.

First, download and install the trained English pipeline ([en_core_web_lg](https://spacy.io/models/en) (https://spacy.io/models/en)) provided by Spacy:

In [3]:
import spacy

#!python -m spacy download en_core_web_lg

Now, complete the code below to perform the following preprocessing:

* Split the text into words

* Lowercase the words

* Remove stop words (which we expect to be less informative for text classification)

* Remove punctuations

* Lemmatize the tokens (to reduce the number of distinct words, or features, which would help the classifier generalize better).

In [4]:
import string
from spacy.lang.en import English
from spacy.lang.en.stop_words import STOP_WORDS

# Load trained English pipeline "en_core_web_lg"
nlp = spacy.load('en_core_web_lg')

# Creating your own tokenizer function with functions built in Spacy.
def custom_tokenizer(doc):

    ####################################
    #   Your code here
    ####################################
    tokens = nlp(doc.lower())  # lower case and tokenize
    tokens = [token.lemma_ for token in tokens if token.text not in STOP_WORDS and not token.is_punct]  # remove stopwords and punctuations, and then lemmatize
    
    #### Note: we tried the method bellow which gave higher accuracy, 
    #### but splitting based on "space" causes some bugs, 
    #### like "#That" which should be 2 tokens of "#" and "That" (a stopword). 
    #### so we used the uncommented method above to directly use nlp's tokenizer.
    # lowercase = str.lower(doc)  # lowercase the text
    # words = lowercase.split()  # split into words

    # drop_stop_words = [word for word in words if word not in STOP_WORDS]  # remove stop words
    # drop_punctuation = [''.join(char for char in text if char not in string.punctuation) for text in drop_stop_words]  # remove punctuation

    # lemmatize (word by word): it gave a bit better results for NB, but we didn't keep this 
    # tokens = [token.lemma_ for word in drop_punctuation for token in nlp(word)]  

    ####################################

    # return preprocessed list of tokens (strings)
    return tokens

In [5]:
"that" in STOP_WORDS

True

In [6]:
newsgroups_train['data'][2]

"#That describes some straights -- and nearly all homosexual males.\n\nCan you provide any evidence that doesn't ahve massive selection\neffects?\n\nNo, I thought not.\n\nJust slander on your part.\n"

In [7]:
custom_tokenizer(newsgroups_train['data'][2])

['describe',
 'straight',
 'nearly',
 'homosexual',
 'male',
 '\n\n',
 'provide',
 'evidence',
 'ahve',
 'massive',
 'selection',
 '\n',
 'effect',
 '\n\n',
 'think',
 '\n\n',
 'slander',
 '\n']

### Build the pipeline for the BOW classification model

Now let's design the pipeline for the BOW classifier with sklearn. The overall pipeline for it should contain:

1. A BOW vectorizer applying the tokenizer implemented above

2. A classifier, which should be set to logistic regression for now

To learn how to use `CountVectorizer` to obtain BOW vectors with a customized tokenizer:

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

To learn how to use the Pipeline object to implement classification models:

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html




In [8]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

bow_vector = CountVectorizer(tokenizer=custom_tokenizer, ngram_range=(1, 1))
classifier = LogisticRegression(max_iter=1000)

# Create pipeline for BOW classfier.
pipe = Pipeline([('vectorizer', bow_vector), ('classifier', classifier)])

Now let's train the model on the training set obtained from 20 newsgroups.

In [9]:
pipe.fit(newsgroups_train.data, newsgroups_train.target)



We can now evaluate the classifier's performance on the test set. As an example, here we only compute [accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) as the evaluation metric. Other evaluation metrics such as [recall](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html), [precision](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html), and [F1](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html), may be used for deeper model analysis.

In [10]:
from sklearn import metrics

predicted = pipe.predict(newsgroups_test.data)

# Model Accuracy
print(f"Logistic Regression Accuracy: {metrics.accuracy_score(newsgroups_test.target, predicted)*100:.2f}%")

Logistic Regression Accuracy: 72.33%


You may want to print the number of features with `classifier.coef_.shape[-1]`. Since we trained a BOW model, the number of features should be equal to the vocabulary size after preprocessing.

### Model Variations

Train and test the following variations: (1) changing the logistic regression classifier to Naive Bayes; and (2) including unigram __and bigram__ features. You may either make one modification at a time or experiment with combinations of these design choices. Implement the solution in the code cell below and print the accuracies of the different classifiers.

In [11]:
from sklearn.naive_bayes import MultinomialNB

####################################
#   Your code here
####################################
classifier = MultinomialNB()

# Create pipeline for BOW classfier.
nb_pipe = Pipeline([('vectorizer', bow_vector), ('classifier', classifier)])
nb_pipe.fit(newsgroups_train.data, newsgroups_train.target)
nb_predicted = nb_pipe.predict(newsgroups_test.data)
####################################

print(f"Naive Bayes Accuracy: {metrics.accuracy_score(newsgroups_test.target, nb_predicted)*100:.2f}%")

####################################
#   Your code here
####################################
bigram_classifier = MultinomialNB()

# Create pipeline for BOW classfier.
bow_vector = CountVectorizer(tokenizer=custom_tokenizer, ngram_range=(1, 2))
bigram_pipe = Pipeline([('vectorizer', bow_vector), ('classifier', bigram_classifier)])
bigram_pipe.fit(newsgroups_train.data, newsgroups_train.target)
bigram_predicted = bigram_pipe.predict(newsgroups_test.data)
####################################

print(f"Bigram LR Accuracy: {metrics.accuracy_score(newsgroups_test.target, bigram_predicted)*100:.2f}%")
####################################



Naive Bayes Accuracy: 76.31%




Bigram LR Accuracy: 79.02%




Naive Bayes Accuracy: 76.13%




Bigram LR Accuracy: 80.29%


## Part 2 - Word Embedding-based classifier

The pipeline of BOW classifier implemeted above consists of two components: BOW vectorizer and the classifier (LR or NB). In that scenario, in sklearn our customized tokenizer could be called together with the BOW vectorizer.

This new classifier will use a distributed representation of the input document as the average of the embeddings of the words contained in the document. This is called CBOW, since it is the continous (and low-dimensional) version of the BOW approach, in which we summed the one-hot vectors of the words in the document.

We could potentially re-use the custom tokenizer as is and replace only the vectorizer. However, to make use of Spacy word embeddings, that are part of the annotation process, we will implement a combined tokenizer and vectorizer that gets a document and returns its feature vector (i.e., average of word embeddings in the document). So for the sake of simplicity, we will not remove punctuation and stop words, nor lemmatize the words.  

Complete the code in the following function in the cell below:

* __transform( )__: gets `X`,  containing all the documents in the input set (i.e. train or test set), and converts each document into the average of its word embeddings, returning a list of feature vectors. This will be called for both the train and test data.

We include the following function as it's a part of the `BaseEstimator` class:

* __fit( )__: learns the model parameters from the training data. For the vectorizer, it doesn't do anything.

In [12]:
import numpy as np
from sklearn.base import BaseEstimator


class CBOWVectorizer(BaseEstimator):
    def __init__(self, nlp):
        self.nlp = nlp
        self.dim = 300

    def transform(self, X):
        ####################################
        #   Your code here
        ####################################
        avg_embeddings = []

        for doc in X:
            doc = self.nlp(doc)

            total_emb = np.zeros(self.dim)
            count = 0

            for word in doc:
                if word.has_vector:
                    total_emb += word.vector
                    count += 1

            avg_emb = total_emb / count if count > 0 else np.zeros(self.dim)
            avg_embeddings.append(avg_emb)

            #### Note: we also tried doc.vector directly, but the result was worse! 
            # so probably this low-level averaging of embeddings is better.

        return np.array(avg_embeddings)
        ####################################

    def fit(self, X, y=None):
        return self

# Create the pipeline for the word embedding-based classfier, and train it.
cbow_classifier = LogisticRegression(max_iter=1000)
cbow_pipe = Pipeline([
    ("vectorizer", CBOWVectorizer(nlp)), ("classifier", cbow_classifier)])
cbow_pipe.fit(newsgroups_train.data, newsgroups_train.target)

Let's evaluate the CBOW classifier on the test set to compare it with the best BOW performance.

In [29]:
cbow_predicted = cbow_pipe.predict(newsgroups_test.data)
print(f"CBOW Accuracy: {metrics.accuracy_score(newsgroups_test.target, cbow_predicted)*100:.2f}")

CBOW Accuracy: 78.66


## Part 3 - Transformer-Based classifier

Finally, in the last part of the assignment, you will develop a text classifier based on a pretrained language model, RoBERTa.

In the previous part, we encoded each word separately into a static word embedding. In this part, we encode the entire document with a contextualized representation, which allows to dynamically compute word representations that represent the appropriate sense of the word in the given context.

To represent the entire document (__pooling__), we will take the embedding of the `[CLS]` token, or the first embedding.    

This is an overview of the classifier:

![](https://drive.google.com/uc?export=view&id=1HcsfSZ4Rix7-je4JWJqM_u9majur7Vw6)

Note that we will not fine-tune RoBERTa but instead use the vectors from RoBERTa as feature vectors (updating only the classifier parameters).

First, download and install another Spacy English pipeline, [en_core_web_trf](https://spacy.io/models/en). This is a transformer pipeline based on [RoBERTa-base](https://ai.meta.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/).



In [13]:
# !pip install spacy-transformers
# !python -m spacy download en_core_web_trf

import spacy_transformers

In [14]:
nlp_trf = spacy.load('en_core_web_trf')

In [22]:
token = nlp_trf(newsgroups_train.data[2])
# token._.trf_data.tensors[0]#.model_output['last_hidden_state']
token._.trf_data.model_output['last_hidden_state'][0,0]#.shape
# token
token._.trf_data.model_output['pooler_output'].shape
# token._.trf_data#.last_hidden_layer_state.data[0] #.tensors[-1].shape#.model_output['pooler_output']

(1, 768)

Again, you are required to implement a new vectorizer that takes the text documents and returns a RoBERTa-based vector representation for each document. Please check the [Spacy documentation](https://spacy.io/api/transformer) to understand how to obtain RoBERTa embeddings and the [Transformers documentation](https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/output#transformers.modeling_outputs.SequenceClassifierOutput) to understand the outputs and dimensions. Please add a comment to explain your code.

In [23]:
#import cupy as cp
from tqdm import tqdm

class RobertaVectorizer(BaseEstimator):
    def __init__(self, nlp):
        self.nlp = nlp
        self.dim = 768

    def transform(self, X):
        ####################################
        #   Your code here
        ####################################
        X_embeddings = []

        batch_size = 2  # Adjust the batch size as needed
        with tqdm(total=len(X)) as progress_bar:
            for trf_output in tqdm(self.nlp.pipe(X, batch_size=2)): # get the transformer output of the document (in batches)
                # cls_output = trf_output._.trf_data.model_output['last_hidden_state'][0,0]  # get the last hidden state, its first element (<CLS>)
                cls_output = trf_output._.trf_data.model_output['pooler_output'][0]  # get the last hidden state, its first element (<CLS>)
                # cls_output = trf_output._.trf_data.tensors[-1][0]  # get the (<CLS>) from pooler output which is the pooled of the last 4 hidden state layers
                X_embeddings.append(cls_output)
                progress_bar.update(1)
        return X_embeddings

        # batch_size = 16  # Adjust the batch size as needed
        # for i in tqdm(range(0, len(X), batch_size)):
        #     batch = X[i:i + batch_size]
        #     docs = list(self.nlp.pipe(batch))
        #     embeddings = [doc._.trf_data.tensors[-1][0] for doc in docs]
        #     X_embeddings.extend(embeddings)

        # return np.asarray(X_embeddings)
        ####################################

    def fit(self, X, y=None):
        return self

# Create the pipeline for the word embedding-based classfier, and train it.
trf_classifier = LogisticRegression(max_iter=1000)
trf_pipe = Pipeline([
    ("vectorizer", RobertaVectorizer(nlp_trf)), ("classifier", trf_classifier)])

We are now ready to train this classifier. Please read: https://spacy.io/usage/embeddings-transformers for how to use GPU for model training and inference.

In [24]:
from thinc.api import set_gpu_allocator, require_gpu, prefer_gpu

# Use GPU if available
if prefer_gpu():
  set_gpu_allocator("pytorch")
  require_gpu(0)
  print("Using GPU.")
else:
  print("GPU unavailable.")

# Model Training (This may take > 10 min depending on the GPU)
trf_pipe.fit(newsgroups_train.data, newsgroups_train.target)

GPU unavailable.


  0%|          | 0/821 [00:00<?, ?it/s]

 61%|██████▏   | 503/821 [07:42<03:00,  1.76it/s]  Token indices sequence length is longer than the specified maximum sequence length for this model (526 > 512). Running this sequence through the model will result in indexing errors
821it [12:13,  1.12it/s] [12:13<00:00,  2.40it/s]
100%|██████████| 821/821 [12:13<00:00,  1.12it/s]


Finally, let's evaluate the transformer-based classifier.

In [25]:
trf_predicted = trf_pipe.predict(newsgroups_test.data)
print(f"Transformer Accuracy: {metrics.accuracy_score(newsgroups_test.target, trf_predicted)*100:.2f}")

553it [07:38,  1.21it/s] [07:38<00:00,  5.11it/s]
100%|██████████| 553/553 [07:38<00:00,  1.21it/s]


NameError: name 'metrics' is not defined