# CPSC 43N Assignment 2 - Text Classification

Follow the instructions in this notebook to develop 3 text classifiers:

1. Bag-of-words (BOW) classifier.
2. Word embedding-based (CBOW) classifier.
3. Transformer-based classifier.

You will test these classifiers on subsets of the 20 newsgroup dataset, and analyze the errors and successes of each model compared to the other.



## Load Dataset

For this assignment, we train and test classification models on the 20 newsgroups dataset. This dataset comprises around 18,000 newsgroups posts on 20 topics. It is split into 2 subsets (train and test) by `sklearn`.

To ensure this assignment is manageable and won't take too long for training and inference, we will use the subset of 20 newsgroups only covering samples belonging to either one of the two classes ('__rec.sport.baseball__' and '__rec.sport.hockey__' used below). With this setting, we will perform a  binary classification instead of multiclass classification.

Please **read carefully** the two links below which provide details about the 20newsgroups corpus and how to load and process it with sklearn:

* http://qwone.com/~jason/20Newsgroups/
* https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html

In [1]:
from sklearn.datasets import fetch_20newsgroups

newsgroups_train = fetch_20newsgroups(
    subset='train', remove=('headers', 'footers', 'quotes'),
    categories=['rec.sport.baseball','rec.sport.hockey'])

newsgroups_test = fetch_20newsgroups(
    subset='test', remove=('headers', 'footers', 'quotes'),
    categories=['rec.sport.baseball','rec.sport.hockey'])

# Remove empty documents
for s, name in zip([newsgroups_train, newsgroups_test], ["train", "test"]):
  empty_indices = {i for i, doc in enumerate(s.data) if len(doc) == 0}
  orig_len = len(s.data)
  for k in ['data', 'filenames', 'target', 'DESCR']:
    s[k] = [s[k][i] for i in range(orig_len) if i not in empty_indices]
  print(f"Removed {len(empty_indices)} empty documents from the {name} set. Before: {orig_len}. After: {len(s.data)}.")

Removed 26 empty documents from the train set. Before: 1197. After: 1171.
Removed 19 empty documents from the test set. Before: 796. After: 777.


Let's look at the data! Specifically, using the training set, let's find the length of the shortest and longest message in terms of number of characters, and the number of examples from each class.

In [2]:
####################################
#   Your code here
####################################
# Get lengths of each doc in training data, assigning min and max respectively
lengths = [len(doc) for doc in newsgroups_train.data]
shortest = min(lengths)
longest = max(lengths)

# Initialize label balance dictionary
label_balance = {name: 0 for name in newsgroups_train.target_names}
for label in newsgroups_train.target:
    class_name = newsgroups_train.target_names[label]
    label_balance[class_name] += 1

####################################

print(f"Shortest message: {shortest} chars. Longest message: {longest} chars.")

# label_balance should be a dictionary from class name to the number of
# examples from that class in the train set
print(label_balance)

Shortest message: 1 chars. Longest message: 74878 chars.
{'rec.sport.baseball': 581, 'rec.sport.hockey': 590}


## Part 1 - BOW Classifier

The first classifier we will train is a bag-of-words (BOW) classifier, such as we learned in class:

![](https://drive.google.com/uc?export=view&id=1HcgY3jHSuXoAMaBWmDnY4FlCoLghxh84)

Training this classifier requires the following steps:

1. Preprocessing the data, i.e. preparing the text features we would like to include. In our case, we will use the text content and will tokenize it.

2. Vectorizing the data. Each instance will be represented as a high-dimensional vector indicating the count of each word in the vocabulary.

3. Training the classifier.


### Tokenization

[Tokenization](https://www.analyticsvidhya.com/blog/2020/05/what-is-tokenization-nlp/) is a critical preprocessing step when we work with text data. A basic tokenizer separates the input text into tokens, which can be either words, characters, or subwords. In this assignment, we include additional processing to reduce the noise caused by typos and frequent but insignificant words contained in text.

First, download and install the trained English pipeline ([en_core_web_lg](https://spacy.io/models/en) (https://spacy.io/models/en)) provided by Spacy:

In [3]:
import spacy

!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl (400.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.7/400.7 MB[0m [31m33.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


Now, complete the code below to perform the following preprocessing:

* Split the text into words

* Lowercase the words

* Remove stop words (which we expect to be less informative for text classification)

* Remove punctuations

* Lemmatize the tokens (to reduce the number of distinct words, or features, which would help the classifier generalize better).

In [4]:
import string
from spacy.lang.en import English
from spacy.lang.en.stop_words import STOP_WORDS

# Load trained English pipeline "en_core_web_lg"
nlp = spacy.load('en_core_web_lg')

# Creating your own tokenizer function with functions built in Spacy.
def custom_tokenizer(doc):

    ####################################
    #   Your code here
    ####################################
    processed_doc = nlp(doc)
    tokens = []

    # Process the document token by token
    for token in processed_doc:
        if not token.is_alpha: # remove punctuation
            continue
        if token.text.lower() in STOP_WORDS:
            continue           # skip stop words
        tokens.append(token.lemma_.lower()) # lemmatize and lowercase the word
    ####################################

    # return preprocessed list of tokens (strings)
    return tokens

### Build the pipeline for the BOW classification model

Now let's design the pipeline for the BOW classifier with sklearn. The overall pipeline for it should contain:

1. A BOW vectorizer applying the tokenizer implemented above

2. A classifier, which should be set to logistic regression for now

To learn how to use `CountVectorizer` to obtain BOW vectors with a customized tokenizer:

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

To learn how to use the Pipeline object to implement classification models:

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html




In [5]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

bow_vector = CountVectorizer(tokenizer=custom_tokenizer, ngram_range=(1, 1))
classifier = LogisticRegression(max_iter=1000)

# Create pipeline for BOW classfier.
pipe = Pipeline([('vectorizer', bow_vector), ('classifier', classifier)])

Now let's train the model on the training set obtained from 20 newsgroups.

In [6]:
pipe.fit(newsgroups_train.data, newsgroups_train.target)



0,1,2
,steps,"[('vectorizer', ...), ('classifier', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,<function cus...t 0x13cb5f380>
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'
,ngram_range,"(1, ...)"

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,1000


We can now evaluate the classifier's performance on the test set. As an example, here we only compute [accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) as the evaluation metric. Other evaluation metrics such as [recall](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html), [precision](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html), and [F1](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html), may be used for deeper model analysis.

In [7]:
from sklearn import metrics

predicted = pipe.predict(newsgroups_test.data)

# Model Accuracy
print(f"Logistic Regression Accuracy: {metrics.accuracy_score(newsgroups_test.target, predicted)*100:.2f}%")

Logistic Regression Accuracy: 89.32%


You may want to print the number of features with `classifier.coef_.shape[-1]`. Since we trained a BOW model, the number of features should be equal to the vocabulary size after preprocessing.

### Model Variations

Train and test the following variations: (1) changing the logistic regression classifier to Naive Bayes; and (2) including unigram __and bigram__ features. You may either make one modification at a time or experiment with combinations of these design choices. Implement the solution in the code cell below and print the accuracies of the different classifiers.

In [8]:
from sklearn.naive_bayes import MultinomialNB

####################################
#   Your code here
####################################

# 1) changing the logistic regression classifier to Naive Bayeses

# Reuse vectorizer, change the classifier to Naive Bayes
pipe_bayes = Pipeline([('vectorizer', bow_vector), ('classifier', MultinomialNB())])

# Fit and predict on the data
pipe_bayes.fit(newsgroups_train.data, newsgroups_train.target)
nb_predicted = pipe_bayes.predict(newsgroups_test.data)

####################################

print(f"Naive Bayes Accuracy: {metrics.accuracy_score(newsgroups_test.target, nb_predicted)*100:.2f}%")

####################################
#   Your code here
####################################

# 2) including unigram and bigram features

# Change vectorizer including bigram features, keep original classifier
pipe_bigram = Pipeline([('vectorizer', CountVectorizer(tokenizer=custom_tokenizer, ngram_range=(1, 2))), 
                       ('classifier', classifier)])

# Fit and predict on the data
pipe_bigram.fit(newsgroups_train.data, newsgroups_train.target)
bigram_predicted = pipe_bigram.predict(newsgroups_test.data)

####################################

print(f"Bigram LR Accuracy: {metrics.accuracy_score(newsgroups_test.target, bigram_predicted)*100:.2f}%")



Naive Bayes Accuracy: 93.56%




Bigram LR Accuracy: 88.80%


## Part 2 - Word Embedding-based classifier

The pipeline of BOW classifier implemeted above consists of two components: BOW vectorizer and the classifier (LR or NB). In that scenario, in sklearn our customized tokenizer could be called together with the BOW vectorizer.

This new classifier will use a distributed representation of the input document as the average of the embeddings of the words contained in the document. This is called CBOW, since it is the continous (and low-dimensional) version of the BOW approach, in which we summed the one-hot vectors of the words in the document.

We could potentially re-use the custom tokenizer as is and replace only the vectorizer. However, to make use of Spacy word embeddings, that are part of the annotation process, we will implement a combined tokenizer and vectorizer that gets a document and returns its feature vector (i.e., average of word embeddings in the document). So for the sake of simplicity, we will not remove punctuation and stop words, nor lemmatize the words.  

Complete the code in the following function in the cell below:

* __transform( )__: gets `X`,  containing all the documents in the input set (i.e. train or test set), and converts each document into the average of its word embeddings, returning a list of feature vectors. This will be called for both the train and test data.

We include the following function as it's a part of the `BaseEstimator` class:

* __fit( )__: learns the model parameters from the training data. For the vectorizer, it doesn't do anything.

In [9]:
import numpy as np
from sklearn.base import BaseEstimator


class CBOWVectorizer(BaseEstimator):
    def __init__(self, nlp):
        self.nlp = nlp
        self.dim = 300

    def transform(self, X):
        ####################################
        #   Your code here
        ####################################
        vectors = []
        for doc in X:
            processed_doc = self.nlp(doc)
            
            # Get the embedding vectors for all tokens in documents
            word_vectors = [token.vector for token in processed_doc if token.has_vector]

            # If there are any, collect the average across all tokens
            if len(word_vectors) > 0:
                avg_vector = np.mean(word_vectors, axis=0)
            else:
                avg_vector = np.zeros(self.dim) # Otherwise return all zeros if there are no word vectors
            vectors.append(avg_vector)
        
        return np.array(vectors)
        ####################################

    def fit(self, X, y=None):
        return self

# Create the pipeline for the word embedding-based classfier, and train it.
cbow_classifier = LogisticRegression(max_iter=1000)
cbow_pipe = Pipeline([
    ("vectorizer", CBOWVectorizer(nlp)), ("classifier", cbow_classifier)])
cbow_pipe.fit(newsgroups_train.data, newsgroups_train.target)

0,1,2
,steps,"[('vectorizer', ...), ('classifier', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,nlp,<spacy.lang.e...t 0x138416780>

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,1000


Let's evaluate the CBOW classifier on the test set to compare it with the best BOW performance.

In [10]:
cbow_predicted = cbow_pipe.predict(newsgroups_test.data)
print(f"CBOW Accuracy: {metrics.accuracy_score(newsgroups_test.target, cbow_predicted)*100:.2f}")

CBOW Accuracy: 91.51


## Part 3 - Transformer-Based classifier

Finally, in the last part of the assignment, you will develop a text classifier based on a pretrained language model, RoBERTa.

In the previous part, we encoded each word separately into a static word embedding. In this part, we encode the entire document with a contextualized representation, which allows to dynamically compute word representations that represent the appropriate sense of the word in the given context.

To represent the entire document (__pooling__), we will take the embedding of the `[CLS]` token, or the first embedding.    

This is an overview of the classifier:

![](https://drive.google.com/uc?export=view&id=1HcsfSZ4Rix7-je4JWJqM_u9majur7Vw6)

Note that we will not fine-tune RoBERTa but instead use the vectors from RoBERTa as feature vectors (updating only the classifier parameters).

First, download and install another Spacy English pipeline, [en_core_web_trf](https://spacy.io/models/en). We will use it for a transformer pipeline based on [RoBERTa-base](https://ai.meta.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/).



In [11]:
!pip install spacy-transformers
!python -m spacy download en_core_web_trf


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Collecting en-core-web-trf==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.8.0/en_core_web_trf-3.8.0-py3-none-any.whl (457.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m457.4/457.4 MB[0m [31m35.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_trf')


If you want to use a GPU (e.g., in Colab). Feel free to change it to a CPU, but it would take longer to train the model.

In [12]:
import torch

# Changed due to deprecated PyTorch
is_using_gpu = spacy.prefer_gpu() and torch.cuda.is_available()

if is_using_gpu:
    torch.set_default_device("cuda")
    torch.set_default_dtype(torch.float32)
    print("Using GPU")
else:
    torch.set_default_device("cpu")
    torch.set_default_dtype(torch.float32)
    print("Using CPU")

Using CPU


Now, load the pre-trained RoBERTa base model.

In [13]:
import spacy_transformers

# Empty English pipeline
nlp_trf = spacy.blank("en")

# Create the config with the name of the model (RoBERTa base).
config = {
    "model": {
        "@architectures": "spacy-transformers.TransformerModel.v3",
        "name": "roberta-base"
    }
}

nlp_trf.add_pipe("transformer", config=config)
nlp_trf.initialize()

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


<thinc.optimizers.Optimizer at 0x12f3cb880>

Again, you are required to implement a new vectorizer that takes the text documents and returns a RoBERTa-based vector representation for each document. Please check the [Spacy documentation](https://spacy.io/usage/embeddings-transformers) to understand how to obtain RoBERTa embeddings and the [Transformers documentation](https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/output#transformers.modeling_outputs.SequenceClassifierOutput) to understand the outputs and dimensions. Please add a comment to explain your code. You may truncate the inputs to save time.

In [14]:
import tqdm

class RobertaVectorizer(BaseEstimator):
    def __init__(self, nlp):
        self.nlp = nlp
        self.dim = 768

    def transform(self, X):
        ####################################
        #   Your code here
        ####################################
        embeddings = []
        
        for doc in tqdm.tqdm(X):
            # Process document
            processed_doc = self.nlp(doc)

            # Get transformer output, which has context embeddings
            transformer_data = processed_doc._.trf_data
            
            # Select the embedding for first token,CLS token
            cls_embedding = transformer_data.tensors[0][0][0]
            embeddings.append(cls_embedding)
            
        return np.array(embeddings)
        ####################################

    def fit(self, X, y=None):
        return self

# Create the pipeline for the transformer-based classfier, and train it.
trf_classifier = LogisticRegression(max_iter=1000)
trf_pipe = Pipeline([
    ("vectorizer", RobertaVectorizer(nlp_trf)), ("classifier", trf_classifier)])

We are now ready to train this classifier.

In [15]:
# Model Training (This may take > 10 min if you don't use a GPU).
trf_pipe.fit(newsgroups_train.data, newsgroups_train.target)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
 21%|████████▏                               | 241/1171 [00:12<00:31, 29.28it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (536 > 512). Running this sequence through the model will result in indexing errors
100%|███████████████████████████████████████| 1171/1171 [01:04<00:00, 18.14it/s]


0,1,2
,steps,"[('vectorizer', ...), ('classifier', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,nlp,<spacy.lang.e...t 0x337c1b170>

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,1000


Finally, let's evaluate the transformer-based classifier.

In [16]:
trf_predicted = trf_pipe.predict(newsgroups_test.data)
print(f"Transformer Accuracy: {metrics.accuracy_score(newsgroups_test.target, trf_predicted)*100:.2f}")

100%|█████████████████████████████████████████| 777/777 [00:37<00:00, 20.75it/s]

Transformer Accuracy: 91.25



