# News Article Classification

**Project**: Feature generation using transformers

**Author**: Nada Stojanovic

**Task**: Use DistilBERT output to train a classifier using Vowpal Wabbit, in order to sort news articles into one of 24 categories, and compare performance to a classifier trained directly on vectorized data.

---

Transformers are deep neural network models that are designed to process sequential data, such as time series, images, and audio signals. They differ from traditional recurrent and convolutional neural networks in their ability to selectively attend to different parts of the input sequence, allowing them to capture long-range **dependencies** and **interactions** between different elements of the sequence, also known as the principle of **self-attention**.

BERT is a **transformer-based** model and an open source machine learning framework for natural language processing. It was designed to help computers understand the meaning of ambiguous language in text by using surrounding text to establish **context**. More details in the DistilBERT section below, but some of this contextual data is stored in CLS tokens which are added to every input sequence passed to BERT.

In this notebook, I will be extracting output vectors and hidden state embedding information associated with CLS tokens corresponding to short descriptions of news articles, and use them, alongside existing features to predict the category of each  article.

## | Preparing the data

I will be using a Kaggle dataset found [here]('https://www.kaggle.com/datasets/rmisra/news-category-dataset'). It contains information about nearly 210,000 HuffPost news headlines from 2012 to 2022. Each record in the dataset consists of the following attributes: **category**, **headline**, authors, link, **short_description** and date. To avoid prolonged training times when training BERT, I will only be using a subset of the dataset, namely the first 2,500 rows.

In [None]:
import pandas as pd

# Downloading the dataset from Kaggle
! pip install kaggle
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

! kaggle datasets download rmisra/news-category-dataset
! unzip news-category-dataset.zip

# Reading the file into a dataframe
df = pd.read_json('News_Category_Dataset_v3.json', lines = True)

There are 26 unique categories of news articles represented in the first 2500 instances, which we will be looking to predict based on the article's short description, headline and BERT output. We will not be using the 'authors', 'link' and 'date' columns.

In [None]:
# Removing unnecessary columns
df = df.drop(['link', 'authors', 'date'], axis=1)
# Factorizing the category column
df['category'] = pd.factorize(df['category'])[0]
# Using a subset of the dataset
df = df.iloc[:2500]

## | Feature Extraction using DistilBERT

I will be using **DistilBERT**, a compressed version of BERT. It uses a smaller architecture and fewer parameters than BERT, making it computationally lighter and faster to train and deploy. Despite its smaller size, DistilBERT achieves comparable performance to BERT, preserving over 95% of its performance as measured on the GLUE language understanding benchmark.

DistilBERT takes as input a sequence of words, and passes them through a stack of encoders which output a sequence of vectors corresponding to each token of the input sequence, including special tokens such as CLS and SEP. 

The **CLS token** is of particular interest for this task. It is added to the start of each input sequence, and contains **high-level information** about the sequence as a whole. The output vector corresponding to the CLS token can be used for **classification** and allows the model to make predictions based on the overall meaning of the input, rather than just local context.

A graphical representation in which we view DistilBERT as a black box may help visualize this process:

        [CLS]  word1  word2  word3  word4 ...    
          ↓      ↓      ↓      ↓      ↓       
      |------------------------------------|
      |             DistilBERT             |
      |------------------------------------|      
          ↓ 
    output vector
          ↓
    |------------|
    | Classifier |        ...
    |------------|
          ↓
    Classification

I will be using a pre-defined tokenizer and a pre-trained model from DistilBERT's base, imported from HuggingFace.

In [None]:
# Importing the model and tokenizer from HuggingFace
from transformers import AutoTokenizer, AutoModel
import torch

# Use GPU if possible
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load the pre-defined DistilBERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
# Load the pre-trained DistilBert model
model = AutoModel.from_pretrained("distilbert-base-uncased").to(device)

I will then split the dataset for training and testing, and use the pre-loaded tokenizer on the short description values.

In [None]:
from sklearn.model_selection import train_test_split

# Splitting the dataset into training and testing tests, with 20% allocated for testing
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Tokenize short description values, with applied padding and truncation to ensure
# consistent sequence lengths, and return PyTorch tensors
tokenized_train = tokenizer(train_df["short_description"].values.tolist(), padding = True, truncation = True, return_tensors="pt")
tokenized_test = tokenizer(test_df["short_description"].values.tolist() , padding = True, truncation = True,  return_tensors="pt")

# Move to device
tokenized_train = {k:v.clone().detach().to(device) for k,v in tokenized_train.items()}
tokenized_test = {k:v.clone().detach().to(device) for k,v in tokenized_test.items()}

I will then pass the tokenized data to DistilBERT and extract the hidden state corresponding to the CLS token at the start of each input sequence.

In [None]:
# Disable gradient calculation to save memory and time
with torch.no_grad():
  # Generate hidden states using the tokenized data from the previous step as input
  hidden_train = model(**tokenized_train)
  hidden_test = model(**tokenized_test)

# Obtain the hidden state embedding information associated with the CLS token
cls_train = hidden_train.last_hidden_state[:,0,:]
cls_test = hidden_test.last_hidden_state[:,0,:]

# Moving to CPU
x_train = cls_train.to("cpu")
y_train = train_df["category"]

x_test = cls_test.to("cpu")
y_test = test_df["category"]

## | Training a Classifier with DistilBERT output

I chose to train a classifier with a logisic loss function in VW, although other options could work well too.

In [None]:
import vowpalwabbit
from sklearn.metrics import precision_recall_fscore_support
from sklearn.feature_extraction.text import CountVectorizer

vw = vowpalwabbit.Workspace("--loss_function hinge --oaa 24")

# Define VW format
def to_vw_format(x1, x2, y, cls):
    # Flatten the DistilBERT output tensor 
    cls_str = ' '.join(map(str, cls.flatten().tolist()))
    res = f"{int(y)} | headline: {x1} | description: {x2}| cls: {cls_str}"
    return res

# Learn from the training set
for x1, x2, y, cls in zip(train_df['headline'], train_df['short_description'], train_df['category'], x_train):
    instance = to_vw_format(x1, x2, y, cls)
    vw.learn(instance)

# Make predictions from the test set
predictions = []
for x1, x2, y, cls in zip(test_df['headline'], test_df['short_description'], test_df['category'], x_test):
    instance = to_vw_format(x1, x2, y, cls)
    predicted_class = vw.predict(instance)
    predictions.append(predicted_class)

In order to gain a more comprehensive understanding of the classifier's performance, I opted to compute accuracy, precision, recall and F1 score.

In [None]:
# Evaluate the accuracy of the classifier
accuracy = len(y_test[y_test == predictions]) / len(y_test)
print(f"Model accuracy {accuracy:.2f}")

# Evaluate the precision, recall and F1 score of the classifier
precision, recall, f1_score, _ = precision_recall_fscore_support(y_test, predictions, average='weighted', zero_division=1)

print(f"Model precision: {precision:.2f}")
print(f"Model recall: {recall:.2f}")
print(f"Model F1-score: {f1_score:.2f}")

Model accuracy 0.32
Model precision: 0.40
Model recall: 0.32
Model F1-score: 0.26


## | Training a classifier on original data
Rather than comparing the model trained above to a dummy classifier, I thought it would only be fair to compare its performance to that of a classifier trained directly on original data, without the CLS token information.

In [None]:
vw = vowpalwabbit.Workspace("--loss_function hinge --oaa 24")

# Define VW format
def to_vw_format(x1, x2, y):
    res = f"{int(y)} | headline: {x1} | description: {x2}"
    return res

# Learn from the training set
for x1, x2, y in zip(train_df['headline'], train_df['short_description'], train_df['category']):
    instance = to_vw_format(x1, x2, y)
    vw.learn(instance)

# Make predictions from the test set
predictions = []
for x1, x2, y in zip(test_df['headline'], test_df['short_description'], test_df['category']):
    instance = to_vw_format(x1, x2, y)
    predicted_class = vw.predict(instance)
    predictions.append(predicted_class)

In [None]:
# Evaluate the accuracy of the classifier
accuracy = len(test_df[test_df.category == predictions]) / len(test_df)
print(f"Model accuracy {accuracy:.2f}")

# Evaluate the precision, recall and F1 score of the classifier
precision, recall, f1_score, _ = precision_recall_fscore_support(test_df.category, predictions, average='weighted', zero_division=1)

print(f"Model precision: {precision:.2f}")
print(f"Model recall: {recall:.2f}")
print(f"Model F1-score: {f1_score:.2f}")

Model accuracy 0.62
Model precision: 0.63
Model recall: 0.62
Model F1-score: 0.58


## | Discussion

Although my expectation was that, at least in some regard, the first classifier would outperform the second one, I was wrong. The first model predicted article category with an accuracy of 32%, precision of 40%, recall of 32% and F1 score of 0.26. The second model predicted article category with an accuracy of 62%, precision of 63%, recall of 62% and F1 score of 0.58.

On the basis of these results, it can be concluded that the second approach significantly **outperforms** the first, across the board. However, I theorize that this may not be a reflection on the value of DistilBERT's encoding.

In fact, I believe that the results would be different had VW been equipped with native support for DistilBERT, and been able to handle its output accordingly. Furthermore, VW and its classifiers are optimized for larger datasets whereas we are only dealing with 2500 instances.

## | Further experimentation

In order to test my hypothesis, I will conduct an additional experiment. I will train a Support Vector Machine model from the sklearn library, first on DistilBERT's output, and then just on original data.

The reason I opted for SVMs, although other classifiers would work well too, is that they have been shown to perform well in text classification tasks. They are also known to perform well on smaller datasets, which is fitting in this case.

In [None]:
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Traing a linear SVM model on DistilBERT's output
svm = LinearSVC(random_state=42, max_iter=10000)
svm.fit(x_train, y_train)

# Make predictions on the test data
y_pred = svm.predict(x_test)

# Evaluate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

# Evaluate the precision of the classifier
precision = precision_score(y_test, y_pred, average='weighted', zero_division=1)
print(f'Precision: {precision:.2f}')

# Evaluate the recall of the classifier
recall = recall_score(y_test, y_pred, average='weighted', zero_division=1)
print(f'Recall: {recall:.2f}')

# Evaluate the F1 score of the classifier
f1 = f1_score(y_test, y_pred, average='weighted')
print(f'F1-score: {f1:.2f}\n')

Accuracy: 0.64
Precision: 0.66
Recall: 0.64
F1-score: 0.63



And now, I will train the same classifier model on vectorized original data, without DistilBERT's output.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Vectorize the text data
vectorizer = CountVectorizer()
X_train_vect = vectorizer.fit_transform(train_df['short_description'])
X_test_vect = vectorizer.transform(test_df['short_description'])

Y_train_vect = train_df['category']
Y_test_vect = test_df['category']

# Traing a linear SVM model
svm = LinearSVC(random_state=42, max_iter=10000)
svm.fit(X_train_vect, Y_train_vect)

# Make predictions on the test data
y_pred = svm.predict(X_test_vect)

# Evaluate the accuracy of the classifier
accuracy = accuracy_score(Y_test_vect, y_pred)
print(f'Accuracy: {accuracy:.2f}')

# Evaluate the precision of the classifier
precision = precision_score(Y_test_vect, y_pred, average='weighted', zero_division=1)
print(f'Precision: {precision:.2f}')

# Evaluate the recall of the classifier
recall = recall_score(Y_test_vect, y_pred, average='weighted', zero_division=1)
print(f'Recall: {recall:.2f}')

# Evaluate the F1 score of the classifier
f1 = f1_score(Y_test_vect, y_pred, average='weighted')
print(f'F1-score: {f1:.2f}\n')

Accuracy: 0.51
Precision: 0.53
Recall: 0.51
F1-score: 0.48



## | Conclusion

Now, the first model predicted article category with an accuracy of 64%, precision of 66%, recall of 64% and F1 score of 0.63. The second model predicted article category with an accuracy of 51%, precision of 53%, recall of 51% and F1 score of 0.88.

This confirms that there is some value in DistilBERT's encoding, and affirms its ability to selectively attend to different parts of the input, thus capturing long-range dependencies and interactions between different elements of the sequence.

The model has shown it's able to learn which parts of the input sequence are most relevant for the task at hand, and to selectively focus on those parts while ignoring irrelevant information. Translating this concept to the area of feature generation and feature engineering would be extremely valuable.

However, as shown by the results of the first set of experiments, further investigations are required to determine the robustness of these results. I've outlined some of the ideas I have for continuing this research below.

## | Further research
* Experiment with different transformer-based models, such as RoBERTa or GPT-3
* Investigate the impact of using larger datasets on model performance, especially in VW
* Explore the use of other classifiers, such as Neural Networks, and see if they yield better performance
* Experiment with hyperparameter tuning
* Experiment with different input data types
* Investigate the potential for transfer learning by training the model on one dataset and fine-tuning it on a related dataset

## | Citations
1. Misra, Rishabh. "News Category Dataset." arXiv preprint arXiv:2209.11429 (2022).
2. Misra, Rishabh and Jigyasa Grover. "Sculpting Data for ML: The first act of Machine Learning." ISBN 9798585463570 (2021).