<a href="https://colab.research.google.com/github/maddran/headlineclassification/blob/main/Headline_classification_XLMRoberta_BBC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# News Topic Classification Using Transformers

This blog post (or Colab notebook) is a demontration of applying [transfer learning](https://en.wikipedia.org/wiki/Transfer_learning) to a pretrained transformer model. Here, we will start with a sequence classification model provided by [Hugging Face](https://huggingface.co/models), and fine-tune it using a dataset of [BBC News content](http://mlg.ucd.ie/datasets/bbc.html) to produce a news topic classification pipeline.

**N.B.** - this demonstration assumes you have some basic background in Natural Language Processing (NLP) and Deep Learning (DL), and as such, I will not be explaining the methods used in great detail. That said, I will ember links to material that could help you along the way.

**N.B.** - portions of the code that follows is heavily based of the following sources:

* Hugging Face examples on [Fine-tuning with custom datasets](https://huggingface.co/transformers/custom_datasets.html).
* [This post](https://medium.com/@aniruddha.choudhury94/part-2-bert-fine-tuning-tutorial-with-pytorch-for-text-classification-on-the-corpus-of-linguistic-18057ce330e1) by Medium user Aniruddha Choudhury.

---



## Install Modules

Let's start by installing two requisite modules:


*   **Sentencepiece** - Python implementation of [Google's unsupervised text tokenizer](https://github.com/google/sentencepiece). This is only required if the transformer model chosen uses SentencePiece tokenization.
*   **transformers** - [Hugging Face's library](https://huggingface.co/transformers/index.html) which contains Python implementations of many of the widely adopted NLP models along with a host of useful helper functions for pre-processing, training, pipelining, and more.



In [1]:
!pip install -Uq Sentencepiece
!pip install -Uq transformers

[K     |████████████████████████████████| 1.2MB 7.5MB/s 
[K     |████████████████████████████████| 1.9MB 7.7MB/s 
[K     |████████████████████████████████| 890kB 37.7MB/s 
[K     |████████████████████████████████| 3.2MB 54.6MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


## Download BBC data

Now, lets download and parse the [BBC Dataset](http://mlg.ucd.ie/datasets/bbc.html) into a dataframe.

In [2]:
import os
import pandas as pd
from glob import glob
import numpy as np

# Check if file has already been downloaded
if "bbc-fulltext.zip" not in glob("bbc*.zip"):
  !wget "http://mlg.ucd.ie/files/datasets/bbc-fulltext.zip"
  !unzip -q bbc*.zip

# Function to extract and decode text from each file
def get_text(filepath, headline = False):
  with open(filepath, 'rb') as f:
    if headline:
      text = f.readline()
    else:
      text = f.read()
  
  return text.decode('utf-8', 'replace')

path = "bbc"
# Get all subfolders
subfolders = [f.path for f in os.scandir(path) if f.is_dir()]
res = []

# Iterate through each subfolder and get text
for sf in subfolders:
  # get news category from fiel path
  category = sf.split('/')[-1]
  glob_pattern = os.path.join(f'bbc/{category}', '*')
  # get filepaths of all files in subfolder
  filepaths = sorted(glob(glob_pattern), key=os.path.getctime)
  # call get_text for each filepath
  res.append([{"category":category, "text":get_text(fp)} 
              for fp in filepaths])

# Flatten resulting list of dictionaries    
res = [item for sublist in res for item in sublist]

# Create DataFrame
bbc_data = pd.DataFrame(res)
# Replace linebreaks with spaces
bbc_data['text'] = bbc_data['text'].replace(r'\n',' ', regex=True) 


--2021-03-03 20:03:32--  http://mlg.ucd.ie/files/datasets/bbc-fulltext.zip
Resolving mlg.ucd.ie (mlg.ucd.ie)... 137.43.93.132
Connecting to mlg.ucd.ie (mlg.ucd.ie)|137.43.93.132|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2874078 (2.7M) [application/zip]
Saving to: ‘bbc-fulltext.zip’


2021-03-03 20:03:33 (2.87 MB/s) - ‘bbc-fulltext.zip’ saved [2874078/2874078]



Let's take a look at the dataset in a bit more detail.

In [3]:
bbc_data.describe()

Unnamed: 0,category,text
count,2225,2225
unique,5,2127
top,sport,What high-definition will do to DVDs First it...
freq,511,2


So, it's a relatively small dataset with 2225 entires and five categories, which are listed below along with their frequency counts.

In [4]:
bbc_data.category.value_counts()

sport            511
business         510
politics         417
tech             401
entertainment    386
Name: category, dtype: int64

## Prepare and split data

Put text data into a list and encode the category labels from strings into integers. That is to say, the classification model will output an integer corresponding to the topic class (well, not exactly, but let's say that for simplicity).

We also store the label encoding for use during prediction tasks.

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Put string data into a list
data = bbc_data['text'].tolist()

# Encode label categories from strings to ints
le = LabelEncoder()
labels = le.fit_transform(bbc_data.category.tolist())
# Save encoder classes to use at inference
np.save('classes.npy', le.classes_)

# Split data and labels in train and test sets
train_texts, test_texts, train_labels, test_labels = train_test_split(data, labels, test_size=.2)
# Further split train data into train and validation sets
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

The encoded labels and their corresponding categories are as follows:

In [6]:
dict(zip(range(len(le.classes_)),le.classes_))

{0: 'business', 1: 'entertainment', 2: 'politics', 3: 'sport', 4: 'tech'}

## Picking a pre-trained model

Just as we encoded the topic labels for use in model training, we must also encode the text to be fed into the model. In essence, we want to convert a sequence of words or *tokens* into a sequence of numbers. As you can imagine, this is a less than trivial task.

Thankfully, the `transformers` library has predefined tokenizers for each pre-trained model. All we have to do is download the tokenizer for our chosen model and apply the requisite pre-processing steps.

So, what model should we pick? In this demonstration, I will start with the pre-trained [XLM-RoBERTa](https://huggingface.co/transformers/model_doc/xlmroberta.html) model which was developed and released by [Facebook AI](https://arxiv.org/pdf/1911.02116.pdf) in 2020. While we could go with any one of the pre-trained models provided by Hugging Face, I chose XLM-R because, as of writing this in February 2021, it is relatively new, and more crucially, it is a multilingual model trained on 100 different languages. 

Although the BBC dataset is monolingual (i.e. it only contains English language news), we should be able to use it to train the XLM-R model to classify news in any of the 100 languages it has been trained on 🤯.

However, the downside of using such a large model is that it is, well ... large. So unless you have a multilingual dataset to classify, I strongly urge you to look to the more pared down monolingual models.

## Tokenizing and encoding text

Ok, so we picked the XLM-R model, let's get to it! We start by downloading the XLM-R tokenizer from Hugging Face. We tell the tokenizer to load from a pre-trained XLM-R model `xlm-roberta-base`. The function `encode_text` calls the tokenizer on the text it is fed using the following parameters:

* `max_length` - the maximum length (in tokens) of each input sequence.
* `add_special_tokens` - adds beginning and end of sequence tokens as required by the model.
* `truncation` - truncates each sequence to `max_length`
* `padding` - pads each sequence to `max_length`

The result is a dataset of encoded sequences of uniform length, ready to be used in model training.

In [7]:
# Download the tokenizer for the model being used
from transformers import XLMRobertaTokenizer
print("Downloading Tokenizer...")
tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base')

# Tokenizes and encodes the text using 
# the tokenizer loaded above
def encode_text(text, tokenizer):
  return tokenizer(text, 
            max_length=128, # Could use longer or shorter max lenghts
            add_special_tokens = True, # Adds end of sequence and start of sequence tokens
            truncation=True, # Truncates sequences to max_length
            padding='max_length')  # Pads sequences to max_length




Downloading Tokenizer...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=5069051.0, style=ProgressStyle(descript…




Let's get a better handle on what an encoded sentence looks like by running some sample sentences in different languages through `encode_text`:

In [8]:
# sample sequences in English, Arabic, and French
# All credit to Will Smith, DJ Jazzy Jeff and Google Translate
sample_text = ['Here it is the groove slightly transformed', 
        'فقط قليلا من كسر من القاعدة',
        'Juste un petit quelque chose pour briser la monotonie']

sample_encoded = [encode_text(text, tokenizer) for text in sample_text]

for i, encoded in enumerate(sample_encoded):
  print("Input :",sample_text[i])
  print("Output :",encoded['input_ids'][0:25],"...")
  print(f"Length of output : {len(encoded['input_ids'])}")
  print(f"Length of output w/o padding : {len([e for e in encoded['input_ids'] if e != 1])}\n")

Input : Here it is the groove slightly transformed
Output : [0, 11853, 442, 83, 70, 11969, 8206, 161549, 27198, 297, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] ...
Length of output : 128
Length of output w/o padding : 11

Input : فقط قليلا من كسر من القاعدة
Output : [0, 7012, 185997, 230, 6, 135154, 230, 207934, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] ...
Length of output : 128
Length of output w/o padding : 9

Input : Juste un petit quelque chose pour briser la monotonie
Output : [0, 9563, 13, 51, 10174, 38944, 19667, 578, 14799, 2189, 21, 182902, 478, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] ...
Length of output : 128
Length of output w/o padding : 14



So, all of the input text sequences have been converted to integer sequences. A few things to note:

* Each output sequence starts with a `0` (which indicates the beginning),
* Followed a several integers numbers then a `2` (which indicates the end),
* Finally followed by a series of `1`s to pad the sequence to the defined length of 128 characters.
* The length of the output sequence excluding padding is not necessarily equal to the  number of words in the input sequence + 2 (for the begin and end tokens). This is due to the way SentencePiece uses subword splitting to more efficiently represent a large vocabulary of words ([see here](https://www.aclweb.org/anthology/D18-2012/) for details).

That seems to work! Let's apply it to the split BBC dataset:

In [9]:
texts = dict(train=train_texts, val=val_texts, test=test_texts)
encoded_text = {key : encode_text(value, tokenizer) for key, value in texts.items()}

Before we continue, please note that much of the remaining code is heavily based of the following sources:

* Hugging Face examples on [Fine-tuning with custom datasets](https://huggingface.co/transformers/custom_datasets.html).
* [This post](https://medium.com/@aniruddha.choudhury94/part-2-bert-fine-tuning-tutorial-with-pytorch-for-text-classification-on-the-corpus-of-linguistic-18057ce330e1) by Medium user Aniruddha Choudhury.

## Preparing datasets

We're finally ready to dip our toes into PyTorch! 

Let's start by defining a class `Dataset` which effectively wraps the `torch.utils.data.Dataset` class to fit our sequence classifier. Then we define an instance of `Dataset` for each of the three training, validation, and testing datasets.


In [10]:
import torch

class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = Dataset(encoded_text['train'], train_labels)
val_dataset = Dataset(encoded_text['val'], val_labels)
test_dataset = Dataset(encoded_text['test'], test_labels)

Here are a couple of functions to help with monitoring the progress of model training. 

* `accuracy` - returns the accuracy of input predictions `pred`, given true labels `labels`.
* `format_time` - takes a time in seconds and returns a string `hh:mm:ss`

In [11]:
def accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

import time
import datetime
def format_time(elapsed):
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

Now we set ALL the random seeds (for reproducibility) and set `device` to `cuda` if a GPU is available, otherwise we will use the CPU. Please not that the CPU could be several orders of magnitude slower than a  GPU in model training.

In [12]:
import random
import numpy as np

seed_val = 42
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

Finally we set the `batch_size` that the PyTorch `DataLoader` will use to create training mini-batches ([see this](https://datascience.stackexchange.com/questions/16807/why-mini-batch-size-is-better-than-one-single-batch-with-all-training-data) for why min-batches are used). I chose `batch_size = 16`, but this is one of the model hyperparameters that can be tuned to optimize the classifier.

We then create `DataLoader` instances for the training and validation datasets.

In [13]:
from torch.utils.data import DataLoader

batch_size = 16

train_loader = DataLoader(train_dataset, batch_size=batch_size)
val_loader = DataLoader(val_dataset, batch_size=batch_size)


## Prepare model

Now that we have our data, let's turn our attention to the model. We start by downlading the pre-trained `XLMRobertaForSequenceClassification` model. One thing to note is that we have to define how many labels we want our classifier to identify in the `num-labels` parameter - this sets the dimension of the output layer of the transformer model.

Once the model in configured, we load it to `device` and set it to train mode.

The last step is to define the optimizer and number of epochs we wish to use in model training. In this case we will use the [AdamW optimizer](https://pytorch.org/docs/stable/optim.html#torch.optim.AdamW) with a learning rate = `5e-5` and `10` epochs. While all of these parameters were the result of some trial and error, in practice, there are more programatic ways to arrive at them. [This tutorial](https://pytorch.org/tutorials/beginner/hyperparameter_tuning_tutorial.html) on hyperparameter tuning in PyTorch may be a good starting point.

In [14]:
from transformers import XLMRobertaForSequenceClassification, AdamW

print("Downloading XLM-R model from Hugging Face...")
model = XLMRobertaForSequenceClassification.from_pretrained('xlm-roberta-base',
          num_labels = len(set(labels)),
          output_attentions = False,
          output_hidden_states = False)
model.to(device)
model.train();

optim = AdamW(model.parameters(), lr=5e-5)
epochs = 10
loss_values = []

Downloading XLM-R model from Hugging Face...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=512.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1115590446.0, style=ProgressStyle(descr…




Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense

## Model training

What follows is a standard PyTorch implementation of model training via backpropogation. For each training epoch, the code outputs the average training loss as well as the calculated accuracy on the validation dataset.

We also checkpoint the trained model after every epoch - this can be done more often (e.g. every n steps) for longer training times.

In [15]:
for epoch in range(epochs):
    print("")
    print(f'======== Epoch {epoch+1} / {epochs} ========')
    print('Training...')
    total_loss = 0
    t0 = time.time()
    update_interval = len(train_loader)//5
    for step, batch in enumerate(train_loader):
      if step % update_interval == 0 and not step == 0:
        elapsed = format_time(time.time() - t0)
        print(f'  Batch {step:>5,}  of  {len(train_loader):>5,}.    Elapsed: {elapsed}.')
      optim.zero_grad()
      input_ids = batch['input_ids'].to(device)
      attention_mask = batch['attention_mask'].to(device)
      labels = batch['labels'].to(device)
      outputs = model(input_ids, 
                      attention_mask=attention_mask, 
                      labels=labels)
      
      loss = outputs[0]
      total_loss += loss.item()
      loss.backward()

      optim.step()

    avg_train_loss = total_loss / len(train_loader)
    loss_values.append(avg_train_loss) 
    print("\n  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epoch took: {:}".format(format_time(time.time() - t0)))

    nb_eval_steps, eval_accuracy = 0, 0
    print("\nRunning Validation...")
    t0 = time.time()

    model.eval()

    for batch in val_loader:
      input_ids = batch['input_ids'].to(device)
      attention_mask = batch['attention_mask'].to(device)
      labels = batch['labels'].to(device)

      with torch.no_grad(): 
        outputs = outputs = model(input_ids, 
                                  attention_mask=attention_mask, 
                                  token_type_ids=None) 

      logits = outputs[0]                   
      logits = logits.detach().cpu().numpy()
      label_ids = labels.to('cpu').numpy()

      eval_accuracy += accuracy(logits, label_ids)
      nb_eval_steps += 1

    print("\n  Accuracy: {0:.2f}".format(eval_accuracy/nb_eval_steps))
    print("  Validation took: {:}".format(format_time(time.time() - t0)))

    torch.save(model, f"xlmr_multilingual_categorization_bbc_datastet_{epoch+1}_epochs.pt")

    



Training...
  Batch    17  of     89.    Elapsed: 0:00:04.
  Batch    34  of     89.    Elapsed: 0:00:08.
  Batch    51  of     89.    Elapsed: 0:00:12.
  Batch    68  of     89.    Elapsed: 0:00:16.
  Batch    85  of     89.    Elapsed: 0:00:20.

  Average training loss: 0.64
  Training epoch took: 0:00:21

Running Validation...

  Accuracy: 0.85
  Validation took: 0:00:01

Training...
  Batch    17  of     89.    Elapsed: 0:00:04.
  Batch    34  of     89.    Elapsed: 0:00:08.
  Batch    51  of     89.    Elapsed: 0:00:12.
  Batch    68  of     89.    Elapsed: 0:00:16.
  Batch    85  of     89.    Elapsed: 0:00:19.

  Average training loss: 0.17
  Training epoch took: 0:00:20

Running Validation...

  Accuracy: 0.94
  Validation took: 0:00:01

Training...
  Batch    17  of     89.    Elapsed: 0:00:04.
  Batch    34  of     89.    Elapsed: 0:00:08.
  Batch    51  of     89.    Elapsed: 0:00:12.
  Batch    68  of     89.    Elapsed: 0:00:16.
  Batch    85  of     89.    Elapsed: 0:00:

## Results

As per the training progress above, we get as high as `0.98` accuracy on the validation set, and the average training loss drops to `0.01`. Not bad at all!

Lets plot the progress of the average training loss.

In [16]:
import plotly.express as px
f = pd.DataFrame(loss_values)
f.columns=['Loss']
fig = px.line(f, x=f.index + 1, y=f.Loss)
fig.update_layout(title='Training loss of the Model',
                   xaxis_title='Epoch',
                   yaxis_title='Loss')
fig.show()

## Testing the model

We now have a trained model! In fact, we have a few. Let's pick the one that corresponds to the lowest average loss (as in the plot above) and see how it does with our test dataset.

The function `predict` is essentially the same as the training code we used previously, without the outer (epoch) loop. When classifying new data, the model outputs a series of [logits](https://en.wikipedia.org/wiki/Logit) which represents the relative probability of each category. i.e. the largest output value corresponds to the label that the model deems most probable for the given text.

In [17]:
test_loader = DataLoader(test_dataset, batch_size=batch_size)

def predict(data_loader, model):
  predictions = []
  true_labels = []

  for batch in data_loader:
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    labels = batch['labels'].to(device)

    with torch.no_grad(): 
      outputs = outputs = model(input_ids, 
                                attention_mask=attention_mask, 
                                token_type_ids=None) 

    logits = outputs[0]                   
    logits = logits.detach().cpu().numpy()
    label_ids = labels.to('cpu').numpy()

    predictions.append(logits)
    true_labels.append(label_ids)

  return predictions, true_labels

model_path = glob(f"xlmr_*{np.argmin(loss_values)+1}_epochs.pt")[0]
model = torch.load(model_path)
model.eval()

predictions, true_labels = predict(test_loader, model)

And here is what the predicitons look like:

In [18]:
print("# of predictions =", sum(len(p) for p in predictions))
print("# of logits per prediction =", len(predictions[0][0]))
print('\n', predictions[0][0:5])

# of predictions = 445
# of logits per prediction = 5

 [[-0.5814546   0.12978259 -0.78264946 -1.2094038   1.9385245 ]
 [-0.4160801  -0.8368629  -1.5777658  -2.4954176   5.7345533 ]
 [-1.4676574  -1.0874201  -1.6271573   6.6566863  -1.6515472 ]
 [-0.7357026  -0.96731544 -1.0665618  -2.6714253   5.478526  ]
 [-1.7571563  -0.7775552   5.796031   -1.4086983  -2.1100788 ]]


So, we have a vector of five logits (or log-odds) for each text input, where each logit corresponds to one of the topic categories.

### Assessing predicitons

Yay, predictions! But how good are our predictions? The following code finds the category which corresponds to the largest logit in each prediciton vector and compares it against the true category to produce a classification report.

In [19]:
from sklearn.metrics import classification_report

pred_cats1 = [np.argmax(predictions[i], axis=1).flatten() for i in range(len(true_labels))]
pred_cats1 = np.concatenate(pred_cats1).ravel()
pred_cats1 = le.inverse_transform(pred_cats1)

true_cats = le.inverse_transform(np.concatenate(true_labels).ravel())

print(classification_report(true_cats, pred_cats1))

               precision    recall  f1-score   support

     business       0.89      0.93      0.91       115
entertainment       1.00      0.92      0.96        71
     politics       0.85      0.98      0.91        81
        sport       1.00      0.97      0.98        94
         tech       0.95      0.86      0.90        84

     accuracy                           0.93       445
    macro avg       0.94      0.93      0.93       445
 weighted avg       0.93      0.93      0.93       445



As you can see from the `accuracy` values above, the model performs almost as well on the testing dataset as on the training dataset (ah, sweet, sweet generalization).

Let's do one more bit of analysis. Now, we grab the 2nd largest value in each prediction vector and see how well it matches the true category for *input text that was categorized incorrectly*. i.e. are there ambiguous cases where the text could reasonably be classified into multiple categories?

The code below gets the `k`-th largest logit in each vector, if that `k-th logit > 0.0` (since logit value of `0.0` corresponds to `50%` probability). In this case, let's just look at the topic category correponding to the 2nd largest logit value.

In [20]:
def kth_largest(predictions_i, k):
  kth = np.argsort(-predictions_i, axis = 1)[:,k-1]
  preds = le.inverse_transform(list(kth))
  if k != 1:
    mask = np.where(np.bincount(np.where(predictions_i > 0)[0]) != k)[0]
    preds = np.array([None if i in mask else val for i, val in enumerate(preds)])

  return preds

pred_cats2 = [kth_largest(predictions[i], 2).flatten() for i in range(len(predictions))]
pred_cats2 = np.concatenate(pred_cats2).ravel()

Now, lets look at the classification report comparing the actual categories of the input instances that were missclassified, to the 2nd most probable predicted categories.

In [21]:
res = pd.DataFrame({'actual':true_cats,
                    'prediction_1':pred_cats1, 
                    'prediction_2':pred_cats2, 
                    'text':test_texts})

subres = res[(~res['prediction_2'].isna()) & (res['prediction_1']!=res['actual'])]

print(classification_report(subres['actual'], subres['prediction_2']))

               precision    recall  f1-score   support

     business       0.62      1.00      0.77         5
entertainment       1.00      0.60      0.75         5
     politics       0.33      1.00      0.50         1
        sport       0.50      1.00      0.67         1
         tech       1.00      0.50      0.67         8

     accuracy                           0.70        20
    macro avg       0.69      0.82      0.67        20
 weighted avg       0.85      0.70      0.70        20



As you can see, the 2nd most probable prediction (with logit value > `0.0`) does seem to consistently match the actual category for the miscategorized input text. However, this sample size is rather small and as such, we should take this result with a grain of salt.

That said, let's look at a sample of the text that was miscategorized:

In [26]:
for i, row in subres.sample(5).iterrows():
  print(f"Text: {row['text'][0:120]}...")
  print(f"Actual category: {row['actual']}")
  print(f"Top prediction: {row['prediction_1']}")
  print(f"2nd prediction: {row['prediction_2']}\n")

Text: Prop Jones ready for hard graft  Adam Jones says the Wales forwards are determined to set the perfect attacking platform...
Actual category: sport
Top prediction: politics
2nd prediction: sport

Text: Arthur Hailey: King of the bestsellers  Novelist Arthur Hailey, who has died at the age of 84, was known for his bestsel...
Actual category: entertainment
Top prediction: politics
2nd prediction: sport

Text: Bets off after Big Brother 'leak'  A bookmaker has stopped taking bets on Celebrity Big Brother after claiming "sensitiv...
Actual category: entertainment
Top prediction: politics
2nd prediction: business

Text: BT boosts its broadband packages  British Telecom has said it will double the broadband speeds of most of its home and b...
Actual category: tech
Top prediction: business
2nd prediction: tech

Text: Millions to miss out on the net  By 2025, 40% of the UK's population will still be without internet access at home, says...
Actual category: tech
Top prediction: business
2n

From the above, it can be noted that while some of the examples are slightly ambiguous in their category, there are a few instances where the predictions are flatout incorrect. Depending on your intended application, the `k-th` largest logit approach may add unnecessary complexity for a marginal improvement in accuracy.

For now, let's see what effect using the 2nd predictions has on the accuracy of the test dataset predictions:

In [28]:
p1 = [ p==l for (p,l) in zip(pred_cats1, true_cats)]
p2 = [ p==l for (p,l) in zip(pred_cats2, true_cats)]
acc1 = sum(p1)/len(p1)
acc12 = sum([ p1 | p2 for (p1,p2) in zip(p1, p2)])/len(p1)

print(f"Accuracy of first prediction = {acc1:.2}")
print(f"Accuracy of first or second prediction = {acc12:.2}")

Accuracy of first prediction = 0.93
Accuracy of first or second prediction = 0.96


There we go! A 3% improvement.

I mentioned earlier that XLM-RoBERTa is a multilingual model; however, so far, we have only tested the model on English language news. In the next post, we'll explore how to use our trained model to translate completely new news text in other langauges.