# DIGI405 - Text Classification with Transformers

Notebook developed by Karin Stahel.  

⚠️ **Run this notebook in Google Colab - not in the DIGI405 JupyterHub**. You will want to change to a GPU/TPU accelerated runtime. To do this go to Runtime > Change runtime type. Read more about GPU/TPU availability in Google Colab [here](https://research.google.com/colaboratory/faq.html#gpu-availability).  

This notebook fine-tunes the ["distilbert-base-uncased"](https://huggingface.co/distilbert/distilbert-base-uncased) model on the [20 newsgroups dataset](http://qwone.com/~jason/20Newsgroups/). DistilBERT base (uncased) is accessed through the 🤗 Hugging Face [transformers](https://huggingface.co/docs/hub/en/transformers) library.  

Compare the performance of your fine-tuned transformer model with the model built using the [Text Classification Introduction notebook](https://github.com/polsci/text-classification-introduction).

---

**Remember:** Each time you change settings below, you need to rerun the following cells in order to implement the classification pipeline.


This cell imports installs a required library. If Colab prompts you to restart the session, do that and then continue running the cells.

In [None]:
# Install datasets
!pip install datasets==2.14.5

In [None]:
from sklearn import datasets
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import classification_report

import matplotlib.pyplot as plt
import pandas as pd

## Load corpus and set train/test split

In [None]:
# Preview the category names
categories = datasets.fetch_20newsgroups().target_names
for category in categories:
    print(category)

In [None]:
# this chooses the categories to load
cats = ['talk.politics.misc', 'talk.religion.misc']

In [None]:
# this downloads/loads the data
# dataset = fetch_20newsgroups(subset='train', categories=cats)
dataset = datasets.fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), categories=cats)

In [None]:
# assign the train/test split - 0.2 is 80% for training, 20% for testing
test_size = 0.2

# do the train test split ...
# X_train and X_test are the documents
# y_train and y_test are the labels
X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, 
                                                    test_size = test_size, random_state=42)

---

## Fine-tune DistilBERT

**Question:** Run this model and compare the results to the Bag of Words (Naive Bayes) model. Make note of some of the pros and cons of each and discuss with your neighbour or the tutors.

In [None]:
# Import libraries
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
import torch

The following cell sets-up and trains the model, then returns the results of evaluation against the test set.

Just run the cell, there is no need to change anything.

In [None]:
print('\n\nWe further split our training set into a new training set and an evaluation set\n\n')
docs_train_b, docs_eval, y_train_b, y_eval = train_test_split(X_train, y_train, test_size=0.2, random_state=None)

# Set up DistilBERT model
model_name = "distilbert-base-uncased"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=len(dataset.target_names))

train_encodings = tokenizer(docs_train_b, truncation=True, padding=True, max_length=512)
eval_encodings = tokenizer(docs_eval, truncation=True, padding=True, max_length=512)

train_dataset = Dataset.from_dict({'input_ids': train_encodings['input_ids'], 'attention_mask': train_encodings['attention_mask'], 'labels': y_train_b})
eval_dataset = Dataset.from_dict({'input_ids': eval_encodings['input_ids'], 'attention_mask': eval_encodings['attention_mask'], 'labels': y_eval})

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    eval_strategy="epoch",
    report_to='none'
)

# Define Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)

print('\n\nTraining...please be patient 😉\n')

print("""
Key points:
Convergence: Ideally, both training and validation loss should be decreasing in each epoch and converging to a lower value.
Overfitting: If the training loss decreases while the validation loss increases, it indicates that the model is overfitting to the training data.
Underfitting: If both training and validation loss remain high, it indicates the model is underfitting - it is not learning well from the data.
""")

# Train the model
trainer.train()

def distilbert_classify(texts, model, tokenizer):
    """
    Classify new text using our fine-tuned DistilBERT model.
    """
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # check for cuda availability and set device
    model = model.to(device) # move model to device
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=512).to(device) # move inputs to device
    with torch.no_grad():
        outputs = model(**inputs)
    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
    return probabilities, probabilities.argmax(dim=-1).tolist()

print('\nNow see how the fine-tuned model performs on new data - our original test set')

# Evaluate the model
probabilities, hf_predictions = distilbert_classify(X_test, model, tokenizer)

print("\nDistilBERT Results:")
print(classification_report(y_test, hf_predictions, target_names=dataset.target_names))

# Display confusion matrix
cm = metrics.confusion_matrix(y_true=y_test, y_pred=hf_predictions, labels=range(len(dataset.target_names)))
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=dataset.target_names)
disp.plot(include_values=True, cmap='Blues', xticks_rotation='vertical')
plt.show()

In [None]:
# Identify and rank incorrectly classified documents
incorrect_indices = [i for i, (true, pred) in enumerate(zip(y_test, hf_predictions)) if true != pred]
incorrect_docs = [
    (
        i,  # Original index
        X_test[i],
        dataset.target_names[y_test[i]],  # Map true label to text
        dataset.target_names[hf_predictions[i]],  # Map predicted label to text
        probabilities[i].max().item()
    )
    for i in incorrect_indices
]

# Sort by the highest confidence in the incorrect class
incorrect_docs_sorted = sorted(incorrect_docs, key=lambda x: x[4], reverse=True)
df_incorrect = pd.DataFrame(incorrect_docs_sorted, columns=['Test set index', 'Document', 'True Label', 'Predicted Label', 'Confidence'])

print("\nIncorrectly Classified Documents\n")
print("Confidence is the probability the model has assigned for the given document belonging to the predicted class.\n")
display(df_incorrect)

In [None]:
index_to_display = 1
print(df_incorrect.loc[index_to_display, 'Document'])