<a href="https://colab.research.google.com/github/Josh-Em/text-classification/blob/main/text_multiclassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Installing Dependencies for Text Multi-classification with BERT**

This code block installs the necessary Python libraries for building a text multi-classification script using a BERT model in a Google Colab notebook. The libraries at play include PyTorch (for deep learning), Transformers (for accessing pre-trained models like BERT), NumPy (for efficient array manipulation), Pandas (for data manipulation and analysis), Scikit-learn (for machine learning algorithms), Datasets (for managing and loading datasets), and tqdm (for displaying progress bars).

In [1]:
!pip install torch transformers numpy pandas scikit-learn
!pip install datasets
!pip install tqdm

Collecting transformers
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m50.4 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m33.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m69.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m66.4 MB/s[0m eta [36m0:00:00[0m
Col

# 📚 **Dataset Preparation & Tokenization** 📚

In this section, we load the 20 Newsgroups dataset, preprocess the data, and tokenize the text using BERT's tokenizer. Shuffling and splitting the dataset into training and validation sets ensure a better and unbiased model evaluation. BERT tokenizer is utilized to convert the raw text into a format understandable by the pre-trained BERT model. This is an essential step before feeding your data into the model for training.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer
import torch
from tqdm import tqdm
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split

# Load the 20 newsgroups dataset
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
data = pd.DataFrame({'text_data': newsgroups.data, 'label': newsgroups.target})

# Visualize newsgroup data object
entry_index = 0
print(f"Text:\n{newsgroups['data'][entry_index]}\n\n")
print(f"Label index: {newsgroups['target'][entry_index]}")
print(f"Label name: {newsgroups['target_names'][newsgroups['target'][entry_index]]}")

# Shuffle the dataset
data = data.sample(frac=1).reset_index(drop=True)

# Split the dataset into training and validation sets (80:20 ratio)
train_data, val_data = train_test_split(data, test_size=0.2, random_state=42)

# Initialize BERT tokenizer using the pretrained 'bert-base-uncased' model
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
max_seq_len = 128

def tokenize_data(data, tokenizer, max_seq_len):
    input_ids, attention_masks, labels = [], [], []

    # Iterate through each row in the dataset
    for index, row in tqdm(data.iterrows(), total=len(data)):
        # Tokenize the text using BERT's tokenizer with additional parameters
        encoded = tokenizer.encode_plus(
            row["text_data"],
            add_special_tokens=True,  # Add [CLS] and [SEP] tokens
            max_length=max_seq_len,  # Set max sequence length to 128
            padding="max_length",  # Pad shorter sequences to max_seq_len
            truncation=True,  # Truncate longer sequences to max_seq_len
            return_attention_mask=True,  # Return attention masks
        )

        # Append tokenized data to respective lists
        input_ids.append(encoded["input_ids"])
        attention_masks.append(encoded["attention_mask"])
        labels.append(row["label"])

    # Convert lists to tensors
    return torch.tensor(input_ids), torch.tensor(attention_masks), torch.tensor(labels)

# Tokenize both the training and validation data using the defined function
train_input_ids, train_attention_masks, train_labels = tokenize_data(train_data, tokenizer, max_seq_len)
val_input_ids, val_attention_masks, val_labels = tokenize_data(val_data, tokenizer, max_seq_len)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

100%|██████████| 15076/15076 [02:31<00:00, 99.69it/s] 
100%|██████████| 3770/3770 [00:21<00:00, 177.18it/s]


# 🔀 **Batch Processing with DataLoader** 🔀

After tokenizing the data, this section focuses on creating DataLoader objects for the training and validation sets. DataLoader helps with efficiently processing the data in batches, enabling better resource management during model training and evaluation. This step makes your dataset ready for the subsequent model training and evaluation stages.

In [3]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

batch_size = 16

# Create a TensorDataset object for the training set
train_dataset = TensorDataset(train_input_ids, train_attention_masks, train_labels)
# Use RandomSampler to shuffle the samples in the dataset
train_sampler = RandomSampler(train_dataset)
# Create DataLoader for the training set using dataset, sampler, and batch size
train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=batch_size)

# Create a TensorDataset object for the validation set
val_dataset = TensorDataset(val_input_ids, val_attention_masks, val_labels)
# Use SequentialSampler to process the validation dataset sequentially
val_sampler = SequentialSampler(val_dataset)
# Create DataLoader for the validation set using dataset, sampler, and batch size
val_dataloader = DataLoader(val_dataset, sampler=val_sampler, batch_size=batch_size)

# 🤖 **Loading the BERT Model for Classification Task** 🤖

Here, we configure and load a pre-trained BERT model for a specific classification task. This involves setting the model's output to the desired number of labels and disabling the output of unnecessary components like attention weights and hidden states. Moving the model to the GPU (if available) allows you to benefit from the accelerated training process.

In [4]:
from transformers import BertForSequenceClassification, AdamW, BertConfig

# Load the pre-trained BERT model for sequence classification
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=20,  # Number of labels (20) corresponds to the 20 newsgroups dataset
    output_attentions=False,  # Do not output attention weights
    output_hidden_states=False,  # Do not output hidden states
)

# Move the model to the GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

# 🚂 **BERT Model Training** 🚂

This final part involves fine-tuning the BERT model on the provided dataset and evaluating its performance. We define functions to train and evaluate the model after every epoch and calculate loss and accuracy metrics during training and validation, respectively. Furthermore, components such as optimizer and scheduler are introduced for efficient model training to help improve the results on each step. This section helps you understand the overall process of training BERT for a classification task and assessing the model's performance.

In [5]:
from transformers import get_linear_schedule_with_warmup
from sklearn.metrics import accuracy_score, classification_report

num_epochs = 3
total_steps = len(train_dataloader) * num_epochs

# Create the optimizer and scheduler for fine-tuning the model
optimizer = AdamW(model.parameters(), lr=2e-5)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)

def train_epoch(model, dataloader, optimizer, scheduler, device):
    model.train()
    total_loss = 0

    # Use a progress bar during training
    progress_bar = tqdm(dataloader, desc="Training", position=0, leave=True)

    # Iterate through each batch in a training epoch
    for batch in progress_bar:
        input_ids, attention_masks, labels = [t.to(device) for t in batch]

        # Zero out gradients before each backward pass
        optimizer.zero_grad()

        # Forward pass to compute the outputs and loss
        outputs = model(input_ids, attention_mask=attention_masks, labels=labels)
        loss = outputs[0]
        total_loss += loss.item()

        # Perform a backward pass and update optimizer/scheduler steps
        loss.backward()
        optimizer.step()
        scheduler.step()

        progress_bar.set_description(f"Training - Loss: {loss.item():.4f}")

    return total_loss / len(dataloader)

def evaluate(model, dataloader, device):
    model.eval()
    total_eval_accuracy = 0

    # Use a progress bar during evaluation
    progress_bar = tqdm(dataloader, desc="Evaluation", position=0, leave=True)

    # Iterate through each batch in a validation epoch
    for batch in progress_bar:
        input_ids, attention_masks, labels = [t.to(device) for t in batch]

        # Disable gradient calculations during evaluation
        with torch.no_grad():
            outputs = model(input_ids, attention_mask=attention_masks)

        logits = outputs[0].detach().cpu().numpy()
        label_ids = labels.cpu().numpy()

        # Calculate accuracy for the current batch
        batch_accuracy = accuracy_score(label_ids, logits.argmax(axis=-1))
        total_eval_accuracy += batch_accuracy

        progress_bar.set_description(f"Evaluation - Batch Accuracy: {batch_accuracy:.4f}")

    return total_eval_accuracy / len(dataloader)

# Train and evaluate the model for 'num_epochs' times
for epoch in range(num_epochs):
    train_loss = train_epoch(model, train_dataloader, optimizer, scheduler, device)
    val_accuracy = evaluate(model, val_dataloader, device)

    print(f"\nEpoch {epoch + 1}/{num_epochs}")
    print(f"Loss: {train_loss:.4f} - Validation Accuracy: {val_accuracy:.4f}")

Training - Loss: 1.3537: 100%|██████████| 943/943 [05:42<00:00,  2.75it/s]
Evaluation - Batch Accuracy: 0.6000: 100%|██████████| 236/236 [00:29<00:00,  8.00it/s]



Epoch 1/3
Loss: 1.4618 - Validation Accuracy: 0.7110


Training - Loss: 2.1332: 100%|██████████| 943/943 [05:40<00:00,  2.77it/s]
Evaluation - Batch Accuracy: 0.5000: 100%|██████████| 236/236 [00:29<00:00,  8.00it/s]



Epoch 2/3
Loss: 0.7950 - Validation Accuracy: 0.7296


Training - Loss: 0.9589: 100%|██████████| 943/943 [05:39<00:00,  2.78it/s]
Evaluation - Batch Accuracy: 0.7000: 100%|██████████| 236/236 [00:29<00:00,  7.95it/s]


Epoch 3/3
Loss: 0.5738 - Validation Accuracy: 0.7387





# 📊 **Evaluating BERT Model Using Performance Metrics** 📊

Building on the previous model training and evaluation process, this section is dedicated to extracting the BERT model's predictions and comparing them with the true labels in the validation dataset. We define a function that gathers predictions and true labels, enabling the calculation of accuracy and a detailed classification report. This assessment step is crucial for identifying how well the model performs on unseen data.

In [6]:
from sklearn.metrics import accuracy_score, classification_report
import numpy as np

def get_predictions(model, dataloader, device):
    model.eval()
    predictions, true_labels = [], []

    # Use tqdm for a progress bar
    for batch in tqdm(dataloader, desc="Evaluating"):
        input_ids, attention_masks, labels = [t.to(device) for t in batch]

        # Disable gradient calculations during evaluation
        with torch.no_grad():
            outputs = model(input_ids, attention_mask=attention_masks)

        logits = outputs[0].detach().cpu().numpy()
        label_ids = labels.cpu().numpy()

        # Add batch logits and labels to the list of predictions and true labels
        predictions.extend(logits.argmax(axis=-1))
        true_labels.extend(label_ids)

    # Convert the lists to NumPy arrays
    return np.array(predictions), np.array(true_labels)

# Obtain the model's predictions and true labels for the validation dataset
predictions, true_labels = get_predictions(model, val_dataloader, device)

# Calculate the accuracy on the validation dataset
accuracy = accuracy_score(true_labels, predictions)

# Generate a classification report with more detailed performance metrics
report = classification_report(true_labels, predictions, digits=4)

# Print the accuracy and classification report
print(f"Validation Accuracy: {accuracy:.4f}")
print("Classification Report:")
print(report)

Evaluating: 100%|██████████| 236/236 [00:29<00:00,  7.92it/s]

Validation Accuracy: 0.7387
Classification Report:
              precision    recall  f1-score   support

           0     0.5165    0.6104    0.5595       154
           1     0.6667    0.6927    0.6795       179
           2     0.7950    0.6465    0.7131       198
           3     0.6532    0.6840    0.6682       212
           4     0.7358    0.7396    0.7377       192
           5     0.8542    0.8241    0.8389       199
           6     0.8146    0.8743    0.8434       191
           7     0.5788    0.7545    0.6550       224
           8     0.7632    0.7360    0.7494       197
           9     0.9325    0.8492    0.8889       179
          10     0.9243    0.9000    0.9120       190
          11     0.8042    0.7755    0.7896       196
          12     0.6552    0.7037    0.6786       189
          13     0.8333    0.8757    0.8540       177
          14     0.8488    0.8208    0.8345       212
          15     0.6787    0.7972    0.7332       212
          16     0.6720    0.6




# 💾 **Save Model**

In [None]:
output_dir = "./model/"
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

('./model/tokenizer_config.json',
 './model/special_tokens_map.json',
 './model/vocab.txt',
 './model/added_tokens.json')

# 🔍 **Classifying a News Article with the BERT Model** 🔍

In this section, we present the code to classify a sample news article using our trained BERT model. By providing a text string representing a news article, the code tokenizes the text and feeds it into the model to obtain the predicted label index and class. This is a practical application of the BERT model's capabilities and demonstrates how the model can be used to classify new, unseen data. You can test the model's performance on any desired news article text by simply replacing the `sample_news_article` string in the code.

In [7]:
def classify_news_article(model, tokenizer, device, text, max_len=128):
    # Tokenize the text using BERT tokenizer
    encoded = tokenizer.encode_plus(
        text,
        add_special_tokens=True,
        max_length=max_len,
        padding="max_length",
        truncation=True,
        return_attention_mask=True,
        return_tensors="pt",  # Return tensors
    )

    # Move tensors to device (GPU or CPU)
    input_ids, attention_mask = encoded["input_ids"].to(device), encoded["attention_mask"].to(device)

    # Predict the label using the trained model
    with torch.no_grad():
        logits = model(input_ids, attention_mask=attention_mask)

    # Find the index of the class with the highest score
    predicted_label_index = logits[0].argmax(-1).item()

    return predicted_label_index


sample_news_article = "The Orion spacecraft will be launched on a new mission to explore deep space."

# Get the predicted label index for the given sample article
predicted_label_index = classify_news_article(model, tokenizer, device, sample_news_article)

# Print the predicted label index and name
print(f"Predicted Label Index: {predicted_label_index}")
print(f"Predicted Class: {newsgroups.target_names[predicted_label_index]}")

Predicted Label Index: 14
Predicted Class: sci.space
