# BERT News Classification ‚Äì Starter Notebook

This notebook demonstrates an **end-to-end BERT-based text classification pipeline**
using the **AG News dataset**.

### Key Highlights
- Pre-trained **BERT (bert-base-uncased)**
- Fine-tuned for **multi-class classification**
- Achieved **~94% accuracy**
- Deployed using **Gradio**


In [1]:
# Imports
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW
from transformers import BertTokenizer, BertForSequenceClassification
from sklearn.metrics import accuracy_score, f1_score
import gradio as gr

2026-01-31 15:32:42.261158: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1769873562.512686      55 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1769873562.576555      55 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1769873563.165311      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1769873563.165353      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1769873563.165355      55 computation_placer.cc:177] computation placer alr

## üìä Dataset Overview

We use the **AG News Dataset**, which contains news articles categorized into:

- üåç World
- üèÖ Sports
- üíº Business
- üî¨ Science & Technology

Each sample includes:
- News **title**
- News **description**
- Corresponding **label**


In [2]:
# CSV URLs
train_url = "https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv"
test_url  = "https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/test.csv"

In [3]:
# Load CSV
train_df = pd.read_csv(train_url, header=None)
test_df  = pd.read_csv(test_url, header=None)

In [4]:
train_df = train_df.iloc[:, :3]
test_df  = test_df.iloc[:, :3]

In [5]:
train_df.columns = ["label", "title", "description"]
test_df.columns  = ["label", "title", "description"]

In [6]:
train_df["text"] = train_df["title"] + " " + train_df["description"]
test_df["text"]  = test_df["title"] + " " + test_df["description"]

In [7]:
train_df["label"] = train_df["label"] - 1
test_df["label"]  = test_df["label"] - 1

In [8]:
train_texts, val_texts, train_labels, val_labels = train_test_split(
    train_df["text"].tolist(),
    train_df["label"].tolist(),
    test_size=0.1,
    random_state=42
)

test_texts  = test_df["text"].tolist()
test_labels = test_df["label"].tolist()

print(f"Train: {len(train_texts)}, Validation: {len(val_texts)}, Test: {len(test_texts)}")

Train: 108000, Validation: 12000, Test: 7600


## üßπ Text Preprocessing

- Combined **title + description**
- Converted labels to start from **0**
- Split data into:
  - Training set
  - Validation set
  - Test set

BERT tokenizer handles:
- Tokenization
- Padding
- Truncation


In [9]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

class NewsDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=64):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer(
            text,
            padding="max_length",
            truncation=True,
            max_length=self.max_len,
            return_tensors="pt"
        )
        return {
            "input_ids": encoding["input_ids"].flatten(),
            "attention_mask": encoding["attention_mask"].flatten(),
            "label": torch.tensor(label, dtype=torch.long)
        }

train_dataset = NewsDataset(train_texts, train_labels, tokenizer)
val_dataset   = NewsDataset(val_texts, val_labels, tokenizer)
test_dataset  = NewsDataset(test_texts, test_labels, tokenizer)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [10]:
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader   = DataLoader(val_dataset, batch_size=16)
test_loader  = DataLoader(test_dataset, batch_size=16)

## üß† Custom PyTorch Dataset

We define a custom `Dataset` class to:
- Tokenize text on-the-fly
- Return `input_ids`, `attention_mask`, and `labels`
- Support batching via `DataLoader`

This keeps the pipeline clean and scalable.

In [11]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=4
)
model.to(device)

Using device: cuda


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

## ü§ñ Model Architecture

We use:

- **BERT Base (Uncased)**
- Hidden size: 768
- Transformer layers: 12
- Output layer customized for **4 classes**

The classification head is trained from scratch.


In [None]:
optimizer = AdamW(model.parameters(), lr=5e-5)
epochs = 2

for epoch in range(epochs):
    model.train()
    total_loss = 0
    for batch in train_loader:
        optimizer.zero_grad()
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)

        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )
        loss = outputs.loss
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
    
    print(f"Epoch {epoch+1} | Avg Loss: {total_loss/len(train_loader):.4f}")

## üöÄ Training Strategy

- Optimizer: **AdamW**
- Learning Rate: `5e-5`
- Batch Size: `16`
- Epochs: `2`
- Loss Function: **CrossEntropyLoss** (built-in)

Training performed on **GPU (CUDA)** when available.

In [None]:
model.eval()
all_preds = []
all_labels = []

with torch.no_grad():
    for batch in test_loader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)

        outputs = model(input_ids, attention_mask=attention_mask)
        preds = torch.argmax(outputs.logits, dim=1)
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

acc = accuracy_score(all_labels, all_preds)
f1  = f1_score(all_labels, all_preds, average="weighted")

print(f"Test Accuracy: {acc:.4f}")
print(f"Test F1-score: {f1:.4f}")


## üìà Model Evaluation

We evaluate the model using:

- **Accuracy**
- **Weighted F1-score**

These metrics ensure balanced performance across all classes.

In [None]:
labels = ["World", "Sports", "Business", "Sci/Tech"]

def predict_headline(text):
    encoding = tokenizer(
        text,
        padding="max_length",
        truncation=True,
        max_length=64,
        return_tensors="pt"
    ).to(device)
    model.eval()
    with torch.no_grad():
        logits = model(**encoding).logits
    pred = torch.argmax(logits, dim=1).item()
    return labels[pred]

interface = gr.Interface(
    fn=predict_headline,
    inputs=gr.Textbox(lines=2, placeholder="Enter a news headline..."),
    outputs="text",
    title="News Topic Classifier",
    description="Enter a news headline and BERT predicts the topic!"
)

interface.launch()


## üíæ Saving Model Artifacts

After training and evaluation, we save the **fine-tuned BERT model** and its
associated components for future use.

### What is saved?
- ‚úÖ Trained BERT model weights
- ‚úÖ Tokenizer configuration and vocabulary
- ‚úÖ Label mapping for class interpretation

### Why is this important?
- Enables **reuse without retraining**
- Required for **deployment** (Streamlit / Gradio / FastAPI)
- Ensures **consistent predictions** across environments

All artifacts are stored locally and can be easily loaded for inference or production deployment.


In [None]:
model.save_pretrained("bert_news_model")
tokenizer.save_pretrained("bert_news_model")

print("Model and tokenizer saved successfully!")

In [None]:
label_mapping = {
    0: "World",
    1: "Sports",
    2: "Business",
    3: "Sci/Tech"
}

import json
with open("label_mapping.json", "w") as f:
    json.dump(label_mapping, f)