# **TASK 1 — News Classification with BERT**

**1. Problem Statement & Objective**

Problem Statement:
News organizations publish thousands of articles daily. Manual categorization is inefficient and error-prone.

**Objective:**

To fine-tune a pre-trained Transformer model to automatically classify news articles into categories with high accuracy.

**2. Dataset Loading & Preprocessing**

We used a labeled news dataset containing article text and category labels.

**Steps:**

Loaded dataset using HuggingFace Datasets

Removed missing values

Tokenized text using BERT tokenizer

Applied padding and truncation

Converted labels to numeric format

Purpose: Convert raw text into numerical tensors understandable by the Transformer.


**3. Model Development & Training**

Model used: bert-base-uncased

**Architecture:**

Pre-trained BERT encoder

Classification head (Dense + Softmax)

Training process:

Loss: Cross-Entropy

Optimizer: AdamW

Epochs: 2

Batch training using Trainer API

Transfer learning allows faster convergence with limited data.

**4. Evaluation with Metrics**

**Metrics used:**

Accuracy

F1-Score

**Final Results:**

Metric	Value

Accuracy	92.05%

F1-Score	92.14%

**6. Visualization**

Published train model on gradio for dashboard.

**7. Final Summary / Insights**

The fine-tuned Transformer achieved high accuracy, proving that pre-trained language models can be efficiently adapted for domain-specific classification tasks.

In [None]:
!pip install -q -U transformers datasets evaluate scikit-learn accelerate


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m81.9/84.1 kB[0m [31m5.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
!pip uninstall -y torch torchvision torchaudio
!pip install torch==2.9.0 torchvision==0.24.0 torchaudio==2.9.0 --index-url https://download.pytorch.org/whl/cpu

# ensure latest Transformers & Datasets
!pip install -U transformers datasets gradio scikit-learn


Found existing installation: torch 2.9.0+cpu
Uninstalling torch-2.9.0+cpu:
  Successfully uninstalled torch-2.9.0+cpu
Found existing installation: torchvision 0.24.0+cpu
Uninstalling torchvision-0.24.0+cpu:
  Successfully uninstalled torchvision-0.24.0+cpu
Found existing installation: torchaudio 2.9.0+cpu
Uninstalling torchaudio-2.9.0+cpu:
  Successfully uninstalled torchaudio-2.9.0+cpu
Looking in indexes: https://download.pytorch.org/whl/cpu
Collecting torch==2.9.0
  Using cached https://download.pytorch.org/whl/cpu/torch-2.9.0%2Bcpu-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (29 kB)
Collecting torchvision==0.24.0
  Using cached https://download.pytorch.org/whl/cpu/torchvision-0.24.0%2Bcpu-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (5.9 kB)
Collecting torchaudio==2.9.0
  Using cached https://download.pytorch.org/whl/cpu/torchaudio-2.9.0%2Bcpu-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (6.9 kB)
Using cached https://download.pytorch.org/whl/cpu/torch-2.9.0%2Bcpu-cp312-cp312-

In [None]:
!pip install transformers datasets torch scikit-learn gradio



In [None]:
# Task#1 BERT, News Dataset

import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding
import numpy as np
from sklearn.metrics import f1_score
import evaluate

print("Device:", "cuda" if torch.cuda.is_available() else "cpu")

# Load dataset
dataset = load_dataset("ag_news")

# OPTIONAL: SUBSET FOR SPEED
dataset["train"] = dataset["train"].shuffle(seed=42).select(range(20000))   # 20k instead of 120k
dataset["test"]  = dataset["test"].shuffle(seed=42).select(range(2000))     # 2k instead of 7.6k

# Tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize(batch):
    return tokenizer(
        batch["text"],
        truncation=True,
        padding=False,
        max_length=128   # shorter for CPU speed
    )

tokenized = dataset.map(tokenize, batched=True, remove_columns=["text"])

data_collator = DataCollatorWithPadding(tokenizer)

# Load Model
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=4
)

# Metrics
accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"],
        "f1": f1_score(labels, preds, average="macro")
    }

# Training Args
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_steps=50,
    learning_rate=3e-5,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=False,
    report_to="none"
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

# Train
trainer.train()

# Evaluate
results = trainer.evaluate()
print("\nFINAL METRICS:")
print(results)

# Save Model
trainer.save_model("./distilbert_agnews_model")
tokenizer.save_pretrained("./distilbert_agnews_model")

print("\nModel saved to ./distilbert_agnews_model")


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m512.3/512.3 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.9/8.9 MB[0m [31m48.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.7/47.7 MB[0m [31m22.5 MB/s[0m eta [36m0:00:00[0m
[?25h



Device: cpu


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading builder script: 0.00B [00:00, ?B/s]

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.2337,0.266285,0.915,0.915836




Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.2337,0.266285,0.915,0.915836
2,0.1796,0.272406,0.9205,0.921442





FINAL METRICS:
{'eval_loss': 0.27240559458732605, 'eval_accuracy': 0.9205, 'eval_f1': 0.9214415118922797, 'eval_runtime': 226.6821, 'eval_samples_per_second': 8.823, 'eval_steps_per_second': 0.551, 'epoch': 2.0}

Model saved to ./distilbert_agnews_model


In [None]:
!pip install streamlit transformers torch



Collecting streamlit
  Downloading streamlit-1.52.2-py3-none-any.whl.metadata (9.8 kB)
Collecting pydeck<1,>=0.8.0b4 (from streamlit)
  Downloading pydeck-0.9.1-py2.py3-none-any.whl.metadata (4.1 kB)
Downloading streamlit-1.52.2-py3-none-any.whl (9.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.0/9.0 MB[0m [31m27.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pydeck-0.9.1-py2.py3-none-any.whl (6.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m47.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pydeck, streamlit
Successfully installed pydeck-0.9.1 streamlit-1.52.2


In [None]:
!pip install gradio transformers torch




In [None]:
# For Streaming on Gradio

import gradio as gr
from transformers import pipeline

clf = pipeline("text-classification", model="./distilbert_agnews_model")

label_map = {
    "LABEL_0": "World",
    "LABEL_1": "Sports",
    "LABEL_2": "Business",
    "LABEL_3": "Sci/Tech"
}

def predict_topic(text):
    if not text.strip():
        return "Please enter a headline"

    result = clf(text)[0]
    topic = label_map[result["label"]]
    score = result["score"]

    return f"{topic}  (confidence: {score:.3f})"

app = gr.Interface(
    fn=predict_topic,
    inputs=gr.Textbox(label="Enter News Headline"),
    outputs=gr.Textbox(label="Predicted Topic"),
    title="News Topic Classifier - DistilBERT"
)

app.launch(share=True)


Device set to use cpu


Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://76f217f3c9c557c42e.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


