# Task 1 — News Topic Classifier Using BERT (AG News)

**Objective:** Fine-tune `bert-base-uncased` to classify news headlines into 4 categories using the AG News dataset.

This notebook includes: dataset loading, preprocessing, model fine-tuning (Trainer), evaluation (accuracy & macro F1), and a short inference demo.


## 1) Install & Imports (run in Colab / local environment)


In [None]:
# Uncomment and run when required
# !pip install -q transformers datasets evaluate accelerate sentencepiece
import os
import numpy as np
import pandas as pd
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.metrics import accuracy_score, f1_score
import evaluate


## 2) Load & Inspect Dataset


In [None]:
dataset = load_dataset('ag_news')
print(dataset)
dataset['train'][0]


## 3) Tokenization & Preprocessing


In [None]:
model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
def tokenize_fn(batch):
    return tokenizer(batch['text'], truncation=True, padding='max_length', max_length=128)
tokenized = dataset.map(tokenize_fn, batched=True)
tokenized = tokenized.rename_column('label','labels')
tokenized.set_format(type='torch', columns=['input_ids','attention_mask','labels'])
tokenized


## 4) Fine-tune with Trainer (small demo run)


In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=4)
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        'accuracy': accuracy_score(labels, preds),
        'f1_macro': f1_score(labels, preds, average='macro')
    }
training_args = TrainingArguments(
    output_dir='./outputs/bert_agnews',
    evaluation_strategy='epoch',
    per_device_train_batch_size=8,
    per_device_eval_batch_size=32,
    num_train_epochs=1,
    save_total_limit=1,
    learning_rate=2e-5,
    logging_steps=100
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized['train'],
    eval_dataset=tokenized['test'],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)
# NOTE: Training will run if you execute this cell in an environment with resources.
# trainer.train()
# metrics = trainer.evaluate()
# print(metrics)


## 5) Inference demo


In [None]:
from transformers import TextClassificationPipeline
pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=True)
example = 'NASA announces new telescope for deep space exploration'
print(pipe(example))


## 6) Summary & Next Steps

- Train for more epochs on GPU.
- Save model directory and use Streamlit app for demo.
- Report accuracy and macro F1 after full training.
