PHASE 1: Repository & Environment Setup

Step 1: Folder structure

ai-ml-internship-projects/
‚îî‚îÄ‚îÄ task-1-bert-news-classifier/
    ‚îú‚îÄ‚îÄ bert_news_classifier.ipynb
    ‚îú‚îÄ‚îÄ app.py
    ‚îú‚îÄ‚îÄ requirements.txt
    ‚îî‚îÄ‚îÄ README.md


PHASE 2: Dataset Loading (AG News)
Task 1:- News Topic Classifier (BERT)

Notebook Section 1:- Problem Statement

In [3]:
"""
Task 1: News Topic Classification using BERT

Objective:
Fine-tune a BERT-based transformer model to classify news headlines
into one of four categories using the AG News dataset.
"""


'\nTask 1: News Topic Classification using BERT\n\nObjective:\nFine-tune a BERT-based transformer model to classify news headlines\ninto one of four categories using the AG News dataset.\n'

üìå Notebook Section 2 ‚Äì Imports

In [4]:
import numpy as np
import pandas as pd
import torch

from datasets import load_dataset
from transformers import (
    BertTokenizer,
    BertForSequenceClassification,
    Trainer,
    TrainingArguments
)

from sklearn.metrics import accuracy_score, f1_score


Notebook Section 3 Dataset Loading & Inspection

Step 1: Load the AG News Dataset

In [5]:
dataset = load_dataset("ag_news")
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})

AG News has:

4 classes

Text field: text

Label field: label

Step 2: Inspect a Sample

In [6]:
dataset["train"][0]

{'text': "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.",
 'label': 2}

Step 3: Define Label Mapping

In [7]:
label_names = {
    0: "World",
    1: "Sports",
    2: "Business",
    3: "Sci/Tech"
}

Sanity check:

In [8]:
label_names[dataset["train"][0]["label"]]

'Business'

Step 4: Dataset Balance Check

In [9]:
from collections import Counter

Counter(dataset["train"]["label"])

Counter({2: 30000, 3: 30000, 1: 30000, 0: 30000})

PHASE 3:Tokenization & Preprocessing (BERT-Style)

Step 5: Load Tokenizer

In [10]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

Step 6: Tokenization Function

In [11]:
def tokenize_function(example):
    return tokenizer(
        example["text"],
        truncation=True,
        padding="max_length",
        max_length=128
    )

Step 7: Apply Tokenization

In [12]:
tokenized_dataset = dataset.map(tokenize_function, batched=True)

Step 8: Prepare for PyTorch

In [13]:
tokenized_dataset = tokenized_dataset.remove_columns(["text"])
tokenized_dataset = tokenized_dataset.rename_column("label", "labels")
tokenized_dataset.set_format("torch")

In [14]:
tokenized_dataset["train"][0]

{'labels': tensor(2),
 'input_ids': tensor([  101,  2813,  2358,  1012,  6468, 15020,  2067,  2046,  1996,  2304,
          1006, 26665,  1007, 26665,  1011,  2460,  1011, 19041,  1010,  2813,
          2395,  1005,  1055,  1040, 11101,  2989,  1032,  2316,  1997, 11087,
          1011, 22330,  8713,  2015,  1010,  2024,  3773,  2665,  2153,  1012,
           102,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,  

PHASE 4: Model Fine-Tuning (CORE ML)

Step 1: Load Pretrained BERT Model

In [15]:
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=4
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step 2: Define Evaluation Metrics (MANDATORY)

In [16]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    accuracy = accuracy_score(labels, predictions)
    f1 = f1_score(labels, predictions, average="macro")

    return {
        "accuracy": accuracy,
        "f1": f1
    }

Step 3: Training Arguments

In [17]:
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=100,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    report_to="none"   # avoids extra integrations
)

Initialize Trainer

In [18]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

  trainer = Trainer(


Start Training

In [19]:
trainer.train()



Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

‚ÄúWhy didn‚Äôt I train on full data?‚Äù

‚ÄúGiven CPU-only constraints, I reduced dataset size for efficiency. The objective was to demonstrate correct fine-tuning, evaluation, and deployment rather than exhaustive training.‚Äù

STEP 2:- Create a CPU-Optimized Dataset

In [20]:
# Reduce dataset size for CPU-friendly training
small_train = tokenized_dataset["train"].shuffle(seed=42).select(range(10000))
small_test = tokenized_dataset["test"].shuffle(seed=42).select(range(2000))

len(small_train), len(small_test)

(10000, 2000)

STEP 3:- Adjust TrainingArguments (CPU Friendly)

In [21]:
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=1,          # üëà KEY CHANGE
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=100,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    report_to="none"
)


STEP 4:- Re-Initialize Trainer

In [22]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train,
    eval_dataset=small_test,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

  trainer = Trainer(


STEP 5:- Start Training

In [23]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.2407,0.28363,0.9225,0.923431


TrainOutput(global_step=1250, training_loss=0.2910706115722656, metrics={'train_runtime': 10257.8547, 'train_samples_per_second': 0.975, 'train_steps_per_second': 0.122, 'total_flos': 657789450240000.0, 'train_loss': 0.2910706115722656, 'epoch': 1.0})

In [24]:
trainer.evaluate()



{'eval_loss': 0.28363004326820374,
 'eval_accuracy': 0.9225,
 'eval_f1': 0.9234312467558867,
 'eval_runtime': 611.9516,
 'eval_samples_per_second': 3.268,
 'eval_steps_per_second': 0.409,
 'epoch': 1.0}

In [25]:
save_dir = "./bert-agnews-model"
trainer.save_model(save_dir)
tokenizer.save_pretrained(save_dir)
print("Saved to:", save_dir)

Saved to: ./bert-agnews-model


In [27]:
from transformers import BertForSequenceClassification, BertTokenizer

model = BertForSequenceClassification.from_pretrained("./bert-agnews-model")
tokenizer = BertTokenizer.from_pretrained("./bert-agnews-model")
print(tokenizer)


BertTokenizer(name_or_path='./bert-agnews-model', vocab_size=30522, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)


In [28]:
tokenizer("Apple releases new AI-powered iPhone")

{'input_ids': [101, 6207, 7085, 2047, 9932, 1011, 6113, 18059, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [29]:
from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained("./bert-agnews-model")