# 🧠 Project: BERT Text Classification on AG News Dataset

This project uses a pre-trained BERT model to classify news articles from the AG News dataset into one of four categories: World, Sports, Business, and Sci/Tech. We’ll load the dataset, tokenize text using BERT tokenizer, fine-tune a BERT model, evaluate its accuracy, and test it on sample inputs.


# Step 1: Install Required Libraries

We install the Hugging Face 'transformers' and 'datasets' libraries, which allow us to use pretrained BERT models and access standard NLP datasets.


In [1]:
!pip install transformers datasets -q


# Step 2: Import Required Modules

We'll import necessary components including PyTorch, Hugging Face's BERT models, tokenizer, Trainer API, and accuracy metrics.


In [4]:
import torch
from datasets import load_dataset
from transformers import (
    BertTokenizerFast, BertForSequenceClassification, Trainer, TrainingArguments,
    DataCollatorWithPadding, pipeline
)
from sklearn.metrics import accuracy_score
import pandas as pd
import warnings


# Step 3: Load AG News Dataset

We'll load the AG News dataset, a benchmark dataset for text classification tasks, containing news articles labeled into 4 classes.


In [5]:
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", message=".*HF_TOKEN.*")
    dataset = load_dataset("ag_news")

# Display sample from training set
dataset["train"].to_pandas().head()


Unnamed: 0,text,label
0,Wall St. Bears Claw Back Into the Black (Reute...,2
1,Carlyle Looks Toward Commercial Aerospace (Reu...,2
2,Oil and Economy Cloud Stocks' Outlook (Reuters...,2
3,Iraq Halts Oil Exports from Main Southern Pipe...,2
4,"Oil prices soar to all-time record, posing new...",2


# Step 4: Tokenize the Dataset

We'll tokenize the text data using the 'BertTokenizerFast', truncating longer texts to fit the model input size. We'll also rename the label column to 'labels' as required by the model.


In [6]:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

def tokenize(batch):
    return tokenizer(batch['text'], truncation=True)

tokenized_dataset = dataset.map(tokenize, batched=True)
tokenized_dataset = tokenized_dataset.rename_column("label", "labels")


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Map:   0%|          | 0/120000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

# Step 5: Load Pre-trained BERT Model

We'll use the BERT base model with a classification head. 'DataCollatorWithPadding' ensures all inputs in a batch have the same length using dynamic padding.


In [7]:
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased', num_labels=4
)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Step 6: Define Evaluation Metric

We'll define a function to compute accuracy by comparing predicted labels with true labels during model evaluation.


In [8]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc}


# Step 7: Define Training Arguments

We'll set parameters for training such as batch size, number of epochs, logging steps, and output directory using 'TrainingArguments'.


In [10]:
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    eval_strategy="epoch",  # Corrected argument name
    save_strategy="no",
    logging_steps=10,
    report_to="none"
)

# 8. Initialize Trainer and Train

We'll create a 'Trainer' object with the model, tokenized dataset, training arguments, and metric function. For speed, we use a subset of the full dataset.


In [11]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"].select(range(1000)),
    eval_dataset=tokenized_dataset["test"].select(range(200)),
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()


Epoch,Training Loss,Validation Loss,Accuracy
1,0.4706,0.417388,0.885


TrainOutput(global_step=63, training_loss=0.7397070479771447, metrics={'train_runtime': 1419.8769, 'train_samples_per_second': 0.704, 'train_steps_per_second': 0.044, 'total_flos': 61478646493056.0, 'train_loss': 0.7397070479771447, 'epoch': 1.0})

# Step 9: Evaluate the Model

After training, we'll evaluate the model on the test set subset and print the accuracy.




In [12]:
eval_result = trainer.evaluate()
print(f"\n✅ Evaluation Accuracy: {eval_result['eval_accuracy']:.2f}")



✅ Evaluation Accuracy: 0.89


# Step 10: Make Sample Predictions

We'll use a 'pipeline' for text classification with the fine-tuned model to predict labels on new sample texts and print the predicted categories.


In [16]:
# Get label names from the dataset metadata
label_names = dataset["train"].features["label"].names

# Create text classification pipeline using our fine-tuned model
text_classifier = pipeline(
    "text-classification", model=model, tokenizer=tokenizer
)

# Custom sample headlines for testing
sample_texts = [
    "Real Madrid defeats Barcelona in a thrilling El Clásico.",
    "Oil prices hit a record low amid global tensions.",
    "NASA launches a new rocket to Mars.",
    "The stock market saw a huge increase today.",
    "Manchester United won the football match.",
    "Apple announces the release of the iPhone 15."
]

# Print predictions
print("\n Sample Predictions:\n")
for text in sample_texts:
    pred = text_classifier(text)[0]
    label_index = int(pred['label'].split("_")[-1])
    label = label_names[label_index]
    print(f"Text: {text}\nPredicted Label: {label}\n")


Device set to use cpu



 Sample Predictions:

Text: Real Madrid defeats Barcelona in a thrilling El Clásico.
Predicted Label: Sports

Text: Oil prices hit a record low amid global tensions.
Predicted Label: Business

Text: NASA launches a new rocket to Mars.
Predicted Label: Sci/Tech

Text: The stock market saw a huge increase today.
Predicted Label: Business

Text: Manchester United won the football match.
Predicted Label: Sports

Text: Apple announces the release of the iPhone 15.
Predicted Label: Sci/Tech

