# Hugging Face BERT for Text Classification
## Objective

Fine-tune a pre-trained BERT model for text classification using the Hugging Face ecosystem.

Unlike previous notebooks:

- Feature extraction is learned end-to-end
- Tokenization becomes model-dependent
> Leakage risks move to the tokenization + split boundary
## Why End-to-End Transformers

Compared to frozen embeddings (e.g. SBERT):

- Representations adapt to the task
- Contextual semantics are optimized
- Higher performance ceiling

Trade-offs:

- More compute
- Less interpretability
- Greater tuning sensitivity
## Imports and Setup

In [2]:
import numpy as np
import pandas as pd
import torch

from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer
)

from sklearn.metrics import accuracy_score, f1_score


# Reproducibility Settings

In [56]:
SEED =  2010
torch.manual_seed(SEED)
np.random.seed(SEED)

## Example Dataset
Binary sentiment-style classification.

In [6]:
data = {
    "text": [
        "This model works very well",
        "Excellent performance and stability",
        "Terrible results and poor accuracy",
        "Bad predictions and unreliable output",
        "Robust and interpretable system",
        "Awful behavior and weak model"
    ],
    "label": [1, 1, 0, 0, 1, 0]
}

df = pd.DataFrame(data)
df


Unnamed: 0,text,label
0,This model works very well,1
1,Excellent performance and stability,1
2,Terrible results and poor accuracy,0
3,Bad predictions and unreliable output,0
4,Robust and interpretable system,1
5,Awful behavior and weak model,0


# Train / Validation Split

**Critical:** Split before tokenization.

In [8]:
from sklearn.model_selection import train_test_split

train_df, val_df = train_test_split(
    df,
    test_size=0.3,
    random_state=SEED,
    stratify=df["label"]
)


## Convert to Hugging Face Datasets

In [10]:
train_ds = Dataset.from_pandas(train_df.reset_index(drop=True))
val_ds = Dataset.from_pandas(val_df.reset_index(drop=True))

# Load Tokenizer and Model
## Model Choice

In [12]:
model_name = "bert-base-uncased"

In [13]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2
)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Tokenization Function

In [15]:
def tokenize_batch(batch):
    return tokenizer(
        batch["text"],
        padding="max_length",
        truncation=True,
        max_length=128
    )

## Apply Tokenization

In [17]:
train_ds = train_ds.map(tokenize_batch, batched=True)
val_ds = val_ds.map(tokenize_batch, batched=True)

Map:   0%|          | 0/4 [00:00<?, ? examples/s]

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

## Set Dataset Format for PyTorch

In [19]:
columns = ["input_ids", "attention_mask", "label"]

train_ds.set_format(type="torch", columns=columns)
val_ds.set_format(type="torch", columns=columns)


## Define Evaluation Metrics

In [21]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)

    return {
        "accuracy": accuracy_score(labels, preds),
        "f1": f1_score(labels, preds)
    }

## Training Arguments

In [23]:
training_args = TrainingArguments(
    output_dir="./bert_results",
    #evaluation_strategy="epoch",
    eval_strategy = "epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    report_to="none"
)


In [24]:
# Source - https://stackoverflow.com/a/76452964
# Posted by alvas, modified by community. See post 'Timeline' for change history
# Retrieved 2026-02-06, License - CC BY-SA 4.0

! pip install -U accelerate
! pip install -U transformers









## Trainer Setup

In [26]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

  trainer = Trainer(


## Train the Model

In [28]:
trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.669275,0.5,0.666667
2,No log,0.658616,0.5,0.666667
3,No log,0.649734,0.5,0.666667




TrainOutput(global_step=3, training_loss=0.6913228034973145, metrics={'train_runtime': 9.4453, 'train_samples_per_second': 1.27, 'train_steps_per_second': 0.318, 'total_flos': 789333166080.0, 'train_loss': 0.6913228034973145, 'epoch': 3.0})

## Evaluation

In [30]:
trainer.evaluate()



{'eval_loss': 0.6692748069763184,
 'eval_accuracy': 0.5,
 'eval_f1': 0.6666666666666666,
 'eval_runtime': 0.1651,
 'eval_samples_per_second': 12.117,
 'eval_steps_per_second': 6.059,
 'epoch': 3.0}

## Inference Example

In [32]:
text = "The model shows excellent performance"

inputs = tokenizer(
    text,
    return_tensors="pt",
    truncation=True,
    padding=True,
    max_length=128
)

with torch.no_grad():
    outputs = model(**inputs)

prediction = torch.argmax(outputs.logits, dim=1).item()
prediction


1

# Interpretation Notes

- BERT learns task-specific representations
- Attention â‰  explanation
- Use probing or SHAP-style methods for insights
# Common Mistakes

- `[neg] -` Tokenizing before splitting
- `[neg] -` Using excessive max_length
- `[neg] -` Ignoring class imbalance
- `[neg] -` Over-training small datasets

# Key Takeaways

- Transformers require strict pipeline discipline
- Tokenization is part of the model
- Fine-tuning yields strong gains
- Baselines still matter for validation