# Classics with LLMs - fine-tuning BERT for Iris dataset classification task

## Introduction
Ever wondered how new generation of LLMs handle classical ML tasks? Goal of this series of experiments is to verify how LLMs perform on classical datasets. I create a series of simple experiments fine-tuning LLMs with various classic tasks.

This time we'll see how fine-tuned BERT handles the Iris classification task dataset.

Iris in a "Hello World" kind of dataset in machine learning. From LLM side we will use BERT - one of the simplest encoder-only models. Encoder because in the kind of taks like classification we care more about model creating a rich representation of data (encoder focus) rather then on sequence generation (decoder focus). 

Let's see how BERT handles the Iris dataset.


In [1]:
# !pip install datasets
# !pip install accelerate
# !pip install transformers
# !pip install scikit
# !pip install numpy
# !pip install pandas

In [2]:
from datasets import Dataset, DatasetDict, load_metric
from transformers import AutoTokenizer, DataCollatorWithPadding, AutoModelForSequenceClassification, TrainingArguments, Trainer

from sklearn.datasets import load_iris

import numpy as np
import pandas as pd

## 1. Data preprocessing

### 1.1 Loading dataset

We will just load the Iris dataset from scikit-learn library.

In [3]:
iris_data = load_iris()

In [4]:
iris_data.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [5]:
iris_data["feature_names"]

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [6]:
iris_data["target_names"]

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [7]:
iris_df = pd.DataFrame(data = iris_data["data"], columns = iris_data["feature_names"])

In [8]:
iris_df["label"] = iris_data["target"]

In [9]:
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),label
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [10]:
iris_df["label_decoded"] = iris_df["label"].apply(lambda label_idx: iris_data["target_names"][label_idx])

In [11]:
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),label,label_decoded
0,5.1,3.5,1.4,0.2,0,setosa
1,4.9,3.0,1.4,0.2,0,setosa
2,4.7,3.2,1.3,0.2,0,setosa
3,4.6,3.1,1.5,0.2,0,setosa
4,5.0,3.6,1.4,0.2,0,setosa



We will be performing classification task. The numerical labels are fine to achieve this goal.

However we need to transform all the individual four numerical features into one textual as LLM input. We will do this by concatenating all four values row-wise.

In [12]:
iris_df["text"] = iris_df.apply(lambda row:
                                str(row["sepal length (cm)"]) + " " +
                                str(row["sepal width (cm)"]) + " " +
                                str(row["petal length (cm)"]) + " " +
                                str(row["petal width (cm)"]), axis=1)

In [13]:
iris_df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),label,label_decoded,text
0,5.1,3.5,1.4,0.2,0,setosa,5.1 3.5 1.4 0.2
1,4.9,3.0,1.4,0.2,0,setosa,4.9 3.0 1.4 0.2
2,4.7,3.2,1.3,0.2,0,setosa,4.7 3.2 1.3 0.2
3,4.6,3.1,1.5,0.2,0,setosa,4.6 3.1 1.5 0.2
4,5.0,3.6,1.4,0.2,0,setosa,5.0 3.6 1.4 0.2
...,...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2,virginica,6.7 3.0 5.2 2.3
146,6.3,2.5,5.0,1.9,2,virginica,6.3 2.5 5.0 1.9
147,6.5,3.0,5.2,2.0,2,virginica,6.5 3.0 5.2 2.0
148,6.2,3.4,5.4,2.3,2,virginica,6.2 3.4 5.4 2.3


Finally we will need to transform the DataFrame into Hugging Face Dataset format.

In [15]:
dataset = Dataset.from_pandas(iris_df)

In [16]:
dataset

Dataset({
    features: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)', 'label', 'label_decoded', 'text'],
    num_rows: 150
})

In [17]:
dataset[0]

{'sepal length (cm)': 5.1,
 'sepal width (cm)': 3.5,
 'petal length (cm)': 1.4,
 'petal width (cm)': 0.2,
 'label': 0,
 'label_decoded': 'setosa',
 'text': '5.1 3.5 1.4 0.2'}

To be able to train the model and later properly evaluate it we will not create separate train, validation and test data splits. Train split will contain 70% of data examples and validation and test splits 15% each. We want to reproduce the categories balance of the original dataset in each of the split so we will use a stratify split function.

In [18]:
# column we want to stratify with
stratify_column_name = "label"

# create class label column and stratify
t = dataset.class_encode_column(
    stratify_column_name
)

Stringifying the column:   0%|          | 0/150 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/150 [00:00<?, ? examples/s]

In [19]:
# column we want to stratify first needs to be transformed into categorical variable type
dataset_testvalid = dataset.class_encode_column("label"
                                                ).train_test_split(test_size=0.3, shuffle=True, stratify_by_column="label")

test_valid = dataset_testvalid["test"].train_test_split(test_size=0.5, shuffle=True, stratify_by_column="label")

dataset = DatasetDict({
    "train": dataset_testvalid["train"],
    "validation": test_valid["train"],
    "test": test_valid["test"]

})

Stringifying the column:   0%|          | 0/150 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/150 [00:00<?, ? examples/s]

In [20]:
dataset

DatasetDict({
    train: Dataset({
        features: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)', 'label', 'label_decoded', 'text'],
        num_rows: 105
    })
    validation: Dataset({
        features: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)', 'label', 'label_decoded', 'text'],
        num_rows: 22
    })
    test: Dataset({
        features: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)', 'label', 'label_decoded', 'text'],
        num_rows: 23
    })
})

In [21]:
dataset["train"][0]

{'sepal length (cm)': 5.0,
 'sepal width (cm)': 3.5,
 'petal length (cm)': 1.6,
 'petal width (cm)': 0.6,
 'label': 0,
 'label_decoded': 'setosa',
 'text': '5.0 3.5 1.6 0.6'}

We will be using 105 examples for training, 22 for validation and 23 for testing.

### 1.2 Tokenization

For simplicity of this experiment we will use BERT model. Let's apply regular tokenization to the `text` data field which contains the numerical features string converted and concatenated form. We will also build a simple data collator for later usage in the model training.

In [24]:
MODEL_CHECKPOINT = "bert-base-uncased"

In [26]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

In [27]:
def preprocess_data(example):
    return tokenizer(example["text"])

In [28]:
tokenized_dataset = dataset.map(preprocess_data, batched=True)

Map:   0%|          | 0/105 [00:00<?, ? examples/s]

Map:   0%|          | 0/22 [00:00<?, ? examples/s]

Map:   0%|          | 0/23 [00:00<?, ? examples/s]

In [29]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)', 'label', 'label_decoded', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 105
    })
    validation: Dataset({
        features: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)', 'label', 'label_decoded', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 22
    })
    test: Dataset({
        features: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)', 'label', 'label_decoded', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 23
    })
})

In [30]:
tokenized_dataset["train"][0]

{'sepal length (cm)': 5.0,
 'sepal width (cm)': 3.5,
 'petal length (cm)': 1.6,
 'petal width (cm)': 0.6,
 'label': 0,
 'label_decoded': 'setosa',
 'text': '5.0 3.5 1.6 0.6',
 'input_ids': [101,
  1019,
  1012,
  1014,
  1017,
  1012,
  1019,
  1015,
  1012,
  1020,
  1014,
  1012,
  1020,
  102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [31]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## 2. Model

### 2.1 Metrics

We first define some metrics we want to evaluate the model with. We solve classification problem hence we use classification metrics.

In [33]:
def compute_metrics(eval_pred):
    metric1 = load_metric("precision", trust_remote_code=True)
    metric2 = load_metric("recall", trust_remote_code=True)
    metric3 = load_metric("f1", trust_remote_code=True)
    metric4 = load_metric("accuracy", trust_remote_code=True)

    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    precision = metric1.compute(predictions=predictions, references=labels, average="micro")["precision"]
    recall = metric2.compute(predictions=predictions, references=labels, average="micro")["recall"]
    f1 = metric3.compute(predictions=predictions, references=labels, average="micro")["f1"]
    accuracy = metric4.compute(predictions=predictions, references=labels)["accuracy"]

    return {"precision": precision, "recall": recall, "f1": f1, "accuracy": accuracy}

### 2.2 Model training

We can now load the base model with the sequence classification head that we will be fine-tuning here. In this experiment we will use 'bert-base-uncased' which is relatively - as far as transformers can be of course - simple model.

In [34]:
MODEL_CHECKPOINT

'bert-base-uncased'

In [35]:
model = AutoModelForSequenceClassification.from_pretrained(MODEL_CHECKPOINT, num_labels=3)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We then configure the training arguments with rather standard settings.

In [36]:
# Remove previous training checkpoints if exits
!rm -r "./bert-base-uncased-iris"

In [37]:
training_args = TrainingArguments("bert-base-uncased-iris",
                                  num_train_epochs=15,
                                  logging_steps=50,
                                  evaluation_strategy="epoch",
                                  save_strategy="epoch",
                                  learning_rate=2e-5,
                                  weight_decay=0.01,
                                  push_to_hub=False)

Finally we create the `Trainer` object with specified evaluation metrics and training and validation datasets. And we run the training itself.

In [38]:
trainer = Trainer(
    model,
    training_args,
    compute_metrics=compute_metrics,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [39]:
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.734884,0.636364,0.636364,0.636364,0.636364
2,No log,0.585901,0.681818,0.681818,0.681818,0.681818
3,No log,0.463981,0.909091,0.909091,0.909091,0.909091
4,0.709500,0.345151,1.0,1.0,1.0,1.0
5,0.709500,0.404985,0.818182,0.818182,0.818182,0.818182
6,0.709500,0.215089,0.954545,0.954545,0.954545,0.954545
7,0.709500,0.159396,0.954545,0.954545,0.954545,0.954545
8,0.319000,0.065181,1.0,1.0,1.0,1.0
9,0.319000,0.153938,0.954545,0.954545,0.954545,0.954545
10,0.319000,0.126551,0.954545,0.954545,0.954545,0.954545


TrainOutput(global_step=210, training_loss=0.29829856270835514, metrics={'train_runtime': 159.9781, 'train_samples_per_second': 9.845, 'train_steps_per_second': 1.313, 'total_flos': 11331349337700.0, 'train_loss': 0.29829856270835514, 'epoch': 15.0})

In [43]:
trainer.push_to_hub(commit_message="Training complete")

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

Upload 4 LFS files:   0%|          | 0/4 [00:00<?, ?it/s]

events.out.tfevents.1711391058.6b02ad54f809.13820.1:   0%|          | 0.00/560 [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.92k [00:00<?, ?B/s]

events.out.tfevents.1711390896.6b02ad54f809.13820.0:   0%|          | 0.00/12.9k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/msznajder/bert-base-uncased-iris/commit/25a4cd00c291958916709c9ff4dc2bf2da4d0581', commit_message='Training complete', commit_description='', oid='25a4cd00c291958916709c9ff4dc2bf2da4d0581', pr_url=None, pr_revision=None, pr_num=None)

The model - looking at the training loss - is overfitting of course because it is really complex model applied to very simple problem. But considering the fact that still model generalizes well - this time looking at validation loss getting also lower and lower - tells us the results of the training should work well outside the training data.

At the end the model arrives at very low validation cost and perfect classification performance on validation dataset.

## 3. Evaluation

### 3.1 Quantitative evaluation

Let's now evaluate trained model more thoroughly.

Let's start with producing metrics on the test dataset.

In [41]:
prediction = trainer.predict(tokenized_dataset["test"])

In [42]:
prediction[2]

{'test_loss': 0.25261515378952026,
 'test_precision': 0.9565217391304348,
 'test_recall': 0.9565217391304348,
 'test_f1': 0.9565217391304348,
 'test_accuracy': 0.9565217391304348,
 'test_runtime': 1.3089,
 'test_samples_per_second': 17.572,
 'test_steps_per_second': 2.292}

The model performance dropped to be more realistic on test dataset but still is results are very good.

Especially considering the fact that model was trained on only 105 examples and tested on 23 not used in training test examples.

### 3.2 Qualitative evaluation

Let's now use the model to make a few predictions on some interesting examples to see how it performs in practice trying to catch some properties of the trained model.

We start by loading the model from Hugging Face hub. We will be using it as a pipeline to make predictions.

In [45]:
from transformers import pipeline

model_checkpoint = "msznajder/bert-base-uncased-iris"
iris_sequence_classifier = pipeline("text-classification", model=model_checkpoint)

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

Let's start with some of the actual - hence realistic - examples. We will input the textual representations of of two actual flowers from the dataset.

In [46]:
iris_df.iloc[0]

sepal length (cm)                5.1
sepal width (cm)                 3.5
petal length (cm)                1.4
petal width (cm)                 0.2
label                              0
label_decoded                 setosa
text                 5.1 3.5 1.4 0.2
Name: 0, dtype: object

In [47]:
iris_sequence_classifier("5.1 3.5 1.4 0.2")

[{'label': 'LABEL_0', 'score': 0.9922292828559875}]

In [48]:
iris_df.iloc[100]

sepal length (cm)                6.3
sepal width (cm)                 3.3
petal length (cm)                6.0
petal width (cm)                 2.5
label                              2
label_decoded              virginica
text                 6.3 3.3 6.0 2.5
Name: 100, dtype: object

In [49]:
iris_sequence_classifier("6.3 3.3 6.0 2.5")

[{'label': 'LABEL_2', 'score': 0.9816123843193054}]

Model return correct class for both data points with very high probability score.

Let's now experiment with some very extreme values in terms of order of magnitudes when it comes to measuring flowers petals and sepals.

In [50]:
iris_sequence_classifier("10000.1 1700.8 10.9 107.1")

[{'label': 'LABEL_2', 'score': 0.9472421407699585}]

In [53]:
iris_sequence_classifier("10000.1 1700.8 10388.9 1777707.1")

[{'label': 'LABEL_2', 'score': 0.9476704001426697}]

In [52]:
iris_sequence_classifier("100.1 100.1 2000.1 2000.1")

[{'label': 'LABEL_2', 'score': 0.8828025460243225}]

Of course the model has no other choice in our setting but to assign some class to even such irrational values. It does just that but is assigns lower - however not as low as one would excpect - probabilities to such cases.

We will now experiment with some numerical data formats differences. Meaning that the input values will not differ in value but rather in numerical data representation (int vs. float, smaller vs. higher precision).

In [65]:
iris_sequence_classifier("1.0 2.0 3.0 4.0")

[{'label': 'LABEL_1', 'score': 0.9904539585113525}]

In [64]:
iris_sequence_classifier("1 2 3 4")

[{'label': 'LABEL_1', 'score': 0.5277131199836731}]

In [59]:
iris_sequence_classifier("1.01 2.01 3.01 4.01")

[{'label': 'LABEL_1', 'score': 0.9848111867904663}]

In [60]:
iris_sequence_classifier("1.012 2.012 3.012 4.012")

[{'label': 'LABEL_1', 'score': 0.9679797291755676}]

Here the results are quite different for each example even though numerically the values differ rather slightly between these four examples.

The most natural is the first one - it is the data format and ranges of values similar to the training dataset. The model prediction is done with high probability. 

When we then switch the numerical representations from float to integer values - keeping the same numerical values but changing the data format - the results are hugely different: model predicts the same class as before but the probability score is the lowest of all seen up to now. Again, the values did not change just the data format.

We can see similar pattern - but wil much lower strengh - when we use different floating point precisions instead of the original one digit floating point precision.

It looks like LLM trained on the data learned not only the data values itself but also it expects to get as input data in similar data format in terms of data representation, which is totally not the case for traditional models. So if model was trained on 1 digit floating point precision it will not perform well if we provide integer or 3 digits floating point precision numbers at inference time.

As the last edge-case example let's just input the model with a series of characters instead of numbers.

In [58]:
iris_sequence_classifier("a b c d")

[{'label': 'LABEL_2', 'score': 0.7101045250892639}]

Interestingly the result - at least in terms of predicted top class probability - is similar to the case where we in input numerically realistic integers! 

This could suggest that - quite unsurprisingly but still it is worth reminding ourselves - model learned rather correct data format as the main condition for it to work and only then kind of work on the values to predict correct class.

## 4. Summary

Is it worth using GPU for training Iris classification? Probably not.

Is it interesting how LLMs perform on specific "traditional" machine learning tasks? I think very much. These datasets and tasks are very simple and elementary. Analyzing LLMs behaviour fine-tuned on these specific task can help us better understand these models especially in context of using them in not text related tasks.

Overall, the fine-tuned model pretty much behaves like a regular - not token based - classifier. At the same time model performance is quite good. With one ceveat - when we input the numerical data in a format that model was trained with. So when performing similar project in more realistic context that is the lesson to remember with numerical values. 

The results of this very simple experiment suggests that the model iference for numerical problems is kind of limited or at least the input format of the numerical data should be taken care of for for the model to work. This is especially important as the are products and applications of LLMs applied directly to numeric problems already popping up (e.g. Microsoft Copilot for Finance) which are basically doing math using tokens.

Stay tuned for next experiment in this series.