# Finetuning a Pre-trained Google BERT model for classification

Source:

Textbook: Chapter 11. Fine-Tuning Representation Models for Classification in Allamar and Grotendorst, "Hands-On Large Language Models", O'Reilly Media Inc., September 2024.

https://learning.oreilly.com/library/view/hands-on-large-language/9781098150952/ch11.html


### Load dataset

In [1]:
from datasets import load_dataset

# Prepare data and splits
tomatoes = load_dataset("rotten_tomatoes")
tomatoes 

  from .autonotebook import tqdm as notebook_tqdm


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

In [2]:
tomatoes["train"][0, -1]

{'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'things really get weird , though not particularly scary : the movie is all portent and no content .'],
 'label': [1, 0]}

In [3]:
tomatoes["test"][0, -1]

{'text': ['lovingly photographed in the manner of a golden book sprung to life , stuart little 2 manages sweetness largely without stickiness .',
  "enigma is well-made , but it's just too dry and too placid ."],
 'label': [1, 0]}

In [4]:
train_data, test_data = tomatoes["train"], tomatoes["test"]

### Tokenize and setup training data

Definitions:

AutoModelForSequenceClassification
https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#automodelforsequenceclassification

BertForSequenceClassification
https://huggingface.co/transformers/v3.0.2/model_doc/bert.html#transformers.BertForSequenceClassification

In [5]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig

# Load model and tokenizer
model_id = "bert-base-cased"

configuration = AutoConfig.from_pretrained(model_id)
configuration.hidden_dropout_prob = 0.1
# configuration.attention_probs_dropout_prob = 0.1
configuration.num_labels = 2

model = AutoModelForSequenceClassification.from_pretrained(
    model_id, config = configuration
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
from transformers import DataCollatorWithPadding

# Pad to the longest sequence in the batch
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

def preprocess_function(examples):
   """Tokenize input data"""
   return tokenizer(examples["text"], truncation=True)

# Tokenize train/test data
tokenized_train = train_data.map(preprocess_function, batched=True)
tokenized_test = test_data.map(preprocess_function, batched=True)

### Define evaluation metrics

In [7]:
import numpy as np
from evaluate import load

def compute_metrics(eval_pred):
    """Calculate F1 score"""
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    f1_metric = load("f1")
    f1 = f1_metric.compute(predictions=predictions, references=labels)["f1"]
    return {"f1": f1}

### Instantiate model trainer

In [8]:
from transformers import TrainingArguments, Trainer

# Training arguments for parameter tuning
training_args = TrainingArguments(
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=10,
    weight_decay=0.01,
    save_strategy="epoch",
    report_to="none",
    output_dir="output",
#     seed=0,
    )

In [9]:
# Trainer which executes the training process
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    )

  trainer = Trainer(


### Perform finetuning on training data

In [10]:
trainer.evaluate()

{'eval_loss': 0.8508212566375732,
 'eval_model_preparation_time': 0.0021,
 'eval_f1': 0.6666666666666666,
 'eval_runtime': 1.3989,
 'eval_samples_per_second': 762.012,
 'eval_steps_per_second': 47.894}

#### Expected metrics (textbook): 

{'eval_loss': 0.3663691282272339,
 'eval_f1': 0.8492366412213741,
 'eval_runtime': 4.5792,
 'eval_samples_per_second': 232.791,
 'eval_steps_per_second': 14.631,
 'epoch': 1.0}

### Freezing Layers

We will freeze the main BERT model and allow only updates to pass through the classification head. This will be a great comparison as we will keep everything the same, except for freezing specific layers.

We are going to freeze everything except for the classification head:

In [11]:
# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(
    model_id, num_labels=2
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [12]:
# Print layer names
for name, param in model.named_parameters():
    print(name)

bert.embeddings.word_embeddings.weight
bert.embeddings.position_embeddings.weight
bert.embeddings.token_type_embeddings.weight
bert.embeddings.LayerNorm.weight
bert.embeddings.LayerNorm.bias
bert.encoder.layer.0.attention.self.query.weight
bert.encoder.layer.0.attention.self.query.bias
bert.encoder.layer.0.attention.self.key.weight
bert.encoder.layer.0.attention.self.key.bias
bert.encoder.layer.0.attention.self.value.weight
bert.encoder.layer.0.attention.self.value.bias
bert.encoder.layer.0.attention.output.dense.weight
bert.encoder.layer.0.attention.output.dense.bias
bert.encoder.layer.0.attention.output.LayerNorm.weight
bert.encoder.layer.0.attention.output.LayerNorm.bias
bert.encoder.layer.0.intermediate.dense.weight
bert.encoder.layer.0.intermediate.dense.bias
bert.encoder.layer.0.output.dense.weight
bert.encoder.layer.0.output.dense.bias
bert.encoder.layer.0.output.LayerNorm.weight
bert.encoder.layer.0.output.LayerNorm.bias
bert.encoder.layer.1.attention.self.query.weight
bert.enc

In [13]:
for name, param in model.named_parameters():
    # Trainable classification head
    if name.startswith("classifier"):
        param.requires_grad = True

    # Freeze everything else
    else:
        param.requires_grad = False

In [14]:
from transformers import TrainingArguments, Trainer

# Trainer which executes the training process
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    )

trainer.train()

  trainer = Trainer(


Step,Training Loss
500,0.6867
1000,0.6758
1500,0.6724
2000,0.6708
2500,0.6636
3000,0.6626
3500,0.6625
4000,0.6603
4500,0.6589
5000,0.6594


TrainOutput(global_step=5340, training_loss=0.6667037363802449, metrics={'train_runtime': 105.9814, 'train_samples_per_second': 804.858, 'train_steps_per_second': 50.386, 'total_flos': 2273050323914520.0, 'train_loss': 0.6667037363802449, 'epoch': 10.0})

In [15]:
trainer.evaluate()

{'eval_loss': 0.6473310589790344,
 'eval_f1': 0.6230769230769232,
 'eval_runtime': 1.1893,
 'eval_samples_per_second': 896.349,
 'eval_steps_per_second': 56.337,
 'epoch': 10.0}

#### Expected metrics (textbook): 

{'eval_loss': 0.6821751594543457,
 'eval_f1': 0.6331058020477816,
 'eval_runtime': 4.0175,
 'eval_samples_per_second': 265.337,
 'eval_steps_per_second': 16.677,
 'epoch': 1.0}

### Modification: Encoder block 11 starts at index 165 and we freeze everything before that block.

In [16]:
# Load model
model_id = "bert-base-cased"
model = AutoModelForSequenceClassification.from_pretrained(
    model_id, num_labels=2
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Encoder block 11 starts at index 165 and
# we freeze everything before that block
for index, (name, param) in enumerate(model.named_parameters()):    
    if index < 165:
        param.requires_grad = False

# Trainer which executes the training process
trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_train,
   eval_dataset=tokenized_test,
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
)
trainer.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Step,Training Loss
500,0.4729
1000,0.3766
1500,0.3523
2000,0.3202
2500,0.2897
3000,0.2649
3500,0.2445
4000,0.2215
4500,0.216
5000,0.1851


TrainOutput(global_step=5340, training_loss=0.2876098761397801, metrics={'train_runtime': 156.236, 'train_samples_per_second': 545.969, 'train_steps_per_second': 34.179, 'total_flos': 2273050323914520.0, 'train_loss': 0.2876098761397801, 'epoch': 10.0})

In [17]:
trainer.evaluate()

{'eval_loss': 0.5359287261962891,
 'eval_f1': 0.8322274881516587,
 'eval_runtime': 1.1619,
 'eval_samples_per_second': 917.451,
 'eval_steps_per_second': 57.663,
 'epoch': 10.0}

#### Expected metrics (textbook): 

{'eval_loss': 0.40812647342681885,
 'eval_f1': 0.8,
 'eval_runtime': 3.7125,
 'eval_samples_per_second': 287.137,
 'eval_steps_per_second': 18.047,
 'epoch': 1.0}

### Conclusion:

It demonstrates that although we generally want to train as many layers as possible, you can get away with training less if you do not have the necessary computing power.