<a href="https://colab.research.google.com/github/nov05/Google-Colaboratory/blob/master/20241130_finetune_bert_solution_4_class_weights.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **\<TOP\>**  

* changed by nov05 on 2024-11-28  
* Udacity AWS MLE Nanodegree (ND189)  
  Course 4, 3.15 Excercise: Fine-Tuning BERT    
* Colab or local env `conda activate cuda_py310` with cuda enabled  
* data source: [CoLA dataset on KaggleHub](https://www.kaggle.com/datasets/krazy47/cola-the-corpus-of-linguistic-acceptability)  
  The Corpus of Linguistic Acceptability   

---  

* This notebook is based on the 3-partial-freeze one.  
* Set `config.use_class_weights = True` in this notebook for training to see if there is improvement.    

* freeze top 6 layers + class weights  
```
Step 1515: [1300/6413 (99%)] Loss: 0.069613
100%|██████████| 15/15 [10:05<00:00, 40.36s/it]🟢 Test Accuracy (%):  80.59247737556561
```  
* train the whole model + class weights  
```
Step 808: [1300/6413 (99%)] Loss: 0.006557
 53%|█████▎    | 8/15 [09:14<08:04, 69.21s/it]🟢 Test Accuracy (%):  84.40328054298642  
```
```
Step 1515: [1300/6413 (99%)] Loss: 0.431469
100%|██████████| 15/15 [17:19<00:00, 69.30s/it]🟢 Test Accuracy (%):  84.19824660633485
```  

# **Solution: Fine-tune BERT model**

In [1]:
import os
# import sys
# import json
from tqdm import tqdm
import wandb
import numpy as np
import pandas as pd
import torch
from torch.optim import AdamW
# import torch.distributed as dist
import torch.utils.data
import torch.utils.data.distributed
from torch.utils.data import DataLoader, RandomSampler, TensorDataset
from transformers import BertForSequenceClassification, BertTokenizer # type: ignore
from transformers import get_linear_schedule_with_warmup # type: ignore
from sklearn.model_selection import train_test_split # type: ignore
from sklearn.utils.class_weight import compute_class_weight # type: ignore

## log training process with W&B if commented
# os.environ['WANDB_MODE'] = 'disabled'

In [2]:
class Config:
    def __init__(self):
        """
        hyperparameters
        """
        self.wandb = True
        self.device = torch.device('cpu')
        self.max_len = int(64) ## this is the max length of the sentence
        self.epochs = int(15)
        self.batch_size = int(64)  ## ⚠️ important. if too small, the model might not learn.
        self.opt_lr = 2e-5    ## ⚠️ VERY important. keep it small for pre-trained model.
        self.opt_weight_decay = 1e-6
        self.use_class_weights = True
        self.freeze_layers = 0 #int(9)  ## from bottom to top, total 15 for BERT-base

config = Config()
config.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f"👉 Running on device type: {config.device}")

👉 Running on device type: cuda:0


# **Prepare train-test datasets**    

---



In [3]:
## get raw data
!mkdir -p cola_public
!mkdir -p cola_public/raw
!wget https://raw.githubusercontent.com/nov05/udacity-CD0387-deep-learning-topics-within-computer-vision-nlp-project-starter/refs/heads/main/cd0387_common_model_arch_types_fine_tuning/cola_public/raw/in_domain_train.tsv -O cola_public/raw/in_domain_train.tsv

--2024-11-30 07:54:09--  https://raw.githubusercontent.com/nov05/udacity-CD0387-deep-learning-topics-within-computer-vision-nlp-project-starter/refs/heads/main/cd0387_common_model_arch_types_fine_tuning/cola_public/raw/in_domain_train.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 428578 (419K) [text/plain]
Saving to: ‘cola_public/raw/in_domain_train.tsv’


2024-11-30 07:54:09 (13.3 MB/s) - ‘cola_public/raw/in_domain_train.tsv’ saved [428578/428578]



In [4]:
df = pd.read_csv(
   r"/content/cola_public/raw/in_domain_train.tsv",
   sep="\t",
   header=None,
   usecols=[1, 3],
   names=["label", "sentence"],
)
sentences = df.sentence.values
labels = df.label.values
print(df.shape)
## ⚠️ there is some imbalance in the training dataset
print(sum(labels)/len(labels))
df.sample(3)

(8551, 2)
0.704362062916618


Unnamed: 0,label,sentence
7710,1,Michael abandoned an old friend at Mardi Gras
7177,1,She may have and should have thawed the roast.
4839,1,What causes students to select particular majors?


In [5]:
## train-test split
!mkdir -p data
!mkdir -p data/cola
train_df, test_df = train_test_split(df, stratify=labels)
train_df.to_csv(r"data/cola/train.csv", index=False)
test_df.to_csv(r"data/cola/test.csv", index=False)

# **Functions**

In [6]:
print("Loading BERT tokenizer...")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=True)


def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat==labels_flat) / len(labels_flat)


def get_train_data_loader(batch_size):
    dataset = pd.read_csv(os.path.join("data", "cola", "train.csv"))
    sentences = dataset.sentence.values
    labels = dataset.label.values

    input_ids = []
    for sent in sentences:
        encoded_sent = tokenizer.encode(sent, add_special_tokens=True)
        input_ids.append(encoded_sent)

    # pad shorter sentences
    input_ids_padded = []
    for id in input_ids:
        while len(id) < config.max_len:
            id.append(0)
        input_ids_padded.append(id)
    input_ids = input_ids_padded

    # mask; 0: added, 1: otherwise
    attention_masks = []
    # For each sentence...
    for sent in input_ids:
        att_mask = [int(token_id > 0) for token_id in sent]
        attention_masks.append(att_mask)

    # convert to PyTorch data types.
    train_inputs = torch.tensor(input_ids)
    train_labels = torch.tensor(labels)
    train_masks = torch.tensor(attention_masks)

    train_data = TensorDataset(train_inputs, train_masks, train_labels)
    train_sampler = RandomSampler(train_data)
    train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

    return train_dataloader


def get_test_data_loader(test_batch_size):
    dataset = pd.read_csv(os.path.join("data", "cola", "test.csv"))
    sentences = dataset.sentence.values
    labels = dataset.label.values

    input_ids = []
    for sent in sentences:
        encoded_sent = tokenizer.encode(sent, add_special_tokens=True)
        input_ids.append(encoded_sent)

    # pad shorter sentences
    input_ids_padded = []
    for id in input_ids:
        while len(id)<config.max_len:
            id.append(0)
        input_ids_padded.append(id)
    input_ids = input_ids_padded

    # mask; 0: added, 1: otherwise
    attention_masks = []
    # For each sentence...
    for sent in input_ids:
        att_mask = [int(token_id>0) for token_id in sent]
        attention_masks.append(att_mask)

    # convert to PyTorch data types.
    train_inputs = torch.tensor(input_ids)
    train_labels = torch.tensor(labels)
    train_masks = torch.tensor(attention_masks)

    train_data = TensorDataset(train_inputs, train_masks, train_labels)
    train_sampler = RandomSampler(train_data)
    train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=test_batch_size)

    return train_dataloader


def train():
    train_loader = get_train_data_loader(config.batch_size)
    test_loader = get_test_data_loader(config.batch_size)

    model = BertForSequenceClassification.from_pretrained(
        "bert-base-uncased",  # Use the 12-layer BERT model, with an uncased vocab.
        num_labels=2,  # The number of output labels--2 for binary classification.
        output_attentions=False,  # Whether the model returns attentions weights
        output_hidden_states=False,  # Whether the model returns all hidden-states
    )
    ## from bottom to top, freeze {config.freeze_layers} layers
    param_names_to_freeze = []
    for i in range(config.freeze_layers):
        if i==0:
            g = model.bert.embeddings.named_parameters()  ## generator
            prefix = "bert.embeddings."
        elif i<=12:
            g = model.bert.encoder.layer[i-1].named_parameters()
            prefix = f"bert.encoder.layer{i-1}."
        elif i==13:
            g = model.bert.pooler.named_parameters()
            prefix = "bert.pooler."
        else:
            # g = model.classifier.named_parameters()
            raise Exception("⚠️ No more layers to freeze")
        for name, params in g:
            params.requires_grad = False
            param_names_to_freeze.append(prefix+name)
    model = model.to(config.device)

    ## set up optimizer
    optimizer_grouped_parameters = [{
        "params": [params for name,params in model.named_parameters()
            if not name in param_names_to_freeze],
        "lr": config.opt_lr,
        "weight_decay": config.opt_weight_decay,
    }]
    optimizer = AdamW(optimizer_grouped_parameters)

    ## set up scheduler
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=0,
        num_training_steps=len(train_loader)*config.epochs)

    ## set up loss function
    if config.use_class_weights:
        class_weights = compute_class_weight('balanced', classes=np.array([0, 1]), y=labels)
        class_weights = torch.tensor(class_weights, dtype=torch.float).to(config.device)
        loss_fn = torch.nn.CrossEntropyLoss(weight=class_weights)

    total_steps = 0
    for epoch in tqdm(range(config.epochs)):
        print(f"👉 Train Epoch {epoch}:")
        loss_epoch = 0
        model.train()
        for step, batch in enumerate(train_loader):
            total_steps += 1
            b_input_ids = batch[0].to(config.device)
            b_input_mask = batch[1].to(config.device)
            b_labels = batch[2].to(config.device)
            model.zero_grad()

            outputs = model(
                b_input_ids,                 ## Shape: (batch_size, sequence_length)
                token_type_ids=None,         ## Shape: (batch_size, sequence_length)
                attention_mask=b_input_mask, ## Shape: (batch_size, sequence_length)
                labels=b_labels)             ## Shape: (batch_size,)

            if config.use_class_weights:
                logits = outputs.logits  ## same with outputs[1]
                loss = loss_fn(logits.view(-1, 2), b_labels.view(-1))
            else:
                loss = outputs.loss  ## same with outputs[0]
            wandb.log({"train_loss": loss.item()}, step=total_steps)
            loss_epoch += loss.item()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.)
            optimizer.step()
            scheduler.step()
            if step%10==0:
                print(
                    f"Step {total_steps}: "
                    f"[{step*len(batch[0])}/{len(train_loader.sampler)} "
                    f"({(100.0*step/len(train_loader)):.0f}%)] "
                    f"Loss: {loss.item():.6f}"
                )
        wandb.log({"train_loss_epoch": loss_epoch/config.batch_size}, step=total_steps)
        eval_accuracy = test(model, test_loader)
        wandb.log({f"eval_accuracy_epoch (%)": eval_accuracy*100}, step=total_steps)
    return model


def test(model, test_loader):
    model.eval()
    eval_accuracy_steps = 0
    total_steps = 0
    with torch.no_grad():
        for batch in test_loader:
            total_steps += 1
            b_input_ids = batch[0].to(config.device)
            b_input_mask = batch[1].to(config.device)
            b_labels = batch[2].to(config.device)
            outputs = model(b_input_ids,
                            token_type_ids=None,
                            attention_mask=b_input_mask)
            logits = outputs.logits.detach().cpu().numpy()
            label_ids = b_labels.to("cpu").numpy()
            eval_accuracy_steps += flat_accuracy(logits, label_ids)
    eval_accuracy = eval_accuracy_steps / total_steps
    print("🟢 Test Accuracy (%): ", eval_accuracy*100)
    return eval_accuracy

Loading BERT tokenizer...


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

# **👉 Train the model**  



In [7]:
wandb.init(
    # set the wandb project where this run will be logged
    project="udacity-awsmle-bert-cola",
    config=config
)
train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  0%|          | 0/15 [00:00<?, ?it/s]

👉 Train Epoch 0:


  7%|▋         | 1/15 [01:10<16:27, 70.55s/it]

🟢 Test Accuracy (%):  81.34898190045249
👉 Train Epoch 1:


 13%|█▎        | 2/15 [02:19<15:05, 69.64s/it]

🟢 Test Accuracy (%):  82.54736990950227
👉 Train Epoch 2:


 20%|██        | 3/15 [03:28<13:54, 69.52s/it]

🟢 Test Accuracy (%):  83.68566176470588
👉 Train Epoch 3:


 27%|██▋       | 4/15 [04:38<12:43, 69.38s/it]

🟢 Test Accuracy (%):  83.60082013574662
👉 Train Epoch 4:


 33%|███▎      | 5/15 [05:47<11:33, 69.33s/it]

🟢 Test Accuracy (%):  84.01442307692308
👉 Train Epoch 5:


 40%|████      | 6/15 [06:56<10:23, 69.28s/it]

🟢 Test Accuracy (%):  84.22299208144796
👉 Train Epoch 6:


 47%|████▋     | 7/15 [08:05<09:13, 69.24s/it]

🟢 Test Accuracy (%):  83.50890837104072
👉 Train Epoch 7:


 53%|█████▎    | 8/15 [09:14<08:04, 69.21s/it]

🟢 Test Accuracy (%):  84.40328054298642
👉 Train Epoch 8:


 60%|██████    | 9/15 [10:24<06:55, 69.22s/it]

🟢 Test Accuracy (%):  83.62556561085972
👉 Train Epoch 9:


 67%|██████▋   | 10/15 [11:33<05:45, 69.20s/it]

🟢 Test Accuracy (%):  83.34629524886877
👉 Train Epoch 10:


 73%|███████▎  | 11/15 [12:42<04:36, 69.18s/it]

🟢 Test Accuracy (%):  83.90130090497738
👉 Train Epoch 11:


 80%|████████  | 12/15 [13:51<03:27, 69.20s/it]

🟢 Test Accuracy (%):  83.79171380090497
👉 Train Epoch 12:


 87%|████████▋ | 13/15 [15:00<02:18, 69.25s/it]

🟢 Test Accuracy (%):  84.33611425339367
👉 Train Epoch 13:


 93%|█████████▎| 14/15 [16:10<01:09, 69.26s/it]

🟢 Test Accuracy (%):  84.03916855203619
👉 Train Epoch 14:


100%|██████████| 15/15 [17:19<00:00, 69.30s/it]

🟢 Test Accuracy (%):  84.19824660633485





BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [8]:
wandb.finish()

VBox(children=(Label(value='0.132 MB of 0.132 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval_accuracy_epoch (%),▁▄▆▆▇█▆█▆▆▇▇█▇█
train_loss,█▇█▇▆▄▃▃▃▃▂▂▄▃▂▂▁▁▁▃▂▁▁▃▂▁▁▂▁▂▁▃▂▁▁▁▁▁▁▁
train_loss_epoch,█▆▄▃▂▂▂▂▂▁▁▁▁▁▁

0,1
eval_accuracy_epoch (%),84.19825
train_loss,0.43147
train_loss_epoch,0.04133


# **Explore BERT-base Architecture**  

```python
BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
            (intermediate_act_fn): GELUActivation()
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (pooler): BertPooler(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (activation): Tanh()
    )
  )
  (dropout): Dropout(p=0.1, inplace=False)
  (classifier): Linear(in_features=768, out_features=2, bias=True)
)
```

In [9]:
from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",  # Use the 12-layer BERT model, with an uncased vocab.
    num_labels=2,  # The number of output labels--2 for binary classification
    output_attentions=False,  # Whether the model returns attentions weights
    output_hidden_states=False,  # Whether the model returns all hidden-states
)
## Some weights of BertForSequenceClassification were not initialized from the model checkpoint
## at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
## You should probably TRAIN this model on a down-stream task to be able to use
## it for predictions and inference.

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [10]:
param_names = [name for name,param in model.named_parameters()]
print(len(param_names))
param_names[:23]  ## bottom layers param names

201


['bert.embeddings.word_embeddings.weight',
 'bert.embeddings.position_embeddings.weight',
 'bert.embeddings.token_type_embeddings.weight',
 'bert.embeddings.LayerNorm.weight',
 'bert.embeddings.LayerNorm.bias',
 'bert.encoder.layer.0.attention.self.query.weight',
 'bert.encoder.layer.0.attention.self.query.bias',
 'bert.encoder.layer.0.attention.self.key.weight',
 'bert.encoder.layer.0.attention.self.key.bias',
 'bert.encoder.layer.0.attention.self.value.weight',
 'bert.encoder.layer.0.attention.self.value.bias',
 'bert.encoder.layer.0.attention.output.dense.weight',
 'bert.encoder.layer.0.attention.output.dense.bias',
 'bert.encoder.layer.0.attention.output.LayerNorm.weight',
 'bert.encoder.layer.0.attention.output.LayerNorm.bias',
 'bert.encoder.layer.0.intermediate.dense.weight',
 'bert.encoder.layer.0.intermediate.dense.bias',
 'bert.encoder.layer.0.output.dense.weight',
 'bert.encoder.layer.0.output.dense.bias',
 'bert.encoder.layer.0.output.LayerNorm.weight',
 'bert.encoder.layer

In [11]:
param_names[-10:]  ## top layers param names

['bert.encoder.layer.11.intermediate.dense.weight',
 'bert.encoder.layer.11.intermediate.dense.bias',
 'bert.encoder.layer.11.output.dense.weight',
 'bert.encoder.layer.11.output.dense.bias',
 'bert.encoder.layer.11.output.LayerNorm.weight',
 'bert.encoder.layer.11.output.LayerNorm.bias',
 'bert.pooler.dense.weight',
 'bert.pooler.dense.bias',
 'classifier.weight',
 'classifier.bias']

* 🟢 BERT-base layers from the bottom to the top:  
`[embeddings, encoder.layer[0], encoder.layer[1], ..., encoder.layer[11], pooler, classifier]`

In [12]:
## access a specific layer
for name, params in model.bert.embeddings.named_parameters():
    print(name)

word_embeddings.weight
position_embeddings.weight
token_type_embeddings.weight
LayerNorm.weight
LayerNorm.bias


In [13]:
## access a specific encoder layer
for name, params in model.bert.encoder.layer[0].named_parameters():
    print(name)

attention.self.query.weight
attention.self.query.bias
attention.self.key.weight
attention.self.key.bias
attention.self.value.weight
attention.self.value.bias
attention.output.dense.weight
attention.output.dense.bias
attention.output.LayerNorm.weight
attention.output.LayerNorm.bias
intermediate.dense.weight
intermediate.dense.bias
output.dense.weight
output.dense.bias
output.LayerNorm.weight
output.LayerNorm.bias


In [14]:
## access a specific layer
for name, params in model.bert.pooler.named_parameters():
    print(name)

dense.weight
dense.bias


# **Online discussion**    

* [Unfreezing all layers of BERT giving good results than freezing and adding custom Forward layer for Fine-Tuning](https://www.reddit.com/r/MLQuestions/comments/1d07qlz/unfreezing_all_layers_of_bert_giving_good_results/)   

* [I just can't fine tune BERT over 40% accuracy for text-classification task](https://www.reddit.com/r/MachineLearning/comments/1bx5r8r/d_i_just_cant_fine_tune_bert_over_40_accuracy_for/)  

* [Why do we train whole BERT model for fine tuning and not freeze it?](https://www.reddit.com/r/deeplearning/comments/ndmqm6/why_do_we_train_whole_bert_model_for_fine_tuning/)   
  > That is how you usually train BERT. It gives more room for improvement and adjustment than just adding a classifier. The pretrained weights are not destroyed since you use very low learning rates, e.g. 1e-5  

* [Why don't we regularize the bias term?](https://www.deepwizai.com/simply-deep/why-does-regularizing-the-bias-lead-to-underfitting-in-neural-networks#:~:text=Regularization%20and%20Bias&text=Each%20bias%20controls%20only%20a,a%20significant%20amount%20of%20underfitting.%E2%80%9D)    
  > Each bias controls only a single variable. This means that we do not induce too much variance by leaving the biases unregularized. Also, regularizing the bias parameters can introduce a significant amount of underfitting.” May 17, 2021   

* > When using weight decay in optimizers (like in AdamW), **the decay is typically applied only to the weights of the model** (usually the parameters associated with the kernels in layers like Linear or Convolutional). Biases, along with parameters like batch normalization weights, generally do not have weight decay applied by default.

  > You **don't have to manually freeze the biases** for this behavior to occur. The optimizer can be set up to apply weight decay only to specific parameters (like weights) and exclude others (like biases). This is often done by creating parameter groups in your optimizer configuration.