# bitfit

## I. Presentation

Bitfit is one of the technique to do finetuning but only update part of the parameters, namely the bias only.
In other words, the weights of the layers are kept unchanged.

## II. Example

We take the the conversational example.

In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # or "0,1" for multiple GPUs
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [2]:
# Prepare the example
# For sake of illustrating finetuning, we don't do data split and don't do evaluations during training.


# import
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForSeq2Seq, TrainingArguments, Trainer

ckp_data = "yahma/alpaca-cleaned"
ckp = "bigscience/bloomz-1b1"

# load dataset
data = load_dataset(ckp_data, split="train[:1000]")

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(ckp)

# process data
def process(sample):

    MAX_LEN = 256

    human = tokenizer("Human: " + "\n".join([sample["instruction"], sample["input"]]).strip() + "\n\nAssistant: ")
    ml = tokenizer(sample["output"] + tokenizer.eos_token)

    input_ids = human["input_ids"] + ml["input_ids"]
    attention_mask = human["attention_mask"] + ml["attention_mask"]
    labels = [-100] * len(human["input_ids"]) + ml["input_ids"]

    if len(input_ids) > MAX_LEN:

        input_ids = input_ids[:MAX_LEN]
        attention_mask = attention_mask[:MAX_LEN]
        labels = labels[:MAX_LEN]

    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": labels
    }

# tokenize dataset
tokenized_data = data.map(process, remove_columns=data.column_names)

# load model
model = AutoModelForCausalLM.from_pretrained(ckp, low_cpu_mem_usage=True)

# define training arguments
args = TrainingArguments(
    output_dir="../tmp/checkpoint",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    logging_steps=10,
    num_train_epochs=1
)

# define trainer
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_data,
    tokenizer=tokenizer,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True)
)

# train
trainer.train()

2024-06-23 16:30:29.495059: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-23 16:30:29.495221: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-23 16:30:29.584726: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-23 16:30:29.737948: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Step,Training Loss
10,2.375
20,2.3978
30,2.3615
40,2.3327
50,2.2497
60,2.2426
70,2.0264
80,2.3792
90,1.8954
100,1.9446


TrainOutput(global_step=125, training_loss=2.1997402267456057, metrics={'train_runtime': 198.2195, 'train_samples_per_second': 5.045, 'train_steps_per_second': 0.631, 'total_flos': 556685400268800.0, 'train_loss': 2.1997402267456057, 'epoch': 1.0})

In [3]:
# compute model size

params = sum(param.numel() for param in model.parameters())
print("model size: ", params/1e9, "GB")
print("total required memory: ", round(params/1e9 * (4 + 4 + 12), 2), "GB")

model size:  1.065314304 GB
total required memory:  21.31 GB


While running the training, it can be seen that we use 21Gb of momery over 24GB (my GPU has 24GB).

## III. use bitfit

In [5]:
# We freeze all weights in the model

# we can see that the updated parameters reduced a lot

params_count = 0
total_count = 0

for name, param in model.named_parameters():

    if "bias" not in name:
        # if the parameter is not bias, freeze it
        # and we count the parameter numbers 
        param.requires_grad = False
        total_count += param.numel()
    else:
        # if the parameter is bias, count the number
        # also update the total number of parameters
        params_count += param.numel()
        total_count += param.numel()

print("update parameters: ", params_count, "over total of: ", total_count, " (", round(params_count/total_count*100, 4), "%)" )

update parameters:  408576 over total of:  1065314304  ( 0.0384 %)


In [6]:
# finetuning only the bias

# it cna be seen that the training is much faster

# define training arguments
args = TrainingArguments(
    output_dir="../tmp/checkpoint",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    logging_steps=10,
    num_train_epochs=1
)

# define trainer
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_data,
    tokenizer=tokenizer,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True)
)

# train
trainer.train()

Step,Training Loss
10,0.8669
20,0.9688
30,0.9557
40,0.9266
50,1.0318
60,0.9047
70,0.9681
80,1.0221
90,1.044
100,1.096


TrainOutput(global_step=125, training_loss=1.118237762451172, metrics={'train_runtime': 106.3603, 'train_samples_per_second': 9.402, 'train_steps_per_second': 1.175, 'total_flos': 556685400268800.0, 'train_loss': 1.118237762451172, 'epoch': 1.0})

We can see that it uses < 6GB for training compared to 21 GB previously.

## IV. Inference

In [4]:
from transformers import pipeline

# before finetuning we try the model prediction

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, device=model.device)
human = "human: {}\n{}".format("List five steps for comparing two products.", "").strip() + "\n\nAssistant: "
pipe(human, max_length=256)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


[{'generated_text': 'human: List five steps for comparing two products.\n\nAssistant: 1. Identify the main difference between the two products\n2. Compare the two products\n3. Compare the two products by their features\n4. Compare the two products by their specifications\n5. Compare the two products by their performance'}]

In [7]:
# after finetuning, we get the prediction

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, device=model.device)
human = "human: {}\n{}".format("List five steps for comparing two products.", "").strip() + "\n\nAssistant: "
pipe(human, max_length=256)

[{'generated_text': 'human: List five steps for comparing two products.\n\nAssistant: 1. Identify the main difference between the two products\n2. Compare the two products\n3. Evaluate the performance of the two products\n4. Compare the two products\n5. Evaluate the overall performance of the two products'}]

## V. Conclusion

So using bitfit we can reduce largely the training parameters so to decrease memory usage. Bu this method has limited capacity for applications due to the fact that the full model should be saved every time, which still require a lot of saving space.