### We will use  Langboat 1b4-zh from HuggingFace as base model. ###

## BitFit is a clever and minimalist fine-tuning strategy that focuses exclusively on adjusting the bias terms of a pretrained model — and yes, that’s exactly what makes it so efficient and surprising. ##

🧠 What Does “Only Adjusting Bias” Mean?
In neural networks, each layer typically has:

- Weights: the main parameters that transform inputs.

- Biases: scalar offsets added after the weighted sum.

BitFit freezes all weights and only updates the biases during fine-tuning. That means:

- The model’s core representations stay intact.

- Only the additive shifts (biases) are tuned to adapt to the new task.

## 1. Load modules

In [1]:
from datasets import load_dataset

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForSeq2Seq, TrainingArguments, Trainer

W0805 16:29:32.835000 96695 site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.


In [3]:
import datasets

In [4]:
datasets.__version__

'3.6.0'

In [5]:
import warnings
warnings.filterwarnings('ignore')

## 2. Load data

In [6]:
ds = load_dataset("json", data_dir = "./alpaca_data_zh/")
ds = ds["train"]

Generating train split: 0 examples [00:00, ? examples/s]

In [7]:
ds

Dataset({
    features: ['instruction', 'input', 'output'],
    num_rows: 48818
})

In [9]:
ds[:3]

{'instruction': ['保持健康的三个提示。', '三原色是什么？', '描述原子的结构。'],
 'input': ['', '', ''],
 'output': ['以下是保持健康的三个提示：\n\n1. 保持身体活动。每天做适当的身体运动，如散步、跑步或游泳，能促进心血管健康，增强肌肉力量，并有助于减少体重。\n\n2. 均衡饮食。每天食用新鲜的蔬菜、水果、全谷物和脂肪含量低的蛋白质食物，避免高糖、高脂肪和加工食品，以保持健康的饮食习惯。\n\n3. 睡眠充足。睡眠对人体健康至关重要，成年人每天应保证 7-8 小时的睡眠。良好的睡眠有助于减轻压力，促进身体恢复，并提高注意力和记忆力。',
  '三原色通常指的是红色、绿色和蓝色（RGB）。它们是通过加色混合原理创建色彩的三种基础颜色。在以发光为基础的显示设备中（如电视、计算机显示器、智能手机和平板电脑显示屏）, 三原色可混合产生大量色彩。其中红色和绿色可以混合生成黄色，红色和蓝色可以混合生成品红色，蓝色和绿色可以混合生成青色。当红色、绿色和蓝色按相等比例混合时，可以产生白色或灰色。\n\n此外，在印刷和绘画中，三原色指的是以颜料为基础的红、黄和蓝颜色（RYB）。这三种颜色用以通过减色混合原理来创建色彩。不过，三原色的具体定义并不唯一，不同的颜色系统可能会采用不同的三原色。',
  '原子是物质的基本单位，它由三种基本粒子组成：质子、中子和电子。质子和中子形成原子核，位于原子中心，核外的电子围绕着原子核运动。\n\n原子结构具有层次性。原子核中，质子带正电，中子不带电（中性）。原子核非常小且致密，占据了原子总质量的绝大部分。电子带负电，通常围绕核运动，形成若干层次，称为壳层或电子层。电子数量与质子数量相等，使原子呈电中性。\n\n电子在每个壳层中都呈规律分布，并且不同壳层所能容纳的电子数也不同。在最里面的壳层一般只能容纳2个电子，其次一层最多可容纳8个电子，再往外的壳层可容纳的电子数逐层递增。\n\n原子核主要受到两种相互作用力的影响：强力和电磁力。强力的作用范围非常小，主要限制在原子核内，具有极强的吸引作用，使核子（质子和中子）紧密结合在一起。电磁力的作用范围较大，主要通过核外的电子与原子核相互作用，发挥作用。\n\n这就是原子的基本结构。原子内部结构复杂多样，不同元素的原子核中质子、中子数量不同

## 3. Preprocess data

In [10]:
tokenizer = AutoTokenizer.from_pretrained("Langboat/bloom-1b4-zh")

tokenizer_config.json:   0%|          | 0.00/268 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/96.0 [00:00<?, ?B/s]

In [11]:
tokenizer

BloomTokenizerFast(name_or_path='Langboat/bloom-1b4-zh', vocab_size=46145, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [16]:
def process_func(example):
    MAX_LENGTH = 256
    input_ids, attention_mask, labels = [], [], []
    instruction = tokenizer("\n".join(["Human: " + example["instruction"], 
                                       example["input"]]).strip() + "\n\nAssistant: ")
    response = tokenizer(example["output"] + tokenizer.eos_token)
    input_ids = instruction["input_ids"] + response["input_ids"]
    attention_mask = instruction["attention_mask"] + response["attention_mask"]
    labels = [-100] * len(instruction["input_ids"]) + response["input_ids"]
    """
    Use [-100] to ignore the prompt/instruction part during loss computation.
    The model still needs the instruction to generate the correct response.
    During training, the model gets the full input_ids (instruction + response).
    But it should only be evaluated (via loss) on the response.
    If you remove the instruction tokens from labels, then the alignment between input_ids and labels breaks —
    you’ll have mismatched sequence lengths, which causes errors.
    """
    if len(input_ids) > MAX_LENGTH:
        input_ids = input_ids[:MAX_LENGTH]
        attention_mask = attention_mask[:MAX_LENGTH]
        labels = labels[:MAX_LENGTH]
    return{
        "input_ids" : input_ids,
        "attention_mask" : attention_mask,
        "labels" : labels
    }

In [17]:
tokenized_ds = ds.map(process_func, remove_columns = ds.column_names)

Map:   0%|          | 0/48818 [00:00<?, ? examples/s]

In [19]:
tokenized_ds

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 48818
})

In [21]:
tokenized_ds[2]

{'input_ids': [26283,
  29,
  210,
  10096,
  1742,
  8328,
  7241,
  672,
  189,
  4340,
  17245,
  29,
  210,
  11392,
  584,
  10009,
  15139,
  7066,
  355,
  1954,
  1321,
  25020,
  5099,
  23972,
  7720,
  1038,
  2993,
  1020,
  554,
  655,
  27702,
  7964,
  420,
  2993,
  1020,
  24405,
  1020,
  6454,
  11392,
  3317,
  355,
  5699,
  11392,
  3669,
  355,
  3317,
  13589,
  7964,
  22273,
  2282,
  11392,
  3317,
  6053,
  672,
  189,
  11392,
  7241,
  5140,
  25008,
  1237,
  420,
  11392,
  3317,
  655,
  355,
  2993,
  1020,
  3099,
  1395,
  1936,
  355,
  39089,
  643,
  3099,
  1936,
  928,
  27810,
  927,
  420,
  11392,
  3317,
  4433,
  1225,
  2409,
  2880,
  3211,
  355,
  24588,
  658,
  11392,
  2333,
  8554,
  373,
  41727,
  420,
  7964,
  3099,
  4002,
  1936,
  355,
  6770,
  22273,
  3317,
  6053,
  355,
  6454,
  11915,
  25008,
  355,
  9833,
  15953,
  4673,
  1326,
  7964,
  4673,
  420,
  7964,
  10162,
  1235,
  2993,
  1020,
  10162,
  32775,
  355

In [23]:
tokenizer.decode(tokenized_ds[2]["input_ids"])

'Human: 描述原子的结构。\n\nAssistant: 原子是物质的基本单位，它由三种基本粒子组成：质子、中子和电子。质子和中子形成原子核，位于原子中心，核外的电子围绕着原子核运动。\n\n原子结构具有层次性。原子核中，质子带正电，中子不带电（中性）。原子核非常小且致密，占据了原子总质量的绝大部分。电子带负电，通常围绕核运动，形成若干层次，称为壳层或电子层。电子数量与质子数量相等，使原子呈电中性。\n\n电子在每个壳层中都呈规律分布，并且不同壳层所能容纳的电子数也不同。在最里面的壳层一般只能容纳2个电子，其次一层最多可容纳8个电子，再往外的壳层可容纳的电子数逐层递增。\n\n原子核主要受到两种相互作用力的影响：强力和电磁力。强力的作用范围非常小，主要限制在原子核内，具有极强的吸引作用，使核子（质子和中子）紧密结合在一起。电磁力的作用范围较大，主要通过核外的电子与原子核相互作用，发挥作用。\n\n这就是原子的'

In [30]:
tokenizer.decode(list(filter(lambda x:x!=-100,tokenized_ds[2]["labels"])))

'原子是物质的基本单位，它由三种基本粒子组成：质子、中子和电子。质子和中子形成原子核，位于原子中心，核外的电子围绕着原子核运动。\n\n原子结构具有层次性。原子核中，质子带正电，中子不带电（中性）。原子核非常小且致密，占据了原子总质量的绝大部分。电子带负电，通常围绕核运动，形成若干层次，称为壳层或电子层。电子数量与质子数量相等，使原子呈电中性。\n\n电子在每个壳层中都呈规律分布，并且不同壳层所能容纳的电子数也不同。在最里面的壳层一般只能容纳2个电子，其次一层最多可容纳8个电子，再往外的壳层可容纳的电子数逐层递增。\n\n原子核主要受到两种相互作用力的影响：强力和电磁力。强力的作用范围非常小，主要限制在原子核内，具有极强的吸引作用，使核子（质子和中子）紧密结合在一起。电磁力的作用范围较大，主要通过核外的电子与原子核相互作用，发挥作用。\n\n这就是原子的'

In [33]:
tokenized_ds[2]["input_ids"]

[26283,
 29,
 210,
 10096,
 1742,
 8328,
 7241,
 672,
 189,
 4340,
 17245,
 29,
 210,
 11392,
 584,
 10009,
 15139,
 7066,
 355,
 1954,
 1321,
 25020,
 5099,
 23972,
 7720,
 1038,
 2993,
 1020,
 554,
 655,
 27702,
 7964,
 420,
 2993,
 1020,
 24405,
 1020,
 6454,
 11392,
 3317,
 355,
 5699,
 11392,
 3669,
 355,
 3317,
 13589,
 7964,
 22273,
 2282,
 11392,
 3317,
 6053,
 672,
 189,
 11392,
 7241,
 5140,
 25008,
 1237,
 420,
 11392,
 3317,
 655,
 355,
 2993,
 1020,
 3099,
 1395,
 1936,
 355,
 39089,
 643,
 3099,
 1936,
 928,
 27810,
 927,
 420,
 11392,
 3317,
 4433,
 1225,
 2409,
 2880,
 3211,
 355,
 24588,
 658,
 11392,
 2333,
 8554,
 373,
 41727,
 420,
 7964,
 3099,
 4002,
 1936,
 355,
 6770,
 22273,
 3317,
 6053,
 355,
 6454,
 11915,
 25008,
 355,
 9833,
 15953,
 4673,
 1326,
 7964,
 4673,
 420,
 7964,
 10162,
 1235,
 2993,
 1020,
 10162,
 32775,
 355,
 1408,
 11392,
 8168,
 1936,
 27810,
 672,
 189,
 7964,
 587,
 9993,
 15953,
 4673,
 33629,
 8168,
 22003,
 10740,
 355,
 6187,
 3657,


In [24]:
tokenized_ds[2]["labels"]

[-100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 11392,
 584,
 10009,
 15139,
 7066,
 355,
 1954,
 1321,
 25020,
 5099,
 23972,
 7720,
 1038,
 2993,
 1020,
 554,
 655,
 27702,
 7964,
 420,
 2993,
 1020,
 24405,
 1020,
 6454,
 11392,
 3317,
 355,
 5699,
 11392,
 3669,
 355,
 3317,
 13589,
 7964,
 22273,
 2282,
 11392,
 3317,
 6053,
 672,
 189,
 11392,
 7241,
 5140,
 25008,
 1237,
 420,
 11392,
 3317,
 655,
 355,
 2993,
 1020,
 3099,
 1395,
 1936,
 355,
 39089,
 643,
 3099,
 1936,
 928,
 27810,
 927,
 420,
 11392,
 3317,
 4433,
 1225,
 2409,
 2880,
 3211,
 355,
 24588,
 658,
 11392,
 2333,
 8554,
 373,
 41727,
 420,
 7964,
 3099,
 4002,
 1936,
 355,
 6770,
 22273,
 3317,
 6053,
 355,
 6454,
 11915,
 25008,
 355,
 9833,
 15953,
 4673,
 1326,
 7964,
 4673,
 420,
 7964,
 10162,
 1235,
 2993,
 1020,
 10162,
 32775,
 355,
 1408,
 11392,
 8168,
 1936,
 27810,
 672,
 189,
 7964,
 587,
 9993,
 15953,
 4673,
 33629,
 8168,
 22003,
 10740,
 355,
 6187,
 3

In [36]:
tokenizer.decode(tokenized_ds[2]["labels"][100:])

'通常围绕核运动，形成若干层次，称为壳层或电子层。电子数量与质子数量相等，使原子呈电中性。\n\n电子在每个壳层中都呈规律分布，并且不同壳层所能容纳的电子数也不同。在最里面的壳层一般只能容纳2个电子，其次一层最多可容纳8个电子，再往外的壳层可容纳的电子数逐层递增。\n\n原子核主要受到两种相互作用力的影响：强力和电磁力。强力的作用范围非常小，主要限制在原子核内，具有极强的吸引作用，使核子（质子和中子）紧密结合在一起。电磁力的作用范围较大，主要通过核外的电子与原子核相互作用，发挥作用。\n\n这就是原子的'

In [37]:
len(tokenized_ds[2]["labels"])

256

## 4. Create model

In [38]:
model = AutoModelForCausalLM.from_pretrained("Langboat/bloom-1b4-zh", low_cpu_mem_usage = True)

config.json:   0%|          | 0.00/801 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/5.59G [00:00<?, ?B/s]

In [39]:
model

BloomForCausalLM(
  (transformer): BloomModel(
    (word_embeddings): Embedding(46145, 2048)
    (word_embeddings_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
    (h): ModuleList(
      (0-23): 24 x BloomBlock(
        (input_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (self_attention): BloomAttention(
          (query_key_value): Linear(in_features=2048, out_features=6144, bias=True)
          (dense): Linear(in_features=2048, out_features=2048, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (post_attention_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (mlp): BloomMLP(
          (dense_h_to_4h): Linear(in_features=2048, out_features=8192, bias=True)
          (gelu_impl): BloomGelu()
          (dense_4h_to_h): Linear(in_features=8192, out_features=2048, bias=True)
        )
      )
    )
    (ln_f): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
  )
  (l

In [40]:
sum(param.numel() for param in model.parameters())

1303111680

### Bitfit
Bit fit only select the bias of the model to adjust

In [42]:
num_param = 0
for name, param in model.named_parameters():
    if "bias" not in name:
        #freeze the parameter
        param.requires_grad = False
    else:
        num_param+=param.numel()

num_param

544768

In [43]:
num_param / sum(param.numel() for param in model.parameters())

0.000418051659240749

The parameters that need to be adjusts is significantly decreased from 1.3B to 544K.

This methos is too easy and is not inlcuded in HuggingFace.

We use it to get familarized with the flow.

## 5. Configure the training

In [44]:
args = TrainingArguments(
    output_dir = "./chatbot", #to store the prediction results and checkpoints of the model file
    per_device_train_batch_size = 1, #8 by default
    gradient_accumulation_steps = 8, #1 by default, calculate 8 times of gradient then update the parameters in back propagation, this is more efficient
    logging_steps = 10,
    num_train_epochs = 1, # number of times to let the model learn
)

## 6. Create the trainer

In [45]:
trainer = Trainer(
    model = model, #the model with frozen parameters
    args = args,
    train_dataset = tokenized_ds,
    #To builde one batch
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer, padding = True)   
)

`data_collator` is responsible for taking a list of dataset samples (from the DataLoader) and converting it into a single batch.

Think of it as:

- The function that builds each batch during training or evaluation.

- It handles padding, truncation, and tensor conversion.

In [46]:
trainer.train()

Step,Training Loss
10,2.6956
20,2.8198
30,2.7107
40,2.6258
50,2.4725
60,2.661
70,2.8262
80,2.6269
90,2.4617
100,2.7249


SafetensorError: Error while serializing: IoError(Os { code: 28, kind: StorageFull, message: "No space left on device" })

### Different from Pycharm, after Jupyter finished or paused running, it will still occupy the RAM. You need to restart the Kernel to lease the resource.