# Prompte Tuning

## I. Presentation

This consists of adding a small part of the embedding model to the base model:

    ____________________________________________________________
    |      prompt embedding          |    Embedding            |
    ------------------------------------------------------------
    |_____________ __________________|____________ ____________|
                  |                               |
                Prompt                     base model Input     

So the training was only done on the prompt embedding model and leave the base model intact.
This reduce drastically the number of training parameters.

Here, we did both finetuning: soft and hard.


In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # or "0,1" for multiple GPUs
os.environ["TOKENIZERS_PARALLELISM"] = "false"

## II. Parameter-efficient fine-tuning (PEFT)

We wil use peft module from hugging face to do finetuning. The supported models can be found: 
https://huggingface.co/docs/peft/v0.6.2/en/index for this type of finetuning.

In [2]:
# install peft

# uncomment and run the following line if no peft is installed, otherwise, skip

# !python -m pip install peft --break-system-packages

## III. Example

we use the same example as bitfit.

In [3]:
# prepare example

import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForSeq2Seq, TrainingArguments, Trainer

ckp_data = "yahma/alpaca-cleaned"
ckp = "bigscience/bloomz-1b1"

# load dataset
data = load_dataset(ckp_data, split="train[:1000]")

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(ckp)

# process data
def process(sample):

    MAX_LEN = 256

    human = tokenizer("Human: " + "\n".join([sample["instruction"], sample["input"]]).strip() + "\n\nAssistant: ")
    ml = tokenizer(sample["output"] + tokenizer.eos_token)

    input_ids = human["input_ids"] + ml["input_ids"]
    attention_mask = human["attention_mask"] + ml["attention_mask"]
    labels = [-100] * len(human["input_ids"]) + ml["input_ids"]

    if len(input_ids) > MAX_LEN:

        input_ids = input_ids[:MAX_LEN]
        attention_mask = attention_mask[:MAX_LEN]
        labels = labels[:MAX_LEN]

    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": labels
    }

# tokenize dataset
tokenized_data = data.map(process, remove_columns=data.column_names)

# load model
model = AutoModelForCausalLM.from_pretrained(ckp, low_cpu_mem_usage=True)

# send to device
if torch.cuda.is_available():
    model = model.to("cuda:0")


2024-06-25 15:21:32.943273: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-25 15:21:32.943335: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-25 15:21:32.945568: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-25 15:21:32.957841: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [4]:
# compute model size

params = sum(param.numel() for param in model.parameters())
print("model size: ", params/1e9, "GB")
print("total required memory: ", round(params/1e9 * (4 + 4 + 12), 2), "GB")

model size:  1.065314304 GB
total required memory:  21.31 GB


## IV. soft prompt

For soft prompt, there is no concrete value for the added prompt, so the model will just intitiate ramdomly its value and learn with data.

In [5]:
# config for soft prompt

from peft import PromptTuningConfig, get_peft_model, TaskType, PromptTuningInit

config = PromptTuningConfig(task_type=TaskType.CAUSAL_LM, num_virtual_tokens=10)
config

PromptTuningConfig(peft_type=<PeftType.PROMPT_TUNING: 'PROMPT_TUNING'>, auto_mapping=None, base_model_name_or_path=None, revision=None, task_type=<TaskType.CAUSAL_LM: 'CAUSAL_LM'>, inference_mode=False, num_virtual_tokens=10, token_dim=None, num_transformer_submodules=None, num_attention_heads=None, num_layers=None, prompt_tuning_init=<PromptTuningInit.RANDOM: 'RANDOM'>, prompt_tuning_init_text=None, tokenizer_name_or_path=None, tokenizer_kwargs=None)

In [6]:
# wrap model

# So the original bloom model becomes peft model

peft_model = get_peft_model(model, config)
peft_model

PeftModelForCausalLM(
  (base_model): BloomForCausalLM(
    (transformer): BloomModel(
      (word_embeddings): Embedding(250880, 1536)
      (word_embeddings_layernorm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
      (h): ModuleList(
        (0-23): 24 x BloomBlock(
          (input_layernorm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
          (self_attention): BloomAttention(
            (query_key_value): Linear(in_features=1536, out_features=4608, bias=True)
            (dense): Linear(in_features=1536, out_features=1536, bias=True)
            (attention_dropout): Dropout(p=0.0, inplace=False)
          )
          (post_attention_layernorm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
          (mlp): BloomMLP(
            (dense_h_to_4h): Linear(in_features=1536, out_features=6144, bias=True)
            (gelu_impl): BloomGelu()
            (dense_4h_to_h): Linear(in_features=6144, out_features=1536, bias=True)
          )
        )
      

In [7]:
# show parameters of wrapped model

peft_model.print_trainable_parameters()

trainable params: 15,360 || all params: 1,065,329,664 || trainable%: 0.0014


In [8]:
# compared to the previous config, some parameters have changed, such as token_dim, num_layers...
# since there is an update of those parameters from model.

config

PromptTuningConfig(peft_type=<PeftType.PROMPT_TUNING: 'PROMPT_TUNING'>, auto_mapping=None, base_model_name_or_path='bigscience/bloomz-1b1', revision=None, task_type=<TaskType.CAUSAL_LM: 'CAUSAL_LM'>, inference_mode=False, num_virtual_tokens=10, token_dim=1536, num_transformer_submodules=1, num_attention_heads=16, num_layers=24, prompt_tuning_init=<PromptTuningInit.RANDOM: 'RANDOM'>, prompt_tuning_init_text=None, tokenizer_name_or_path=None, tokenizer_kwargs=None)

In [9]:
# here, the loss seems not decreasing
# actually, it may decrease, but since the decrease is so slow that it is hard to see from their values here.

# define training arguments
args = TrainingArguments(
    output_dir="../tmp/checkpoint",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    logging_steps=50,
    num_train_epochs=3
)

# define trainer
trainer = Trainer(
    model=peft_model,
    args=args,
    train_dataset=tokenized_data,
    tokenizer=tokenizer,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True)
)

# train
trainer.train()

Step,Training Loss
50,2.6865
100,2.5316
150,2.5924
200,2.7851
250,2.5489
300,2.577
350,2.6745


TrainOutput(global_step=375, training_loss=2.6220494079589844, metrics={'train_runtime': 337.8862, 'train_samples_per_second': 8.879, 'train_steps_per_second': 1.11, 'total_flos': 1670056200806400.0, 'train_loss': 2.6220494079589844, 'epoch': 3.0})

In [10]:
from transformers import pipeline

# before finetuning we try the model prediction

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, device=model.device)
human = "human: {}\n{}".format("List five steps for comparing two products.", "").strip() + "\n\nAssistant: "
pipe(human, max_length=256)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


[{'generated_text': 'human: List five steps for comparing two products.\n\nAssistant:  What are the steps for comparing two products?'}]

In [11]:
# inference

# if we use the pipeline as in bitfit, we got error: The model 'PeftModelForCausalLM' is not 
# supported for text-generation. Because the pipeline doesn't support causal ml yet.
# so we have to generate the result by ourself.

# the result doesn't seem to be right, which means that the training is not good.
# This is due to that the training is not enough. For soft prompt training, we need more
# training to be effective.

def generate(model, tokenizer, instruction, input=None):

    prompt = "human: {}\n{}".format(instruction, input).strip() + "\n\nAssistant: "
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].to(model.device)

    generation_output = model.generate(
        input_ids=input_ids,
        output_scores=True,
        max_new_tokens=256
    )
    for seq in generation_output:
        output = tokenizer.decode(seq, skip_special_tokens=True)
        print(output)

generate(peft_model, tokenizer, "List five steps for comparing two products.")

human: List five steps for comparing two products.
None

Assistant:  No


## V. hard prompt

Contrarary to the soft prompt and as indicated by its name, hard prompt take an intial prompt value.

In [12]:
# config
from peft import PromptTuningConfig, get_peft_model, TaskType, PromptTuningInit

# hard prompt

config = PromptTuningConfig(task_type=TaskType.CAUSAL_LM, 
                            prompt_tuning_init=PromptTuningInit.TEXT,
                            prompt_tuning_init_text="below is a human model dialog.",
                            num_virtual_tokens=len(tokenizer("below is a human model dialog.")["input_ids"]),
                            tokenizer_name_or_path=ckp
                            )
config

PromptTuningConfig(peft_type=<PeftType.PROMPT_TUNING: 'PROMPT_TUNING'>, auto_mapping=None, base_model_name_or_path=None, revision=None, task_type=<TaskType.CAUSAL_LM: 'CAUSAL_LM'>, inference_mode=False, num_virtual_tokens=7, token_dim=None, num_transformer_submodules=None, num_attention_heads=None, num_layers=None, prompt_tuning_init=<PromptTuningInit.TEXT: 'TEXT'>, prompt_tuning_init_text='below is a human model dialog.', tokenizer_name_or_path='bigscience/bloomz-1b1', tokenizer_kwargs=None)

In [13]:
# wrap the model

peft_model = get_peft_model(model, config)
peft_model

PeftModelForCausalLM(
  (base_model): BloomForCausalLM(
    (transformer): BloomModel(
      (word_embeddings): Embedding(250880, 1536)
      (word_embeddings_layernorm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
      (h): ModuleList(
        (0-23): 24 x BloomBlock(
          (input_layernorm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
          (self_attention): BloomAttention(
            (query_key_value): Linear(in_features=1536, out_features=4608, bias=True)
            (dense): Linear(in_features=1536, out_features=1536, bias=True)
            (attention_dropout): Dropout(p=0.0, inplace=False)
          )
          (post_attention_layernorm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
          (mlp): BloomMLP(
            (dense_h_to_4h): Linear(in_features=1536, out_features=6144, bias=True)
            (gelu_impl): BloomGelu()
            (dense_4h_to_h): Linear(in_features=6144, out_features=1536, bias=True)
          )
        )
      

In [14]:
# compred to soft prompt we got fewer parameters to train since out prompt is shorter

peft_model.print_trainable_parameters()

trainable params: 10,752 || all params: 1,065,325,056 || trainable%: 0.0010


In [15]:
# compared to soft prompt, we can see perceivably that the loss decreases

# define training arguments
args = TrainingArguments(
    output_dir="../tmp/checkpoint",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    logging_steps=50,
    num_train_epochs=3
)

# define trainer
trainer = Trainer(
    model=peft_model,
    args=args,
    train_dataset=tokenized_data,
    tokenizer=tokenizer,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True)
)

# train
trainer.train()

Step,Training Loss
50,2.5887
100,2.4084
150,2.4088
200,2.5077
250,2.1515
300,2.1916
350,2.1931


TrainOutput(global_step=375, training_loss=2.3334760030110675, metrics={'train_runtime': 335.4992, 'train_samples_per_second': 8.942, 'train_steps_per_second': 1.118, 'total_flos': 1670056200806400.0, 'train_loss': 2.3334760030110675, 'epoch': 3.0})

In [16]:
# inference

generate(peft_model, tokenizer, "List five steps for comparing two products.")

human: List five steps for comparing two products.
None

Assistant: Product A is a product that is sold in a store. Product B is a product that is sold in a store. Product A is better than product B.


## load trained peft model

In [None]:
# by default the trainer save the model every 500 steps.
# The correspond files can be found in the directory defined in the "out_dir" in training arguments.
# The files are called:
# - adapter_config.json
# - dadapter_model.bin
# which contains the trained prompt model

# to load the saved models and use them, we do the following:

from peft import PeftModel

# load base model

model = AutoModelForCausalLM.from_pretrained(ckp)

ckp_peft = "./checkpoint/checkpoint_100" # example of checkpoint, replace this with the path of the checkpoint define in training argument

# construct peft model
peft_model = PeftModel.from_pretrained(model=model, model_id=ckp_peft) # make sure the base model is in cpu

# inference
peft_model = peft_model.cuda()
generate(peft_model, tokenizer, "List five steps for comparing two products.")


## VI. Conclusion



Even though the results are not good, but the hard prompt finetuning is better than the soft one. 
With more training, the results will get better. Also, use do_sample to generate more possible results.