# P-Tuning

## I. Presentation

From prompt tuning, we see that soft prompt tuning is not as effective as hard prompt tuning. However, there are ways to improve soft prompt tuning, one of which is p-tuning.

It consists of adding an encoding model to the embedding model:

    ____________________________________________________________
    |      prompt embedding          |    Embedding            |
    ------------------------------------------------------------
    |_____________ __________________|____________ ____________|
                  |                               |
            Prompt Encoder                  base model Input
                  ^
                  |
        lsmt/mlp + prompt Embedding


## II. Example

In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # or "0,1" for multiple GPUs
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [10]:
# prepare example

# we use the same example as butfit

import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForSeq2Seq, TrainingArguments, Trainer

ckp_data = "yahma/alpaca-cleaned"
ckp = "bigscience/bloomz-1b1"

# load dataset
data = load_dataset(ckp_data, split="train[:1000]")

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(ckp)

# process data
def process(sample):

    MAX_LEN = 256

    human = tokenizer("Human: " + "\n".join([sample["instruction"], sample["input"]]).strip() + "\n\nAssistant: ")
    ml = tokenizer(sample["output"] + tokenizer.eos_token)

    input_ids = human["input_ids"] + ml["input_ids"]
    attention_mask = human["attention_mask"] + ml["attention_mask"]
    labels = [-100] * len(human["input_ids"]) + ml["input_ids"]

    if len(input_ids) > MAX_LEN:

        input_ids = input_ids[:MAX_LEN]
        attention_mask = attention_mask[:MAX_LEN]
        labels = labels[:MAX_LEN]

    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": labels
    }

# tokenize dataset
tokenized_data = data.map(process, remove_columns=data.column_names)

# load model
model = AutoModelForCausalLM.from_pretrained(ckp, low_cpu_mem_usage=True)

# send to device
if torch.cuda.is_available():
    model = model.to("cuda:0")


In [4]:
# compute model size

params = sum(param.numel() for param in model.parameters())
print("model size: ", params/1e9, "GB")
print("total required memory: ", round(params/1e9 * (4 + 4 + 12), 2), "GB")

model size:  1.065314304 GB
total required memory:  21.31 GB


## III. P-Tuning

In [5]:
# config

from peft import PromptEncoderConfig, get_peft_model, TaskType

# soft prompt
# by default the encoder type is mlp
config = PromptEncoderConfig(task_type=TaskType.CAUSAL_LM, num_virtual_tokens=10)
config

PromptEncoderConfig(peft_type=<PeftType.P_TUNING: 'P_TUNING'>, auto_mapping=None, base_model_name_or_path=None, revision=None, task_type=<TaskType.CAUSAL_LM: 'CAUSAL_LM'>, inference_mode=False, num_virtual_tokens=10, token_dim=None, num_transformer_submodules=None, num_attention_heads=None, num_layers=None, encoder_reparameterization_type=<PromptEncoderReparameterizationType.MLP: 'MLP'>, encoder_hidden_size=None, encoder_num_layers=2, encoder_dropout=0.0)

In [12]:
# wrap model

peft_model = get_peft_model(model, config)
peft_model

PeftModelForCausalLM(
  (base_model): BloomForCausalLM(
    (transformer): BloomModel(
      (word_embeddings): Embedding(250880, 1536)
      (word_embeddings_layernorm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
      (h): ModuleList(
        (0-23): 24 x BloomBlock(
          (input_layernorm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
          (self_attention): BloomAttention(
            (query_key_value): Linear(in_features=1536, out_features=4608, bias=True)
            (dense): Linear(in_features=1536, out_features=1536, bias=True)
            (attention_dropout): Dropout(p=0.0, inplace=False)
          )
          (post_attention_layernorm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
          (mlp): BloomMLP(
            (dense_h_to_4h): Linear(in_features=1536, out_features=6144, bias=True)
            (gelu_impl): BloomGelu()
            (dense_4h_to_h): Linear(in_features=6144, out_features=1536, bias=True)
          )
        )
      

In [7]:
# compared to prompt tuning, the trainable parameters is much greater

# trainable params: 15,360 || all params: 1,065,329,664 || trainable%: 0.0014

peft_model.print_trainable_parameters()

trainable params: 7,097,856 || all params: 1,072,412,160 || trainable%: 0.6619


In [8]:
# compared to prompt tuning, the training is much slower since the trainable parameters are greater
# but the loss decreases more too.

# define training arguments
args = TrainingArguments(
    output_dir="../tmp/checkpoint",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    logging_steps=50,
    num_train_epochs=3
)

# define trainer
trainer = Trainer(
    model=peft_model,
    args=args,
    train_dataset=tokenized_data,
    tokenizer=tokenizer,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True)
)

# train
trainer.train()

Step,Training Loss
50,2.3521
100,1.9946
150,1.9948
200,2.1375
250,1.9353
300,1.9841
350,1.9877


TrainOutput(global_step=375, training_loss=2.0450433654785156, metrics={'train_runtime': 331.0104, 'train_samples_per_second': 9.063, 'train_steps_per_second': 1.133, 'total_flos': 1670056200806400.0, 'train_loss': 2.0450433654785156, 'epoch': 3.0})

In [11]:
# inference

# if we use the pipeline as in bitfit, we got error: The model 'PeftModelForCausalLM' is not 
# supported for text-generation. Because the pipeline doesn't support causal ml yet.
# so we have to generate the result by ourself.

# the result doesn't seem to be right, which means that the training is not good.
# This is due to that the training is not enough. For soft prompt training, we need more
# training to be effective.

def generate(model, tokenizer, instruction, input=None):

    prompt = "human: {}\n{}".format(instruction, input).strip() + "\n\nAssistant: "
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].to(model.device)

    generation_output = model.generate(
        input_ids=input_ids,
        output_scores=True,
        max_new_tokens=256
    )
    for seq in generation_output:
        output = tokenizer.decode(seq, skip_special_tokens=True)
        print(output)

generate(model, tokenizer, "List five steps for comparing two products.")

human: List five steps for comparing two products.
None

Assistant:  No


In [9]:
# inference after finetuning

generate(peft_model, tokenizer, "List five steps for comparing two products.")

human: List five steps for comparing two products.
None

Assistant: 1. Choose a product to compare. 2. Choose a comparison method. 3. Choose a reference product. 4. Choose a reference method. 5. Choose a reference reference product.


## IV. Layer Type

We can change the some parameters of the p-tuning.

In [7]:
# change to lstm instead of mlp
# Depending on the vaiable values, the trainable parameter numbers will 
# change and will impact the trainign process and result

from peft import PromptEncoderReparameterizationType

config = PromptEncoderConfig(task_type=TaskType.CAUSAL_LM, num_virtual_tokens=10,
                             encoder_reparameterization_type=PromptEncoderReparameterizationType.LSTM,
                             encoder_dropout=0.1,
                             encoder_num_layers=5,
                             encoder_hidden_size=1024)

In [9]:
# compared to 3 layer mlp, there are 7m parameters to train, we have here 129M to train

model =  get_peft_model(model, config)
print(model.print_trainable_parameters())

trainable params: 129,075,712 || all params: 1,194,390,016 || trainable%: 10.8068
None


## V. Conclusion

P-Tuning can improve the soft prompt finetuning performance but with more trainable parameters.