## Prompt Tuning with `GPT-2`

This project aims to explore prompt tuning for improving the task-specific performance of a causal language model (CLM), specifically `GPT-2`. Prompt tuning is a parameter-efficient fine-tuning method that optimizes a small set of virtual tokens prepended to the input sequence, enabling task adaptation without updating the entire model’s parameters. This approach is particularly useful for reducing computational overhead while leveraging the capabilities of pre-trained large language models.

### Background

Traditional fine-tuning methods require updating all model parameters, making them computationally expensive and less feasible for resource-constrained environments. In contrast, prompt tuning introduces and trains a small number of task-specific parameters, allowing efficient adaptation for various tasks while preserving the pre-trained model’s general knowledge.

### Dataset

The dataset used for this project is derived from the Hugging Face repository `fka/awesome-chatgpt-prompts`, a collection of task-oriented prompts designed to guide language models. This dataset allows evaluation on a variety of tasks, such as text completion, translation, or instruction following.
	•	Train Size: 100 samples
	•	Format: JSON with fields for prompt-response pairs.

### Methods

1.	Model Configuration:

    •	Pre-trained `GPT-2` model

    •	Prompt tuning applied using the PEFT (Parameter-Efficient Fine-Tuning) library.

2.	Prompt Tuning Setup:

    •	Number of virtual tokens: Initially set to 5.

    •	Initialization: Random embeddings.

    •	Task type: Causal Language Modeling.

    •	Optimization: AdamW with a learning rate of 0.005.


3.	Training Process:

    •	Initial configuration: `epochs=5`, `virtual_tokens=5`.

    •	Results were suboptimal, suggesting the model required more iterations for convergence.

    •	Updated configuration: `epochs=10`, `virtual_tokens=5`.

    •	This resulted in significantly improved performance, with training time still being very acceptable.


4.	Evaluation:

    •	Model performance was assessed based on its ability to generate task-specific responses, measured via qualitative (human feedback) examination of outputs.

### Results

Initial Configuration: The model trained with epochs=5 produced responses that lacked coherence and alignment with the prompts, indicating insufficient training time.

Improved Configuration: Increasing the training duration to epochs=10 and  virtual_tokens=20 led to significantly better results. The model generated contextually relevant and fluent responses, adhering more closely to the task requirements.

Training Time: Despite the increased epochs and number of virtual tokens, the training remained computationally efficient.




In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

  from .autonotebook import tqdm as notebook_tqdm


In [10]:
inputs = tokenizer("I want you to act as a rapper.", return_tensors='pt')
output = tokenizer.batch_decode(
    model.generate(
        input_ids=inputs['input_ids'],
        attention_mask=inputs['attention_mask'],
        max_length=100,
        repetition_penalty=1.5,
        eos_token_id=tokenizer.eos_token_ids
    ), skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print("Original model:\n", output)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


---------------------------------------------------------------------------------------------------
Original model:
 ['I want you to act as a rapper. I\'m not gonna be the one of them."\n"You\'re going through this?" he asked, "and it\'s like being in an old times? You know what that means when they say \'you are?\' and then there is no way for me or anybody else who knows anything about my life right now because we don\'t have any records at all!" He paused again before continuing: "\'Cause if people think your record label doesn\'t care']


The model is not quite sure of the context.

Here, I am using the `fka/awesome-chatgpt-prompts` dataset from HuggingFace for tuning the model. This dataset provides motivational content for tuning, ensuring that the model adapts its responses accordingly.

In [4]:
from datasets import load_dataset

dataset_path = "fka/awesome-chatgpt-prompts"
dataset = load_dataset(dataset_path)
dataset = dataset.map(lambda x: tokenizer(x['prompt']), batched=True)
train_prompt = dataset["train"].select(range(203))

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [5]:
from peft import get_peft_model, PromptTuningConfig, TaskType, PromptTuningInit

tuning_config = PromptTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    prompt_tuning_init=PromptTuningInit.RANDOM,
    num_virtual_tokens=5,
    tokenizer_name_or_path=model_name
)

pt_model_with_5_vt = get_peft_model(model, tuning_config)

In [6]:
from transformers import TrainingArguments
training_args = TrainingArguments(
    use_cpu=True,
    output_dir="./",
    auto_find_batch_size=True,
    num_train_epochs=5,
    learning_rate=0.05,
    optim='adamw_hf',
)

In [7]:
from transformers import Trainer, DataCollatorForLanguageModeling

trainer = Trainer(
    model=pt_model_with_5_vt,
    args=training_args,
    train_dataset=train_prompt,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

In [8]:
trainer.train()



Step,Training Loss


TrainOutput(global_step=130, training_loss=3.859867976262019, metrics={'train_runtime': 286.1026, 'train_samples_per_second': 3.548, 'train_steps_per_second': 0.454, 'total_flos': 92923509888000.0, 'train_loss': 3.859867976262019, 'epoch': 5.0})

In [9]:
inputs = tokenizer("I want you to act as a rapper.", return_tensors='pt')

original_model_output = tokenizer.batch_decode(
    model.generate(
        input_ids=inputs['input_ids'],
        attention_mask=inputs['attention_mask'],
        max_length=100,
        repetition_penalty=1.5,
        eos_token_id=tokenizer.eos_token_ids
    ), skip_special_tokens=True
)

pt_model_with_5_vt_output = tokenizer.batch_decode(
    pt_model_with_5_vt.generate(
        input_ids=inputs['input_ids'],
        attention_mask=inputs['attention_mask'],
        max_length=100,
        repetition_penalty=1.5,
        eos_token_id=tokenizer.eos_token_ids
    ), skip_special_tokens=True
)

print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_output}')
print(dash_line)
print(f'PROMPT TUNED (5 VIRTUAL TOKENS) MODEL:\n{pt_model_with_5_vt_output}')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
['I want you to act as a rapper. I\'m not going into that."\n"You\'re gonna be like, \'Oh my God.\' " —Drake on his new album The Black Album (featuring Drake)\n\n']
---------------------------------------------------------------------------------------------------
PROMPT TUNED (5 VIRTUAL TOKENS) MODEL:
['I want you to act as a rapper.\nThe first thing I wanted is to be able for the audience to understand what it means and how they can relate with each other, so that we are all have an understanding of our own personalities."\n\n\xa0This was my second attempt at this project: "A New York City" (the last one being about 20 years ago). It\'s been going on since 2005 when there were no more than 10 people in NYC who could sing along or even listen']


As we can see, the model output has improved and the training cost (time and computation) is relatively low, so I decided to increase the training epoches and number of virtual tokens.

In [11]:
tuning_config = PromptTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    prompt_tuning_init=PromptTuningInit.RANDOM,
    num_virtual_tokens=20,
    tokenizer_name_or_path=model_name
)

pt_model_with_20_vt = get_peft_model(model, tuning_config)

In [12]:
training_args = TrainingArguments(
    use_cpu=True,
    output_dir="./",
    auto_find_batch_size=True,
    num_train_epochs=10,
    learning_rate=0.05,
    optim='adamw_hf',
)

In [13]:
trainer = Trainer(
    model=pt_model_with_20_vt,
    args=training_args,
    train_dataset=train_prompt,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

In [14]:
trainer.train()



Step,Training Loss


TrainOutput(global_step=260, training_loss=3.6462167593149037, metrics={'train_runtime': 609.754, 'train_samples_per_second': 3.329, 'train_steps_per_second': 0.426, 'total_flos': 185634720000000.0, 'train_loss': 3.6462167593149037, 'epoch': 10.0})

In [17]:
inputs = tokenizer("I want you to act as a rapper.", return_tensors='pt')

original_model_output = tokenizer.batch_decode(
    model.generate(
        input_ids=inputs['input_ids'],
        attention_mask=inputs['attention_mask'],
        max_length=100,
        repetition_penalty=1.5,
        eos_token_id=tokenizer.eos_token_ids
    ), skip_special_tokens=True
)

pt_model_with_5_vt_output = tokenizer.batch_decode(
    pt_model_with_5_vt.generate(
        input_ids=inputs['input_ids'],
        attention_mask=inputs['attention_mask'],
        max_length=100,
        repetition_penalty=1.5,
        eos_token_id=tokenizer.eos_token_ids
    ), skip_special_tokens=True
)

pt_model_with_20_vt_output = tokenizer.batch_decode(
    pt_model_with_20_vt.generate(
        input_ids=inputs['input_ids'],
        attention_mask=inputs['attention_mask'],
        max_length=100,
        repetition_penalty=1.5,
        eos_token_id=tokenizer.eos_token_ids
    ), skip_special_tokens=True
)

print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_output}')
print(dash_line)
print(f'PROMPT TUNED (5 VIRTUAL TOKENS) MODEL:\n{pt_model_with_5_vt_output}')
print(dash_line)
print(f'PROMPT TUNED (20 VIRTUAL TOKENS) MODEL:\n{pt_model_with_20_vt_output}')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
['I want you to act as a rapper. I\'m not saying that\'s what he does, but it is the way we do things like this year and how they\'re doing them in their own ways."\n"It was just one of those moments where there were two or three different guys who had been on stage for years with me being so much more than myself," said Taylor Swift at his first performance last week before an audience during The New York Times Music Awards press conference Thursday night (Feb 9']
---------------------------------------------------------------------------------------------------
PROMPT TUNED (5 VIRTUAL TOKENS) MODEL:
['I want you to act as a rapper.\nThe following are some of the most important things that I\'ve learned about music in my life:\n\n"You can\'t be an artist without being able-bodied." - The Beatles\' "A Day In A New York City". (1954)']
-------------------------------------

As we can see now, the model knows what the rapper does.

The results of this experiment demonstrate the importance of tuning hyperparameters, particularly the number of training epochs, in achieving optimal performance with prompt tuning. By increasing the epochs to 10 and virtual tokens to 20, the model achieved significant improvements without compromising on efficiency, highlighting prompt tuning’s potential as an effective and resource-efficient approach for task adaptation in large language models.