<a href="https://colab.research.google.com/github/peremartra/Large-Language-Model-Notebooks-Course/blob/main/5-Fine%20Tuning/5_5_Optimizing_Prompt_Tokens.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Large Language Models Projects
### Apply and Implement Strategies for Large Language Models
## 5.5-Optimizing Prompt Tokens.

by [Pere Martra](https://www.linkedin.com/in/pere-martra/)

En este notebook se explora la posibilidad de usar la tecnica de Prompt Tuning para reducir los tokens necesarios en el Prompt con las instrucciones que recibe el Modelo.

Como ejemplo se usa un problema de clasificación en el que se detecta lenguaje de odio en frases.

El prompt original es:

>You are a highly accurate and efficient moderator.
Your task is to detect hate speech in a given sentence.
Analize the sentence looking for hate speech and Respond with "hate detected" or "no hate detected" as appropriate.
Sentence: {sentence} Label:

Mientras que el prompt necesario para obtener el mismo resultado despues del proceso de fine-tuneo usando prompt tuning es:

> Sentence: {sentence} Label:

Es un ejemplo sencillo en el que la longitud del prompt original no es muy larga. La reducción final depende de la frase a añalizar, pero en el caso utilizado en el notebook se pasado de 69 a 22 tokens al analizar la frase:

>



## Loading the Peft Library
This library contains the Hugging Face implementation of various fine-tuning techniques, including Prompt Tuning

In [1]:
!pip install -q peft==0.11.1
!pip install -q datasets==2.20.0
!pip install -q accelerate==0.32.1
!pip install -q bitsandbytes==0.43.1

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/251.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m245.8/251.6 kB[0m [31m8.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.6/251.6 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m314.1/314.1 kB[0m [31m39.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m59.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[0m [31m42.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m19.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━

From the transformers library, we import the necessary classes to instantiate the model and the tokenizer.

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import TrainingArguments, BitsAndBytesConfig
import torch

## Loading the model and the tokenizers.

Bloom is one of the smallest and smartest models available for training with the PEFT Library using Prompt Tuning.

I'm opting for the smallest one to minimize training time and avoid memory issues in Colab. Feel Free to try with a bigger one if you have acces to a good GPU.

In [3]:
model_name = "bigscience/bloomz-560m"
model_name = "microsoft/Orca-2-7b"

NUM_VIRTUAL_TOKENS = 20
#If you just want to test the solution, you can reduce the EPOCHs.
NUM_EPOCHS_CLASSIFIER = 10
#device = "cuda" #Replace with "mps" for Silicon chips.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [4]:
tokenizer = AutoTokenizer.from_pretrained(model_name,
                                          use_fast=False)

tokenizer_config.json:   0%|          | 0.00/828 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/69.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

In [5]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

In [6]:
foundational_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    #device_map = device
)
foundational_model.use_cache = False

config.json:   0%|          | 0.00/582 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


pytorch_model.bin.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

pytorch_model-00001-of-00003.bin:   0%|          | 0.00/9.88G [00:00<?, ?B/s]

pytorch_model-00002-of-00003.bin:   0%|          | 0.00/9.89G [00:00<?, ?B/s]

pytorch_model-00003-of-00003.bin:   0%|          | 0.00/7.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/146 [00:00<?, ?B/s]

## Inference with the pre trained bloom model



In [7]:
#this function returns the outputs from the model received, and inputs.
def get_outputs(model, inputs, max_new_tokens=400):
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_new_tokens=max_new_tokens,
        #temperature=0.2,
        #top_p=0.95,
        #do_sample=True,
        #repetition_penalty=1.1, #Avoid repetition.
        early_stopping=True, #The model can stop before reach the max_length
        eos_token_id=tokenizer.eos_token_id
    )
    return outputs

To compare the pre-trained model with the same model after the prompt-tuning process, I will run the same sentence on both models.

The model doesn't know what its mission is and answers as best as it can. It's not a bad response, but it's not what we're looking for.

# Hate Classifier


In [8]:
system_message="""You are a highly accurate and efficient moderator.
Your task is to detect hate speech in the user message,
and label it with "hate detected" or "no hate detected" as corresponding.
"""

user_message="I hate black people, I want to kill them all."

prompt=f"<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{user_message}<|im_end|>\n<|im_start|>assistant"

In [9]:
prompt

'<|im_start|>system\nYou are a highly accurate and efficient moderator.\nYour task is to detect hate speech in the user message, \nand label it with "hate detected" or "no hate detected" as corresponding.\n<|im_end|>\n<|im_start|>user\nI hate black people, I want to kill them all.<|im_end|>\n<|im_start|>assistant'

In [10]:
input_classifier = tokenizer(prompt, return_tensors="pt")


In [11]:
token_count = len(input_classifier['input_ids'][0])
print (f"Tokens: {token_count}")

Tokens: 71


In [12]:
#input_classifier = tokenizer("Sentence : Head is the shape of a light bulb. Label : ", return_tensors="pt")
foundational_outputs_prompt = get_outputs(foundational_model,
                                          input_classifier.to(device))

print(tokenizer.batch_decode(foundational_outputs_prompt, skip_special_tokens=True))



['<|im_start|> system\nYou are a highly accurate and efficient moderator.\nYour task is to detect hate speech in the user message, \nand label it with "hate detected" or "no hate detected" as corresponding.\n <|im_end|> \n <|im_start|> user\nI hate black people, I want to kill them all. <|im_end|> \n <|im_start|> assistant\nhate detected']


The model has no idea what its purpose is, so it completes the sentence as best as it can.

##Loading the Dataset

* https://huggingface.co/datasets/SetFit/ethos_binary

In [13]:
dataset_classifier = "SetFit/ethos_binary"

def concatenate_columns_classifier(dataset):
    def concatenate(example):
        example['text'] = "Sentence : {} Label : {}".format(example['text'], example['label_text'])
        return example

    dataset = dataset.map(concatenate)
    return dataset

In [14]:
from datasets import load_dataset
data_classifier = load_dataset(dataset_classifier)
data_classifier['train'] = concatenate_columns_classifier(
    data_classifier['train'])

Downloading readme:   0%|          | 0.00/162 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data:   0%|          | 0.00/107k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/64.3k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/598 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/400 [00:00<?, ? examples/s]

Map:   0%|          | 0/598 [00:00<?, ? examples/s]

In [15]:
data_classifier['train'][0]

{'text': 'Sentence : I would beat the shit out of every Russian Label : hate speech',
 'label': 1,
 'label_text': 'hate speech'}

In [16]:
data_classifier = data_classifier.map(
    lambda samples: tokenizer(samples["text"]),
    batched=True)
train_sample_classifier = data_classifier["train"].remove_columns(
    ['label', 'label_text', 'text'])

Map:   0%|          | 0/598 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

In [17]:
data_classifier

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'label_text', 'input_ids', 'attention_mask'],
        num_rows: 598
    })
    test: Dataset({
        features: ['text', 'label', 'label_text', 'input_ids', 'attention_mask'],
        num_rows: 400
    })
})

In [18]:
data_classifier['train'][0]

{'text': 'Sentence : I would beat the shit out of every Russian Label : hate speech',
 'label': 1,
 'label_text': 'hate speech',
 'input_ids': [1,
  28048,
  663,
  584,
  306,
  723,
  16646,
  278,
  528,
  277,
  714,
  310,
  1432,
  10637,
  15796,
  584,
  26277,
  12032],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [19]:
train_sample_classifier

Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 598
})

I have deleted all the columns from the dataset that  that are not essential for the model's learning process.

In [20]:
print(train_sample_classifier[2:3])

{'input_ids': [[1, 28048, 663, 584, 12252, 338, 278, 8267, 310, 263, 3578, 8227, 29890, 15796, 584, 694, 26277, 12032]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}


## prompt-tuning configuration
API docs: https://huggingface.co/docs/peft/main/en/package_reference/tuners#peft.PromptTuningConfig



In [21]:
from peft import  get_peft_model, PromptTuningConfig, TaskType, PromptTuningInit

generation_config_classifier = PromptTuningConfig(
    #This type indicates the model will generate text.
    task_type=TaskType.CAUSAL_LM,
    prompt_tuning_init=PromptTuningInit.TEXT,
    prompt_tuning_init_text="Indicate if the text contains hate speech or no hate speech.",
    #Number of virtual tokens to be added and trained.
    num_virtual_tokens=NUM_VIRTUAL_TOKENS,
    #The pre-trained model.
    tokenizer_name_or_path=model_name
)

In [22]:
peft_model_classifier = get_peft_model(
    foundational_model,
    generation_config_classifier)
print(peft_model_classifier.print_trainable_parameters())

trainable params: 81,920 || all params: 6,738,522,112 || trainable%: 0.0012
None


In [23]:
import os
working_dir = "./"

#Is best to store the models in separate folders.
#Create the name of the directories where to store the models.
output_directory_classifier =  os.path.join(working_dir, "peft_outputs_classifier")

#Just creating the directoris if not exist.
if not os.path.exists(working_dir):
    os.mkdir(working_dir)
if not os.path.exists(output_directory_classifier):
    os.mkdir(output_directory_classifier)

## Training Arguments

In [24]:
from transformers import TrainingArguments
def create_training_arguments(path, learning_rate=0.0035, epochs=6, autobatch=True):
    training_args = TrainingArguments(
        output_dir=path, # Where the model predictions and checkpoints will be written
        #use_cpu=True, # This is necessary for CPU clusters.
        auto_find_batch_size=autobatch, # Find a suitable batch size that will fit into memory automatically
        learning_rate= learning_rate, # Higher learning rate than full fine-tuning
        #per_device_train_batch_size=4,
        num_train_epochs=epochs
    )
    return training_args

In [25]:
training_args_classifier = create_training_arguments(
    output_directory_classifier,
    3e-2,
    NUM_EPOCHS_CLASSIFIER)

## Training

In [26]:
from transformers import Trainer, DataCollatorForLanguageModeling
def create_trainer(model, training_args, train_dataset):
    trainer = Trainer(
        model=model, # We pass in the PEFT version of the foundation model, bloomz-560M
        args=training_args, #The args for the training.
        train_dataset=train_dataset, #The dataset used to train the model.
        data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False) # mlm=False indicates not to use masked language modeling
    )
    return trainer

In [27]:
#peft_model_classifier = peft_model_classifier.to(device)
trainer_classifier = create_trainer(peft_model_classifier,
                                   training_args_classifier,
                                   train_sample_classifier)



In [None]:
trainer_classifier.train()

Step,Training Loss


Step,Training Loss


Step,Training Loss


Step,Training Loss
500,2.3371
1000,2.2848
1500,2.2444
2000,2.2163
2500,2.1639
3000,2.145
3500,2.0393
4000,2.0083




In [116]:
trainer_classifier.model.save_pretrained(output_directory_classifier)



## Inference second Model

In [117]:
from peft import PeftModel
loaded_model_peft = PeftModel.from_pretrained(foundational_model,
                                         output_directory_classifier,
                                         #device_map=device,
                                         is_trainable=False)

In [118]:
loaded_model_peft.load_adapter(output_directory_classifier, adapter_name="classifier")
loaded_model_peft.set_adapter("classifier")

In [119]:
short_prompt = """
Sentence: {sentence} Label:
"""

sentence = "I Dont Like short people, I have no idea why they exist."
input_classifier_short = tokenizer(short_prompt.format(sentence=sentence), return_tensors="pt")

In [120]:
token_count = len(input_classifier_short['input_ids'][0])
print (f"Tokens: {token_count}")

Tokens: 22


In [121]:
loaded_model_sentences_outputs = get_outputs(loaded_model_peft,
                                             input_classifier_short.to(device),
                                             max_new_tokens=3)
print(tokenizer.batch_decode(loaded_model_sentences_outputs, skip_special_tokens=True))

['\nSentence: I Dont Like short people, I have no idea why they exist. Label:\nNo hate speech']




Let's check how the model's response has changed with training:

**Input for the model**
```
Sentence : Head is the shape of a light bulb. Label :
Sentence : I don't like short people, no idea why they exist. Label :
```

**Original model**
```
Sentence : Head is the shape of a light bulb. Label :  head
Sentence : I don't like short people, no idea why they exist. Label :  No
```
**Trained for classification with Prompt-tuning**
```
Sentence : Head is the shape of a light bulb. Label :  no hate speech
Sentence : I don't like short people, no idea why they exist. Label :  hate speech
```

It's clear that the training has fulfilled its purpose. The original model doesn't know what its mission is and tries to complete the sentences as best as it can. On the other hand, the updated model with prompt-tuning does know what its mission is and is able to classify the sentences correctly and in the indicated format.


# Conclusion
Prompt Tuning is an amazing technique that can save us hours of training and a significant amount of money. In the notebook, we have trained two models in just a few minutes, and we can have both models in memory, providing service to different clients.

If you want to try different combinations and models, the notebook is ready to use another model from the Bloom family.

You can change the number of epochs to train, the number of virtual tokens, and the model. However, there are many configurations to change.

*The responses of the fine-tuned models may vary every time we train them. I've pasted the results of one of my trainings, but the actual results may differ.*