<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


#  Instruction-Tuning with LLMs

Instruction-based fine-tuning, referred to as instruction GPT. It trains the language models to follow specific instructions and generate appropriate responses. For instruction-tuning, the dataset plays an important role as it provides structured examples of instructions, contexts, and responses, allowing the model to learn how to handle various tasks effectively. Instruction GPT often uses human feedback to refine and improve model performance; however, this lab doesn't cover this aspect.

The context and instruction are concatenated to form a single input sequence that the model can understand and use to generate the correct response.

#### Context and instruction

	•	Instruction: A command to specify what the model should do
	•	Context: Additional information or background required for performing the instruction
	•	Combined input: The instruction and context combine together into a single input sequence
    

Let's review certain examples for various templates:

---
#### Response template
Template: `### Question: {question}\n ### Answer: {answer}`

Example:
```
### Question: What is the capital of France?
### Answer: Paris
```

---
#### Conversation template

Template: `### User: {user_input}\n ### Bot: {bot_response}`
Example:
```
### User: How are you today?
### Bot: I'm doing great, thank you! How can I assist you today?
```

---
#### Instruction and output template

Template: `### Instruction: {instruction}\n ### Output: {output}`

Example:
```
### Instruction: Translate the following sentence to Spanish: "Hello, how are you?"
### Output: "Hola, ¿cómo estás?"
```

---
#### Completion template

Template: `{prompt} ### Completion: {completion}`
Example:
```
Once upon a time in a faraway land, ### Completion: there lived a wise old owl who knew all the secrets of the forest.
```

#### Summarization template

Template: `### Text: {text}\n ### Summary: {summary}`

Example:
```
### Text: The quick brown fox jumps over the lazy dog.
### Summary: A fox jumps over a dog.
```

---
#### Dialogue template

Template: `### Speaker 1: {utterance_1}\n ### Speaker 2: {utterance_2}\n ### Speaker 1: {utterance_3}`

Example:
```
### Speaker 1: Hi, what are you doing today?
### Speaker 2: I'm going to the park.
### Speaker 1: That sounds fun!
```

---
#### Code generation template

Template: `### Task: {task_description}\n ### Code: {code_output}`

Example:
```
### Task: Write a function to add two numbers in Python.
### Code: def add(a, b):\n    return a + b
```

---
#### Data analysis template

Template: `### Analysis Task: {task_description}\n ### Analysis: {analysis_output}`

Example:
```
### Analysis Task: Provide insights from the sales data of Q1 2022.
### Analysis: The sales increased by 15% compared to Q4 2021, with the highest growth in the electronics category.
```

---
#### Recipe template

Template: `### Recipe Name: {recipe_name}\n ### Ingredients: {ingredients}\n ### Instructions: {instructions}`

Example:
```
### Recipe Name: Chocolate Chip Cookies
### Ingredients: Flour, Sugar, Chocolate Chips, Butter, Eggs, Vanilla Extract
### Instructions: Mix the dry ingredients, add the wet ingredients, fold in the chocolate chips, and bake at 350°F for 10-12 minutes.
```

---
#### Explanation template

Template: `### Concept: {concept}\n ### Explanation: {explanation}`

Example:
```
### Concept: Photosynthesis
### Explanation: Photosynthesis is the process by which green plants use sunlight to synthesize nutrients from carbon dioxide and water.
```

---


## Objectives

After completing this lab, you will be able to:

 - Understand the various types of templates including instruction-response, question-answering, summarization, code generation, dialogue, data analysis, and explanation and their applications for fine-tuning large language models (LLMs).
 - Create and apply different templates to fine-tune LLMs for various tasks.
 - Format datasets based on the created templates to prepare them for effective model training
 - Perform instruction fine-tuning using Hugging Face libraries and tools
 - Apply Low-Rank Adaptation (LoRA) techniques to fine-tune LLMs efficiently
 - Configure and use the SFTTrainer for supervised fine-tuning of instruction-following models


The concepts presented in this lab would apply to the other template formats as well.


# __Table of contents__

<ol>
    <li>
        <a href="#Setup">Setup</a>
        <ol>
            <li><a href="#Install-required-libraries">Install required libraries</a></li>
            <li><a href="#Import-required-libraries">Import required libraries</a></li>
            <li><a href="#Define-the-device">Define the device</a></li>
        </ol>
    </li>
    <li><a href="#Dataset-description">Dataset description</a></li>
    <li><a href="#Model-and-tokenizer">Model and tokenizer</a></li>
    <li><a href="#Preprocessing-the-data">Preprocessing the data</a></li>
    <li><a href="#Test-the-base-model">Test the base model</a></li>
        <ol>
            <li><a href="#BLEU-score">BLEU score</a></li>
        </ol>
    <li><a href="#Perform-instruction-fine-tuning-with-LoRA">Perform instruction fine-tuning with LoRA</a></li>
    <li><a href="#Exercises">Exercises</a></li>
</ol>


# Setup

### Install required libraries

For this lab, use the following libraries, which are __not__ preinstalled in the Skills Network Labs environment. You can install libraries by running the code in the below cell. 

```bash
!pip install -qq datasets==2.20.0 trl==0.9.6 transformers==4.42.3 peft==0.11.1 tqdm==4.66.4 numpy==1.26.4 pandas==2.2.2 matplotlib==3.9.1 seaborn==0.13.2 scikit-learn==1.5.1 sacrebleu==2.4.2 evaluate==0.4.2
```

### Import required libraries

The following code imports the required libraries.


In [1]:
# Set environment variables
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

# Check CUDA availability
import torch
print(f"CUDA Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA Device: {torch.cuda.get_device_name()}")

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

# Transformers and NLP-related libraries
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from datasets import load_dataset
from torch.utils.data import Dataset

# Training and evaluation utilities
from tqdm import tqdm
import evaluate
from trl import SFTConfig, SFTTrainer, DataCollatorForCompletionOnlyLM

# Parameter-efficient fine-tuning (PEFT) utilities
from peft import get_peft_model, LoraConfig, TaskType

# General-purpose libraries
import pickle
import json
import matplotlib.pyplot as plt
from urllib.request import urlopen
import io

print("All imports loaded successfully!")

CUDA Available: True
CUDA Device: Tesla P40
All imports loaded successfully!


### Define the device

The below code will set your device to 'cuda' if your device is compatible with GPU, otherwise, you can use 'cpu'.


In [2]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Dataset description

Use the below sentences to download the CodeAlpaca 20k dataset, a programming code dataset. This code is available [here](https://github.com/sahil280114/codealpaca?tab=readme-ov-file#data-release). The CodeAlpaca dataset contains the following elements:


- `instruction`: **str**, describes the task the model should perform. Each of the 20K instructions is unique.
- `input`: **str**, optional context or input for the task. For example, when the instruction is "Amend the following SQL query to select distinct elements", the input is the SQL query. Around 40% of the examples have an input.
- `output`: **str**, the answer to the instruction as generated by text-davinci-003.

The following code block downloads the training split from the CodeAlpaca-20k dataset:


In [3]:
dataset = load_dataset("lucasmccabe-lmi/CodeAlpaca-20k", split="train")
dataset

Dataset({
    features: ['instruction', 'input', 'output'],
    num_rows: 20022
})

Let's look at the example in the dataset:


In [4]:
dataset[1000]

{'instruction': 'Create a JavaScript code snippet to get a list of all the elements in an array with even index.',
 'input': 'let arr = [1,2,3,4,5,6];',
 'output': 'let evenIndexArr = arr.filter((elem, index) => index % 2 === 0);'}

To keep things simple let's just focus on the examples that do not have any `input`:


In [5]:
dataset = dataset.filter(lambda example: example["input"] == '')
dataset

Dataset({
    features: ['instruction', 'input', 'output'],
    num_rows: 9764
})

The original CodeAlpaca dataset may not have been shuffled. The following line indicates how to shuffle a `datasets.arrow_dataset.Dataset()` object with a random seed:


In [6]:
dataset = dataset.shuffle(seed=42)

In [7]:
dataset

Dataset({
    features: ['instruction', 'input', 'output'],
    num_rows: 9764
})

The CodeAlpaca 20k dataset has a training and test set. You can split the original training data into a train and test set by assigning 80% of the data to the training set and 20% to the testing set.


In [8]:
dataset_split = dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = dataset_split['train']
test_dataset = dataset_split['test']
dataset_split

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output'],
        num_rows: 7811
    })
    test: Dataset({
        features: ['instruction', 'input', 'output'],
        num_rows: 1953
    })
})

In [9]:
# Select a small set of data for the resource limitation
# This dataset will be only used for evaluation parts, not for the training
tiny_test_dataset=test_dataset.select(range(10))
tiny_train_dataset=train_dataset.select(range(10))

# Model and tokenizer

In this exercise, let's fine-tune the [`opt-350m`](https://huggingface.co/facebook/opt-350m) model from Facebook. A description of this OpenSource model was published [here](https://arxiv.org/abs/2205.01068), and the model was originally made available on [metaseq's Github repository](https://github.com/facebookresearch/metaseq).

The below lines load the base model from Hugging Face:


In [10]:
# Base model
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m").to(device)

In [11]:
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m", padding_side='left')

Let's find the end of sentence (EOS) token. This is a special tokenizer token. Once this token is encountered, the model will stop generating further tokens:


In [12]:
tokenizer.eos_token

'</s>'

# Preprocessing the data

To perform the fine-tuning, first, preprocess the data by creating functions that generate the prompt.

The `formatting_prompts_func` function takes a dataset as input. For every element in the dataset format, the instruction and the output into a template using the format:

```
### Instruction:
Translate the following sentence to Spanish: "Hello, how are you?"

### Response:
"Hola, ¿cómo estás?</s>"
```

_**Note:**_ 
1. The template provided in this section may differ from the **Instruction and output template** presented in the introduction of this lab. You can replace the  `### Response:` with `### Output:` to generate similar results.

2. Introducing the `</s>` end of sentence token at the end of the text informs the model to stop generating text beyond this point.

Finally, the `formatting_prompts_func_no_response` function behaves similarly to the `formatting_prompts_func` except the response is not included.


In [13]:
def formatting_prompts_func(mydataset):
    output_texts = []
    for i in range(len(mydataset['instruction'])):
        text = (
            f"### Instruction:\n{mydataset['instruction'][i]}"
            f"\n\n### Response:\n{mydataset['output'][i]}</s>"
        )
        output_texts.append(text)
    return output_texts

def formatting_prompts_func_no_response(mydataset):
    output_texts = []
    for i in range(len(mydataset['instruction'])):
        text = (
            f"### Instruction:\n{mydataset['instruction'][i]}"
            f"\n\n### Response:\n"
        )
        output_texts.append(text)
    return output_texts

The following code block generates the `instructions` (the part of the prompt that does not include the response), the `instructions_with_responses` (the full prompt with the response and `eos` token), and the `expected_outputs`, which are the parts of the `instructions_with_responses` that are between the `instructions` and the `eos` token.

To find the `expected_outputs`, tokenize `instructions` and the `instructions_with_responses`. Then, count the number of tokens in `instructions`, and discard the equivalent amount of tokens from the beginning of the tokenized `instructions_with_responses` vector. Finally, discard the final token in `instructions_with_responses`, corresponding to the `eos` token. Decode the resulting vector using the tokenizer, resulting in the `expected_output`:


In [14]:
expected_outputs = []

instructions_with_responses = formatting_prompts_func(test_dataset)

instructions = formatting_prompts_func_no_response(test_dataset)

for i in tqdm(range(len(instructions_with_responses))):
    tokenized_instruction_with_response = tokenizer(instructions_with_responses[i], return_tensors="pt", max_length=1024, truncation=True, padding=False)
    tokenized_instruction = tokenizer(instructions[i], return_tensors="pt")
    expected_output = tokenizer.decode(tokenized_instruction_with_response['input_ids'][0][len(tokenized_instruction['input_ids'][0])-1:], skip_special_tokens=True)
    expected_outputs.append(expected_output)


100%|█████████████████████████████████████| 1953/1953 [00:00<00:00, 3027.61it/s]


In [15]:
instructions_with_responses[0]

'### Instruction:\nWhat type of data structure would you use to store key-value pairs in a Python program? Write corresponding code in Python.\n\n### Response:\nThe data structure to use for key-value pairs in Python is a dictionary. A dictionary is a data type that consists of key-value pairs, and is denoted by {} in Python. Each key has a unique value associated with it that can be accessed using the key. For example, a dictionary called "person" could look like this: \n\nperson = {\'name\':\'John\', \'age\': 32} \n\nThe value of the key "name" can be accessed using person[\'name\'] which returns "John".</s>'

In [16]:
instructions[0]

'### Instruction:\nWhat type of data structure would you use to store key-value pairs in a Python program? Write corresponding code in Python.\n\n### Response:\n'

Let's look at the example to view what `instructions` include, `instructions_with_responses`, and `expected_outputs`:


In [17]:
print('############## instructions ##############\n' + instructions[0])
print('############## instructions_with_responses ##############\n' + instructions_with_responses[0])
print('\n############## expected_outputs ##############' + expected_outputs[0])

############## instructions ##############
### Instruction:
What type of data structure would you use to store key-value pairs in a Python program? Write corresponding code in Python.

### Response:

############## instructions_with_responses ##############
### Instruction:
What type of data structure would you use to store key-value pairs in a Python program? Write corresponding code in Python.

### Response:
The data structure to use for key-value pairs in Python is a dictionary. A dictionary is a data type that consists of key-value pairs, and is denoted by {} in Python. Each key has a unique value associated with it that can be accessed using the key. For example, a dictionary called "person" could look like this: 

person = {'name':'John', 'age': 32} 

The value of the key "name" can be accessed using person['name'] which returns "John".</s>

############## expected_outputs ##############
The data structure to use for key-value pairs in Python is a dictionary. A dictionary is a da

Instead of keeping the instructions as-is, it's beneficial to convert the `instructions` list into a `torch` `Dataset`. The following code defines a class called `ListDataset` that inherits from `Dataset` and creates a `torch` `Dataset` from a list. This class is then used to generate a `Dataset` object from `instructions`: 


In [18]:
class ListDataset(Dataset):
    def __init__(self, original_list):
        self.original_list = original_list
    
    def __len__(self):
        return len(self.original_list)
    
    def __getitem__(self, i):
        return self.original_list[i]

instructions_torch = ListDataset(instructions)

In [19]:
instructions_torch[0]

'### Instruction:\nWhat type of data structure would you use to store key-value pairs in a Python program? Write corresponding code in Python.\n\n### Response:\n'

# Test the base model

Let's understand how the base model performs without performing fine-tuning in the model. This may involve response generation from the base, that is from the non-fine-tuned mode. 

The below code defines a text generation pipeline using the `pipeline` class from `transformers`. This pipeline is useful to generate text given by a model and a tokenizer:


In [20]:
gen_pipeline = pipeline("text-generation",
                        model=model,
                        tokenizer=tokenizer,
                        device=device,
                        batch_size=2,
                        max_length=50,
                        truncation=True,
                        padding=False,
                        return_full_text=False)

**_Note:_** The generation pipeline can generate tokens or text. If```return_tensors=True```, the pipeline returns token IDs; otherwise, it returns words. Additionally, the generation pipeline generates both the instructions *and* the responses by default. However, to assess the model's performance, exclude the generated instructions and focus on the responses. To do this, set ```return_full_text=False```.


The below code leverages the pre-defined generation pipeline to generate outputs using the model. 

**_Note:_** The code is commented out because it may take a long time for CPU. Instead of generating the raw tokens here, you can load output from this model later.


In [21]:
tokenizer.padding_side = 'left'

with torch.no_grad():
    # Due to resource limitation, only apply the function on 3 records using "instructions_torch[:10]"
    pipeline_iterator= gen_pipeline(instructions_torch[:3], 
                                    max_length=50, # this is set to 50 due to resource constraint, using a GPU, you can increase it to the length of your choice
                                    num_beams=5,
                                    early_stopping=True,)

generated_outputs_base = []
for text in pipeline_iterator:
    generated_outputs_base.append(text[0]["generated_text"])

In [22]:
generated_outputs_base

['What type of data structure would you use to store key-value pairs',
 'This is an example of a method to solve an equation of the form',
 'The CSS rule is set to “big-header”']

In [23]:
# urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/VvQRrSqS1P0_GobqtL-SKA/instruction-tuning-generated-outputs-base.pkl')
# generated_outputs_base = pickle.load(io.BytesIO(urlopened.read()))

Let's look at the sample responses generated by the base model and the expected responses from the dataset.


In [24]:
for i in range(3):
    print('@@@@@@@@@@@@@@@@@@@@')
    print('@@@@@ Instruction '+ str(i+1) +': ')
    print(instructions[i])
    print('\n\n')
    print('@@@@@ Expected response '+ str(i+1) +': ')
    print(expected_outputs[i])
    print('\n\n')
    print('@@@@@ Generated response '+ str(i+1) +': ')
    print(generated_outputs_base[i])
    print('\n\n')
    print('@@@@@@@@@@@@@@@@@@@@')
    

@@@@@@@@@@@@@@@@@@@@
@@@@@ Instruction 1: 
### Instruction:
What type of data structure would you use to store key-value pairs in a Python program? Write corresponding code in Python.

### Response:




@@@@@ Expected response 1: 

The data structure to use for key-value pairs in Python is a dictionary. A dictionary is a data type that consists of key-value pairs, and is denoted by {} in Python. Each key has a unique value associated with it that can be accessed using the key. For example, a dictionary called "person" could look like this: 

person = {'name':'John', 'age': 32} 

The value of the key "name" can be accessed using person['name'] which returns "John".



@@@@@ Generated response 1: 
What type of data structure would you use to store key-value pairs



@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@
@@@@@ Instruction 2: 
### Instruction:
Describe a method to solve an equation of the form ax + b = 0. Write corresponding code in Python.

### Response:




@@@@@ Expected response 2:

You can see that the responses generated by the base model are not up to the mark. Also, the responses have the tendency to extend and repeat the answers until they generate the maximum number of tokens. Later on, you can see that the instruction-tuning can fix both of these issues. First, the instruction fine-tuned model will be able to provide more meaningful responses. Second, because, you appended the `eos` token `<\s>` to the output, you will teach the model via instruction fine-tuning to not generate responses without bound.


## BLEU score

Let's set up a metric that compares the generated responses and the expected responses in the test environment. In this lab, let's use the [BLEU score](https://en.wikipedia.org/wiki/BLEU), a metric originally intended to check the quality of translations made by translation models. You can calculate the BLEU scores for individual generated segments by comparing them with a set of expected outputs and average the scores for the individual segments. Depending on the implementation, BLEU scores range from 0 to 1 or from 0 to 100 (as in the implementation used herein), with higher scores indicating a better match between the model generated output and the expected output.

_**Note:**_ 
1. The BLEU score was originally implemented for assessing the quality of translations. However, it may not necessarily be the best metric for instruction fine-tuning in general, but it is nonetheless a useful metric that gives a sense of the alignment between the model generated output and the expected output.
2. BLEU scores are very challenging to compare from one study to the next because it is a parametrized metric. As a result, you can employ a variant of BLEU called [SacreBLEU](https://aclanthology.org/W18-6319/) invariant to the metric's parametrization.


In [28]:
import evaluate

# Load sacrebleu metric
sacrebleu = evaluate.load("sacrebleu")

# Debugging: Print lengths of predictions and references
print(f"Number of predictions: {len(generated_outputs_base)}")
print(f"Number of references: {len(expected_outputs)}")

# Align lengths if necessary
if len(generated_outputs_base) != len(expected_outputs):
    min_len = min(len(generated_outputs_base), len(expected_outputs))
    generated_outputs_base = generated_outputs_base[:min_len]
    expected_outputs = expected_outputs[:min_len]

# Ensure references are wrapped in a list
expected_outputs = [[ref] for ref in expected_outputs]

# Compute sacrebleu metric
results_base = sacrebleu.compute(predictions=generated_outputs_base,
                                 references=expected_outputs)

print(list(results_base.keys()))
print(round(results_base["score"], 1))


Number of predictions: 3
Number of references: 3
['score', 'counts', 'totals', 'precisions', 'bp', 'sys_len', 'ref_len']
0.1


The SacreBLEU score of 0.4/100 indicates that there is very little alignment between the base model's generated responses and the expected responses for the examples in the test dataset.

---

## Perform instruction fine-tuning with LoRA

To save time, let's perform instruction fine-tuning using a parameter-efficient fine-tuning (PEFT) method called low-rank adaptation (LoRA).
First, convert the model into a PEFT model suitable for LoRA fine-tuning by defining a `LoraConfig` object from the `peft` library that outlines LoRA parameters, such as the LoRA rank and the target modules. Next, apply LoRA configuration on the model using `get_peft_model()`, which effectively converts `model` into a LoRA `model`.


In [29]:
lora_config = LoraConfig(
    r=16,  # Low-rank dimension
    lora_alpha=32,  # Scaling factor
    target_modules=["q_proj", "v_proj"],  # Modules to apply LoRA
    lora_dropout=0.1,  # Dropout rate
    task_type=TaskType.CAUSAL_LM  # Task type should be causal language model
)

model = get_peft_model(model, lora_config)

Instruction fine-tuning using the `SFTTrainer` has the effect of generating the instructions *and* the responses. However, for the purposes of assessing the quality of the generated text, consider only the quality of the response and not the quality of the instruction. For the purposes of calculating the BLEU score, eliminate the length of tokens corresponding to the instruction from the beginning of the tokenized model output. 

For example, suppose the tokenized instruction had a length of ten, but the generated text had a length of fourteen. Then the tokenized response that was kept for the purposes of calculating the BLEU score was just the four tokens at the end of the tokenized generated text because the first ten tokens represent the model's generation of the tokenized instruction.

Although eliminating the first few tokens of the tokenized output worked well for the purposes of calculating BLEU. However, during fine-tuning, the first few tokens won't have an impact on the loss function. You can mask those tokens using -100 by ignoring the value of PyTorch loss functions such as [CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html). By masking the tokens corresponding to the instruction with -100, only the tokens associated with the response can bear the loss.

You can create such a masking manually by defining your own function. However, it is easier to instead use the `DataCollatorForCompletionOnlyLM` class from `trl`:


In [30]:
response_template = "### Response:\n"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)

Now, pass the `collator`, `DataCollatorForCompletionOnlyLM` object to the data collator into `SFTTrainer`, resulting in the generated instructions without bearing on the loss.

To perform the training, first configure our `SFTTrainer`, and create the `SFTTrainer` object by passing to the `collator`:

```python
training_args = SFTConfig(
    output_dir="/tmp",
    num_train_epochs=10,
    save_strategy="epoch",
    fp16=True,
    per_device_train_batch_size=2,  # Reduce batch size
    per_device_eval_batch_size=2,  # Reduce batch size
    max_seq_length=1024,
    do_eval=True
)

trainer = SFTTrainer(
    model,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    formatting_func=formatting_prompts_func,
    args=training_args,
    packing=False,
    data_collator=collator,
)
```

In [31]:
# Enable Memory-Efficient Features
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

training_args = SFTConfig(
    output_dir="/tmp",
    num_train_epochs=3,
    save_strategy="epoch",
    fp16=True,  # Mixed precision training
    per_device_train_batch_size=4,  # Reduce if needed
    per_device_eval_batch_size=2,  # Reduce if needed
    max_seq_length=1024,
    do_eval=True,
    gradient_accumulation_steps=4,  # Simulates a batch size of 16
)

# Enable gradient checkpointing
model.gradient_checkpointing_enable()

trainer = SFTTrainer(
    model,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    formatting_func=formatting_prompts_func,
    args=training_args,
    packing=False,
    data_collator=collator,
)


Please ignore the above warning.
The below comments, runs the trainer, because this would take a long time on the CPU. Therefore, let's not run the trainer here.


In [32]:
trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss
500,1.7696
1000,1.6548


TrainOutput(global_step=1464, training_loss=1.6827167031543502, metrics={'train_runtime': 5776.1949, 'train_samples_per_second': 4.057, 'train_steps_per_second': 0.253, 'total_flos': 9288178628542464.0, 'train_loss': 1.6827167031543502, 'epoch': 2.998463901689708})

If you want to train the trainer, the `trainer` object would have a state history for every training step. You would be able to access this state history using the below commented out line:


In [37]:
log_history_lora = trainer.state.log_history
log_history_lora

[{'loss': 1.7696,
  'grad_norm': 6.382034778594971,
  'learning_rate': 3.2957650273224044e-05,
  'epoch': 1.0240655401945724,
  'step': 500},
 {'loss': 1.6548,
  'grad_norm': 6.261003017425537,
  'learning_rate': 1.5881147540983607e-05,
  'epoch': 2.048131080389145,
  'step': 1000},
 {'train_runtime': 5776.1949,
  'train_samples_per_second': 4.057,
  'train_steps_per_second': 0.253,
  'total_flos': 9288178628542464.0,
  'train_loss': 1.6827167031543502,
  'epoch': 2.998463901689708,
  'step': 1464}]

Instead of extracting the state history above, let's load the state history of a model that was instruction fine-tuned to the above specifications on a GPU.


In [39]:
urlopened = urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/49I70jQD0-RNRg2v-eOoxg/instruction-tuning-log-history-lora.json')
log_history_lora = json.load(io.BytesIO(urlopened.read()))
log_history_lora

[{'loss': 1.964,
  'grad_norm': 3.12174391746521,
  'learning_rate': 4.936251920122888e-05,
  'epoch': 0.12800819252432155,
  'step': 500},
 {'loss': 1.7329,
  'grad_norm': 3.3765358924865723,
  'learning_rate': 4.872247823860727e-05,
  'epoch': 0.2560163850486431,
  'step': 1000},
 {'loss': 1.731,
  'grad_norm': 4.632175922393799,
  'learning_rate': 4.808371735791091e-05,
  'epoch': 0.38402457757296465,
  'step': 1500},
 {'loss': 1.7693,
  'grad_norm': 3.1081297397613525,
  'learning_rate': 4.74436763952893e-05,
  'epoch': 0.5120327700972862,
  'step': 2000},
 {'loss': 1.6826,
  'grad_norm': 1.7782740592956543,
  'learning_rate': 4.680363543266769e-05,
  'epoch': 0.6400409626216078,
  'step': 2500},
 {'loss': 1.7027,
  'grad_norm': 1.7686882019042969,
  'learning_rate': 4.6163594470046084e-05,
  'epoch': 0.7680491551459293,
  'step': 3000},
 {'loss': 1.6579,
  'grad_norm': 8.598638534545898,
  'learning_rate': 4.5523553507424476e-05,
  'epoch': 0.8960573476702509,
  'step': 3500},
 {'

In [40]:
tokenizer.padding_side = 'left'

with torch.no_grad():
    # Due to resource limitation, only apply the function on 3 records using "instructions_torch[:10]"
    pipeline_iterator= gen_pipeline(instructions_torch[:3], 
                                    max_length=50, # this is set to 50 due to resource constraint, using a GPU, you can increase it to the length of your choice
                                    num_beams=5,
                                    early_stopping=True,)

generated_outputs_lora = []
for text in pipeline_iterator:
    generated_outputs_lora.append(text[0]["generated_text"])

Let's have a look at some of the responses from the instruction fine-tuned model and the expected responses.


In [41]:
for i in range(3):
    print('@@@@@@@@@@@@@@@@@@@@')
    print('@@@@@ Instruction '+ str(i+1) +': ')
    print(instructions[i])
    print('\n\n')
    print('@@@@@ Expected response '+ str(i+1) +': ')
    print(expected_outputs[i])
    print('\n\n')
    print('@@@@@ Generated response '+ str(i+1) +': ')
    print(generated_outputs_lora[i])
    print('\n\n')
    print('@@@@@@@@@@@@@@@@@@@@')
    

@@@@@@@@@@@@@@@@@@@@
@@@@@ Instruction 1: 
### Instruction:
What type of data structure would you use to store key-value pairs in a Python program? Write corresponding code in Python.

### Response:




@@@@@ Expected response 1: 
[['\nThe data structure to use for key-value pairs in Python is a dictionary. A dictionary is a data type that consists of key-value pairs, and is denoted by {} in Python. Each key has a unique value associated with it that can be accessed using the key. For example, a dictionary called "person" could look like this: \n\nperson = {\'name\':\'John\', \'age\': 32} \n\nThe value of the key "name" can be accessed using person[\'name\'] which returns "John".']]



@@@@@ Generated response 1: 
The key-value pairs in a Python program can be stored in a



@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@
@@@@@ Instruction 2: 
### Instruction:
Describe a method to solve an equation of the form ax + b = 0. Write corresponding code in Python.

### Response:




@@@@@ Expected 

In [42]:
sacrebleu = evaluate.load("sacrebleu")
results_lora = sacrebleu.compute(predictions=generated_outputs_lora,
                                 references=expected_outputs)
print(list(results_lora.keys()))
print(round(results_lora["score"], 1))

['score', 'counts', 'totals', 'precisions', 'bp', 'sys_len', 'ref_len']
0.3


You can see that the fine-tuned model achieves a SacreBLEU score of 14.7/100, significantly better than the 0.4/100 achieved by the base model. 

Let's conclude. The instruction fine-tuned model generates responses that align much better with the expected responses in the dataset.

---

# Exercises


### Exercise 1: Try with another response template (Question-Answering)

Create a `formatting_prompts_response_template` function to format the train_dataset in the Response Template. 

Template: `### Question: {question}\n ### Answer: {answer}`


In [None]:
#write your code here
def formatting_prompts_response_template(mydataset):
    output_texts = []
    for i in range(len(mydataset['instruction'])):
        text = (
            f"### Question:\n{mydataset['instruction'][i]}"
            f"\n\n### Answer:\n{mydataset['output'][i]}</s>"
        )
        output_texts.append(text)
    return output_texts

<details>
    <summary>Click here for the solution</summary>

```python
def formatting_prompts_response_template(mydataset):
    output_texts = []
    for i in range(len(mydataset['instruction'])):
        text = (
            f"### Question:\n{mydataset['instruction'][i]}"
            f"\n\n### Answer:\n{mydataset['output'][i]}</s>"
        )
        output_texts.append(text)
    return output_texts
```

</details>


Create a `formatting_prompts_response_template_no_response` function to format the `test_dataset` in the Response Template, excluding the response.


Template: `### Question: {question}\n ### Answer: `


In [None]:
#write your code here

In [None]:
def formatting_prompts_response_template_no_response(mydataset):
    output_texts = []
    for i in range(len(mydataset['instruction'])):
        text = (
            f"### Question:\n{mydataset['instruction'][i]}"
            f"\n\n### Answer:\n"
        )
        output_texts.append(text)
    return output_texts

<details>
    <summary>Click here for the solution</summary>

```python
def formatting_prompts_response_template_no_response(mydataset):
    output_texts = []
    for i in range(len(mydataset['instruction'])):
        text = (
            f"### Question:\n{mydataset['instruction'][i]}"
            f"\n\n### Answer:\n"
        )
        output_texts.append(text)
    return output_texts
```

</details>


### Exercise 2: Try with another LLM (EleutherAI/gpt-neo-125m)

The EleutherAI/gpt-neo-125m is a smaller variant of the GPT-Neo family of models developed by EleutherAI. With 125 million parameters, it is designed to be computationally efficient while still providing robust performance for various natural language processing tasks.

Download and load the `EleutherAI/gpt-neo-125m` model


In [None]:
#write your code here
model_name = "EleutherAI/gpt-neo-125m"

# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

<details>
    <summary>Click here for the solution</summary>

```python
model_name = "EleutherAI/gpt-neo-125m"

# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
```

</details>


Initialize LoRA Configuration:

- r: 8 (Low-rank dimension)
- lora_alpha: 16 (Scaling factor)
- target_modules: ["q_proj", "v_proj"] (Modules to apply LoRA)
- lora_dropout: 0.1 (Dropout rate)
- task_type: TaskType.CAUSAL_LM (Task type should be causal language model)


In [None]:
#write your code here
lora_config = LoraConfig(
    r=8,  # Low-rank dimension
    lora_alpha=16,  # Scaling factor
    target_modules=["q_proj", "v_proj"],  # Modules to apply LoRA
    lora_dropout=0.1,  # Dropout rate
    task_type=TaskType.CAUSAL_LM  # Task type should be causal language model
)

<details>
    <summary>Click here for the solution</summary>

```python

lora_config = LoraConfig(
    r=8,  # Low-rank dimension
    lora_alpha=16,  # Scaling factor
    target_modules=["q_proj", "v_proj"],  # Modules to apply LoRA
    lora_dropout=0.1,  # Dropout rate
    task_type=TaskType.CAUSAL_LM  # Task type should be causal language model
)
```

</details>


Apply LoRA Configuration to the model.


In [None]:
#write your code here
model = get_peft_model(model, lora_config)

<details>
    <summary>Click here for the solution</summary>

```python
model = get_peft_model(model, lora_config)
```

</details>


## Congratulations! You have completed the lab


## Authors

[Wojciech "Victor" Fulmyk](https://www.linkedin.com/in/wfulmyk) is a Data Scientist and a PhD Candidate in Economics at the University of Calgary.

[Fateme Akbari](https://www.linkedin.com/in/fatemeakbari/) is a Ph.D. candidate in Information Systems at McMaster University with demonstrated research experience in Machine Learning and NLP.

[Joseph Santarcangelo](https://author.skills.network/instructors/joseph_santarcangelo) has a Ph.D. in Electrical Engineering, his research focused on using machine learning, signal processing, and computer vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.

## References

[Supervised Fine-tuning Trainer](https://huggingface.co/docs/trl/main/en/sft_trainer)

[Finetuning To Follow Instructions](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/01_main-chapter-code/ch07.ipynb)

[Finetuning with LoRA -- A Hands-On Example](https://lightning.ai/lightning-ai/studios/code-lora-from-scratch)

© Copyright IBM Corporation. All rights reserved.
