<p> <center> <a href="../../LLM-Application.ipynb">Home Page</a> </center> </p>

 
<div>
    <span style="float: left; width: 75%; text-align: center;">
        <a>1</a>
         <a href="trt-llama-chat.ipynb">2</a>
          <a href="trt-custom-model.ipynb">3</a>
        <a href="triton-llama.ipynb">4</a>
        <a href="nemo-guardrails.ipynb">5</a>
        <a href="challenge.ipynb">6</a>
    </span>
    <span style="float: left; width: 23%; text-align: right;"><a href="trt-llama-chat.ipynb">Next Notebook</a></span>
</div>

# Finetuning Llama-2-7B with Custom Data
---

<div style="text-align:left; color:#FF0000; height:80px; text-color:red; font-size:20px">Please note that you can only run this lab using the Llama Fine-tuning container</div>

In this notebook, we demonstrate how to preprocess a dataset for a text generation task and fine-tune with the Llama-2-7b-chat model using 4-bit quantization via [QLoRA](https://arxiv.org/abs/2305.14314) allows efficient fine-tuning and Parameter-Efficient Fine-Tuning (PEFT) techniques that work by only updating a small subset of the model's parameters. The last section of the notebook describes the steps to run inference on the fine-tuned model.

## Overview of LLAMA 2

Llama 2 is a family of pre-trained and fine-tuned Large Language Models (LLMs) from [Meta](https://llama.meta.com/llama2) that consist of Llama-2 and Llama-2-chat. The pre-trained models range in scale from 7 billion to 70 billion parameters (7B, 13B, 70B) variants. The fine-tuned models (Llama 2-Chat) are optimized for conversational applications using Reinforcement Learning from Human Feedback ([RLHF](https://arxiv.org/abs/2305.18438)). Llama 2 is trained on 2 trillion tokens and has a context length of 4K; therefore, the model can grasp and generate extensive content. To improve inferencing scalability, Llama 2 adopts the [Grouped Query Attention (GQA)](https://arxiv.org/abs/2305.13245) that caches previous token pairs to accelerate attention computation.

<center><img src="images/llama2-chat-arc2.png" width="900px" height="900px" alt-text="Arc"/></center>
<center><i>Source: </i> <a href="https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/">Llama 2: Open Foundation and Fine-Tuned Chat Models </a> paper</center>
    
We focus on using a custom dataset and fine-tuning the Llama 2 Chat 7 billion (Llama-2-7b-chat) variant. The variant was created using supervised fine-tuning and refined using the RLHF technique by rejection sampling and [Proximal Policy Optimization (PPO)](https://arxiv.org/abs/1707.06347). 
Please run the cell below to check if the `Llama-2-7b-chat` model already exists in your directory; otherwise, uncomment the nested cell below to download it.  

In [None]:
!ls -LR ../../model/Llama-2-7b-chat

**Expected Output:**
```python
../../model/Llama-2-7b-chat:
LICENSE.txt		   USE_POLICY.md	params.json
README.md		   checklist.chk	tokenizer.model
Responsible-Use-Guide.pdf  consolidated.00.pth	tokenizer_checklist.chk
```

In [None]:
###### download Llama-2-7b-chat model. Remove all comments to run the cell #####################

#!python3 ../../source_code/Llama2/download-llama2-chat.py   
#print("extracting files......")
#!tar -xf ../../model/Llama-2-7b-chat.tar  -C ../../model
#print("files extraction done! removing tar file......")
#!rm -rf ../../model/Llama-2-7b-chat.tar
#print("All done!!!")

## Text Generation

Text Generation is the task of generating text that closely resembles human-written text. It is a process known as Causal Language Modelling, where an AI model generates a coherent and meaningful text that imitates natural human communication. The text generation task involves using algorithms and language models trained to learn patterns/long-term dependencies and contextual information to process input data and generate new output text based on given prompts. Popular large language models like  GPT (Generative Pre-trained Transformer), Mistral, and Llama 2 are used for the task. Use cases include:

- Instruction Models: adapt to follow instructions
- Code Generation: trained on code from scratch to help the programmers with repetitive coding tasks.
- Stories Generation: receives input text and creates story-like text based on the given text.
You can read more about text generation task variants such as Completion Generation Models, Text-to-Text Generation Models, and Text Generation from Image and Text from [here](https://huggingface.co/tasks/text-generation).


## Data Preprocessing

We consider [openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco) as our choice dataset for finetuning. The openassistant-guanaco dataset is a subset of the [Open Assistant Conversations](https://huggingface.co/datasets/OpenAssistant/oasst1/tree/main) dataset that contains only the highest-rated paths in the conversation tree, with a total of 9,85k training samples and 518 test samples. The OpenAssistant Conversations (OASST1) is a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees. The corpus is a product of a worldwide crowd-sourcing effort involving over 13,500 volunteers ([source](https://huggingface.co/datasets/OpenAssistant/oasst1)).

### Dataset Structure

Each row in the `openassistant-guanaco` jsonl dataset is a text dictionary that consists of `human` instructions and `Assistant` that provide context to draw the response to the instruction. Within the `text`, the `human` and `Assistant` fields are separated with three `###` delimiters that denote the start and end of positions.  

```bash

{ "text": "### Human: Can you write a short introduction about the relevance of the term \"monopsony\" in economics? Please use examples related to...
           ### Assistant: \"Monopsony\" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term  is...   
           ### Human: Now explain it to a dog"
}
{ "text": "### Human: \u00bfCUales son las etapas del desarrollo y en qu\u00e9 consisten seg\u00fan Piaget?
           ### Assistant: Jean Piaget fue un psic\u00f3log suizo que propuso una teor\u00eda sobre el desarrollo cognitivo...
           ### Human: \u00bfHay otras teor\u00edas sobre las etapas del desarrollo que reafirmen o contradigan a la teor\u00eda de Piaget?"
}
{"text": "### Human: Can you give me an example of a python script that opens an api point and serves a string?
          ### Assistant: Sure! Here's an example Python script that uses the Flask web framework to create a simple API endpoint that serves a string:\n\n`         \nfrom flask import Flask\n\napp = Flask(__name__)\n\n@app.route('/')\ndef hello_world():\n    return 'Hello, world!'\n\nif __name__ ==  \n ...   
          ### Human: What changes would you need to make to the code above to serve a JSON object instead of a string?
          ### Assistant: To serve a JSON object instead of a string, you can modify the \"hello_world()\" function to return a JSON response using the  Flask \"jsonify\" function. Here's an example of how to modify the previous code to serve a JSON object:\n\n... "
}
...

```
Please run the cell below to check if the dataset exists in the data directory; otherwise, uncomment the nested cell below to download it.

In [None]:
!ls -LR ../../data

**Expected Output:**
```python
../../data:
README.md  openassistant_best_replies_eval.jsonl   simple_data.json
filtered   openassistant_best_replies_train.jsonl
...

```


In [None]:
###### download dataset. Remove comment to run the cell ###########
#!python3 ../../source_code/Llama2/download-guanco-ds.py

Import all required libraries 

In [None]:
# In some cases where you have access to limited computing resources, you might have to uncomment os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:64" if you run into not enough memory issue 


import os
import torch
import json
from datasets import load_dataset, load_from_disk
from langdetect import detect
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)
import re
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

#os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:64"

### Data Extraction

Let's execute the `read_jsonl` function below to read the dataset

In [None]:

def read_jsonl(file_path):
    with open(file_path) as f:
        data = [json.loads(line) for line in f]
        return data


Let's perform the following steps in the cell below:
- Set the path to the train and test jsonl files
- Read both files using the `read_jsonl` function
- Extract 5k samples from the training set
- Display the samples to see the content and format

In [None]:
#set path to jsonl files
train_path = '../../data/openassistant_best_replies_train.jsonl'
test_path = '../../data/openassistant_best_replies_eval.jsonl'

# read the files
raw_train_data = read_jsonl(train_path)
raw_test_data = read_jsonl(test_path)

# extract 5000 samples 
train_samples = raw_train_data[:5000]

print("length of traning samples: ", len(train_samples))
train_samples[:10]

**Expected output:**
```python

length of traning samples:  5000
[{'text': '### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as ..'},
 ...
 {'text': '### Human: ¿CUales son las etapas del desarrollo y en qué consisten según Piaget?### Assistant: Jean Piaget fue un psicólogo suizo que propuso una teoría sobre el desarrollo cognitivo humano que consta de cuatro etapas:\n\nEtapa sensoriomotora (0-2 años): Durante esta etapa, el niño aprende a través de sus sentidos y movimientos. Descubre que sus acciones pueden tener un impacto en el entorno y comienza a formarse una idea básica de objetividad y continuidad.\n\nEtapa preoperatoria (2-7 años): En esta etapa, el niño comienza a desarrollar un pensamiento simbólico y a comprender que las cosas pueden representar a otras cosas. También comienzan a desarrollar un pensamiento lógico y a comprender conceptos como la causa y el efecto.\n\nEtapa de ...?'},
 ...
 {'text': '### Human: Schreibe einen kurze und präzise Konstruktionsbeschreibung zu einem Dreieck ABC mit c=6\xa0cm, h_c=5\xa0cm und γ=40°. γ ist hierbei der von Seite c gegenüberliegende Winkel.### Assistant: Dreieck ABC ist ein rechtwinkliges Dreieck mit der Seitenlänge c=6 cm als Hypotenuse. Die Höhe h_c von c beträgt 5 cm und der Winkel γ von c gegenüberliegend beträgt 40°.### Human: Vielen Dank, das hat mir sehr weitergeholfen.'},
 {'text': '### Human: Напиши информацию о игре Hytale### Assistant: Hytale - это предстоящая игра-песочница, придуманная Hypixel Studios. Объединяя в себе диапазон игры-песочницы с глубиной ролевой игры, Hytale погружает игроков в живой космос, полный эскапад и творчества.\n\nВ Hytale геймеры могут путешествовать по просторному, процедурно созданному миру и принимать участие в разнообразных действиях. От возведения баз и создания собственных персонажей до создания карт приключений и сражений с монстрами или другими игроками - возможности Hytale безграничны. Игроки также могут создавать свои ..'},
 {'text': '### Human: 私は猫ちゃんが好きなんですけど\n貴方も猫ちゃんが好きですか?### Assistant: 猫ちゃんが好きなんですね。\n私も猫ちゃんが好きですよ！\n猫ちゃんのどんなところが好きですか？### Human: 猫ちゃんの全てが好きです！猫ちゃんは天使です！！### Assistant: わかりますよ、その気持ち！\n猫ちゃんの愛くるしい姿は、天使みたいに可愛いですよね！\nあなたのお家に、猫ちゃんはいるんですか？### Human: 勿論です！とっても可愛くて、毎日が幸せです！'},

 {'text': "### Human: Quins sinònims amb altres dialectes d'aquesta llengua té nen o nena?### Assistant: Al·lot o al·lota, vailet o vaileta, manyac o manyaga, nin o nina, xiquet o xiqueta, xic o xica, marrec, minyó o minyona."},
...
```

We can see from the content displayed that the file contains text written in different languages. For demonstration purposes, we want to consider only English text for training and prompting. Therefore, the train and test sets will be processed to filter out non-English text samples using a `detect` feature from the `spacy-langdetect` library. The English-only texts (`1778 training samples`) are saved as the new training set. The same process is applied to the test samples. Run the three two cells below to execute the filter process.     

In [None]:
# filter non-Emglish word
def remove_nonEnglish_rows(ds):
    new_ds = []
    for row in (ds):
        if detect(str(row)) == 'en':
            new_ds.append(row)
    return new_ds
        
# save English text samples
def save_jsonl(ds,filename):
    with open(f"../../data/filtered/{filename}.jsonl", "w") as write_file:
            json.dump(ds, write_file, indent=4)
            print("dataset saved in jsonl format ....")

In [None]:
#calling filter function
filter_train_samples = remove_nonEnglish_rows(train_samples)
filter_test_samples = remove_nonEnglish_rows(raw_test_data )

print("len of training samples: ",len(filter_train_samples))
filter_train_samples[:10]

**Expected output:**
```python
len of training samples:  1775
[{'text': '### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as ...'},
 {'text': '### Human: Can you explain contrastive learning in machine learning in simple terms for someone new to the field of ML?### Assistant: Sure! Let\'s say you want to build a model which can distinguish between images of cats and dogs. You gather your dataset, consisting of many cat and dog pictures. Then you put them through a neural net of your choice, which produces some representation for each image, a sequence of numbers like [0.123, 0.045, 0.334, ...]. The problem is, if your model is unfamiliar with cat and dog images, these representations will be quite random. At one time a cat and ...'},
 {'text': "### Human: I want to start doing astrophotography as a hobby, any suggestions what could i do?### Assistant: Getting started in astrophotography can seem daunting, but with some patience and practice, you can become a master of the craft. To begin, you'll need a good camera and lens, a tripod, and a dark sky location free of light pollution. You will also need to learn about the basics of astrophotography, such as what camera settings to use, how to capture star trails, and the best techniques for tracking celestial objects. You can also purchase or rent different types of telescopes, depending on what types of objects you want to capture. Additionally, it's important to keep up with the latest astrophotography news and trends. Once you have the necessary ..."},
...
```

In [None]:
# set file names  
save_train_filename = 'train'
save_test_filename = 'test'

# save file
save_jsonl(filter_train_samples, save_train_filename)
save_jsonl(filter_test_samples, save_test_filename)

### Data Transformation

In this section, we want to format our dataset to the Llama2 acceptable template for finetuning:

```python

<s>[INST] {human text} [/INST] {assistant/context} </s>
<s>[INST]{human text} [/INST] </s>

```
- **Human text**: It denotes human instructions to the model. The human text is enclosed within an instruction tag `[INST] [/INST]` 
- **Assistant**: represents the context that will assist the model in drawing out a response to the instruction issued by a human. The assistant text is nested to the human text within a segment tag `<s>  </s>`.

One or more segments can exist within a training sample text. A segment can consist of both human instruction text and assistant context or human instruction text only.


<img src="images/template.png"/>


Next, we load the filtered dataset by running the cell below.

In [None]:
dataset = load_dataset('../../data/filtered')

Execute the function to transform the filtered dataset to the format explained above.

In [None]:
### credit: https://huggingface.co/datasets/mlabonne/guanaco-llama2-1k ### 

# Define a function to transform the data
def transform_to_template(example):
    conversation_text = example['text']
    segments = conversation_text.split('###')

    reformatted_segments = []

    # Iterate over pairs of segments
    for i in range(1, len(segments) - 1, 2):
        human_text = segments[i].strip().replace('Human:', '').strip()

        # Check if there is a corresponding assistant segment before processing
        if i + 1 < len(segments):
            assistant_text = segments[i+1].strip().replace('Assistant:', '').strip()

            # Apply the new template
            reformatted_segments.append(f'<s>[INST] {human_text} [/INST] {assistant_text} </s>')
        else:
            # Handle the case where there is no corresponding assistant segment
            reformatted_segments.append(f'<s>[INST] {human_text} [/INST] </s>')

    return {'text': ''.join(reformatted_segments)}


# Apply the transformation
template_dataset = dataset.map(transform_to_template)

Let's display a sample to inspect if our training set is in the right format as shown in the screenshot above.

In [None]:
template_dataset['train'][2:3]

Save the preprocessed dataset to the directory `../data/ds_preprocess`.

In [None]:
template_dataset.save_to_disk('../../data/ds_preprocess')

## Fine-tuning LLAMA 2

As mentioned earlier in the notebook, our choice model for finetuning is the `Llama-2-7b-chat`. Below is the list of walkthrough steps to complete the task.

- Convert the `Llama-2-7b-chat` to Hugging Face transformer format
- Set the paths to the model and load the dataset
- Load the model tokenizer
- Set the training parameter
- Configure the Parameter Efficient Fine Tuning with LoRA
- Apply 4-bits quantization
- Setup the trainer to start the finetuning process


### Convert model to Hugging Face Transformers Format
We will convert our model checkpoint to the Hugging Face transformer format. The benefits are as follows:  

- Delivers a convenient way to finetune the Llama2 model with limited GPU computing resources (like laptops, workstations, or Google Colab) through the use of some technique that a transformer-compatible 
- Ability to immediately use a model on a given input text using transformer Pipelines. 
- The use of transformers `Pipelines` group together the pre-trained model with the preprocessing that was used during that training to enable quick inferencing.


To convert our model checkpoint, we use the `convert_llama_weights_to_hf.py` script located on [GitHub](https://github.com/cedrickchee/transformers-llama/tree/llama_push/src/transformers/models/llama). 

- `--input_dir`: denotes the directory to the llama model to be converted
- `--model_size`: represents the llama model parameter size
- `--out_dir`: the directory to save the converted model

<img src="images/llama-hf.png" height="550px" width="900px" />

If you already have the Hugging Face format of the model `Llama-2-7b-chat-hf`, you can skip running the cell below.



In [None]:
!python3 ../../source_code/Llama2/llama/convert_llama_weights_to_hf.py \
--input_dir ../../model/Llama-2-7b-chat --model_size 7B --output_dir ../../model/Llama-2-7b-chat-hf

**Exepected Output:**
```python
...
https://github.com/huggingface/transformers/pull/24565
Fetching all parameters from the checkpoint at ../model/Llama-2-7b-chat.
Loading the checkpoint in a Llama model.
Loading checkpoint shards: 100%|████████████████| 33/33 [00:06<00:00,  4.95it/s]
Saving in the Transformers format.

```

Now, we can initialize the path to our transformer format model and load the transformed/preprocessed training dataset from the directory where it was saved. 

In [None]:

# initailize path to the base model 
base_model = "../../model/Llama-2-7b-chat-hf"

# set the path to the dataset template
data_path = "../../data/ds_preprocess/train"

# set the path to the dataset template
eval_path = "../../data/ds_preprocess/test"

# load the transformed dataset
dataset = load_from_disk(data_path)
eval_dataset = load_from_disk(eval_path)

### Loading tokenizer

Tokenization is breaking a text into sentences, words, or sub-words. Each word or sentence in a text is considered as a token. Tokenization allows a detailed text data analysis when broken into smaller units. The [LLaMA tokenizer](https://huggingface.co/docs/transformers/en/model_doc/llama2) is a byte-pair-encoding (BPE) model based on [sentencepiece](https://aclanthology.org/P16-1162/), an unsupervised text tokenizer and detokenizer for Neural Network-based text generation systems that predetermined the vocabulary size prior to the neural model training.

In the cell below, we load the tokenizer from our base model directory and set the parameters:

- **pad_token**: a special token used to make arrays of tokens the same size for batching purposes. 
- **padding_side**: side to pad

A comprehensive list of Llama tokenizer parameters can be found [here](https://huggingface.co/docs/transformers/en/model_doc/llama2)

In [None]:
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

### Set Training Hyperparameters

Training hyperparameters are customized using the `TrainingArguments` class. The class provides an API that offers a wide range of options to customize and optimize the training process. Please find a comprehensive description of the hyperparameters [here](https://huggingface.co/transformers/v3.0.2/main_classes/trainer.html#transformers.TrainingArguments). You can further modify the values of the hyperparameters in the next cell after the complete finetune process and rerun the cells to see how it impact on the training outcome.

**Note:** *If running on a single DGX A100 GPU, use the hyperparameter settings below to modify the next cell.*

```python 
training_params = TrainingArguments(
    output_dir="../../model/results",
    num_train_epochs=5,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    group_by_length=True,
    save_steps=50,
    logging_steps=50, 
    ...
)
```

The cell below contains a hyperparameter setting to run on a GPU MiG instance (20GB).

In [None]:
training_params = TrainingArguments(
    output_dir="../../model/results",
    num_train_epochs=2,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    group_by_length=True,
    save_steps=25,
    logging_steps=25, 
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    optim="paged_adamw_32bit",
    lr_scheduler_type="constant",
    report_to="tensorboard"
)

### Configure PEFT With LoRA

Finetuning large language pre-trained models is computationally costly. Our main goal is to accelerate our finetuning process with minimal memory consumption. A method to achieve that is to use a state-of-the-art [Parameter-Efficient Finetuning (PEFT)](https://github.com/huggingface/peft/tree/main) approach. [PEFT](https://arxiv.org/abs/2305.16742) allows finetuning a small number of (extra) model parameters instead of all the model's parameters, and this significantly decreases the computational and storage costs. One of the ways to implement PEFT is to adopt the Low-Rank Adaptation (LoRA) technique. Lora makes finetuning more efficient by greatly reducing the number of trainable parameters for downstream tasks. It does this by freezing the pre-trained model weights and injecting trainable rank decomposition matrices into each layer of the Transformer architecture. According to the [authors of LoRA](https://arxiv.org/abs/2106.09685), Aside from reducing the number of trainable parameters by 10k times, it also reduces the GPU consumption by 3x, thus delivering high throughput with no inference latency. For quick background on LoRA, please follow this [link](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora).

<center><img src="images/lora-arch.png" height="500px" width="900px"  /></center>
<center>  LoRA reparametrization and Weight merging. <a href="https://huggingface.co/docs/peft/main/en/conceptual_guides/lora"> View source</a> </center>

LoRA techniques are applied through `LoraConfig`, which provides PEFT parameters that control how the method is applied to the base model. A description of the parameter used in the cell below is given as follows:

- **lora_alpha**: LoRA scaling factor
- **lora_dropout**: The dropout probability for LoRA layers.
- **r**: the rank of the update matrices, expressed in int. Lower rank results in smaller update matrices with fewer trainable parameters.
- **bias**: Specifies if the bias parameters should be trained. It can be 'none', 'all', or 'lora_only'.
- **task_type**: Possible task types which include `CAUSAL_LM`, `FEATURE_EXTRACTION`, `QUESTION_ANS`, `SEQ_2_SEQ_LM`, and `SEQ_CLS and TOKEN_CLS`.   

Because the task we want to perform is text generation, we have set the task_type to Causal language model `(CAUSAL_LM)`, which is frequently used for text generation tasks. Please run the cell below to set up the LoRA configuration. 

In [None]:
peft_params = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

### 4-bit quantization configuration

Model quantization is a popular deep-learning optimization method in which model data—network parameters and activations—are converted from floating-point to lower-precision representation, typically using 8-bit integers. Quantization represents data with fewer bits, making it a useful technique for reducing memory usage and accelerating inference, especially in large language models (LLMs). It can be combined with PEFT methods to make it easier to train and load LLMs for inference.

<center><img src="images/quantization.png" height="400px" width="700px" /></center>
<center> <a href="https://developer.nvidia.com/blog/achieving-fp32-accuracy-for-int8-inference-using-quantization-aware-training-with-tensorrt/" > source: Using Quantization Aware Training with NVIDIA TensorRT</a></center>

Several ways and algorithms to quantize a model including can be found [here](https://huggingface.co/docs/peft/main/en/developer_guides/quantization). A library to easily implement quantization and integrate with transformers is the `bitsandbytes` library. The library provides config parameters to quantize a model to 8 or 4 bits using the `BitsAndBytesConfig` class. The 4 bits parameters used in the cell below are described as follows:

- **load_in_4bit**: set `True` to quantize the model to 4-bits when you load it
- **bnb_4bit_quant_type**: set to `"nf4"` to use a special 4-bit data type for weights initialized from a normal distribution
- **bnb_4bit_use_double_quant**: set `True` to use a nested quantization scheme to quantize the already quantized weights
- **bnb_4bit_compute_dtype**: set to `torch.float16` or `torch.bfloat16` to use bfloat16 for faster computation 

Run the cell below to set the 4-bit quantization for our model.

In [None]:

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=False,
)

### Loading Base Model

The next step is to load our base model `(Llama-2-7b-chat-hf)` with the causal language model class used for the text generation task. We do this by passing the base model, quantization config, and GPU device ID to the `AutoModelForCausalLM` object, as shown in the cell below

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=quant_config,
    device_map={"": 0}
)
model.config.use_cache = False
model.config.pretraining_tp = 1

### Set the Trainer Hyperparameters

To initiate our model trainer, we create a trainer object from [Supervised fine-tuning (SFT)](https://huggingface.co/docs/trl/en/sft_trainer). SFT is part of the integrated transformer [Reinforcement Learning (TRL)](https://huggingface.co/docs/trl/en/index) tools used to train transformer language models using Reinforcement Learning. Others include [Reward Modeling step (RM)](https://huggingface.co/docs/trl/en/reward_trainer) and  Proximal [Policy Optimization (PPO)](https://arxiv.org/abs/1707.06347). In our SFT trainer object, we set our model, training dataset, PEFT config  object, model tokenizer, and training argument parameter. We also specify the field (`text`) to use within our dataset.

**Note:** *If running on a single DGX A100 GPU, modify the value of `max_seq_length` to 1024 or set it to none (as default).*

In [None]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    eval_dataset = eval_dataset,
    dataset_text_field="text",
    peft_config=peft_params,
    args=training_params,
    max_seq_length=512,
    packing=False,
)

Run the cell below to train the SFT trainer object. *Please note that it takes an hour or more to complete the training (each epoch takes 31 minutes or more)*.

In [None]:
trainer.train()

Save the finetuning model and tokenizer in the directory  `../model/Llama-2-7b-chat-hf-finetune`

In [None]:
# save model
new_model = "../../model/Llama-2-7b-chat-hf-finetune"
trainer.model.save_pretrained(new_model)
trainer.tokenizer.save_pretrained(new_model)

## Inferencing

Now that we completed our finetuning process, we can run a quick inference to test how our new model is performing. To do that, we will create the following:

- a transformer pipeline with four parameter inputs: `model`, `tokenizer`, `max_length`, and `task`.  
- format prompt as: `f"<s>[INST] {prompt} [/INST]"`
- pass the prompt into the pipeline object and get result.

```python
 
inf_pipeline = pipeline(model=model, tokenizer=tokenizer, max_length=200, task="text-generation")
prompt = inf_pipeline(f"<s>[INST] {prompt} [/INST]")
result = inf_pipeline(prompt)
print(result[0]['generated_text'])
```
You can modify the `max_length` to decide the length of text generated by the model. 


In [None]:

def run_inference(prompt):
    inf_pipeline = pipeline(model=model, tokenizer=tokenizer, max_length=200, task="text-generation")
    prompt = f"<s>[INST] {prompt} [/INST]"
    result = inf_pipeline(prompt)
    print(result[0]['generated_text'])

In [None]:
prompt = "explain what is astrophotography?"
run_inference(prompt)

**Likely output:**

```bash
<s>[INST] explain what is astrophotography? [/INST] Astrophotography is the branch of photography that deals with the photographing of celestial objects such as stars, planets, galaxies, and other astronomical phenomena. Astrophotographers use specialized cameras and telescopes to capture images of these objects, often in low light conditions. Astrophotography requires a great deal of patience, skill, and knowledge, as photographers must be able to accurately track the movement of celestial objects and compensate for the effects of light pollution and atmospheric distortion. Astrophotography has become increasingly popular in recent years, with many amateur astronomers and professional photographers engaging in this hobby and art form. 

Astrophotography can be done using a variety of techniques, including long exposure times, tracking mounts, and using narrow-band filters to
```

In [None]:
prompt = "can you explain further?"
run_inference(prompt)

**Likely output:**

```bash
<s>[INST] can you explain further? [/INST] Sure, I'd be happy to explain further! Could you please provide more context or clarify what you're asking about? 😃 👍 💡 🖊 📝 📅 🕰 🕰️ 🕳️ 🕷️ 🕸️ 🕷️ 🕸️ 🕳️ 🕷️ 🕸️ 🕳️ 🕷️ 🕸️ 🕰️ 🕰️ 🕳️ 🕷️ 🕸️ 🕳️ 🕷️ 🕸️ 🕰️

```

In [None]:
prompt = "I want you to explain further on astrophotography"
run_inference(prompt)

**Likely output:**

```bash

<s>[INST] I want you to explain futher on astrophotography [/INST] Sure, I d be happy to explain further on astrophotography. Astrophotography is the process of capturing images of celestial objects such as stars, planets, galaxies, and other astronomical phenomena. It involves using specialized cameras and techniques to capture high-quality images of these objects in the night sky.

There are several key components to astrophotography:

    Camera: Astrophotography cameras are designed specifically for capturing images of celestial objects. They typically have large sensors, fast lenses, and specialized features such as built-in equatorial mounts, motorized tracking systems, and cooling systems to reduce noise and improve image quality.

    Telescope: A telescope is used to gather light from the celestial object being photographed. There are several types of

```

### Reload model in FP16 and merge with LoRA weights

To have our model as a single entity for ease of use and widened task coverage, we reload it in fp16 mode and merge it with the LoRA weights using `model.merge_and_unload()`. The tokenizer is reloaded, pad, and saved along with the merged model in the same directory, `../model/Llama-2-7b-chat-hf-merged`. 

In [None]:
# Reload model in FP16 and merge it with LoRA weights
load_base_model = AutoModelForCausalLM.from_pretrained( base_model, torch_dtype=torch.float16, low_cpu_mem_usage=True, return_dict=True, device_map={"": 0})

model = PeftModel.from_pretrained(load_base_model, new_model)
model = model.merge_and_unload()


# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"


In [None]:
model.save_pretrained("../../model/Llama-2-7b-chat-hf-merged", safe_serialization=True)
tokenizer.save_pretrained("../../model/Llama-2-7b-chat-hf-merged")


**expected output**:
```python
('../model/Llama-2-7b-chat-hf-merged/tokenizer_config.json',
 '../model/Llama-2-7b-chat-hf-merged/special_tokens_map.json',
 '../model/Llama-2-7b-chat-hf-merged/tokenizer.model',
 '../model/Llama-2-7b-chat-hf-merged/added_tokens.json',
 '../model/Llama-2-7b-chat-hf-merged/tokenizer.json')
```

<div style="text-align:left; color:#FF0000; height:80px; text-color:red; font-size:20px">Please close the Jupyter notebook and switch to the TRT-LLM Container to continue with the next lab</div>

---
## References

- https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/
- https://llama.meta.com/llama2
- https://huggingface.co/tasks/text-generation
- https://arxiv.org/abs/2304.07327
- https://huggingface.co/datasets/timdettmers/openassistant-guanaco
- https://huggingface.co/docs/transformers/en/model_doc/llama2
- https://huggingface.co/docs/peft/main/en/developer_guides/quantization

## Licensing

Copyright © 2022 OpenACC-Standard.org. This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.

 <div>
    <span style="float: left; width: 75%; text-align: center;">
         <a>1</a>
          <a href="trt-llama-chat.ipynb">2</a>
          <a href="trt-custom-model.ipynb">3</a>
        <a href="triton-llama.ipynb">4</a>
        <a href="nemo-guardrails.ipynb">5</a>
        <a href="challenge.ipynb">6</a>
    </span>
    <span style="float: left; width: 23%; text-align: right;"><a href="trt-llama-chat.ipynb">Next Notebook</a></span>
</div>

<p> <center> <a href="../../Start_Here.ipynb">Home Page</a> </center> </p>
