#  Instruction-Tuning with LLMs


Instruction-based fine-tuning, referred to as instruction LLAMA. It trains the language models to follow specific instructions and generate appropriate responses. For instruction-tuning, the dataset plays an important role as it provides structured examples of instructions, contexts, and responses, allowing the model to learn how to handle various tasks effectively. Instruction LLAMA often uses human feedback to refine and improve model performance; however, this lab doesn't cover this aspect.

The context and instruction are concatenated to form a single input sequence that the model can understand and use to generate the correct response.

#### Context and instruction

	•	Instruction: A command to specify what the model should do
	•	Context: Additional information or background required for performing the instruction
	•	Combined input: The instruction and context combine together into a single input sequence

In [None]:
!pip install torch torchvision torchaudio

In [14]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments, pipeline
from datasets import load_dataset
from peft import AutoPeftModelForCausalLM, LoraConfig, prepare_model_for_kbit_training, get_peft_model
from trl import SFTTrainer

# Load a tokenizer and format function

In [2]:
template_tokenizer = AutoTokenizer.from_pretrained(
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
)

def format_prompt(example):
    """Format the prompt to using the <|user|> template TinyLLama
    is using
    """

    # Format answers
    chat = example["messages"]
    prompt = template_tokenizer.apply_chat_template(chat, tokenize=False)

    return {"text": prompt}


### Load and format the data using the template TinyLLama is using

In [3]:
dataset = (
  load_dataset("HuggingFaceH4/ultrachat_200k", split="test_sft")
    .shuffle(seed=42)
    .select(range(3_000))
)

dataset = dataset.map(format_prompt)

Example of formatted prompt

In [4]:
print(dataset["text"][2576])

<|user|>
Given the text: Knock, knock. Who’s there? Hike.
Can you continue the joke based on the given text material "Knock, knock. Who’s there? Hike"?</s>
<|assistant|>
Sure! Knock, knock. Who's there? Hike. Hike who? Hike up your pants, it's cold outside!</s>
<|user|>
Can you tell me another knock-knock joke based on the same text material "Knock, knock. Who's there? Hike"?</s>
<|assistant|>
Of course! Knock, knock. Who's there? Hike. Hike who? Hike your way over here and let's go for a walk!</s>



# Model Quantization

Now that we have our data, we can start loading in our model. This is where
we apply the Q in QLoRA, namely quantization. We use the
`bitsandbytes` package to compress the pretrained model to a 4-bit
representation.
In BitsAndBytesConfig, you can define the quantization scheme. We
follow the steps used in the original QLoRA paper and load the model in 4-
bit (`load_in_4bit`) with a normalized float representation
(`bnb_4bit_quant_type`) and double quantization
(`bnb_4bit_use_double_quant`):

In [5]:
model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"

# 4-bit quantization configuration - The Q in QLoRA
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, # Use 4-bit precision model loading
    bnb_4bit_quant_type="nf4", # Quantization type
    bnb_4bit_compute_dtype="float16", # Compute dtype
    bnb_4bit_use_double_quant=True, # Apply nested quantization
)

### Load the model to train on the GPU

In [6]:
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",

    # Leave this out for regular SFT
    quantization_config=bnb_config
)
model.config.use_cache = False
model.config.pretraining_tp = 1

### Load LLaMA tokenizer

In [7]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = "<PAD>"
tokenizer.padding_side = "left"

We use `padding_side = "left"` with TinyLlama (decoder-only model) to ensure that useful tokens are at the end of the sequence. This is important because these models generate text autoregressively and expect informative tokens to come last. With left padding, the attention mask correctly ignores `<PAD>` tokens on the left.

This quantization procedure allows us to decrease the size of the original
model while retaining most of the original weights’ precision. Loading the
model now only uses ~1 GB VRAM compared to the ~4 GB of VRAM it
would need without quantization. 

> Note that during fine-tuning, more
VRAM will be necessary so it does not cap out on the ~1 GB VRAM
needed to load the model.

# LoRA Configuration
Next, we will need to define our LoRA configuration using the peft
library, which represents hyperparameters of the fine-tuning process:

In [8]:
# Prepare LoRA Configuration
peft_config = LoraConfig(
    lora_alpha=32,  # LoRA Scaling
    lora_dropout=0.1,  # Dropout for LoRA Layers
    r=64,  # Rank
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[ 
        "k_proj",
        "gate_proj",
        "v_proj",
        "up_proj",
        "q_proj",
        "o_proj",
        "down_proj",
    ], # Layers to target 
)

# Prepare model for training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

`r`
This is the rank of the compressed matrices. 
Increasing this value will also increase the sizes of compressed
matrices leading to less compression and thereby improved
representative power. Values typically range between 4 and 64.

`lora_alpha`
Controls the amount of change that is added to the original weights. In
essence, it balances the knowledge of the original model with that of the
new task. A rule of thumb is to choose a value twice the size of `r`.

`target_modules`
Controls which layers to target. The LoRA procedure can choose to
ignore specific layers, like specific projection layers. This can speed up
training but reduce performance and vice versa.

# Training Configuration 

In [9]:
OUTPUT_DIR = "./results"

# Training Arguments
training_arguments = TrainingArguments(
    output_dir=OUTPUT_DIR,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    num_train_epochs=1,
    logging_steps=10,
    fp16=True,
    gradient_checkpointing=True
)

# Training

In [10]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    max_seq_length=512,

    # Leave this out for regular SFT
    peft_config=peft_config,
)

# Train model
trainer.train()

# Save QLoRA weights
QLORA_MODEL = "TinyLlama-1.1B-qlora"
trainer.model.save_pretrained(QLORA_MODEL)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
  3%|▎         | 10/375 [00:36<21:56,  3.61s/it]

{'loss': 1.6705, 'grad_norm': 0.26160547137260437, 'learning_rate': 0.00019964928592495045, 'epoch': 0.03}


  5%|▌         | 20/375 [01:13<21:30,  3.64s/it]

{'loss': 1.4756, 'grad_norm': 0.25835326313972473, 'learning_rate': 0.0001985996037070505, 'epoch': 0.05}


  8%|▊         | 30/375 [01:49<21:11,  3.69s/it]

{'loss': 1.4511, 'grad_norm': 0.18799914419651031, 'learning_rate': 0.0001968583161128631, 'epoch': 0.08}


 11%|█         | 40/375 [02:26<20:39,  3.70s/it]

{'loss': 1.488, 'grad_norm': 0.19733956456184387, 'learning_rate': 0.00019443763702374812, 'epoch': 0.11}


 13%|█▎        | 50/375 [03:03<19:39,  3.63s/it]

{'loss': 1.478, 'grad_norm': 0.19109897315502167, 'learning_rate': 0.0001913545457642601, 'epoch': 0.13}


 16%|█▌        | 60/375 [03:39<19:00,  3.62s/it]

{'loss': 1.3903, 'grad_norm': 0.19982600212097168, 'learning_rate': 0.00018763066800438636, 'epoch': 0.16}


 19%|█▊        | 70/375 [04:15<18:17,  3.60s/it]

{'loss': 1.4949, 'grad_norm': 0.2272220402956009, 'learning_rate': 0.00018329212407100994, 'epoch': 0.19}


 21%|██▏       | 80/375 [04:51<17:42,  3.60s/it]

{'loss': 1.4499, 'grad_norm': 0.19912979006767273, 'learning_rate': 0.000178369345732584, 'epoch': 0.21}


 24%|██▍       | 90/375 [05:27<17:18,  3.64s/it]

{'loss': 1.4275, 'grad_norm': 0.2007320374250412, 'learning_rate': 0.00017289686274214118, 'epoch': 0.24}


 27%|██▋       | 100/375 [06:04<16:52,  3.68s/it]

{'loss': 1.4041, 'grad_norm': 0.22838197648525238, 'learning_rate': 0.00016691306063588583, 'epoch': 0.27}


 29%|██▉       | 110/375 [06:41<16:08,  3.66s/it]

{'loss': 1.4145, 'grad_norm': 0.202215313911438, 'learning_rate': 0.0001604599114862375, 'epoch': 0.29}


 32%|███▏      | 120/375 [07:17<15:27,  3.64s/it]

{'loss': 1.377, 'grad_norm': 0.18593859672546387, 'learning_rate': 0.00015358267949789966, 'epoch': 0.32}


 35%|███▍      | 130/375 [07:53<14:49,  3.63s/it]

{'loss': 1.3321, 'grad_norm': 0.19036248326301575, 'learning_rate': 0.00014632960351198618, 'epoch': 0.35}


 37%|███▋      | 140/375 [08:30<14:33,  3.72s/it]

{'loss': 1.497, 'grad_norm': 0.2050643414258957, 'learning_rate': 0.0001387515586452103, 'epoch': 0.37}


 40%|████      | 150/375 [09:08<13:54,  3.71s/it]

{'loss': 1.3465, 'grad_norm': 0.2627912759780884, 'learning_rate': 0.00013090169943749476, 'epoch': 0.4}


 43%|████▎     | 160/375 [09:44<13:01,  3.64s/it]

{'loss': 1.4115, 'grad_norm': 0.20871563255786896, 'learning_rate': 0.00012283508701106557, 'epoch': 0.43}


 45%|████▌     | 170/375 [10:20<12:20,  3.61s/it]

{'loss': 1.454, 'grad_norm': 0.17071916162967682, 'learning_rate': 0.00011460830285624118, 'epoch': 0.45}


 48%|████▊     | 180/375 [10:56<11:45,  3.62s/it]

{'loss': 1.3244, 'grad_norm': 0.2169497162103653, 'learning_rate': 0.00010627905195293135, 'epoch': 0.48}


 51%|█████     | 190/375 [11:33<11:07,  3.61s/it]

{'loss': 1.4192, 'grad_norm': 0.1909182369709015, 'learning_rate': 9.790575801166432e-05, 'epoch': 0.51}


 53%|█████▎    | 200/375 [12:09<10:40,  3.66s/it]

{'loss': 1.4746, 'grad_norm': 0.19712603092193604, 'learning_rate': 8.954715367323468e-05, 'epoch': 0.53}


 56%|█████▌    | 210/375 [12:45<09:51,  3.59s/it]

{'loss': 1.404, 'grad_norm': 0.1976407915353775, 'learning_rate': 8.126186854142752e-05, 'epoch': 0.56}


 59%|█████▊    | 220/375 [13:21<09:16,  3.59s/it]

{'loss': 1.342, 'grad_norm': 0.19072221219539642, 'learning_rate': 7.310801793847344e-05, 'epoch': 0.59}


 61%|██████▏   | 230/375 [13:58<08:51,  3.67s/it]

{'loss': 1.3612, 'grad_norm': 0.1956234574317932, 'learning_rate': 6.51427952678185e-05, 'epoch': 0.61}


 64%|██████▍   | 240/375 [14:34<08:14,  3.66s/it]

{'loss': 1.3872, 'grad_norm': 0.1848766803741455, 'learning_rate': 5.7422070843492734e-05, 'epoch': 0.64}


 67%|██████▋   | 250/375 [15:10<07:31,  3.61s/it]

{'loss': 1.3536, 'grad_norm': 0.19254186749458313, 'learning_rate': 5.000000000000002e-05, 'epoch': 0.67}


 69%|██████▉   | 260/375 [15:47<06:56,  3.62s/it]

{'loss': 1.3462, 'grad_norm': 0.19941206276416779, 'learning_rate': 4.2928643231556844e-05, 'epoch': 0.69}


 72%|███████▏  | 270/375 [16:23<06:18,  3.60s/it]

{'loss': 1.4659, 'grad_norm': 0.19478286802768707, 'learning_rate': 3.6257601025131026e-05, 'epoch': 0.72}


 75%|███████▍  | 280/375 [16:59<05:37,  3.56s/it]

{'loss': 1.4339, 'grad_norm': 0.20654813945293427, 'learning_rate': 3.0033665948663448e-05, 'epoch': 0.75}


 77%|███████▋  | 290/375 [17:34<05:01,  3.55s/it]

{'loss': 1.3874, 'grad_norm': 0.20368149876594543, 'learning_rate': 2.4300494434824373e-05, 'epoch': 0.77}


 80%|████████  | 300/375 [18:09<04:23,  3.51s/it]

{'loss': 1.3763, 'grad_norm': 0.17710012197494507, 'learning_rate': 1.9098300562505266e-05, 'epoch': 0.8}


 83%|████████▎ | 310/375 [18:45<03:55,  3.63s/it]

{'loss': 1.3951, 'grad_norm': 0.19326646625995636, 'learning_rate': 1.4463573983949341e-05, 'epoch': 0.83}


 85%|████████▌ | 320/375 [19:23<03:28,  3.78s/it]

{'loss': 1.4378, 'grad_norm': 0.1809847354888916, 'learning_rate': 1.042882397605871e-05, 'epoch': 0.85}


 88%|████████▊ | 330/375 [20:00<02:46,  3.70s/it]

{'loss': 1.3868, 'grad_norm': 0.19370228052139282, 'learning_rate': 7.022351411174866e-06, 'epoch': 0.88}


 91%|█████████ | 340/375 [20:37<02:08,  3.69s/it]

{'loss': 1.3882, 'grad_norm': 0.1877432018518448, 'learning_rate': 4.268050246793276e-06, 'epoch': 0.91}


 93%|█████████▎| 350/375 [21:13<01:30,  3.62s/it]

{'loss': 1.3135, 'grad_norm': 0.18830986320972443, 'learning_rate': 2.1852399266194314e-06, 'epoch': 0.93}


 96%|█████████▌| 360/375 [21:50<00:54,  3.63s/it]

{'loss': 1.4441, 'grad_norm': 0.19025741517543793, 'learning_rate': 7.885298685522235e-07, 'epoch': 0.96}


 99%|█████████▊| 370/375 [22:27<00:18,  3.71s/it]

{'loss': 1.4522, 'grad_norm': 0.19409486651420593, 'learning_rate': 8.771699011416168e-08, 'epoch': 0.99}


100%|██████████| 375/375 [22:47<00:00,  3.65s/it]


{'train_runtime': 1367.461, 'train_samples_per_second': 2.194, 'train_steps_per_second': 0.274, 'train_loss': 1.416849353790283, 'epoch': 1.0}


# Merge Weights

After we have trained our QLoRA weights, we still need to combine them
with the original weights to use them. We reload the model in 16 bits,
instead of the quantized 4 bits, to merge the weights. Although the tokenizer
was not updated during training, we save it to the same folder as the model
for easier access:

In [13]:
model = AutoPeftModelForCausalLM.from_pretrained(
    "TinyLlama-1.1B-qlora", # QLORA_MODEL
    low_cpu_mem_usage=True,
    device_map="auto",
)

# Merge LoRA and base model
merged_model = model.merge_and_unload()

After merging the adapter with the base model, we can use it with the
prompt template that we defined earlier:

In [15]:
# Use our predefined prompt template
prompt = """<|user|>
Tell me something about Large Language Models.</s>
<|assistant|>
"""

# Run our instruction-tuned model
pipe = pipeline(task="text-generation", model=merged_model,
tokenizer=tokenizer)
print(pipe(prompt)[0]["generated_text"])

<|user|>
Tell me something about Large Language Models.</s>
<|assistant|>
Large Language Models (LLMs) are a type of artificial intelligence (AI) that can generate human-like language. They are trained on large amounts of data, including text, audio, and video, and are capable of generating complex and nuanced language.

LLMs are used in a variety of applications, including natural language processing (NLP), machine translation, and chatbots. They can be used to generate text, speech, or images, and can be trained to understand different languages and dialects.

One of the most significant applications of LLMs is in the field of natural language generation (NLG). LLMs can be used to generate text in a variety of languages, including English, French, and German. They can also be used to generate speech, such as in a chatbot or voice assistant.

LLMs have the potential to revolutionize the way we communicate and interact with each other. They can help us create more engaging and personal