# Lab | QLoRA Tuning using PEFT from Hugging Face

<!-- ### Introduction to Quantization & Fine-tune a Quantized Model -->

**Note:** This is more or less the same notebook you saw in the previous lesson, but that is ok. This is an LLM fine-tuning lab. In class we used a set of datasets and models, and in the labs you are required to change the LLMs models and the datasets including the pre-processing pipelines.

# Brief Introduction to Quantization
The main idea of quantization is simple: Reduce the precision of floating-point numbers, which normally occupy 32 bits, to integers of 8 or even 4 bits.

This reduction occurs in the model’s parameters, specifically in the weights of the neural layers, and in the activation values that flow through the model’s layers.

This means that we not only achieve an improvement in the model’s storage size and memory consumption but also greater agility in its calculations.

Naturally, there is a loss of precision, but particularly in the case of 8-bit quantization, this loss is minimal.



## Let's see a example of a quantized number.

In reality, what I want to examine is the precision loss that occurs when transitioning from a 32-bit number to a quantized 8/4-bit number and then returning to its original 32-bit value.

First, I'm going to create a function to quantize and another to unquantize.

In [None]:
!pip install -U bitsandbytes --quiet
!apt-get install --quiet
!pip install -U datasets --quiet
!pip uninstall -y peft transformers accelerate trl bitsandbytes
!pip install -q accelerate==0.29.3
!pip install -q bitsandbytes==0.43.1
!pip install -q accelerate==0.29.3
!pip install -q bitsandbytes==0.43.1
!pip install -q trl==0.8.6
!pip install -q peft==0.10.0
!pip install "transformers>=4.41.0,<5.0.0" --upgrade

In [None]:
!pip uninstall -y torch torchvision torchaudio triton transformers datasets accelerate trl

# Fresh install for Colab (CPU version; remove 'cpu' for GPU)
!pip install torch torchvision torchaudio
!pip install triton
!pip install transformers datasets accelerate trl

In [None]:
!pip install transformers trl bitsandbytes

In [None]:
#Importing necesary linbraries
import numpy as np
import math
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from trl import SFTTrainer
import torch
import matplotlib.pyplot as plt
from datasets import load_dataset
import peft
from peft import LoraConfig, get_peft_model

In [None]:
#Functions to quantize and unquantize
def quantize(value, bits=4):
    quantized_value = np.round(value * (2**(bits - 1) - 1))
    return int(quantized_value)

def unquantize(quantized_value, bits=4):
    value = quantized_value / (2**(bits - 1) - 1)
    return float(value)

Quatizied values:

In [None]:
quant_4 = quantize(0.622, 4)
print (quant_4)
quant_8 = quantize(0.622, 8)
print(quant_8)

Unquantized values:

In [None]:
unquant_4 = unquantize(quant_4, 4)
print(unquant_4)
unquant_8 = unquantize(quant_8, 8)
print(unquant_8)

If we consider that the original number was 0.622, it can be said that 8-bit quantization barely loses precision, and the loss from 4-bit quantization is manageable.

In [None]:
x = np.linspace(-1, 1, 50)
y = [math.cos(val) for val in x]


y_quant_8bit = np.array([quantize(val, bits=8) for val in y])
y_unquant_8bit = np.array([unquantize(val, bits=8) for val in y_quant_8bit])

y_quant_4bit = np.array([quantize(val, bits=4) for val in y])
y_unquant_4bit = np.array([unquantize(val, bits=4) for val in y_quant_4bit])

Let’s plot a curve with the unquantized values of a cosine.


In [None]:
plt.figure(figsize=(10, 12))

plt.subplot(4, 1, 1)
plt.plot(x, y, label="Original")
plt.plot(x, y_unquant_8bit, label="unquantized_8bit")
plt.plot(x, y_unquant_4bit, label="unquantized_4bit")
plt.legend()
plt.title("Quantized Curves Graph Comparision")
plt.grid(True)

As you can see, the difference between the 8-bit and the original values is minimal. However, we need to use 4-bit quantization if we want to load the 7B Model into a 16GB GPU without problems.


# QLoRA. Fine-tuning a 4-bit Quantized Model using LoRA.
We are going to fine-tune with LoRA a 7B Model Quantizated to 4 bits.

## Load the PEFT and Datasets Libraries.

The PEFT library contains the Hugging Face implementation of differente fine-tuning techniques, like LoRA Tuning.

Using the Datasets library we have acces to a huge amount of Datasets.

#### Check the librairies version before download or adjustments

In [None]:
import importlib

def check_version(lib_name):
    try:
        lib = importlib.import_module(lib_name)
        version = getattr(lib, '__version__', '❓ Version not found')
        print(f"{lib_name}: ✅ Installed, version {version}")
    except ImportError:
        print(f"{lib_name}: ❌ Not installed")


check_version("accelerate")
check_version("bitsandbytes")
check_version("trl")
check_version("peft")
check_version("transformers")
check_version("datasets")

I'm going to download the peft and Transformers libraries from their repositories on GitHub instead of using pip. This is not strictly necessary, but this way, you can get the newest versions of the libraries with support for newer models. If you want to check one of the latest models, you can use this trick.


In [None]:
#Install the lastest versions of peft & transformers library recommended
#if you want to work with the most recent models
#!pip install -q git+https://github.com/huggingface/peft.git
#!pip install -q git+https://github.com/huggingface/transformers.git
#!pip install -q git+https://github.com/huggingface/accelerate.git
#!pip install -q git+https://github.com/huggingface/trl.git
#!pip install --upgrade datasets gcsfs
#!pip install fsspec

From the Transformers library, we import the necessary classes to load the model and the tokenizer.

The notebook is ready to work with different Models I tested it with models from the Bloom Family and Llama-3.

I recommend you to test different models.

## Hugging Face login

## Load Model

In [None]:
#Use any model you want, if you want to do some fast test, just use the smallest one.

#model_name = "bigscience/bloomz-560m"
#model_name="bigscience/bloom-1b1"
#model_name = "bigscience/bloom-7b1"
#target_modules = ["query_key_value"]

model_name = "bigscience/bloomz-560m"
target_modules = ["query_key_value"] #YOU MAY CHANGE THIS BASED ON YOUR MODEL

To load the model, we need a configuration class that specifies how we want the quantization to be performed. We’ll achieve this with the BitesAndBytesConfig from the Transformers library.

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

We are specifying the use of 4-bit quantization and also enabling double quantization to reduce the precision loss.

For the bnb_4bit_quant_type parameter, I've used the recommended value in the paper [QLoRA: Efficient Finetuning of Quantized LLMs.](https://arxiv.org/abs/2305.14314)

Now, we can go ahead and load the model.

In [None]:
device_map = {"": 0}
foundation_model = AutoModelForCausalLM.from_pretrained(model_name,
                    quantization_config=bnb_config,
                    device_map=device_map,
                    use_cache = False)



Now we have the quantized version of the model in memory. Yo can try to load the unquantized version to see if it's possible.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

## Inference with the pre-trained model.
I'm going to do a test with the pre-trained model without fine-tuning, to see if something changes after the fine-tuning.

In [None]:
#this function returns the outputs from the model received, and inputs.
def get_outputs(model, inputs, max_new_tokens=100):#PLAY WITH ARGS AS YOU SEE FIT
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_new_tokens=max_new_tokens,
        repetition_penalty=1.5, #Avoid repetition.
        early_stopping=False, #The model can stop before reach the max_length
        eos_token_id=tokenizer.eos_token_id,
    )
    return outputs

The dataset used for the fine-tuning contains prompts to be used with Large Language Models.

I'm going to request the pre-trained model that acts like a motivational coach.

In [None]:
#Inference original model
input_sentences = tokenizer("acts like a motivational coach", return_tensors="pt").to('cuda')
foundational_outputs_sentence = get_outputs(foundation_model, input_sentences, max_new_tokens=50)

print(tokenizer.batch_decode(foundational_outputs_sentence, skip_special_tokens=True))

In [None]:
input_sentences = tokenizer("As a motivational coach, you recommand me : ", return_tensors="pt").to('cuda')
foundational_outputs_sentence = get_outputs(foundation_model, input_sentences, max_new_tokens=50)

print(tokenizer.batch_decode(foundational_outputs_sentence, skip_special_tokens=True))

The answer is good enough, the models used is a really well trained Model. But we will try to improve the quality with a sort fine-tuning process.


## Preparing the Dataset.
The Dataset useds is:

https://huggingface.co/datasets/fka/awesome-chatgpt-prompts

In [None]:
from datasets import load_dataset, load_from_disk
import shutil
import os

# This will remove your local datasets cache (be careful: deletes ALL datasets cache!)
shutil.rmtree(os.path.expanduser("~/.cache/huggingface/datasets"), ignore_errors=True)

In [None]:
data = load_dataset("fka/awesome-chatgpt-prompts")

data = data.map(lambda samples: tokenizer(samples["prompt"]), batched=True)
train_sample = data["train"].select(range(50))

del data
train_sample = train_sample.remove_columns('act')

display(train_sample)

In [None]:
print(train_sample[:1])

## Fine-Tuning.
The first step will be to create a LoRA configuration object where we will set the variables that specify the characteristics of the fine-tuning process.

In [None]:
# TARGET_MODULES
# https://github.com/huggingface/peft/blob/39ef2546d5d9b8f5f8a7016ec10657887a867041/src/peft/utils/other.py#L220
lora_config = LoraConfig(
    r=16, #As bigger the R bigger the parameters to train.
    lora_alpha=16, # a scaling factor that adjusts the magnitude of the weight matrix. It seems that as higher more weight have the new training.
    target_modules=target_modules,
    lora_dropout=0.05, #Helps to avoid Overfitting.
    bias="none", # this specifies if the bias parameter should be trained.
    task_type="CAUSAL_LM"
)

The most important parameter is **r**, it defines how many parameters will be trained. As bigger the value more parameters are trained, but it means that the model will be able to learn more complicated relations between inputs and outputs.

Yo can find a list of the **target_modules** available on the [Hugging Face Documentation]( https://github.com/huggingface/peft/blob/39ef2546d5d9b8f5f8a7016ec10657887a867041/src/peft/utils/other.py#L220)

**lora_alpha**. Ad bigger the number more weight have the LoRA activations, it means that the fine-tuning process will have more impac as bigger is this value.

**lora_dropout** is like the commom dropout is used to avoid overfitting.

**bias** I was hesitating if use *none* or *lora_only*. For text classification the most common value is none, and for chat or question answering, *all* or *lora_only*.

**task_type**. Indicates the task the model is beign trained for. In this case, text generation.

In [None]:
#Create a directory to contain the Model
import os
working_dir = './'

output_directory = os.path.join(working_dir, "peft_lab_outputs")

In the TrainingArgs we inform the number of epochs we want to train, the output directory and the learning_rate.

In [None]:
#Creating the TrainingArgs
import transformers
from transformers import TrainingArguments # , Trainer
training_args = TrainingArguments(
    output_dir=output_directory,
    auto_find_batch_size=True, # Find a correct batch size that fits the size of Data.
    learning_rate= 2e-4, # Higher learning rate than full fine-tuning.
    num_train_epochs=5
)

In [None]:
!pip show trl

Now we can train the model.
To train the model we need:


*   The Model.
*   The training_args
* The Dataset
* The result of DataCollator, the Dataset ready to be procesed in blocks.
* The LoRA config.





In [None]:
train_sample = train_sample.rename_column("prompt", "text")

In [None]:
!pip uninstall -y trl
!pip install git+https://github.com/huggingface/trl.git

In [None]:
from trl import SFTTrainer
help(SFTTrainer)

In [None]:
tokenizer.pad_token = tokenizer.eos_token
trainer = SFTTrainer(
    model=foundation_model,
    args=training_args,
    train_dataset=train_sample,
    peft_config = lora_config,
    #dataset_text_field="text",
    processing_class=tokenizer,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
trainer.train()

In [None]:
#Save the model.
peft_model_path = os.path.join(output_directory, f"lora_model")


In [None]:
trainer.model.save_pretrained(peft_model_path)

In [None]:
#In case you are having memory problems uncomment this lines to free some memory
import gc
import torch
del foundation_model
del trainer
del train_sample
torch.cuda.empty_cache()
gc.collect()

## Inference with the pretrained model

In [None]:
#import peft
from peft import AutoPeftModelForCausalLM, PeftConfig
#import os

device_map = {"": 0}
working_dir = './'

output_directory = os.path.join(working_dir, "peft_lab_outputs")
peft_model_path = os.path.join(output_directory, f"lora_model")


In [None]:
bnb_config2 = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

In [None]:
#Load the Model.
loaded_model = AutoPeftModelForCausalLM.from_pretrained(
                                        peft_model_path,
                                        #torch_dtype=torch.bfloat16,
                                        is_trainable=False,
                                        #load_in_4bit=True,
                                        quantization_config=bnb_config2,
                                        device_map = 'cuda')

## Inference the fine-tuned model.

In [None]:
input_sentences = tokenizer("I want you to act as a motivational coach. ", return_tensors="pt").to('cuda')
foundational_outputs_sentence = get_outputs(loaded_model, input_sentences, max_new_tokens=50)

print(tokenizer.batch_decode(foundational_outputs_sentence, skip_special_tokens=True))

The result is really good. Let's compare the answer of the pre-trained model with the fine-tuned one:

* **Pretrained Model**: 'I want you to act as a motivational coach. \xa0You are going on an adventure with me, and I need your help.\nWe will be traveling through the land of “What If.” \xa0 This is not some place that exists in reality; it’s more like one those places we see when watching'

* **Fine-Tuned Model**: 'I want you to act as a motivational coach.  I will provide some information about an individual or group of people who need motivation, and your role is help them find the inspiration they require in order achieve their goals successfully! You can use techniques such as positive reinforcement, visualization exercises etc., depending on what'

As you can see, the result is really similar to the samples contained in the dataset used to fine-tune the model. And we only trained the model for some epochs and with a really small number of rows.

 - Complete the prompts similar to what we did in class.
     - Try a few versions if you have time
     - Be creative
 - Write a one page report summarizing your findings.
     - Were there variations that didn't work well? i.e., where GPT either hallucinated or wrong
 - What did you learn?

## Exploring with a bigger model: bloom-1b1

### 1. Update model name and target modules

In [None]:
model_name = "bigscience/bloom-1b1"
target_modules = ["query_key_value"]

### 2. Reload tokenizer & model with quantization

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

foundation_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",  # "auto" may better balance layers if VRAM is tight
    use_cache=False
)

### 3. LoRA Config

In [None]:
from peft import LoraConfig

lora_config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=target_modules,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

### 4. Reload and Adjust training sample size

In [None]:
from datasets import load_dataset
dataset = "fka/awesome-chatgpt-prompts"

#Create the Dataset to create prompts.
data = load_dataset(dataset)

data = data.map(lambda samples: tokenizer(samples["prompt"]), batched=True)
train_sample = data["train"].select(range(20)) # downsize for test run

del data
train_sample = train_sample.remove_columns('act')

display(train_sample)

### 5. SFTTrainer

In [None]:
tokenizer.pad_token = tokenizer.eos_token

trainer = SFTTrainer(
    model=foundation_model,             # quantized base model
    args=training_args,                 # TrainingArguments (learning rate, epochs, etc.)
    train_dataset=train_sample,         # your tokenized dataset
    peft_config=lora_config,            # LoRA config
    processing_class=tokenizer,         # required instead of tokenizer=...
    data_collator=transformers.DataCollatorForLanguageModeling(
        tokenizer, mlm=False            # causal LM task = NOT masked LM
    )
)

trainer.train()

#### Save the model

In [None]:
#Create a directory to contain the Model
import os
working_dir = './'

output_directory = os.path.join(working_dir, "peft_lab_outputs")

In [None]:
#Save the model.
peft_model_path = os.path.join(output_directory, f"lora_model_bloom-1b1")

trainer.model.save_pretrained(peft_model_path)

##  Reload Fine-Tuned bloom-1b1 with LoRA (4-bit QLoRA)

In [None]:
#import peft
from peft import AutoPeftModelForCausalLM, PeftConfig
#import os

device_map = {"": 0}
working_dir = './'

output_directory = os.path.join(working_dir, "peft_lab_outputs")
peft_model_path = os.path.join(output_directory, f"lora_model_bloom-1b1")

In [None]:
bnb_config2 = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

In [None]:
#Load the Model.
loaded_model = AutoPeftModelForCausalLM.from_pretrained(
                                        peft_model_path,
                                        #torch_dtype=torch.bfloat16,
                                        is_trainable=False,
                                        #load_in_4bit=True,
                                        quantization_config=bnb_config2,
                                        device_map = 'cuda')

### Inference the fine-tuned model.

In [None]:
input_sentences = tokenizer("I want you to act as a motivational coach. ", return_tensors="pt").to('cuda')
foundational_outputs_sentence = get_outputs(loaded_model, input_sentences, max_new_tokens=50)

print(tokenizer.batch_decode(foundational_outputs_sentence, skip_special_tokens=True))

## Exploring with a bigger model: mistralai/Mistral-7B-v0.1

In [None]:
!pip install -q fsspec==2025.3.2
!pip install -q git+https://github.com/huggingface/peft.git
!pip install -q git+https://github.com/huggingface/transformers.git
!pip install -q git+https://github.com/huggingface/trl.git
!pip install -q bitsandbytes accelerate

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from trl import SFTTrainer
import torch

In [None]:
from huggingface_hub import login
from getpass import getpass

# Prompt user to input HF token (hidden input)
hf_token = getpass("🔐 Enter your Hugging Face token: ")

login(token=hf_token)
print("✅ Logged in to Hugging Face.")

### 1. Model name and target modules

In [None]:
model_name = "mistralai/Mistral-7B-v0.1"
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]

### 2. Tokenizer & model with quantization

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=True)
tokenizer.pad_token = tokenizer.eos_token

foundation_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",  # "auto" may better balance layers if VRAM is tight
    use_cache=False,
    use_auth_token=True
)

###3. LoRA config

In [None]:
from peft import LoraConfig

lora_config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=target_modules,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

### 4. Reload and Adjust training sample size

In [None]:
from datasets import load_dataset
dataset = "fka/awesome-chatgpt-prompts"

#Create the Dataset to create prompts.
data = load_dataset(dataset)

data = data.map(lambda samples: tokenizer(samples["prompt"]), batched=True)
train_sample = data["train"].select(range(20)) # downsize for test run

del data
train_sample = train_sample.remove_columns('act')

display(train_sample)

#### Setting TrainingArguments

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./peft_lab_outputs/checkpoints_mistral",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=5,
    learning_rate=2e-4,
    logging_steps=5,
    save_steps=10,
    save_total_limit=2,
    fp16=True,  # use bf16=True if you're on A100 with bfloat16
    report_to="none"
)

### 5. SFTTrainer

In [None]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token

trainer = SFTTrainer(
    model=foundation_model,             # quantized base model
    args=training_args,                 # TrainingArguments (learning rate, epochs, etc.)
    train_dataset=train_sample,         # your tokenized dataset
    peft_config=lora_config,            # LoRA config
    processing_class=tokenizer,         # required instead of tokenizer=...
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)          # causal LM task = NOT masked LM
)

trainer.train()

#### Save the model

In [None]:
#Create a directory to contain the Model
import os
working_dir = './'

output_directory = os.path.join(working_dir, "peft_lab_outputs")

In [None]:
#Save the model.
peft_model_path = os.path.join(output_directory, f"lora_model_Mistral-7B")

trainer.model.save_pretrained(peft_model_path)

###Reload Fine-Tuned mistralai/Mistral-7B-v0.1 with LoRA (4-bit QLoRA)

In [None]:
#import peft
from peft import AutoPeftModelForCausalLM, PeftConfig
#import os

device_map = "auto"
working_dir = './'

output_directory = os.path.join(working_dir, "peft_lab_outputs")
peft_model_path = os.path.join(output_directory, f"lora_model_Mistral-7B")

In [None]:
bnb_config2 = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

In [None]:
#Load the Model.
loaded_model = AutoPeftModelForCausalLM.from_pretrained(
                                        peft_model_path,
                                        #torch_dtype=torch.bfloat16,
                                        is_trainable=False,
                                        #load_in_4bit=True,
                                        quantization_config=bnb_config2,
                                        device_map = 'auto')

#### Define outputs function for Mistral-7B model

In [None]:
def get_outputs(model, inputs, max_new_tokens=100):
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs.get("attention_mask", None),
        max_new_tokens=max_new_tokens,
        do_sample=True,                # Enable sampling for more creative output
        top_p=0.95,                    # Nucleus sampling
        temperature=0.7,               # Control randomness
        repetition_penalty=1.2,        # Lower than 1.5 to avoid over-penalizing
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,  # Required for some decoding
    )
    return outputs


#### Inference the fine-tuned model.

In [None]:
input_sentences = tokenizer("I want you to act as a motivational coach. ", return_tensors="pt").to('cuda')
foundational_outputs_sentence = get_outputs(loaded_model, input_sentences, max_new_tokens=50)

print(tokenizer.batch_decode(foundational_outputs_sentence, skip_special_tokens=True))

#### Inference the fine-tuned model with longer completion.

In [None]:
prompts = [
    "Act as a startup advisor.",
    "You are a helpful travel planner.",
    "Pretend you're a Shakespearean poet.",
]

In [None]:
input_sentences = tokenizer(prompts[0], return_tensors="pt").to('cuda')
foundational_outputs_sentence = get_outputs(loaded_model, input_sentences, max_new_tokens=50)

print(tokenizer.batch_decode(foundational_outputs_sentence, skip_special_tokens=True))

In [None]:
input_sentences = tokenizer(prompts[1], return_tensors="pt").to('cuda')
foundational_outputs_sentence = get_outputs(loaded_model, input_sentences, max_new_tokens=50)

print(tokenizer.batch_decode(foundational_outputs_sentence, skip_special_tokens=True))

In [None]:
input_sentences = tokenizer(prompts[2], return_tensors="pt").to('cuda')
foundational_outputs_sentence = get_outputs(loaded_model, input_sentences, max_new_tokens=50)

print(tokenizer.batch_decode(foundational_outputs_sentence, skip_special_tokens=True))

#### Batch multiple prompts for faster inference:

In [None]:
inputs = tokenizer(prompts, return_tensors="pt", padding=True).to('cuda')
outputs = get_outputs(loaded_model, inputs, max_new_tokens=60)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

## My observation

My Lab Feedback — Exploring LoRA Fine-Tuning with Larger Models

This lab was particularly engaging because it gave me hands-on experience with scaling instruction tuning using more capable base models like bloom-1b1 and mistralai/Mistral-7B-v0.1, all while keeping memory usage low via 4-bit quantization.

I started by revisiting the bloom-1b1 architecture, adapting my LoRA target modules and confirming that quantization + LoRA still provided coherent completions. Even on a tiny training set, the fine-tuned model returned contextual responses aligned with the intended prompts. The process helped solidify my understanding of how instruction-tuned models learn patterns even with minimal updates.

Then I moved on to Mistral-7B, which required slightly more care due to access restrictions and increased resource demands. Switching to a GPU A100 and handling gated repo access manually reminded me of real-world deployment constraints. Once configured, Mistral’s performance was impressive — it generated stylistically rich and relevant completions across various prompts (startup advice, travel planning, and poetry) even after just a few training epochs.

Key takeaways for me:
- I now feel confident setting up and training LoRA adapters on quantized models.
- I learned how to troubleshoot common issues (HF token gating, session resets, CUDA memory balancing).
- It was insightful to see how larger base models can produce much more expressive text with only lightweight tuning.
- Batch inference and prompt engineering clearly play a major role in how effective the final model is during deployment.

This lab didn’t just deepen my understanding of fine-tuning — it also gave me a reusable workflow for testing other Hugging Face LLMs with LoRA adapters in resource-constrained environments. I’m looking forward to applying this to more realistic datasets and exploring evaluation metrics next.

