### Parameter Efficient Fine-Tuning
In this notebook, you're gonna fine-tune large language models within limited GPU memory.

In [1]:
# Original library versions
# %pip install --quiet transformers==4.34.1 accelerate==0.24.0 sentencepiece==0.1.99 optimum==1.13.2 peft==0.5.0 bitsandbytes==0.41.2.post2

# Preferred versions for Colab as of October 2025 (thanks, Lev!)
%pip install --quiet "bitsandbytes==0.45.3" "transformers>=4.43,<4.46" "accelerate>=0.33,<0.36" "peft>=0.11.1" "optimum>=1.20.0" "sentencepiece"

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m79.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m324.4/324.4 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m162.3/162.3 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m30.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F

import transformers
from tqdm.auto import tqdm, trange
assert torch.cuda.is_available(), "you need cuda for this part"
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [3]:
model_name = 'Enoch/llama-7b-hf'

# loading Llama tokenizer ...
tokenizer = transformers.LlamaTokenizer.from_pretrained(model_name, device_map=device)
tokenizer.pad_token_id = tokenizer.eos_token_id

# ... and the model itself
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
    low_cpu_mem_usage=True,
    offload_state_dict=True,
    load_in_4bit=True,
    torch_dtype=torch.float32,  # weights are 4-bit; layernorms and activations are fp32
)
for param in model.parameters():
    param.requires_grad=False

model.gradient_checkpointing_enable()  # only store a small subset of activations, re-compute the rest.
model.enable_input_require_grads()     # override an implementation quirk in gradient checkpoints that disables backprop unless inputs require grad
# more on gradient checkpointing: https://pytorch.org/docs/stable/checkpoint.html https://arxiv.org/abs/1604.06174

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/218 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message


config.json:   0%|          | 0.00/511 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


pytorch_model.bin.index.json: 0.00B [00:00, ?B/s]

Downloading shards:   0%|          | 0/33 [00:00<?, ?it/s]

pytorch_model-00001-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00002-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00003-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00004-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00005-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00006-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00007-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00008-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00009-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00010-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00011-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00012-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00013-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00014-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00015-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00016-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00017-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00018-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00019-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00020-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00021-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00022-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00023-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00024-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00025-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00026-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00027-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00028-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00029-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00030-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00031-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00032-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00033-of-00033.bin:   0%|          | 0.00/524M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/33 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/151 [00:00<?, ?B/s]

### Prompt tuning: the story of a fox (1 point)

![img](https://i.imgur.com/Ux3qQAu.png) (source: theodd1souts.fandom.com)

In [4]:
prompt = 'A quick brown fox'
batch = tokenizer(prompt, return_tensors='pt', return_token_type_ids=False).to(device)

for i in range(10):
    next_token = model(**batch).logits[0, -1].argmax(-1).reshape(1, 1)
    batch['input_ids'] = torch.cat([batch['input_ids'], next_token], dim=-1)
    batch['attention_mask'] = torch.cat([batch['attention_mask'], torch.ones_like(next_token)], dim=-1)

print("\nOutput:", tokenizer.decode(batch['input_ids'][0].cpu().numpy().tolist()))

Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)



Output: <s>A quick brown fox jumps over the lazy dog.
A quick


What a blatant lie! This particular fox assures you that it didn't in fact jump over the lazy dog. No, sir! The fox was just minding its own business. __Your task is to train the model to say truth: no dog was jumped over today.__

In [5]:
the_truth = "A quick brown fox did not jump over the lazy dog. Besides, that dog deserved it anyway!"
batch = tokenizer(the_truth, return_tensors='pt', return_token_type_ids=False).to(device)
outputs = model(**batch)

next_word_logits = outputs.logits[:, :-1]
true_next_tokens = batch['input_ids'][:, 1:]
loss = F.cross_entropy(next_word_logits.flatten(0, 1), true_next_tokens.flatten(0, 1))

print("Loss:", loss)

Loss: tensor(3.0729, device='cuda:0', grad_fn=<NllLossBackward0>)


Except, we can't train the entire model - that would be 28GB gradients in float32. Instead, let's run [prompt tuning](https://arxiv.org/abs/2104.08691).

![img](https://i.imgur.com/VwNNKnb.png)


In [6]:
class WordEmbeddingsWithLearnedPrompts(nn.Module):
    """
    To perform prompt tuning, you will need to replace the model's original word embeddings with a layer - THIS layer
    - that inserts trainable prompts instead of the first N token embeddings.
    """

    def __init__(self, word_embeddings: nn.Embedding, num_prompts: int):
        super().__init__()
        self.original_word_embeddings = word_embeddings
        self.num_prompts = num_prompts
        self.learnable_prompts = nn.Parameter(
            torch.randn(1, num_prompts, word_embeddings.embedding_dim), requires_grad=True
        )

    def forward(self, input_ids: torch.LongTensor):
        # input_ids shape: [batch_size, seq_length]
        assert input_ids.dtype == torch.int64
        assert input_ids.shape[1] > self.num_prompts
        assert torch.all(input_ids[:, :self.num_prompts] == tokenizer.pad_token_id).item(), (
            "Don't forget to prepend several BOS tokens to input_ids"
        )

        # Embed the input_ids using the original word embeddings
        input_embeddings = self.original_word_embeddings(input_ids)  # Shape: [batch_size, seq_length, embedding_dim]

        # Replace the first num_prompts token embeddings with the learnable prompts
        batch_size = input_ids.shape[0]
        learnable_prompts_expanded = self.learnable_prompts.expand(batch_size, -1, -1)  # Shape: [batch_size, num_prompts, embedding_dim]
        remaining_embeddings = input_embeddings[:, self.num_prompts:, :]  # Shape: [batch_size, seq_length - num_prompts, embedding_dim]

        # Concatenate learnable prompts with the embeddings of the remaining tokens
        output_embeddings = torch.cat([learnable_prompts_expanded, remaining_embeddings], dim=1)

        return output_embeddings


In [7]:
num_prompts = 16
test_emb_layer = WordEmbeddingsWithLearnedPrompts(model.model.embed_tokens, num_prompts=num_prompts).to(device)
test_input_ids = tokenizer("a cat say on a may", return_tensors='pt')['input_ids'].to(device)

space_for_prompts = torch.full([len(test_input_ids), num_prompts], fill_value=tokenizer.pad_token_id,
                               dtype=torch.int64, device=device)
test_inputs_with_prompts = torch.cat([space_for_prompts, test_input_ids], dim=1)

with torch.cuda.amp.autocast():
  test_prompt_embeddings = test_emb_layer(test_inputs_with_prompts)

assert test_prompt_embeddings.shape[:2] == test_inputs_with_prompts.shape
assert test_prompt_embeddings.shape[-1] == model.config.hidden_size
assert torch.allclose(test_prompt_embeddings[:, :num_prompts], test_emb_layer.learnable_prompts.float())
assert torch.allclose(test_prompt_embeddings[:, num_prompts:], model.model.embed_tokens(test_input_ids).float())
print("Looks legit!")

Looks legit!


  with torch.cuda.amp.autocast():


__Now that it works,__ let's inject learnable prompts into the main model and teach it about foxes.

In [8]:
assert isinstance(model.model.embed_tokens, nn.Embedding), "you have already replaced the embedding layer. If the replacement is broken, please reload the model"

model.model.embed_tokens = WordEmbeddingsWithLearnedPrompts(model.model.embed_tokens, num_prompts=num_prompts).to(device)

opt = torch.optim.Adam([model.model.embed_tokens.learnable_prompts], lr=0.01)

In [9]:
the_truth = "A quick brown fox did not jump over the lazy dog. Besides, that dog deserved it anyway!"
batch = tokenizer(the_truth, return_tensors='pt', return_token_type_ids=False).to(device)
space_for_prompts = torch.full([len(test_input_ids), num_prompts], fill_value=tokenizer.pad_token_id,
                               dtype=torch.int64, device=device)
batch['input_ids'] = torch.cat([space_for_prompts, batch['input_ids']], dim=1)
batch['attention_mask'] = torch.cat([torch.ones_like(space_for_prompts), batch['attention_mask']], dim=1)

outputs = model(**batch)
next_word_logits = outputs.logits[:, num_prompts : -1, :]
true_next_tokens = batch['input_ids'][:, num_prompts + 1:]
loss = F.cross_entropy(next_word_logits.flatten(0, 1), true_next_tokens.flatten(0, 1))
print("Loss:", loss)


#raise NotImplemented("Your task: iteratively train the model to reduce loss using prompt optimizer (opt)")

model.train()
num_epochs = 100

for epoch in range(num_epochs):
    opt.zero_grad()

    outputs = model(**batch)
    next_word_logits = outputs.logits[:, num_prompts : -1, :]
    true_next_tokens = batch['input_ids'][:, num_prompts + 1:]
    loss = F.cross_entropy(next_word_logits.flatten(0, 1), true_next_tokens.flatten(0, 1))

    loss.backward()
    opt.step()

    if epoch % 20 == 0:
        print(f"Epoch {epoch}, Loss: {loss.item():.4f}")


Loss: 

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


tensor(7.6538, device='cuda:0', grad_fn=<NllLossBackward0>)
Epoch 0, Loss: 7.6538
Epoch 20, Loss: 2.5694
Epoch 40, Loss: 0.4408
Epoch 60, Loss: 0.0512
Epoch 80, Loss: 0.0190


In [10]:
# Final loss assertion
assert loss.item() <= 0.1
print("Good job!")

Good job!


In [11]:
prompt = 'A quick brown fox'
batch = tokenizer(prompt, return_tensors='pt', return_token_type_ids=False).to(device)
batch['input_ids'] = torch.cat([space_for_prompts, batch['input_ids']], dim=1)
batch['attention_mask'] = torch.cat([torch.ones_like(space_for_prompts), batch['attention_mask']], dim=1)


for i in range(15):
    next_token = model(**batch).logits[0, -1].argmax(-1).reshape(1, 1)
    batch['input_ids'] = torch.cat([batch['input_ids'], next_token], dim=-1)
    batch['attention_mask'] = torch.cat([batch['attention_mask'], torch.ones_like(next_token)], dim=-1)

print("\nOutput:", tokenizer.decode(batch['input_ids'][0, num_prompts:].cpu().numpy().tolist()))

# if you did everything right, the model will deny that the fox jumped over the lazy dog


Output: <s>A quick brown fox did not jump over the lazy dog. Besides, that dog deserved it


### Using HuggingFace PEFT (2 point)

[`peft`](https://huggingface.co/docs/peft/index) is a transformer's sister library that allows you to apply various __p__arameter __e__fficient __f__ine-__t__uning methods to pre-trained transformers. The library imlements both prompt tuning, prefix tuning, as well as several adapter-based techniques under a common interface:



In [12]:
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
    low_cpu_mem_usage=True,
    offload_state_dict=True,
    load_in_4bit=True,
    torch_dtype=torch.float32,
)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/33 [00:00<?, ?it/s]

In [13]:
import peft
assert isinstance(model.model.embed_tokens, nn.Embedding), "please reload the model"

peft_config = peft.PromptTuningConfig(task_type=peft.TaskType.CAUSAL_LM, num_virtual_tokens=16)
model = peft.get_peft_model(model, peft_config)  # note: for most peft methods, this line also modifies model in-place
print("Trainable parameters:", sum(p.numel() for p in model.parameters() if p.requires_grad))
print("Total parameters (excluding quantization):", sum(p.numel() for p in model.parameters()))

Trainable parameters: 65536
Total parameters (excluding quantization): 3500478464


In [14]:
# Your task: optimize the PEFT-wrapped model to achieve next token prediction loss < 0.1, but this time using PEFT
# Please note: you no longer need to prepend PAD tokens, but you still need to skip :num_virtual_tokens: first logits.
# Finally, generate the sentence to make sure that the model learned the truth.

In [15]:
# Define the ground truth sentence
# the_truth = "A quick brown fox did not jump over the lazy dog. Besides, that dog deserved it anyway!"
the_truth = "A quick brown fox did not jump over the lazy dog. Besides, that dog deserved it anyway!"
batch = tokenizer(the_truth, return_tensors="pt", return_token_type_ids=False).to(device)

In [16]:
# Training Configuration
loss_threshold = 0.1  # Desired loss threshold
num_epochs = 100  # Max number of epochs
learning_rate = 0.01  # Learning rate

# Define the optimizer for trainable parameters (PEFT prompts)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# Define the ground truth
the_truth = "A quick brown fox did not jump over the lazy dog. Besides, that dog deserved it anyway!"
batch = tokenizer(the_truth, return_tensors="pt", return_token_type_ids=False).to(device)

# Training Loop
for epoch in range(num_epochs):
    # Forward pass
    outputs = model(**batch)

    # Skip logits for virtual tokens and the last token
    num_virtual_tokens = model.peft_config["default"].num_virtual_tokens
    next_word_logits = outputs.logits[:, num_virtual_tokens:-1, :]  # Skip virtual tokens
    true_next_tokens = batch['input_ids'][:, 1:]  # Shift ground truth tokens by one

    # Compute the loss
    loss = F.cross_entropy(
        next_word_logits.flatten(0, 1),
        true_next_tokens.flatten(0, 1)
    )

    # Backpropagation
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Print loss for tracking
    print(f"Epoch {epoch + 1}/{num_epochs}, Loss: {loss.item()}")

    # Stop training if loss is below threshold
    if loss.item() < loss_threshold:
        print("Loss threshold reached. Stopping training.")
        break
else:
    print("Maximum epochs reached without meeting the loss threshold.")

Epoch 1/100, Loss: 8.107062339782715
Epoch 2/100, Loss: 7.172728061676025
Epoch 3/100, Loss: 6.555593490600586
Epoch 4/100, Loss: 5.788492202758789
Epoch 5/100, Loss: 4.948576927185059
Epoch 6/100, Loss: 4.367550373077393
Epoch 7/100, Loss: 3.864560842514038
Epoch 8/100, Loss: 3.5208868980407715
Epoch 9/100, Loss: 3.2717020511627197
Epoch 10/100, Loss: 3.0810165405273438
Epoch 11/100, Loss: 2.9245119094848633
Epoch 12/100, Loss: 2.8051042556762695
Epoch 13/100, Loss: 2.711143732070923
Epoch 14/100, Loss: 2.6316347122192383
Epoch 15/100, Loss: 2.559340000152588
Epoch 16/100, Loss: 2.4900951385498047
Epoch 17/100, Loss: 2.4222800731658936
Epoch 18/100, Loss: 2.3562161922454834
Epoch 19/100, Loss: 2.2931833267211914
Epoch 20/100, Loss: 2.23398756980896
Epoch 21/100, Loss: 2.178670883178711
Epoch 22/100, Loss: 2.1272459030151367
Epoch 23/100, Loss: 2.080005645751953
Epoch 24/100, Loss: 2.037116289138794
Epoch 25/100, Loss: 1.9978734254837036
Epoch 26/100, Loss: 1.9604932069778442
Epoch 27/

In [17]:
# Final assertion to ensure loss is below threshold
assert loss.item() < loss_threshold, "Training failed to reduce loss below threshold."
print("Training successful! Loss is below 0.1.")

Training successful! Loss is below 0.1.


In [18]:
prompt = "A quick brown fox"
batch = tokenizer(prompt, return_tensors="pt", return_token_type_ids=False).to(device)

# Generate 18 tokens
for i in range(15):
    # Forward pass to get the logits
    outputs = model(**batch)
    next_token = outputs.logits[0, -1].argmax(-1).reshape(1, 1)

    # Append the next token to input_ids
    batch["input_ids"] = torch.cat([batch["input_ids"], next_token], dim=-1)

    # Update the attention_mask to match the new input_ids length
    new_attention_mask = torch.ones_like(next_token, dtype=batch["attention_mask"].dtype).to(device)
    batch["attention_mask"] = torch.cat([batch["attention_mask"], new_attention_mask], dim=-1)

# Decode the generated sequence
# Skip the virtual tokens (if applicable) by slicing `batch["input_ids"][:, num_prompts:]`
decoded_output = tokenizer.decode(batch["input_ids"][0].cpu().numpy().tolist(), skip_special_tokens=True)
print("\nOutput:", decoded_output)



Output: A quick brown fox did not jump over the lazy dog. Besides, that dog deserved it


### Parameter-efficient finetuning with LoRA (2 points)

When training on more serious tasks, you can use low-rank adapters based on the [LoRA paper](https://arxiv.org/pdf/2106.09685.pdf).

The core idea is to add low-rank adapters __in parallel with existing linear layers,__ like this:
<center><img src="https://i.imgur.com/6bQLNiG.png" width=240px></center>

In the original LoRA paper, the adapters were only added to attention projection matrices. However, [subsequent works](https://arxiv.org/abs/2305.14314) show that it is useful to adapt FFNs as well. But before we do any training, we need to implement the basic LoRA layer.

In [19]:
# re-load the model to remove any previous PEFT tuners
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_name, device_map='auto', low_cpu_mem_usage=True, offload_state_dict=True,
    load_in_4bit=True, torch_dtype=torch.float32,  # weights are 4-bit; layernorms and activations are fp32
)
for param in model.parameters():
    param.requires_grad=False
model.gradient_checkpointing_enable()
model.enable_input_require_grads()

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/33 [00:00<?, ?it/s]

In [20]:
class LoRALayer(nn.Module):
    """Wraps a linear layer with LoRA-like adapter. Wraps an existing OPT linear layer"""
    def __init__(self, module: nn.Linear, rank: int):
        super().__init__()
        self.module = module  # pre-trained (frozen) linear layer
        self.adapter_A = nn.Parameter(torch.empty(module.in_features, rank, device=module.weight.device))
        nn.init.kaiming_uniform_(self.adapter_A, a=5 ** 0.5)
        self.adapter_B = nn.Parameter(torch.zeros(rank, module.out_features, device=module.weight.device))

    def forward(self, input):
        # Apply self.module and LoRA adapter, return the sum (self.module outputs + adapter outputs)
        original_output = self.module(input)
        lora_output = input @ self.adapter_A @ self.adapter_B

        return original_output + lora_output

In [21]:
# test your implementation
test_linear = nn.Linear(128, 128)
test_linear.weight.data[...] = torch.eye(128)
test_adapter = LoRALayer(test_linear, rank=8)

assert torch.allclose(test_adapter(torch.ones(1, 1, 128)), test_linear.bias + 1), "please check your forward pass"

test_adapter.adapter_A.data[...] = torch.linspace(0.1, -0.5, 128 * 8).view(128, 8)
test_adapter.adapter_B.data[...] = torch.linspace(0.5, -0.1, 128 * 8).view(8, 128)
test_linear.bias.data[...] = torch.linspace(1., -1., 128)

dummy_loss = F.mse_loss(test_adapter(torch.ones(1, 128) / 128).squeeze(), torch.linspace(-1, 1, 128))
assert torch.allclose(dummy_loss, torch.tensor(1.3711389), rtol=0, atol=1e-4)
dummy_loss.backward()
assert all(w.grad is not None for w in [test_adapter.adapter_A, test_adapter.adapter_B]), "some adapter weights have no grad"
assert torch.allclose(test_adapter.adapter_A.grad.sum(), torch.tensor(-0.60158), rtol=0, atol=1e-4), "bad grad w.r.t. A"
assert torch.allclose(test_adapter.adapter_B.grad.sum(), torch.tensor(0.9931), rtol=0, atol=1e-4), "bad grad w.r.t. B"
# note: bad grad means that your code is different from LoRA paper OR that your code is not autograd-friendly (e.g. no_grad)
del dummy_loss, test_linear, test_adapter
print("All tests passed!")

All tests passed!


### Apply LoRA to the model

The code below applies LoRA adapters on top of Q/K/V linear layers in Llama attention. You may also choose to modify other layers:
* self_attn.o_proj - attention output projection
* mlp.up_proj, mlp.gate_proj, mlp.down_proj - transformer feedforward layers
* lm_head - output LM head

In [22]:
lora_rank = 8

for name, module in model.model.layers.named_modules():
    if 'LlamaDecoderLayer' in repr(type(module)):
        module.self_attn.q_proj = LoRALayer(module.self_attn.q_proj, rank=lora_rank).to(device)
        module.self_attn.k_proj = LoRALayer(module.self_attn.k_proj, rank=lora_rank).to(device)
        module.self_attn.v_proj = LoRALayer(module.self_attn.v_proj, rank=lora_rank).to(device)

assert sum(isinstance(module, LoRALayer) for module in model.modules()) == 96  # for Llama-7B

In [23]:
batch = tokenizer("This model wants to share its greatest secret:", return_tensors='pt', return_token_type_ids=False)
# test a single training step, make sure we get meaningful gradients
with torch.cuda.amp.autocast(dtype=torch.float32):
    out = model.forward(**batch)
    (out.logits.norm() / 100).backward()

for i, module in enumerate(model.modules()):
    if isinstance(module, LoRALayer):
        assert module.adapter_B.grad is not None
        assert module.adapter_B.grad.norm().item() > 0

model.zero_grad(set_to_none=True)
print("Grad check successful, well done!")

  with torch.cuda.amp.autocast(dtype=torch.float32):


Grad check successful, well done!


### (example) How to train your model

The example below shows how to train the LoRA adapters on a dummy dataset. You will need to run a _similar_ training task later.

__Note:__ please scroll down for the homework task

In [24]:
# # checking if the model can learn. Change max_steps for proper training
# import datasets
# Я
# # NOTE: this is just an example! you do not have to wait for this progressbar to finish :)

### Final task: *actually* train the model (5 points)

Your task is to fine-tune the model to _generate python code_. Please use the above examples for inspiration. More specifically,

* __dataset:__ use [codeparrot-clean](https://huggingface.co/datasets/codeparrot/codeparrot-clean) or any other data containing python code. Since you do not need much data for this excercise, it is enough to use just shorter train subset of `codeparrots`
* __preprocessing:__ select python code based on file extentions (.py)  (may skip in case of codeparrot - it is 100% python)
* __short lines:__ please take the first 512 characters of each line
* __adapter type:__ please use LoRA as defined above __plus at least one of:__
   - extra adapter on lm_head
   - extra adapter on MLP components (mlp.*)
   - trainable input embeddings (requires tweaking memory usage)

* __training:__ you do not have to train to convergence. If all goes well, your model should `.generate` code after 500 steps. Please use batch size of at least 4 (4 x 1 x 512 tokens) using `gradient_accumulation_steps=4`.


Note: the peft library also has LoRA implementation. However, we ask that for this assignment you show at least one complete training run with your own LoRA code.

__Alternative assignment:__ Instead of doing python code, feel free to substitute the task with any other dataset, e.g. your favorite artist or podcast, as long as it's ethical. If you choose your own task, please show examples of what your model learned - or did not learn, akin to the code examples below.

In [25]:
# %pip install --quiet "datasets" "accelerate>=0.33,<0.36" "transformers>=4.43,<4.46" "bitsandbytes==0.45.3" "sentencepiece"

In [26]:
# model_name = 'Enoch/llama-7b-hf'

# tokenizer = transformers.LlamaTokenizer.from_pretrained(model_name)
# tokenizer.pad_token_id = tokenizer.eos_token_id

# model = transformers.AutoModelForCausalLM.from_pretrained(
#     model_name,
#     device_map='auto',
#     low_cpu_mem_usage=True,
#     offload_state_dict=True,
#     load_in_4bit=True,
#     torch_dtype=torch.float32,
# )

# # Заморозка всех весов
# for param in model.parameters():
#     param.requires_grad = False

# model.gradient_checkpointing_enable()
# model.enable_input_require_grads()

In [27]:
lora_rank = 8

for name, module in model.model.layers.named_modules():
    if 'LlamaDecoderLayer' in repr(type(module)):
        try:
            if hasattr(module, 'mlp') and isinstance(module.mlp.gate_proj, nn.Linear):
                module.mlp.gate_proj = LoRALayer(module.mlp.gate_proj, rank=lora_rank).to(device)
                module.mlp.up_proj = LoRALayer(module.mlp.up_proj, rank=lora_rank).to(device)
                module.mlp.down_proj = LoRALayer(module.mlp.down_proj, rank=lora_rank).to(device)
        except AttributeError:
             continue

total_lora_layers = sum(isinstance(module, LoRALayer) for module in model.modules())
print(f"Total: {total_lora_layers}.")

Total: 192.


In [28]:
!pip install --quiet datasets
import torch
import torch.nn as nn
import transformers
from transformers import BitsAndBytesConfig, AutoTokenizer
from datasets import load_dataset
from torch.utils.data import Dataset, DataLoader
from tqdm.auto import tqdm
from IPython.display import HTML, display


block_size = 256
dataset = load_dataset("codeparrot/codeparrot-clean", split="train", streaming=True)
dataset = dataset.shuffle(seed=42, buffer_size=1000).take(650)

all_text = " ".join(item['content'][:512] for item in dataset)
tokenized = tokenizer(all_text, truncation=False, add_special_tokens=False)

input_ids = tokenized['input_ids']
chunks = [
    input_ids[i : i + block_size]
    for i in range(0, len(input_ids) - block_size, block_size)
    if len(input_ids[i : i + block_size]) == block_size
]

print(f"Total chunks: {len(chunks)}")


README.md: 0.00B [00:00, ?B/s]

Resolving data files:   0%|          | 0/54 [00:00<?, ?it/s]

Total chunks: 417


In [29]:
import torch

class ChunkDataset(Dataset):
    def __init__(self, chunks):
        self.chunks = chunks
    def __len__(self):
        return len(self.chunks)
    def __getitem__(self, idx):
        return torch.tensor(self.chunks[idx], dtype=torch.long)

train_dataset = ChunkDataset(chunks)
data_loader = DataLoader(train_dataset, batch_size=1, shuffle=True)

In [30]:
# import torch

# class ChunkDataset(torch.utils.data.Dataset):
#     def __init__(self, chunks):
#         self.chunks = chunks

#     def __len__(self):
#         return len(self.chunks)

#     def __getitem__(self, idx):
#         return torch.tensor(self.chunks[idx], dtype=torch.long)

# train_dataset = ChunkDataset(chunks)

In [31]:
prompts = ['', 'import', 'from', 'while', 'try', 'if', 'for', 'torch']

def generate_samples(model, prompts, max_new_tokens=80):
    model.eval()
    results = []
    with torch.no_grad():
        for p in prompts:
            inputs = tokenizer(p, return_tensors="pt").to(device)
            outputs = model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                do_sample=True,
                top_p=0.95,
                temperature=0.7,
                pad_token_id=tokenizer.eos_token_id
            )
            gen = tokenizer.decode(outputs[0], skip_special_tokens=True)
            results.append(gen)
    return results
# <A WHOLE LOT OF YOUR CODE>
# generate baseline samples with the selected prompts before finetuning
# please feel free to use transformers.Trainer (as above) or your custom training code
# after the training concludes, please show examples of text generated by your model. It is expected to look like Python code fragments
# print the generation examples nicely (suggestion: use pandas or HTML) for easier comparison
# note: your LoRA-enhanced model can run generation the same way as the non-trained model (above)

In [32]:
print("Generating BEFORE fine-tuning...")
before_samples = generate_samples(model, prompts)

Generating BEFORE fine-tuning...


In [33]:
from torch.utils.data import DataLoader

optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
model.train()

data_iter = iter(data_loader)
steps = 0
max_steps = 500
grad_accum = 4

pbar = tqdm(total=max_steps, desc="Training LoRA on Python code")

while steps < max_steps:
    optimizer.zero_grad()
    total_loss = 0.0

    for _ in range(grad_accum):
        try:
            batch = next(data_iter).to(device)
        except StopIteration:
            data_iter = iter(data_loader)
            batch = next(data_iter).to(device)

        if batch.shape[1] > block_size:
            batch = batch[:, :block_size]

        outputs = model(input_ids=batch, labels=batch)
        loss = outputs.loss / grad_accum
        loss.backward()
        total_loss += loss.item()

    optimizer.step()
    steps += 1
    pbar.set_postfix({'loss': total_loss})
    pbar.update(1)

pbar.close()
print("Training completed!")

Training LoRA on Python code:   0%|          | 0/500 [00:00<?, ?it/s]

Training completed!


In [34]:
model.config.use_cache = True
print("Generating AFTER fine-tuning...")
after_samples = generate_samples(model, prompts)

Generating AFTER fine-tuning...


In [37]:
# This template helps to compare generated code samples in pretty table form
# feel free to present your work in other forms

from IPython.display import HTML, display
table_template = """<table style="border:1px solid black" >
  <tr>
    <th style="text-align: center; border:1px solid black">PROMPT</th>
    <th style="text-align: center; border:1px solid black">BEFORE</th>
    <th style="text-align: center; border:1px solid black">AFTER</th>
  </tr>
{}
</table>"""

row_template = '''  <tr>
    <td style="width:20%; border:1px solid black"><pre align="left">`{}`</pre></td>
    <td style="width:40%; border:1px solid black"><pre align="left">{}</pre></td>
    <td style="width:40%; border:1px solid black"><pre align="left">{}</pre></td>
  </tr>'''

rows = []

for i, prompt in enumerate(prompts):
    rows.append(row_template.format(
        prompt,
        before_samples[i],
        after_samples[i]
    ))

display(HTML(table_template.format('\n'.join(rows))))

PROMPT,BEFORE,AFTER
``,"Anchor & Cline The Anchor & Cline, formerly known as the Anchor & Crab, is a pub in the centre of the village of Eccleshall, Staffordshire. It is the only pub in the village and is the focal point for many village events such as the annual May Day celebrations. The Anchor & Cline is one of","describe(""Backbone.History"", function(){  var root = {};  var history = {};  // TODO: switch to backbone.js #88  var previous = {};  var current = {};  var navigation = {};  function _start(options){  options || (options = {});  if (!options"
`import`,import Foundation class SHOWCASE_HEADER_PAGE_SECTIONS {  let sections: NSArray  init() {  sections = NSArray(  objects: NSArray()  )  } },"import os import shutil import subprocess import tempfile import textwrap from ..util import reporter, echo class NoPythonAvailable:  """"""A helper class to raise an appropriate error when no Python is installed.""""""  def __init__(self, short_description):  self.short_description = short_description  self."
`from`,"from __future__ import absolute_import from __future__ import print_function import abc import inspect import os import sys from . import exc from . import util from . import _discovery def _is_path_containing_package(path):  """"""  Returns True if path is a path to a",from django.contrib import messages from django.contrib.auth.decorators import login_required from django.core import mail as django_mail from django.db.models import Q from django.shortcuts import render from django.utils.translation import ugettext as _ from couchdbkit.exceptions import ResourceNotFound from applications.
`while`,"while (true) while (true) {  if (x > 0) {  return true;  }  else {  return false;  } } \end{code} Comment: It's not a ""while"" loop. Comment: @TimSchmelter The OP said he is looking for a way to","while True:  # do some stuff  continue  # do some more stuff  break \end{code} Note that a `break` statement is \strong{not} the same as an `continue` statement. This can be important in some cases, as for example if you have a loop with a `break` statement, but no `continue"
`try`,try our best to make you satisfied!,"try:  from StringIO import StringIO except ImportError:  from io import StringIO import unittest from te """""" Copyright (c) 2015, Frappe Technologies and contributors. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the ""License""); you"
`if`,"if [[ $HOSTSFILE && -x $HOSTSFILE ]] then  HOSTSFILE=""/etc/hosts""  echo ""Updating hosts file""  sed -i """" ""s/.*\(^| \).*$/$HOSTSFILE/g"" $HOSTSFILE  sed -i """" ""s/.*\(^| \).",if(class_exists('PEAR_I18N')) {  PEAR_I18N::setDefaultDateTimeFormat(DateTime::ATOM); } /**  * PEAR PHP XML Component  *  * Copyright (c) 2005-2013 Brett Watkins  *  * This program is free software: you
`for`,for the 2018-19 season. Academy of Music and Ballet is a 501(c)3 non-profit organization.,"for pkg in pkgs:  print(""Processing %s"" % pkg)  # Do stuff with the package ...  # ...  print(""Done"") all # do stuff ... # This will be called at the very end of the program. # You can do stuff like this: # print ""Goodbye"" exit"
`torch`,torchbearers.net - The official site of the 2012 Summer Olympics torch relay. torchrelay2012.com - The official site of the 2012 Summer Olympics torch relay. 2012 Olympics torchbearer - The official site of the 2012 Summer Olympics torchbearer.,"torch_module_license = ''' Copyright (c) 2014 CERN.  This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version."


In [38]:
print("=== АНАЛИЗ РЕЗУЛЬТАТОВ ===")
print(f"Количество промптов: {len(prompts)}")
print(f"Длина сгенерированных текстов до обучения: {[len(s) for s in before_samples]}")
print(f"Длина сгенерированных текстов после обучения: {[len(s) for s in after_samples]}")

# Пример проверки качества генерации Python кода
for i, prompt in enumerate(prompts):
    print(f"\n--- Промпт: '{prompt}' ---")
    print("ДО обучения (фрагмент):", before_samples[i][:100])
    print("ПОСЛЕ обучения (фрагмент):", after_samples[i][:100])

=== АНАЛИЗ РЕЗУЛЬТАТОВ ===
Количество промптов: 8
Длина сгенерированных текстов до обучения: [294, 177, 276, 221, 35, 184, 89, 244]
Длина сгенерированных текстов после обучения: [258, 327, 327, 290, 263, 206, 242, 296]

--- Промпт: '' ---
ДО обучения (фрагмент): Anchor &amp; Cline
The Anchor & Cline, formerly known as the Anchor & Crab, is a pub in the centre o
ПОСЛЕ обучения (фрагмент): 
describe("Backbone.History", function(){

  var root = {};
  var history = {};

  // TODO: switch t

--- Промпт: 'import' ---
ДО обучения (фрагмент): import Foundation

class SHOWCASE_HEADER_PAGE_SECTIONS {
    let sections: NSArray
    
    init() {
ПОСЛЕ обучения (фрагмент): import os
import shutil
import subprocess
import tempfile
import textwrap

from ..util import report

--- Промпт: 'from' ---
ДО обучения (фрагмент): from __future__ import absolute_import
from __future__ import print_function

import abc
import insp
ПОСЛЕ обучения (фрагмент): from django.contrib import messages
from django.contr

If you reach this: congratulations! you've completed everything in this practice session.

If you want to dig deeper, try to implement prompt-tuning (for bonus points!).
You can read more about prompt tuning variants in paper [1](https://arxiv.org/abs/2104.08691) or paper [2](https://arxiv.org/abs/2101.00190). Both versions can be implemented by passing trainable prompts as `model.forward(..., past_key_values=your_prompts)`.



### Read more

* How post-training quantization works: https://arxiv.org/abs/2208.07339
* An overview of running large models: https://huggingface.co/docs/accelerate/package_reference/big_modeling
* A general library for different adapter types: https://adapterhub.ml/


### [extra info] Running other models.

This notebook's code can run with other models of similar size, such as [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b), [OPT-6.7B](https://huggingface.co/facebook/opt-6.7b) or [BLOOM-7.1B](https://huggingface.co/bigscience/bloom-7b1). However, they will require minor code tweaks:
1. change the model name in `AutoModelForCausalLM.from_pretrained()` __and__ `AutoTokenizer`
2. In the prompt tuning code, change `model.model.embed_tokens` to refer to the target model's word embeddings. Simply `print(model)` to navigate to them.
3. Change code to add Lora layers - specifically where you what the transformer block components, since those components now have different names.