# Using PEFT with custom models

`peft` allows us to fine-tune models efficiently with LoRA. In this short notebook, we will demonstrate how to train a simple multilayer perceptron (MLP) using `peft`.

## Imports

Make sure that you have the latest version of `peft` installed. To ensure that, run this in your Python environment:
    
    python -m pip install --upgrade peft

In [None]:
import copy
import os

# ignore bnb warnings
os.environ["BITSANDBYTES_NOWELCOME"] = "1"

In [2]:
import peft
import torch
from torch import nn
import torch.nn.functional as F

In [3]:
torch.manual_seed(0)

<torch._C.Generator at 0x7f906c95f510>

## Data

We will create a toy dataset consisting of random data for a classification task. There is a little bit of signal in the data, so we should expect that the loss of the model can improve during training.

In [4]:
X = torch.rand((1000, 20))
y = (X.sum(1) > 10).long()

In [5]:
n_train = 800
batch_size = 64

In [6]:
train_dataloader = torch.utils.data.DataLoader(
    torch.utils.data.TensorDataset(X[:n_train], y[:n_train]),
    batch_size=batch_size,
    shuffle=True,
)
eval_dataloader = torch.utils.data.DataLoader(
    torch.utils.data.TensorDataset(X[n_train:], y[n_train:]),
    batch_size=batch_size,
)

## Model

As a model, we use a simple multilayer perceptron (MLP). For demonstration purposes, we use a very large number of hidden units. This is totally overkill for this task but it helps to demonstrate the advantages of `peft`. In more realistic settings, models will also be quite large on average, so this is not far-fetched.

In [7]:
class MLP(nn.Module):
    def __init__(self, num_units_hidden=2000):
        super().__init__()
        self.seq = nn.Sequential(
            nn.Linear(20, num_units_hidden),
            nn.ReLU(),
            nn.Linear(num_units_hidden, num_units_hidden),
            nn.ReLU(),
            nn.Linear(num_units_hidden, 2),
            nn.LogSoftmax(dim=-1),
        )

    def forward(self, X):
        return self.seq(X)

## Training

Here are just a few training hyper-parameters and a simple function that performs the training and evaluation loop.

In [8]:
lr = 0.002
batch_size = 64
max_epochs = 30
device = 'cpu' if not torch.cuda.is_available() else 'cuda'

In [9]:
def train(model, optimizer, criterion, train_dataloader, eval_dataloader, epochs):
    for epoch in range(epochs):
        model.train()
        train_loss = 0
        for xb, yb in train_dataloader:
            xb = xb.to(device)
            yb = yb.to(device)
            outputs = model(xb)
            loss = criterion(outputs, yb)
            train_loss += loss.detach().float()
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

        model.eval()
        eval_loss = 0
        for xb, yb in eval_dataloader:
            xb = xb.to(device)
            yb = yb.to(device)
            with torch.no_grad():
                outputs = model(xb)
            loss = criterion(outputs, yb)
            eval_loss += loss.detach().float()

        eval_loss_total = (eval_loss / len(eval_dataloader)).item()
        train_loss_total = (train_loss / len(train_dataloader)).item()
        print(f"{epoch=:<2}  {train_loss_total=:.4f}  {eval_loss_total=:.4f}")

### Training without peft

Let's start without using `peft` to see what we can expect from the model training.

In [10]:
module = MLP().to(device)
optimizer = torch.optim.Adam(module.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()

In [11]:
%time train(module, optimizer, criterion, train_dataloader, eval_dataloader, epochs=max_epochs)

epoch=0   train_loss_total=0.7970  eval_loss_total=0.6472
epoch=1   train_loss_total=0.5597  eval_loss_total=0.4898
epoch=2   train_loss_total=0.3696  eval_loss_total=0.3323
epoch=3   train_loss_total=0.2364  eval_loss_total=0.5454
epoch=4   train_loss_total=0.2428  eval_loss_total=0.2843
epoch=5   train_loss_total=0.1251  eval_loss_total=0.2514
epoch=6   train_loss_total=0.0952  eval_loss_total=0.2068
epoch=7   train_loss_total=0.0831  eval_loss_total=0.2395
epoch=8   train_loss_total=0.0655  eval_loss_total=0.2524
epoch=9   train_loss_total=0.0380  eval_loss_total=0.3650
epoch=10  train_loss_total=0.0363  eval_loss_total=0.3495
epoch=11  train_loss_total=0.0231  eval_loss_total=0.2360
epoch=12  train_loss_total=0.0162  eval_loss_total=0.2276
epoch=13  train_loss_total=0.0094  eval_loss_total=0.2716
epoch=14  train_loss_total=0.0065  eval_loss_total=0.2237
epoch=15  train_loss_total=0.0054  eval_loss_total=0.2366
epoch=16  train_loss_total=0.0035  eval_loss_total=0.2673
epoch=17  trai

Okay, so we got an eval loss of ~0.26, which is much better than random.

### Training with peft

Now let's train with `peft`. First we check the names of the modules, so that we can configure `peft` to fine-tune the right modules.

In [12]:
[(n, type(m)) for n, m in MLP().named_modules()]

[('', __main__.MLP),
 ('seq', torch.nn.modules.container.Sequential),
 ('seq.0', torch.nn.modules.linear.Linear),
 ('seq.1', torch.nn.modules.activation.ReLU),
 ('seq.2', torch.nn.modules.linear.Linear),
 ('seq.3', torch.nn.modules.activation.ReLU),
 ('seq.4', torch.nn.modules.linear.Linear),
 ('seq.5', torch.nn.modules.activation.LogSoftmax)]

Next we can define the LoRA config. There is nothing special going on here. We set the LoRA rank to 8 and select the layers `seq.0` and `seq.2` to be used for LoRA fine-tuning. As for `seq.4`, which is the output layer, we set it as `module_to_save`, which means it is also trained but no LoRA is applied.

*Note: Not all layers types can be fine-tuned with LoRA. At the moment, linear layers, embeddings, `Conv2D` and `transformers.pytorch_utils.Conv1D` are supported.

In [13]:
config = peft.LoraConfig(
    r=8,
    target_modules=["seq.0", "seq.2"],
    modules_to_save=["seq.4"],
)

Now let's create the `peft` model by passing our initial MLP, as well as the config we just defined, to `get_peft_model`.

In [14]:
module = MLP().to(device)
module_copy = copy.deepcopy(module)  # we keep a copy of the original model for later
peft_model = peft.get_peft_model(module, config)
optimizer = torch.optim.Adam(peft_model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()
peft_model.print_trainable_parameters()

trainable params: 56,164 || all params: 4,100,164 || trainable%: 1.369798866581922


Checking the numbers, we see that only ~1% of parameters are actually trained, which is what we like to see.

Now let's start the training:

In [15]:
%time train(peft_model, optimizer, criterion, train_dataloader, eval_dataloader, epochs=max_epochs)

epoch=0   train_loss_total=0.6918  eval_loss_total=0.6518
epoch=1   train_loss_total=0.5975  eval_loss_total=0.6125
epoch=2   train_loss_total=0.5402  eval_loss_total=0.4929
epoch=3   train_loss_total=0.3886  eval_loss_total=0.3476
epoch=4   train_loss_total=0.2677  eval_loss_total=0.3185
epoch=5   train_loss_total=0.1938  eval_loss_total=0.2294
epoch=6   train_loss_total=0.1712  eval_loss_total=0.2653
epoch=7   train_loss_total=0.1555  eval_loss_total=0.2764
epoch=8   train_loss_total=0.1218  eval_loss_total=0.2104
epoch=9   train_loss_total=0.0846  eval_loss_total=0.1756
epoch=10  train_loss_total=0.0710  eval_loss_total=0.1873
epoch=11  train_loss_total=0.0372  eval_loss_total=0.1539
epoch=12  train_loss_total=0.0350  eval_loss_total=0.2348
epoch=13  train_loss_total=0.0298  eval_loss_total=0.4605
epoch=14  train_loss_total=0.0355  eval_loss_total=0.2208
epoch=15  train_loss_total=0.0099  eval_loss_total=0.1583
epoch=16  train_loss_total=0.0051  eval_loss_total=0.2042
epoch=17  trai

In the end, we see that the eval loss is very similar to the one we saw earlier when we trained without `peft`. This is quite nice to see, given that we are training a much smaller number of parameters.

#### Check which parameters were updated

Finally, just to check that LoRA was applied as expected, we check what original weights were updated what weights stayed the same.

In [16]:
for name, param in peft_model.base_model.named_parameters():
    if "lora" not in name:
        continue

    print(f"New parameter {name:<13} | {param.numel():>5} parameters | updated")

New parameter model.seq.0.lora_A.default.weight |   160 parameters | updated
New parameter model.seq.0.lora_B.default.weight | 16000 parameters | updated
New parameter model.seq.2.lora_A.default.weight | 16000 parameters | updated
New parameter model.seq.2.lora_B.default.weight | 16000 parameters | updated


In [17]:
params_before = dict(module_copy.named_parameters())
for name, param in peft_model.base_model.named_parameters():
    if "lora" in name:
        continue

    name_before = name.partition(".")[-1].replace("original_", "").replace("module.", "").replace("modules_to_save.default.", "")
    param_before = params_before[name_before]
    if torch.allclose(param, param_before):
        print(f"Parameter {name_before:<13} | {param.numel():>7} parameters | not updated")
    else:
        print(f"Parameter {name_before:<13} | {param.numel():>7} parameters | updated")

Parameter seq.0.weight  |   40000 parameters | not updated
Parameter seq.0.bias    |    2000 parameters | not updated
Parameter seq.2.weight  | 4000000 parameters | not updated
Parameter seq.2.bias    |    2000 parameters | not updated
Parameter seq.4.weight  |    4000 parameters | not updated
Parameter seq.4.bias    |       2 parameters | not updated
Parameter seq.4.weight  |    4000 parameters | updated
Parameter seq.4.bias    |       2 parameters | updated


So we can see that apart from the new LoRA weights that were added, only the last layer was updated. Since the LoRA weights and the last layer have comparitively few parameters, this gives us a big boost in efficiency.

#### Pushing the model to Hugging Face Hub

With the `peft` model, it is also very easy to push a model the Hugging Face Hub. Below, we demonstrate how it works. It is assumed that you have a valid Hugging Face account and are logged in:

In [18]:
from huggingface_hub import delete_repo

In [19]:
user = "BenjaminB"  # put your user name here
model_name = "peft-lora-with-custom-model"
model_id = f"{user}/{model_name}"

In [20]:
peft_model.push_to_hub(model_id);

adapter_model.bin:   0%|          | 0.00/211k [00:00<?, ?B/s]

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

As we can see, the adapter size is only 211 kB.

Finally, as a clean up step, you may want to delete the repo.

In [21]:
delete_repo(model_id)