# Block-Diagonal LoRA for Eliminating Communication Overhead in Tensor Parallel LoRA Serving

## Introduction
Block-Diagonal LoRA (BD-LoRA) is a LoRA variant in which some LoRA factors are constrained to be block-diagonal. This allows faster serving by eliminating communication overheads 
when running inference on multiple GPUs. Despite the block-diagonal constraint, BD-LoRA is similarly performant to vanilla LoRA at similar parameter counts.

BD-LoRA is designed to be used with tensor parallelism, which means sharding the weights of a model among multiple GPUs. A popular sharding strategy is the [Megatron Sharding Strategy](https://arxiv.org/abs/1909.08053). For two linear layers $W_1$, $W_2$ that follow each other (for example the up and down projections in a transformer MLP module), we will shard the first layer in a column-parallel way (which requires LoRA B to be block-diagonal) and the second layer in a row-parallel way (which requires LoRA A to be block-diagonal). For the attention module, this can be similarly achieved by taking the Q, K and V projections together as $W_1$ and the out projection as $W_2$, sharding accordingly. This sharding allows a compatible inference engine to distribute each block-diagonal shard over a a different GPU, cutting the need to communicate partial results among GPUs. In the image below, you can see the exact sharding strategy and how this saves computational efforts.

Paper: https://arxiv.org/html/2510.23346v1

<div>
<img src="bdlora-sharding.png" width="800"/>
</div>

### Performance, rank and parameter count
BD-LoRA achieves similar performance to LoRA (see image below, or the `method_comparison` folder in the peft repository root) at the same parameter count. However, as every other factor in BD-LoRA is block-diagonal, a BD-LoRA adapter will have less parameters than a LoRA adapter at the same rank. The performance of BD-LoRA is only competitive when the rank is then increased accordingly. We provide example code for rank-matching at the end of this example notebook.

<div>
<img src="bdlora-performance.png" width="600"/>
</div>


In [1]:
from peft.tuners import BdLoraConfig, LoraConfig
from peft import get_peft_model
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling, AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import torch

  from .autonotebook import tqdm as notebook_tqdm


## Quick Start
To use BD-LoRA, we can follow standard LoRA-training procedures. We only need to change the `LoraConfig` to a `BdLoraConfig` and specify which LoRA should be block-diagonal. 
As an example, we will train a LLama-Model in such a way that it can later benefit from inference speed-up as specified in the BD-LoRA paper. However, BD-LoRA can be used with all other models that follow a transformer architecture. 

As explained in the introduction, we want to shard each module (MLP and attention) in an alternating fashion, first column-parallel with LoRA-B block-diagonal, then row-parallel with LoRA-A block-diagonal. Different from standard MLP modules, Llama also uses a gate projection, which we can fuse together with the up-projection.

Therefore, we want the following block-diagonal factors (following the naming convention from the Llama architecture):

- LoRA-A Block-Diagonal (Row-parallel sharding): Out (`out_proj`), Down (`down_proj`)
- LoRA-B Block-Diagonal (Column-parallel sharding): QKV (`q_proj, k_proj, v_proj`), Up+Gate (`up_proj, gate_proj`)

Additionally, we need to know on how many GPUs we want to serve before we start training, as this corresponds to the number of block we will use for each block-diagonal factor. For this experiment, we will use 2 blocks (equivalent to a tensor-parallelism degree of 2). Caveat: For a small model such as Llama 3.2-1B which we are using, one would use a single GPU for serving, and use TP=2 or TP=8 only for larger models, like Llama 3.1-8B or Llama 3.3-70B respectively. 

In [2]:
model_name = "meta-llama/Llama-3.2-1B"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [3]:
config = BdLoraConfig(
    r=16,
    # adjust target modules and the ...is_blockdiagonal attributes according to model architecture (for example renaming)
    target_modules=["q_proj", "v_proj", "k_proj", "up_proj", "gate_proj", "o_proj", "down_proj"],
    lora_a_is_blockdiagonal=["o_proj", "down_proj"],
    lora_b_is_blockdiagonal=["q_proj", "v_proj", "k_proj", "up_proj", "gate_proj"],
    # Set this equal to the number of GPUs you want to serve the model with later
    nblocks=2,
    lora_bias=False
)

peft_model = get_peft_model(model, config)
peft_model.print_trainable_parameters()

trainable params: 7,471,104 || all params: 1,243,285,504 || trainable%: 0.6009


## Training
We train the model for 10 steps, this training block is just intended to showcase how BD-LoRA integrates into other huggingface tools.

In [4]:
dataset = load_dataset("imdb", split="train[:1%]")

tokenizer.pad_token = tokenizer.eos_token
def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)

dataset = dataset.map(tokenize, batched=True, remove_columns=["text"])
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,
    warmup_steps=2,
    max_steps=10,
    learning_rate=2e-4,
    logging_steps=1,
)

trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=dataset,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

peft_model.config.use_cache = False
trainer.train()

Step,Training Loss
1,3.4596
2,3.4596
3,3.3699
4,3.2575
5,3.1918
6,3.1439
7,3.1054
8,3.0747
9,3.0516
10,3.0361


TrainOutput(global_step=10, training_loss=3.2150065660476685, metrics={'train_runtime': 35.802, 'train_samples_per_second': 71.504, 'train_steps_per_second': 0.279, 'total_flos': 1882785054720000.0, 'train_loss': 3.2150065660476685, 'epoch': 10.0})

## Example Output

In [5]:
text = "The Batman Trilogy by Christopher Nolan"
inputs = tokenizer(text, return_tensors="pt").to(model.device)  

outputs = peft_model.generate(**inputs, max_length=50)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


The Batman Trilogy by Christopher Nolan
The Batman Trilogy by Christopher Nolan
The Batman Trilogy by Christopher Nolan is a movie that I have never seen. I have seen Batman Begins and The Dark Knight, but not this trilogy. I was not expecting this


## Investigating the shapes of LoRA Adapters
We can check out the adapter shapes to see if they follow the sharding patterns that we have discussed. To make the implementation more memory efficient, 
the block-diagonal matrices are not saved in a block-diagonal manner, but the blocks are stacked along the non-rank dimensions. 

For example, if a layer is column sharded, such as the q-proj in Llama, then the LoRA-B factor is block-diagonal. Assume that the q-proj has layer weights (out_features, in_features), 
then LoRA-A will have shape (rank, in_features), and LoRA-B will have shape (out_features, rank / TP), which corresponds to TP blocks of shape (out_features/TP, rank/TP) each. This can be checked by investigating the weight shapes:

In [6]:
shape_base = list(peft_model.state_dict()['base_model.model.model.layers.0.self_attn.v_proj.base_layer.weight'].shape)
shape_a = list(peft_model.state_dict()['base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight'].shape)
shape_b = list(peft_model.state_dict()['base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight'].shape)
print(f"Base layer has shape:    [{shape_base[0]}, {shape_base[1]}]\nLoRA-A (vanilla):        [{shape_a[0]},  {shape_a[1]}]\nLoRA-B (block-diagonal): [{shape_b[0]}, {shape_b[1]}   ]")

Base layer has shape:    [512, 2048]
LoRA-A (vanilla):        [16,  2048]
LoRA-B (block-diagonal): [512, 8   ]


## Matching the rank
Assuming we want to achieve the same performance of a LoRA adapter of rank 16, at which rank would we have to train BD-LoRA? We can find this out by matching the number of trainable parameters. A simple iteration over the ranks of the BD-LoRA adapter is sufficient to do that:

In [None]:
def rank_to_params(r: int, bd_lora: bool):
    model = AutoModelForCausalLM.from_pretrained(model_name)
    if bd_lora:
        config = BdLoraConfig(
            r=r,
            target_modules=["q_proj", "v_proj", "k_proj", "up_proj", "gate_proj", "o_proj", "down_proj"],
            lora_a_is_blockdiagonal=["o_proj", "down_proj"],
            lora_b_is_blockdiagonal=["q_proj", "v_proj", "k_proj", "up_proj", "gate_proj"],
            nblocks=2,
            lora_bias=False
        )
    else:
        config = LoraConfig(
            r=r,
            # If you use a model different from Llama, change the settings below
            target_modules=["q_proj", "v_proj", "k_proj", "up_proj", "gate_proj", "o_proj", "down_proj"],
            lora_bias=False
        )


    peft_model = get_peft_model(model, config)
    return peft_model.get_nb_trainable_parameters()[0]

r = 32
lora_nparams = rank_to_params(r, False)
bdlora_nparams = 0
while bdlora_nparams < lora_nparams:
    r += 2
    bdlora_nparams = rank_to_params(r, True)
# subtract 2 again to be just under the parameter count of vanilla LoRA, following the original papers methodology
print(f"BD-LoRA rank to match vanilla LoRA performance at rank 16: {r-2} at {lora_nparams} vanilla LoRA params and {rank_to_params(r-2, True)} BD-LoRA params.")

BD-LoRA rank to match vanilla LoRA performance at rank 16: 48 at 22544384 vanilla LoRA params and 22413312 BD-LoRA params.
