# Block-Diagonal LoRA for Eliminating Communication Overhead in Tensor Parallel LoRA Serving

## Introduction
Block-Diagonal LoRA (BD-LoRA) is a LoRA variant in which some LoRA factors are constrained to be block-diagonal. This allows faster serving by eliminating communication overheads 
when running inference on multiple GPUs. Despite the block-diagonal constraint, BD-LoRA is similarly performant to vanilla LoRA at similar parameter counts.

Following the [Megatron Sharding Strategy](https://arxiv.org/abs/1909.08053), for two linear layers that follow each other (e.g. up and down projection), we will shard the first layer in a column-parallel way (which requires LoRA B to be block-diagonal) and the second layer in a row-parallel way (which requires LoRA A to be block-diagonal). This sharding allows a compatible inference engine to distribute each block-diagonal shard over a a different GPU, cutting the need to communicate partial results among GPUs. In the image below, you can see our exact sharding strategy and how this saves computational efforts.

Paper: https://arxiv.org/html/2510.23346v1

![image.png](bdlora-sharding.png)


In [39]:
from peft.tuners import BdLoraConfig, LoraConfig
from peft import get_peft_model
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling, AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import torch

## Quick Start
To use BD-LoRA, we can follow standard LoRA-training procedures. We only need to change the LoraConfig to a BD-LoRA config and specify which LoRA should be block-diagonal. For the following example, we will train a LLama-Model in such a way that it can later benefit from inference speed-up as specified in the BD-LoRA paper. 

In Llama, we need to think about how the attention and linear modules are sharded: attention consists of a QKV projection (in parallel) followed by an out projection, while the linear modules consist of parallel up and gate projections, followed by a down projection. Therefore, we want to shard the QKV, up and gate projections in a column-parallel manner (using a block-diagonal LoRA-B factor), and the down and out projections in a row-parallel manner (using a block-diagonal LoRA-A factor).

Additionally, we need to know on how many GPUs we want to serve before we start training, as this corresponds to the number of block we will use for each block-diagonal factor. For this experiment, we will use 2 blocks (equivalent to a tensor-parallelism degree of 2). Caveat: For a small model such as Llama 3.2-1B which we are using, one would use a single GPU for serving, and use TP=2 or TP=8 only for larger models, like Llama 3.1-8B or Llama 3.3-70B respectively. 

In [40]:
model_name = "meta-llama/Llama-3.2-1B"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [41]:
bd_config = BdLoraConfig(
    r=16,
    # If you use a model different from Llama, change the settings below
    target_modules=["q_proj", "v_proj", "k_proj", "up_proj", "gate_proj", "o_proj", "down_proj"],
    lora_a_is_blockdiagonal=["o_proj", "down_proj"],
    lora_b_is_blockdiagonal=["q_proj", "v_proj", "k_proj", "up_proj", "gate_proj"],
    # Set this equal to the number of GPUs you want to serve the model with later
    nblocks=2,
    lora_bias=False
)

peft_model = get_peft_model(model, bd_config)
peft_model.print_trainable_parameters()

trainable params: 7,471,104 || all params: 1,243,285,504 || trainable%: 0.6009


## Training
We train the model for 10 steps, this training block is just intended to showcase how BD-LoRA integrates into other huggingface tools.

In [42]:
dataset = load_dataset("imdb", split="train[:1%]")

tokenizer.pad_token = tokenizer.eos_token
def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)

dataset = dataset.map(tokenize, batched=True, remove_columns=["text"])
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,
    warmup_steps=2,
    max_steps=10,
    learning_rate=2e-4,
    logging_steps=1,
)

trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=dataset,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

peft_model.config.use_cache = False
trainer.train()

Step,Training Loss
1,3.4873
2,3.566
3,3.3288
4,3.5333
5,3.2361
6,3.1984
7,3.2596
8,2.9806
9,3.2335
10,3.2318


TrainOutput(global_step=10, training_loss=3.3055537939071655, metrics={'train_runtime': 18.3745, 'train_samples_per_second': 17.415, 'train_steps_per_second': 0.544, 'total_flos': 236477802872832.0, 'train_loss': 3.3055537939071655, 'epoch': 1.25})

## Example Output

In [45]:
text = "The Batman Trilogy by Christopher Nolan"
inputs = tokenizer(text, return_tensors="pt").to(model.device)  

outputs = peft_model.generate(**inputs, max_length=50)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


The Batman Trilogy by Christopher Nolan
The Batman Trilogy is a trilogy of films that is a must see for any Batman fan. The first film is The Dark Knight, which is the best Batman film ever made. The second film is the 199


## Investigating the shapes of LoRA Adapters
We can check out the adapter shapes to see if they follow the sharding patterns that we have discussed. To make the implementation more memory efficient, 
the block-diagonal matrices are not saved in a block-diagonal manner, but the blocks are stacked along the non-rank dimensions. 

For example, if a layer is column sharded, such as the q-proj in Llama, then the LoRA-B factor is block-diagonal. Assume that the q-proj has layer weights (out_features, in_features), 
then LoRA-A will have shape (rank, in_features), and LoRA-B will have shape (out_features, rank / TP), which corresponds to TP blocks of shape (out_features/TP, rank/TP) each. This can be checked by investigating the weight shapes:

In [78]:
shape_base = list(peft_model.state_dict()['base_model.model.model.layers.0.self_attn.v_proj.base_layer.weight'].shape)
shape_a = list(peft_model.state_dict()['base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight'].shape)
shape_b = list(peft_model.state_dict()['base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight'].shape)
print(f"Base layer has shape:    [{shape_base[0]}, {shape_base[1]}]\nLoRA-A (vanilla):        [{shape_a[0]},  {shape_a[1]}]\nLoRA-B (block-diagonal): [{shape_b[0]}, {shape_b[1]}   ]")

Base layer has shape:    [512, 2048]
LoRA-A (vanilla):        [16,  2048]
LoRA-B (block-diagonal): [512, 8   ]
