# BlockRank Quickstart

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nilesh2797/BlockRank/blob/main/examples/quickstart.ipynb)

This notebook demonstrates how to use **BlockRank** for fast, scalable document ranking with LLMs.

**BlockRank** makes LLMs efficient for in-context ranking through:
- ðŸš€ **Structured Sparse Attention**: Linear complexity (vs quadratic)
- âš¡ **Attention-based Inference**: 4.7Ã— faster (no token generation)
- ðŸŽ¯ **Auxiliary Contrastive Loss**: Better relevance signals

**Paper**: [Scalable In-context Ranking with Generative Models](https://arxiv.org/abs/2510.05396)

## Setup

First, install BlockRank and its dependencies:

In [None]:
# Install BlockRank from GitHub
!pip install -q git+https://github.com/nilesh2797/BlockRank.git

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")


## Sample Data

In [1]:
# create a simple example
import json

sample_data = [
    {
        "query": "what is the capital of france",
        "query_id": "q1",
        "documents": [
            {"doc_id": "0", "title": "Paris", "text": "Paris is the capital and most populous city of France."},
            {"doc_id": "1", "title": "Berlin", "text": "Berlin is the capital and largest city of Germany."},
            {"doc_id": "2", "title": "London", "text": "London is the capital of England and the United Kingdom."},
            {"doc_id": "3", "title": "Rome", "text": "Rome is the capital city of Italy."},
            {"doc_id": "4", "title": "Madrid", "text": "Madrid is the capital and most populous city of Spain."},
        ],
        "answer_ids": ["0"]
    },
    {
        "query": "who wrote the novel 1984",
        "query_id": "q2",
        "documents": [
            {"doc_id": "0", "title": "George Orwell", "text": "George Orwell wrote the dystopian novel Nineteen Eighty-Four in 1949."},
            {"doc_id": "1", "title": "Aldous Huxley", "text": "Aldous Huxley wrote Brave New World in 1932."},
            {"doc_id": "2", "title": "Ray Bradbury", "text": "Ray Bradbury is the author of Fahrenheit 451."},
            {"doc_id": "3", "title": "J.R.R. Tolkien", "text": "J.R.R. Tolkien wrote The Lord of the Rings."},
            {"doc_id": "4", "title": "Ernest Hemingway", "text": "Ernest Hemingway was an American novelist and journalist."},
        ],
        "answer_ids": ["0"]
    }
]

# Save to file
with open("sample_data.jsonl", "w") as f:
    for item in sample_data:
        f.write(json.dumps(item) + "\n")

print(f"Created sample data with {len(sample_data)} queries")

Created sample data with 2 queries


## Load Pre-trained BlockRank Model

Load a BlockRank model fine-tuned on MS MARCO:

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from blockrank import blockrank_triton_kernel_attention, blockrank_std_attention
blockrank_triton_kernel_attention.register_triton_blockrank_attention()
blockrank_std_attention.register_blockrank_attention()

# Load model and tokenizer
model_name = "quicktensor/blockrank-msmarco-mistral-7b"

print(f"Loading {model_name}...")

tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.unk_token

# Configure 4-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="cuda" if torch.cuda.is_available() else "cpu",
    quantization_config=quantization_config,
    attn_implementation="triton_blockrank"
)

model.eval();

Loading quicktensor/blockrank-msmarco-mistral-7b...


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

## Run Attention-Based Inference

Use BlockRank's fast attention-based inference:

In [3]:
import sys

# Import evaluation utilities
from blockrank.dataset import load_icr_dataset_hf, block_icr_collate_fn
from blockrank.utils import calculate_accuracy
import torch
import torch.nn.functional as F
from functools import partial

# Load dataset
ds = load_icr_dataset_hf(
    data_path="sample_data.jsonl",
    tokenizer=tokenizer,
    num_documents=-1,
    eval_mode=True,
    use_blockrank=True,
)

eval_ds = ds["train"]
print(f"Loaded {len(eval_ds)} examples")

# Setup data collator
data_collator = partial(
    block_icr_collate_fn,
    tok=tokenizer,
    max_block_length=512,
    pad_to_multiple_of=16
)

# Create dataloader
dataloader = torch.utils.data.DataLoader(
    eval_ds,
    batch_size=1,
    collate_fn=data_collator,
    shuffle=False
)

Generating train split: 0 examples [00:00, ? examples/s]

num_proc must be <= 2. Reducing num_proc to 2 for dataset of size 2.


Map (num_proc=2):   0%|          | 0/2 [00:00<?, ? examples/s]

num_proc must be <= 2. Reducing num_proc to 2 for dataset of size 2.


Map (num_proc=2):   0%|          | 0/2 [00:00<?, ? examples/s]

Loaded 2 examples


In [5]:
# Run inference
all_predictions = []
attn_layer_idx = 20  # 20th layer for Mistral-7B

print("Running inference...\n")

with torch.no_grad():
    for idx, batch in enumerate(dataloader):
        # Move to device
        device = model.device
        batch = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in batch.items()}
        
        # Forward pass
        outputs = model(
            **batch,
            output_attentions=True,
            layers_to_return_scores=[attn_layer_idx],
            use_cache=False
        )
        
        # Extract attention scores
        attention = outputs.attentions[0]
        B, M, H = batch['attention_mask'].shape
        _, N, H1, MH = attention.shape
        
        # Compute document scores from attention
        attn = F.softmax(attention[:, :, -1, H:-H], dim=-1)
        attn = attn.reshape(B, N, -1, H)
        attn_scores = attn.mean(1).sum(-1)
        
        # Get top-k predictions
        k = min(5, attn_scores.shape[-1])
        top_k_indices = torch.topk(attn_scores, k=k, dim=-1).indices[0].cpu().tolist()
        
        all_predictions.append(top_k_indices)
        
        # Display results
        print(f"Query {idx + 1}: {eval_ds[idx]['query']}")
        print(f"  Predicted ranking: {top_k_indices}")
        print(f"  Ground truth: {eval_ds[idx]['answer_ids'].tolist()}")
        print()

Running inference...

Query 1: what is the capital of france
  Predicted ranking: [4, 3, 1, 2, 0]
  Ground truth: [4]

Query 2: who wrote the novel 1984
  Predicted ranking: [4, 0, 2, 3, 1]
  Ground truth: [4]



## Evaluate Results

Calculate ranking metrics:

In [6]:
# Calculate metrics
results = calculate_accuracy(all_predictions, eval_ds)

print("="*60)
print("Evaluation Results")
print("="*60)
print(f"Accuracy (top-1): {results['accuracy']:.2f}%")
print(f"Exact matches: {results['exact_match']} / {results['total']}")
print("="*60)

Evaluation Results
Accuracy (top-1): 100.00%
Exact matches: 2 / 2


## Learn More
- **Paper**: [arXiv:2510.05396](https://arxiv.org/abs/2510.05396)
- **GitHub**: [nilesh2797/BlockRank](https://github.com/nilesh2797/BlockRank)
- **Training Guide**: See `docs/TRAINING.md`
- **Data Format**: See `docs/DATA_FORMAT.md`

## Train Your Own Model
```bash
# Clone the repository
git clone https://github.com/nilesh2797/BlockRank.git
cd BlockRank

# Install dependencies
pip install -r requirements.txt

# Train on your data
python scripts/train.py --config configs/your-config.yaml
```

## Evaluate on BEIR Benchmarks
```bash
# Evaluate on single beir dataset
python scripts/eval_attn.py \
    --config configs/eval_beir.yaml \
    --checkpoint outputs/your-model
```