### Flash ReRank

Models are better at using relevant information that occurs at the very begging (primacy bias) or end of it's input context (recency bias), and performance degrades significantly when models have to use information located in the middle of the input context.

- Cross Encoders
- Zero shot rerankers

In [1]:
%pip install -qqU flashrank

Note: you may need to restart the kernel to use updated packages.


In [4]:
from flashrank import Ranker

ranker = Ranker()

#### Small Ranker (~34MB), slightly slower & best performance (ranking precision)

In [9]:
ranker_small = Ranker(model_name="ms-marco-MiniLM-L-12-v2", cache_dir="../.cache/")

#### Medium (~110MB), slower model with best zeroshot performance (ranking precision) 
on out of domain data.

In [11]:
ranker_medium_t5 = Ranker(model_name="rank-T5-flan", cache_dir="../.cache/")

# Medium (~150MB), slower model with competitive performance (ranking precision) 
for 100+ languages  (don't use for english)

In [12]:
ranker_medium_int = Ranker(model_name="ms-marco-MultiBERT-L-12", cache_dir="../.cache/")

Downloading ms-marco-MultiBERT-L-12...


ms-marco-MultiBERT-L-12.zip: 100%|██████████| 98.7M/98.7M [00:03<00:00, 26.0MiB/s]


- Metadata is optimal. 
- IDs come from retrieval DB or simple numeric indices

In [13]:
query = "How to speedup LLMs?"

In [14]:
passages = [
    {
        "id": 1,
        "text": "Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step.",
        "meta": {"additional": "info1"},
    },
    {
        "id": 2,
        "text": "LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper",
        "meta": {"additional": "info2"},
    },
    {
        "id": 3,
        "text": "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run.",
        "meta": {"additional": "info3"},
    },
    {
        "id": 4,
        "text": "Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup.",
        "meta": {"additional": "info4"},
    },
    {
        "id": 5,
        "text": "vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels",
        "meta": {"additional": "info5"},
    },
]

In [16]:
from flashrank import RerankRequest

rerank_request = RerankRequest(query=query, passages=passages)
results = ranker_medium_t5.rerank(rerank_request)
results

[{'id': 3,
  'text': "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run.",
  'meta': {'additional': 'info3'},
  'score': 0.54945964},
 {'id': 1,
  'text': 'Introduce *lookahead decoding*: - a parallel decoding algo to ac