# BM25 Retriever
In this guide, we define a bm25 retriever that search documents using the bm25 method.
BM25 (Best Matching 25) is a ranking function that extends TF-IDF by considering term frequency saturation and document length. BM25 effectively ranks documents based on query term occurrence and rarity across the corpus.

## Setup

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

In [None]:
%pip install --upgrade llama-index
%pip install --upgrade llama-index-retrievers-bm25

In [1]:
import os
from llama_index.core import Settings

## Load Data

We first show how to convert a Document into a set of Nodes, and insert into a DocumentStore.

In [2]:
from datasets import load_dataset
import pandas as pd

ds = load_dataset("rag-datasets/rag-mini-bioasq", "text-corpus")
ds = ds['passages'].to_pandas().set_index('id', drop=True)
query_set = load_dataset("rag-datasets/rag-mini-bioasq", "question-answer-passages")


In [3]:
queries = query_set['test'].take(5)

# create a subset of the documents for faster testing

passages_required = set()
[ passages_required.update([int(id) for id in ids[1:-1].split(", ")]) for ids in query_set['test'].take(15)['relevant_passage_ids'] ];


In [4]:
from llama_index.core import Document

docs = [Document(text=ds.loc[id].passage, metadata = {'id' : id}) for id in passages_required]
for x in docs:
    x.doc_id = str(x.metadata['id'])
    x.excluded_llm_metadata_keys = ['id']

In [22]:
from llama_index.core.node_parser import SentenceSplitter

# initialize node parser, for this dataset 512 is actually bigger than most documents.
splitter = SentenceSplitter(chunk_size=512)

nodes = splitter.get_nodes_from_documents(docs)

## BM25 Retriever + Disk Persistance

In [23]:
from llama_index.retrievers.bm25 import BM25Retriever
import Stemmer

# We can pass in the index, docstore, or list of nodes to create the retriever
bm25_retriever = BM25Retriever.from_defaults(
    nodes=nodes,
    similarity_top_k=5,
    # Optional: We can pass in the stemmer and set the language for stopwords
    # This is important for removing stopwords and stemming the query + text
    # The default is english for both
    stemmer=Stemmer.Stemmer("english"),
    language="english",
)

In [24]:
bm25_retriever.persist("./bm25_retriever")

loaded_bm25_retriever = BM25Retriever.from_persist_dir("./bm25_retriever")

Finding newlines for mmindex:   0%|          | 0.00/306k [00:00<?, ?B/s]

In [25]:
from llama_index.core.response.notebook_utils import display_source_node

print(queries[0])
retrieved_nodes = bm25_retriever.retrieve(
    queries[0]['question']
)
for node in retrieved_nodes[:2]:
    display_source_node(node, source_length=5000)

{'question': 'Is Hirschsprung disease a mendelian or a multifactorial disorder?', 'answer': "Coding sequence mutations in RET, GDNF, EDNRB, EDN3, and SOX10 are involved in the development of Hirschsprung disease. The majority of these genes was shown to be related to Mendelian syndromic forms of Hirschsprung's disease, whereas the non-Mendelian inheritance of sporadic non-syndromic Hirschsprung disease proved to be complex; involvement of multiple loci was demonstrated in a multiplicative model.", 'relevant_passage_ids': '[20598273, 6650562, 15829955, 15617541, 23001136, 8896569, 21995290, 12239580, 15858239]', 'id': 0}


**Node ID:** 8c7c283c-c0ce-4045-96ab-4a6e98c27b38<br>**Similarity:** 6.5452728271484375<br>**Text:** Hirschsprung's disease is characterized by the absence of ganglion cells in the 
myenteric and submucosal plexuses of the gastrointestinal tract. Genetic 
dissection was successful as nine genes and four loci for Hirschsprung's disease 
susceptibility were identified. Different approaches were used to find these 
loci such as classical linkage in large families, identity by descent mapping in 
an inbred kindred, candidate gene approaches based on naturally occurring mutant 
mice models, and finally the use of model-free linkage and association analyzes. 
In this study, we review the identification of genes and loci involved in the 
non-syndromic common form and syndromic Mendelian forms of Hirschsprung's 
disease. The majority of the identified genes are related to Mendelian syndromic 
forms of Hirschsprung's disease. The non-Mendelian inheritance of sporadic 
non-syndromic Hirschsprung's disease proved to be complex; involvement of 
multiple loci was demonstrated in a multiplicative model. We discuss the 
practical implications of the elucidation of genes associated with 
Hirschsprung's disease susceptibility for genetic counseling. Finally, we 
speculate on possible strategies to identify new genes for Hirschsprung's 
disease.<br>

**Node ID:** 2d3e6644-2cff-41e9-998c-1d9ec0ac964a<br>**Similarity:** 5.607312202453613<br>**Text:** Hirschsprung's disease (HSCR) is a fairly frequent cause of intestinal 
obstruction in children. It is characterized as a sex-linked heterogonous 
disorder with variable severity and incomplete penetrance giving rise to a 
variable pattern of inheritance. Although Hirschsprung's disease occurs as an 
isolated phenotype in at least 70% of cases, it is not infrequently associated 
with a number of congenital abnormalities and associated syndromes, 
demonstrating a spectrum of congenital anomalies. Certain of these syndromic 
phenotypes have been linked to distinct genetic sites, indicating underlying 
genetic associations of the disease and probable gene-gene interaction, in its 
pathogenesis. These associations with HSCR include Down's syndrome and other 
chromosomal anomalies, Waardenburg syndrome and other Dominant sensorineural 
deafness, the Congenital Central Hypoventilation and Mowat-Wilson and other 
brain-related syndromes, as well as the MEN2 and other tumour associations. A 
number of other autosomal recessive syndromes include the Shah-Waardenburg, the 
Bardet-Biedl and Cartilage-hair hypoplasia, Goldberg-Shprintzen syndromes and 
other syndromes related to cholesterol and fat metabolism among others. The 
genetics of Hirschsprung's disease are highly complex with the majority of known 
genetic sites relating to the main susceptibility pathways (RET an EDNRB). 
Non-syndromic non-familial, short-segment HSCR appears to represent a 
non-Mendelian condition with variable expression and sex-dependent penetrance. 
Syndromic and familial forms, on the other hand, have complex patterns of 
inheritance and being reported as autosomal dominant, recessive and polygenic 
patterns of inheritance. The phenotypic variability and incomplete penetrance 
observed in Hirschsprung's disease could also be explained by the involvement of 
modifier genes, especially in its syndromic forms. In this review, we look at 
the chromosomal and Mendelian associations and their underlying signalling 
pathways, to obtain a better understanding of the pathogenetic mechanisms 
involved in developing aganglionosis of the distal bowel.<br>

In [26]:
from ragas.dataset_schema import SingleTurnSample

samples = []
for row in queries:
    row["relevant_passages"] = [ ds.loc[int(id)].passage for id in row["relevant_passage_ids"][1:-1].split(', ')]
    retrieved_nodes = bm25_retriever.retrieve(row['question'])
    retrieved_nodes = [ str(node.text) for node in retrieved_nodes ]
    print(row)
    sample = SingleTurnSample(
        user_input=row["question"],
        retrieved_contexts=retrieved_nodes,
        reference_contexts=row["relevant_passages"],
    )
    samples.append(sample)

{'question': 'Is Hirschsprung disease a mendelian or a multifactorial disorder?', 'answer': "Coding sequence mutations in RET, GDNF, EDNRB, EDN3, and SOX10 are involved in the development of Hirschsprung disease. The majority of these genes was shown to be related to Mendelian syndromic forms of Hirschsprung's disease, whereas the non-Mendelian inheritance of sporadic non-syndromic Hirschsprung disease proved to be complex; involvement of multiple loci was demonstrated in a multiplicative model.", 'relevant_passage_ids': '[20598273, 6650562, 15829955, 15617541, 23001136, 8896569, 21995290, 12239580, 15858239]', 'id': 0, 'relevant_passages': ['The major gene for Hirschsprung disease (HSCR) encodes the receptor tyrosine \nkinase RET. In a study of 690 European- and 192 Chinese-descent probands and \ntheir parents or controls, we demonstrate the ubiquity of a >4-fold \nsusceptibility from a C-->T allele (rs2435357: p = 3.9 x 10(-43) in European \nancestry; p = 1.1 x 10(-21) in Chinese sam

In [27]:
import os
from dotenv import load_dotenv
load_dotenv()

from llama_index.core import Settings
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding

# You need to deploy your own embedding model as well as your own chat completion model
llm = AzureOpenAI(
    deployment_name='gpt-35-turbo16k',
    model='gpt-35-turbo',
    api_key=os.environ['AZURE_OPENAI_API_KEY'],
    azure_endpoint=os.environ['AZURE_OPENAI_ENDPOINT'],
    api_version=os.environ['AZURE_OPENAI_API_VERSION'],
)

# global settings
Settings.llm = llm

In [30]:
from llama_index.core.query_engine import RetrieverQueryEngine

# make uniform api for all retrievers
def get_query_engine_for_retriever(new_nodes = nodes, k=5):

    bm25_retriever = BM25Retriever.from_defaults(
        nodes=new_nodes,
        similarity_top_k=k,
        stemmer=Stemmer.Stemmer("english"),
        language="english",
    )

    query_engine = RetrieverQueryEngine(bm25_retriever)
    return query_engine

In [None]:
from llama_index.core.evaluation import RetrieverEvaluator
from llama_index.core.evaluation.retrieval.metrics import resolve_metrics, HitRate, MRR
from llama_index.core.node_parser import SentenceSplitter, TokenTextSplitter

metric_dict = {}
metrics = ["precision", "recall", "ap", "ndcg"]
metrics = [x() for x in resolve_metrics(metrics)] + [HitRate(use_granular_hit_rate=True), MRR(use_granular_mrr=True)]

results_data = []
splitters = {'sentance_512_0': SentenceSplitter(chunk_size=512, chunk_overlap=100),
             'token_512': TokenTextSplitter(chunk_size=512, chunk_overlap=100)}

for splitter_name in splitters.keys():
    splitter = splitters[splitter_name]
    nodes = splitter.get_nodes_from_documents(docs)
    for k in [1, 3, 5, 10]:
        query_engine = get_query_engine_for_retriever(new_nodes = nodes, k=k)
        for row in queries:
            row['relevant_passage_ids'] = row['relevant_passage_ids'][1:-1].split(', ')
            query = row['question']
            retrieved_nodes = query_engine.retrieve(query)
            retrieved_passage_ids = [str(node.metadata['id']) for node in retrieved_nodes]
            
            for metric in metrics:
                eval_result = metric.compute(
                    query, row['relevant_passage_ids'], retrieved_passage_ids,
                )
                metric_dict[metric.metric_name] = eval_result.score
            
            results_data.append({
                'splitter': splitter_name,
                'k': k,
                'query': query,
                'retrieved_ids': retrieved_passage_ids,
                'relevant_ids': row['relevant_passage_ids'],
                **metric_dict
            })

results_df = pd.DataFrame(results_data)
results_df.drop(['query', 'retrieved_ids', 'relevant_ids'], axis=1).groupby(['k','splitter']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,precision,recall,ap,ndcg,hit_rate,mrr
k,splitter,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,sentance_512_0,1.0,0.108056,0.108056,1.0,0.108056,1.0
1,token_512,1.0,0.108056,0.108056,1.0,0.108056,1.0
3,sentance_512_0,0.933333,0.311667,0.3075,0.940784,0.311667,0.622222
3,token_512,0.933333,0.311667,0.3075,0.940784,0.311667,0.622222
5,sentance_512_0,0.8,0.439722,0.427986,0.849266,0.439722,0.506889
5,token_512,0.84,0.439722,0.447986,0.875507,0.459722,0.494056
10,sentance_512_0,0.58,0.582222,0.544512,0.679299,0.582222,0.418433
10,token_512,0.591111,0.582222,0.564988,0.695321,0.602222,0.409306


In [None]:
from ragas.dataset_schema import SingleTurnSample
                            
from ragas.metrics import (LLMContextRecall,LLMContextPrecisionWithReference, Faithfulness, 
                            SemanticSimilarity, NonLLMContextRecall, answer_correctness, FactualCorrectness)
from ragas import evaluate, EvaluationDataset
from langchain_openai import AzureOpenAIEmbeddings
from langchain.chat_models import AzureChatOpenAI
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

evaluator_llm = LangchainLLMWrapper(AzureChatOpenAI(
                openai_api_version=os.environ['AZURE_OPENAI_API_VERSION'],
                azure_deployment='gpt-4o',
                model='gpt-4o',
            ))

# evaluator_llm = LangchainLLMWrapper(AzureChatOpenAI(
#                 openai_api_version=os.environ['AZURE_OPENAI_API_VERSION'],
#                 azure_deployment='gpt-35-turbo16k',
#                 model='gpt-35-turbo',
#             ))

evaluator_embeddings = LangchainEmbeddingsWrapper( AzureOpenAIEmbeddings(
                openai_api_version=os.environ['AZURE_OPENAI_API_VERSION'],
                azure_deployment='text-embedding-ada-002',
                model='text-embedding-ada-002',
))


metrics = [
    LLMContextRecall(), # Recall based on claims made in response vs those in reference, uses LLM
    LLMContextPrecisionWithReference(), # Precision based on claims made in response vs those in reference, uses LLM
    FactualCorrectness(), # F1-Score of claims made in response vs those in reference
    SemanticSimilarity(), # embedding based similarity between generated answer and ground truth
    answer_correctness,
    Faithfulness()
]

results_data = []
splitters = {'sentance_512_0': SentenceSplitter(chunk_size=512, chunk_overlap=100),
             'token_512': TokenTextSplitter(chunk_size=512, chunk_overlap=100)}
             
for splitter_name in splitters.keys():
    splitter = splitters[splitter_name]
    nodes = splitter.get_nodes_from_documents(docs)
    
    for k in [5, 10]: # removing others since this process is expensive
        query_engine = get_query_engine_for_retriever(new_nodes = nodes, k=k)
        samples = []
        for row in queries:
            query = row['question']
            response = query_engine.query(query)
            retrieved_nodes = response.source_nodes
            retrieved_passage_ids = [node.metadata['id'] for node in retrieved_nodes]
            retrieved_passages = [ ds.loc[int(id)].passage for id in retrieved_passage_ids ]
            relevant_passages = [ ds.loc[int(id)].passage for id in row["relevant_passage_ids"][1:-1].split(', ') ]
            
            sample = SingleTurnSample(
                user_input=query,
                reference=row["answer"],
                response=response.response,
                retrieved_contexts=retrieved_passages,
                reference_contexts=relevant_passages,
            )
            samples.append(sample)

        eval_dataset = EvaluationDataset(samples = samples)
        results = evaluate(dataset=eval_dataset, metrics=metrics, llm = evaluator_llm, embeddings = evaluator_embeddings)
        df = results.to_pandas()
        df['k'] = k
        df['splitter'] = splitter_name
        results_data.append(df)

results_df = pd.concat(results_data).reset_index(drop=True)
results_df.drop(['user_input','retrieved_contexts','reference_contexts','response','reference'], axis=1).groupby(['k','splitter']).mean()

Evaluating:   0%|          | 0/30 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/30 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/30 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/30 [00:00<?, ?it/s]

KeyError: 'splitter'