# Hybrid Retrieval Concepts

1. Fusion approaches: Retrieving documents separately with each method, then combining results.
    - Simple fusion: Taking the union of results from multiple retrievers
    - Weighted fusion: Adjusting scores from each retriever and combining them
    - Reciprocal rank fusion: Considering the rank position of documents in each result set

2. Ensemble approaches: Using multiple retrievers in series or with more complex logic
    - Sequential retrieval: Using one retriever's output as input to another
    - Filter-then-rank: Using one method to create a candidate pool, another to rank

In [1]:
# Imports
import pandas as pd
from llama_index.core import Document, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.schema import QueryBundle, NodeWithScore
from llama_index.core.retrievers import BaseRetriever
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

In [2]:
# Create sample AI-related documents
documents = [
    Document(text="Machine learning is a subset of artificial intelligence that involves building systems that can learn from data. Common machine learning algorithms include linear regression, decision trees, and neural networks."),
    Document(text="BM25, or Best Match 25, also known as Okapi BM25, is a ranking algorithm for information retrieval and search engines that determines a document's relevance to a given query and ranks documents based on their relevance scores."),
    Document(text="Neural networks are computational models inspired by the human brain. They consist of layers of interconnected nodes or 'neurons' that process and transform input data to produce meaningful outputs."),
    Document(text="Transformers are a type of deep learning architecture introduced in the paper 'Attention is All You Need'. They have revolutionized natural language processing tasks such as translation, summarization, and question answering."),
    Document(text="Backpropagation is a key algorithm for training neural networks. It calculates the gradient of the loss function with respect to the network weights, allowing for efficient optimization."),
    Document(text="Computer vision is an interdisciplinary field that deals with how computers can gain high-level understanding from digital images or videos. It aims to automate tasks that the human visual system can do."),
    Document(text="Reinforcement learning is a type of machine learning where an agent learns to make decisions by taking actions in an environment to maximize a reward signal. It has been used to achieve superhuman performance in games like chess, Go, and various Atari games."),
    Document(text="Large Language Models (LLMs) like GPT-4 and Claude are transformer-based models trained on vast amounts of text data. They can generate human-like text, answer questions, and perform a variety of language tasks."),
    Document(text="Natural Language Processing (NLP) encompasses techniques for understanding, interpreting and generating human language. Modern NLP systems use transformer architectures to process and generate text with remarkable accuracy."),
    Document(text="Deep learning is a subset of machine learning that uses multi-layered neural networks to learn from data. It has achieved breakthroughs in computer vision, speech recognition, and natural language processing."),
]

In [6]:
# Create the individual BM25 and Vector retrievers
def create_retrievers(documents, top_k=5):
    """Create BM25 and Vector retrievers from documents."""

    # VectorDB Setup
    parser = SentenceSplitter(chunk_size=2000, chunk_overlap=0)
    nodes = parser.get_nodes_from_documents(documents)
    bm25_retriever = BM25Retriever.from_defaults(nodes=nodes, similarity_top_k=top_k)
    embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
    vector_index = VectorStoreIndex(nodes, embed_model=embed_model)
    vector_retriever = vector_index.as_retriever(similarity_top_k=top_k)

    return {
        "BM25": bm25_retriever,
        "Vector": vector_retriever,
        "nodes": nodes
    }

In [4]:
# Create a simple hybrid retriever
class SimpleFusionRetriever(BaseRetriever):
    """Simple fusion retriever that combines results from multiple retrievers."""

    def __init__(
        self,
        retrievers: dict,
    ):
        """Initialize with retrievers dictionary."""
        self.retrievers = retrievers
        super().__init__()

    def _retrieve(self, query_bundle: QueryBundle) -> list[NodeWithScore]:
        """Retrieve nodes given query."""
        all_nodes = {}

        for name, retriever in self.retrievers.items():
            results = retriever.retrieve(query_bundle)

            # Add to results dictionary (keyed by node_id for deduplication)
            for node in results:
                all_nodes[node.node.node_id] = node
        # Return all unique nodes
        return list(all_nodes.values())

In [5]:
# Create individual retrievers
retrievers_dict = create_retrievers(documents, top_k=5)

# Create simple fusion retriever
hybrid_retriever = SimpleFusionRetriever(
    retrievers={k: v for k, v in retrievers_dict.items() if k != "nodes"}
)

# Format results helper function
def format_results(results, name):
    """Format retrieval results for display."""
    output = [f"\n{name} Results:"]
    output.append("-" * 40)

    for i, node in enumerate(results):
        output.append(f"Result {i+1} [Score: {node.score:.4f}]:")
        output.append(f"{node.node.text}")
        output.append("-" * 50)

    return "\n".join(output)

# Test the retrievers
query_text = "algorithm"
query_bundle = QueryBundle(query_text)

print(f"Query: {query_text}")
print("=" * 30)

# Get results from each retriever
bm25_results = retrievers_dict["BM25"].retrieve(query_bundle)
vector_results = retrievers_dict["Vector"].retrieve(query_bundle)
hybrid_results = hybrid_retriever.retrieve(query_bundle)

# Print results
print(format_results(bm25_results, "BM25 Retriever"))
print(format_results(vector_results, "Vector Retriever"))
print(format_results(hybrid_results, "Simple Fusion Retriever"))

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Query: algorithm

BM25 Retriever Results:
----------------------------------------
Result 1 [Score: 0.5282]:
Backpropagation is a key algorithm for training neural networks. It calculates the gradient of the loss function with respect to the network weights, allowing for efficient optimization.
--------------------------------------------------
Result 2 [Score: 0.4553]:
Machine learning is a subset of artificial intelligence that involves building systems that can learn from data. Common machine learning algorithms include linear regression, decision trees, and neural networks.
--------------------------------------------------
Result 3 [Score: 0.4465]:
BM25, or Best Match 25, also known as Okapi BM25, is a ranking algorithm for information retrieval and search engines that determines a document's relevance to a given query and ranks documents based on their relevance scores.
--------------------------------------------------
Result 4 [Score: 0.0000]:
Deep learning is a subset of machi

In [7]:
# Create a more advanced hybrid retriever with weighted scoring
class WeightedFusionRetriever(BaseRetriever):
    """Weighted fusion retriever that combines and rescores results."""

    def __init__(
        self,
        retrievers: dict,
        weights: dict
    ):
        """Initialize with retrievers and weights."""
        self.retrievers = retrievers
        self.weights = weights
        super().__init__()

    def _retrieve(self, query_bundle: QueryBundle) -> list[NodeWithScore]:
        """Retrieve nodes with weighted fusion approach."""
        # Get results from each retriever
        all_results = {}

        for name, retriever in self.retrievers.items():
            results = retriever.retrieve(query_bundle)
            weight = self.weights.get(name, 1.0)

            for node_with_score in results:
                node_id = node_with_score.node.node_id

                # Apply weight to score
                weighted_score = node_with_score.score * weight

                if node_id not in all_results:
                    all_results[node_id] = {
                        "node": node_with_score.node,
                        "scores": {}
                    }

                all_results[node_id]["scores"][name] = weighted_score

        # Combine scores and create final results
        final_results = []
        for node_id, data in all_results.items():
            # Sum up the scores from different retrievers
            combined_score = sum(data["scores"].values())

            # Create NodeWithScore object
            node_with_score = NodeWithScore(
                node=data["node"],
                score=combined_score
            )
            final_results.append(node_with_score)

        # Sort by combined score (descending)
        final_results.sort(key=lambda x: x.score, reverse=True)

        return final_results

In [8]:
# Test the weighted fusion retriever with different weights
# Create weighted fusion retrievers with different weight configurations
bm25_heavy = WeightedFusionRetriever(
    retrievers={k: v for k, v in retrievers_dict.items() if k != "nodes"},
    weights={"BM25": 0.8, "Vector": 0.2}  # BM25 heavy
)

vector_heavy = WeightedFusionRetriever(
    retrievers={k: v for k, v in retrievers_dict.items() if k != "nodes"},
    weights={"BM25": 0.2, "Vector": 0.8}  # Vector heavy
)

balanced = WeightedFusionRetriever(
    retrievers={k: v for k, v in retrievers_dict.items() if k != "nodes"},
    weights={"BM25": 0.5, "Vector": 0.5}  # Balanced weights
)

# Test with a semantic query
semantic_query = "how do computers understand images?"
semantic_bundle = QueryBundle(semantic_query)

print(f"\n\nQuery: {semantic_query}")
print("=" * 80)

# Get results from weighted retrievers
bm25_heavy_results = bm25_heavy.retrieve(semantic_bundle)
vector_heavy_results = vector_heavy.retrieve(semantic_bundle)
balanced_results = balanced.retrieve(semantic_bundle)

# Print results
print(format_results(bm25_heavy_results, "BM25-Heavy Weighted Fusion"))
print(format_results(vector_heavy_results, "Vector-Heavy Weighted Fusion"))
print(format_results(balanced_results, "Balanced Weighted Fusion"))



Query: how do computers understand images?

BM25-Heavy Weighted Fusion Results:
----------------------------------------
Result 1 [Score: 2.9622]:
Computer vision is an interdisciplinary field that deals with how computers can gain high-level understanding from digital images or videos. It aims to automate tasks that the human visual system can do.
--------------------------------------------------
Result 2 [Score: 0.5407]:
Natural Language Processing (NLP) encompasses techniques for understanding, interpreting and generating human language. Modern NLP systems use transformer architectures to process and generate text with remarkable accuracy.
--------------------------------------------------
Result 3 [Score: 0.4552]:
Neural networks are computational models inspired by the human brain. They consist of layers of interconnected nodes or 'neurons' that process and transform input data to produce meaningful outputs.
--------------------------------------------------
Result 4 [Score: 0.

In [9]:
# Compare retriever performance across different query types
# Function to compare retrievers
def compare_retrievers(retrievers, queries):
    """Compare multiple retrievers across different query types."""
    results = []

    for query in queries:
        query_bundle = QueryBundle(query)
        query_result = {"Query": query}

        # Get top 3 results from each retriever
        for name, retriever in retrievers.items():
            retrieved = retriever.retrieve(query_bundle)
            # Get first sentence of each result for compact display
            texts = [node.node.text.split(
                '.')[0] for node in retrieved[:]]
            query_result[f"{name} Top 3"] = " | ".join(texts)

        results.append(query_result)

    return pd.DataFrame(results)


# Create dictionary of all retrievers to compare
all_retrievers = {
    "BM25": retrievers_dict["BM25"],
    "Vector": retrievers_dict["Vector"],
    "Fusion": hybrid_retriever,
    "BM25-Heavy": bm25_heavy,
    "Vector-Heavy": vector_heavy
}

# Test with a variety of queries
test_queries = [
    "CV libs",
    "neural networks backpropagation",
    "how do computers understand images?",
    "natural language processing",
    "machine learning applications",
    "reinforcement learning in games"
]

# Compare retrievers
comparison = compare_retrievers(all_retrievers, test_queries)

# Print comparison
print("\nRetriever Comparison Across Query Types:")
print(comparison)


Retriever Comparison Across Query Types:
                                 Query  \
0                              CV libs   
1      neural networks backpropagation   
2  how do computers understand images?   
3          natural language processing   
4        machine learning applications   
5      reinforcement learning in games   

                                          BM25 Top 3  \
0  Deep learning is a subset of machine learning ...   
1  Backpropagation is a key algorithm for trainin...   
2  Computer vision is an interdisciplinary field ...   
3  Natural Language Processing (NLP) encompasses ...   
4  Machine learning is a subset of artificial int...   
5  Reinforcement learning is a type of machine le...   

                                        Vector Top 3  \
0  Large Language Models (LLMs) like GPT-4 and Cl...   
1  Backpropagation is a key algorithm for trainin...   
2  Computer vision is an interdisciplinary field ...   
3  Natural Language Processing (NLP) encompass

# Practical tips for implementing hybrid retrievers:

1. Choosing weights: Start with equal weights and adjust based on the query types you see in your application. If users frequently use keywords, increase BM25 weight; if they ask natural language questions, favor vector search.

2. Performance considerations: The fusion approach requires running both retrievers, which can be more expensive. In production, consider:

    - Running retrievers in parallel
    - Using a faster first-stage retriever (typically BM25) to filter candidates
    - Caching results for common queries

3. Beyond simple weighting: More sophisticated approaches include:

    - Using machine learning to learn optimal weights
    - Adjusting weights dynamically based on query type
    - Implementing reciprocal rank fusion, which considers result positions

4. Evaluation: Always evaluate your hybrid approach on a diverse set of queries to ensure it's performing better than either method alone."