# RAG System with Hybrid Search Comparison

This notebook builds and evaluates a Retrieval-Augmented Generation (RAG) system using a news dataset. It implements and compares three distinct retrieval strategies:

1.  **BM25:** A classic keyword-based (lexical) search.
2.  **Semantic Search:** Using `SentenceTransformer` embeddings for meaning-based retrieval.
3.  **Reciprocal Rank Fusion (RRF):** A hybrid method that combines the ranks from the two methods above.

The final step integrates these retrievers into a RAG pipeline that provides context to a Llama 3 LLM. An interactive widget is used to query the system and directly compare the final, generated answers from each method against a baseline non-RAG response.

In [1]:
import joblib
import numpy as np
import bm25s
import os
from sentence_transformers import SentenceTransformer

In [2]:
from utils import (
    read_dataframe,
    pprint, 
    generate_with_single_input, 
    cosine_similarity,
    display_widget
)
import unittests

## Loading the Datase

In [3]:
NEWS_DATA = read_dataframe("news_data_dedup.csv")

In [4]:
NEWS_DATA[5]

{'guid': '18ba9f2676859f393a271d15692a9c6e',
 'title': 'WATCH: Would you pay a tourist fee to enter Venice?',
 'description': 'From Thursday visitors making a trip to the famous city at peak times will be charged a trial entrance fee.',
 'venue': 'BBC',
 'url': 'https://www.bbc.co.uk/news/world-europe-68898441',
 'published_at': '2024-04-25',
 'updated_at': '2024-04-26'}

In [5]:
len(NEWS_DATA)

870

# Retrieve Functions

### Query news by index

In [6]:
def query_news(indices):
    """
    Retrieves elements from a dataset based on specified indices.
    """
     
    output = [NEWS_DATA[index] for index in indices]
    return output

## BM25 Retrieve

In [7]:
# The corpus used will be the title appended with the description
corpus = [x['title'] + " " + x['description'] for x in NEWS_DATA]

In [8]:
corpus[:2]

['Harvey Weinstein\'s 2020 rape conviction overturned Victims group describes the New York appeal court\'s decision to retry Hollywood mogul as "profoundly unjust".',
 'Police and activists clash on Atlanta campus amid Gaza protests Meanwhile, hundreds of students march in Washington DC, and congresswoman Ilhan Omar joins protesters at a New York campus.']

In [9]:
# The corpus used will be the title appended with the description
corpus = [x['title'] + " " + x['description'] for x in NEWS_DATA]

# Instantiate the retriever by passing the corpus data
BM25_RETRIEVER = bm25s.BM25(corpus=corpus)

# Tokenize the chunks
tokenized_data = bm25s.tokenize(corpus)

# Index the tokenized chunks within the retriever
BM25_RETRIEVER.index(tokenized_data)

# Tokenize the query
sample_query = "What are the recent news about GDP?"
tokenized_sample_query = bm25s.tokenize(sample_query)

# Get the retrieved results and their respective scores
results, scores = BM25_RETRIEVER.retrieve(tokenized_sample_query, k=3)

print(f"Results for query: {sample_query}\n")
for doc in results[0]:
    print(f"Document retrieved {corpus.index(doc)} : {doc}\n")

Split strings:   0%|          | 0/870 [00:00<?, ?it/s]

BM25S Count Tokens:   0%|          | 0/870 [00:00<?, ?it/s]

BM25S Compute Scores:   0%|          | 0/870 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Results for query: What are the recent news about GDP?

Document retrieved 752 : GDP and the Dow Are Up. But What About American Well-Being? The standard ways of measuring economic growth don’t capture what life is like for real people. A new metric offers a better alternative, especially for seeing disparities across the country.

Document retrieved 673 : What the GDP Report Says About Inflation: A Hot First Quarter Thursday’s gross domestic product report suggests that a widely watched inflation reading due Friday could be worse than expected.




In [10]:
results[0]

array(['GDP and the Dow Are Up. But What About American Well-Being? The standard ways of measuring economic growth don’t capture what life is like for real people. A new metric offers a better alternative, especially for seeing disparities across the country.',
       'What the GDP Report Says About Inflation: A Hot First Quarter Thursday’s gross domestic product report suggests that a widely watched inflation reading due Friday could be worse than expected.',
      dtype='<U251')

In [11]:
# Use these as a global defined BM25 retriever objects

corpus = [x['title'] + " " + x['description'] for x in NEWS_DATA]
BM25_RETRIEVER = bm25s.BM25(corpus=corpus)
TOKENIZED_DATA = bm25s.tokenize(corpus)
BM25_RETRIEVER.index(TOKENIZED_DATA)

Split strings:   0%|          | 0/870 [00:00<?, ?it/s]

BM25S Count Tokens:   0%|          | 0/870 [00:00<?, ?it/s]

BM25S Compute Scores:   0%|          | 0/870 [00:00<?, ?it/s]

In [12]:
def bm25_retrieve(query: str, top_k: int = 5):
    """
    Retrieves the top k relevant documents for a given query using the BM25 algorithm.

    This function tokenizes the input query and uses a pre-indexed BM25 retriever to
    search through a collection of documents. It returns the indices of the top k documents
    that are most relevant to the query.

    Args:
        query (str): The search query for which documents need to be retrieved.
        top_k (int): The number of top relevant documents to retrieve. Default is 5.

    Returns:
        List[int]: A list of indices corresponding to the top k relevant documents
        within the corpus.
    """

    tokenized_query = bm25s.tokenize(query)
    results, scores = BM25_RETRIEVER.retrieve(tokenized_query, k=top_k)

    results = results[0]

    # Convert the retrieved documents into their corresponding indices in the results list
    top_k_indices = [corpus.index(doc) for doc in results]
    
    return top_k_indices

In [13]:
bm25_retrieve("What are the recent news about GDP?")

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

[752, 673, 289, 626, 43]

## Semantic Searc

In [14]:
EMBEDDINGS = joblib.load("embeddings.joblib")

In [15]:
EMBEDDINGS.shape

(870, 768)

In [16]:
model_name = os.path.join(os.environ['MODEL_PATH'], "BAAI/bge-base-en-v1.5" )
model = SentenceTransformer(model_name)

In [17]:
query = "RAG is awesome"
model.encode(query)[:40]

array([ 0.00886302, -0.04775146, -0.00156089,  0.01309993, -0.00206938,
       -0.06157268,  0.01384688,  0.00101498, -0.04903949, -0.04762559,
       -0.03628184,  0.00478035, -0.03492182,  0.05323148,  0.02193964,
        0.03645132,  0.04029363, -0.00453639,  0.01883798, -0.03367384,
        0.02516192, -0.04843621, -0.04047944,  0.02590903,  0.02175229,
        0.03160364,  0.03937921, -0.03640463, -0.03113303, -0.01247228,
        0.03661649, -0.00458202, -0.00100169, -0.03188789,  0.02957137,
        0.01986158, -0.00737474,  0.02370178, -0.02151621, -0.07361361],
      dtype=float32)

In [18]:
query1 = "What are the primary colors"
query2 = "Yellow, red and blue"
query3 = "Cats are friendly animals"

query1_embed = model.encode(query1)
query2_embed = model.encode(query2)
query3_embed = model.encode(query3)

print(f"Similarity between '{query1}' and '{query2}' = {cosine_similarity(query1_embed, query2_embed)[0]}")
print(f"Similarity between '{query1}' and '{query3}' = {cosine_similarity(query1_embed, query3_embed)[0]}")

Similarity between 'What are the primary colors' and 'Yellow, red and blue' = 0.7377141714096069
Similarity between 'What are the primary colors' and 'Cats are friendly animals' = 0.4508620798587799


In [19]:
query = "Taylor Swift"
query_embed = model.encode(query)
similarity_scores = cosine_similarity(query_embed, EMBEDDINGS)
similarity_indices = np.argsort(-similarity_scores) # Sort on decreasing order (sort the negative on increasing order), but return the indices
# Top 2 indices
top_2_indices = similarity_indices[:2]
print(top_2_indices)

[350 176]


In [20]:
query_news(top_2_indices)

[{'guid': '927257674585bb6ef669cf2c2f409fa7',
  'title': '‘The working class can’t afford it’: the shocking truth about the money bands make on tour',
  'description': 'As Taylor Swift tops $1bn in tour revenue, musicians playing smaller venues are facing pitiful fees and frequent losses. Should the state step in to save our live music scene?When you see a band playing to thousands of fans in a sun-drenched festival field, signing a record deal with a major label or playing endlessly from the airwaves, it’s easy to conjure an image of success that comes with some serious cash to boot – particularly when Taylor Swift has broken $1bn in revenue for her current Eras tour. But looks can be deceiving. “I don’t blame the public for seeing a band playing to 2,000 people and thinking they’re minted,” says artist manager Dan Potts. “But the reality is quite different.”Post-Covid there has been significant focus on grassroots music venues as they struggle to stay open. There’s been less focus on

In [21]:
def semantic_search_retrieve(query, top_k=5):
    """
    Retrieves the top k relevant documents for a given query using semantic search and cosine similarity.

    This function generates an embedding for the input query and compares it against pre-computed document
    embeddings using cosine similarity. The indices of the top k most similar documents are returned.

    Args:
        query (str): The search query for which relevant documents need to be retrieved.
        top_k (int): The number of top relevant documents to retrieve. Default value is 5.

    Returns:
        List[int]: A list of indices corresponding to the top k most relevant documents in the corpus.
    """
    query_embedding = model.encode(query)
    similarity_scores = cosine_similarity(query_embedding, EMBEDDINGS)
    similarity_indices = np.argsort(-similarity_scores)
    top_k_indices_array = similarity_indices[:top_k]
    top_k_indices = [int(x) for x in top_k_indices_array]
    
    return top_k_indices

In [22]:
semantic_search_retrieve("What are the recent news about GDP?")

[743, 673, 626, 752, 326]

## RRF Retrieve

Reciprocal Rank Fusion (RRF) is an information retrieval technique used to combine results from multiple ranking systems. It aims to enhance the overall retrieval performance by integrating different ranking algorithms. RRF assigns a score to each document based on its rank in different result lists, allowing it to leverage the strengths of several retrieval approaches.


$$ 
\text{Score}(d) = \sum_{r=1}^{n} \frac{1}{k + \text{rank}_r(d)} 
$$

- $n$ is the number of ranking systems,
- $\text{rank}_r(d)$ is the rank of document $d$ in the $r$-th result list,
- $k$ is a constant to scale the contribution of each rank, often set to a small positive value.

In [23]:
def reciprocal_rank_fusion(list1, list2, top_k=5, K=60):
    """
    Fuse rank from multiple IR systems using Reciprocal Rank Fusion.

    Args:
        list1 (list[int]): A list of indices of the top-k documents that match the query.
        list2 (list[int]): Another list of indices of the top-k documents that match the query.
        top_k (int): The number of top documents to consider from each list for fusion. Defaults to 5.
        K (int): A constant used in the RRF formula. Defaults to 60.

    Returns:
        list[int]: A list of indices of the top-k documents sorted by their RRF scores.
    """

    rrf_scores = {}

    # Iterate over each document list
    for lst in [list1, list2]:
        # Calculate the RRF score for each document index
        for rank, item in enumerate(lst, start=1):
            # If the item is not in the dictionary, initialize its score to 0
            if item not in rrf_scores:
                rrf_scores[item] = 0
            # Update the RRF score for each document index using the formula 1 / (rank + K)
            rrf_scores[item] += 1 / (K + rank)

    # Sort the document indices based on their RRF scores in descending order
    sorted_items = sorted(rrf_scores, key=rrf_scores.get, reverse = True)

    # Slice the list to get the top-k document indices
    top_k_indices = [int(x) for x in sorted_items[:top_k]]

    return top_k_indices

In [24]:
list1 = semantic_search_retrieve('What are the recent news about GDP?')
list2 = bm25_retrieve('What are the recent news about GDP?')
rrf_list = reciprocal_rank_fusion(list1, list2)
print(f"Semantic Search List: {list1}")
print(f"BM25 List: {list2}")
print(f"RRF List: {rrf_list}")

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Semantic Search List: [743, 673, 626, 752, 326]
BM25 List: [752, 673, 289, 626, 43]
RRF List: [673, 752, 626, 743, 289]


## RAG System

### Creating the final prompt

In [25]:
def generate_final_prompt(query, top_k, retrieve_function=None, use_rag=True):
    """
    Generates an augmented prompt for a Retrieval-Augmented Generation (RAG) system by retrieving the top_k most 
    relevant documents based on a given query.

    Parameters:
    query (str): The search query for which the relevant documents are to be retrieved.
    top_k (int): The number of top relevant documents to retrieve.
    retrieve_function (callable): The function used to retrieve relevant documents. If 'reciprocal_rank_fusion', 
                                  it will combine results from different retrieval functions.
    use_rag (bool): A flag to determine whether to incorporate retrieved data into the prompt (default is True).

    Returns:
    str: A prompt that includes the top_k relevant documents formatted for use in a RAG system.
    """    
    prompt = query
    
    if not use_rag:
        return prompt


    # Determine which retrieve function to use based on its name.
    if retrieve_function.__name__ == 'reciprocal_rank_fusion':
        # Retrieve top documents using two different methods.
        list1 = semantic_search_retrieve(query, top_k)
        list2 = bm25_retrieve(query, top_k)
        # Combine the results using reciprocal rank fusion.
        top_k_indices = retrieve_function(list1, list2, top_k)
    else:
        # Use the provided retrieval function.
        top_k_indices = retrieve_function(query=query, top_k=top_k)
    
    
    # Retrieve documents from the dataset using the indices.
    relevant_documents = query_news(top_k_indices)
    
    formatted_documents = []

    # Iterate over each retrieved document.
    for document in relevant_documents:
        # Format each document into a structured string.
        formatted_document = (
            f"Title: {document['title']},\tDescription: {document['description']},\t"
            f"Published at: {document['published_at']}\nURL: {document['url']}"
        )
        # Append the formatted string to the main data string with a newline for separation.
        formatted_documents.append(formatted_document)

    retrieve_data_formatted = "\n".join(formatted_documents)
    
    prompt = f"""
    **Instructions:**
    Answer the user's query by synthesizing your general knowledge with the provided "2024 News Context".
    This context is recent and should be prioritized. Do not simply repeat the context; integrate it into a comprehensive response.

    **User Query:**
    {query}

    **2024 News Context:**
    {retrieve_data_formatted}

    **Answer:**
    """
    
    return prompt

In [26]:
def llm_call(query, retrieve_function=None, top_k=5, use_rag=True):
    prompt = generate_final_prompt(query, top_k=top_k, retrieve_function=retrieve_function, use_rag=use_rag)
    generated_response = generate_with_single_input(prompt)
    generated_message = generated_response['content']
    return generated_message

In [27]:
query = "Recent news in technology. Provide sources."
print(llm_call(query, retrieve_function=semantic_search_retrieve))

**Recent News in Technology:**

The technology sector has been experiencing significant changes in recent times. One of the key areas of focus is the advancement of Artificial Intelligence (AI) and its impact on the chip industry. According to a recent article by El Pais, the unstoppable advance of AI is changing the rules, creating new winners and losers in the increasingly important semiconductor sector. This has led to a "Game of Thrones" scenario in the chip industry, with various players vying for dominance.

Another area of interest is the impact of AI on the advertising industry. A slower pace of business in the technology sector has continued to weigh on some ad holding companies in the first quarter, but things might be looking up, according to a recent article by The Wall Street Journal.

In terms of market trends, T-Mobile, Imax, Rogers Communications, and other companies have been making headlines in the latest Market Talks covering Technology, Media and Telecom. These comp

In [28]:
query = "Recent news in technology. Provide sources."
print(llm_call(query, retrieve_function=bm25_retrieve))

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

**Recent News in Technology:**

In recent news, ByteDance has expressed its reluctance to sell TikTok in the US, citing the app's 'secret source' algorithm as a core part of its operations. This has made a sale of the app highly unlikely, according to sources close to the parent company. (Source: The Guardian, April 25, 2024)

On the other hand, Microsoft has seen a rise in profit due to the increasing demand for its software and cloud services, which have been bolstered by the growing use of AI technology. This has prompted the company to invest heavily in infrastructure to accommodate the growing appetite for AI. (Source: The Wall Street Journal, April 26, 2024)

Additionally, the Biden administration has announced plans to consolidate approval authority over big power-grid projects to accelerate upgrades and provide access to new clean-energy projects. This move is aimed at promoting the development of clean energy and reducing the country's reliance on fossil fuels. (Source: The Wa

In [29]:
query = "Recent news in technology. Provide sources."
print(llm_call(query, retrieve_function=reciprocal_rank_fusion))

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

**Recent News in Technology:**

In recent news, ByteDance, the parent company of TikTok, has expressed its preference to shut down the app in the US rather than sell it. This decision is reportedly due to the algorithms used by TikTok being deemed core to ByteDance's overall operations, making a sale highly unlikely. (Source: The Guardian, April 25, 2024)

Additionally, the advancement of artificial intelligence (AI) is changing the rules in the semiconductor sector, creating new winners and losers. The unstoppable advance of AI is driving the demand for more powerful and efficient chips, leading to a "Game of Thrones" scenario in the industry. (Source: El Pais, April 12, 2024)

Furthermore, the Biden administration is consolidating approval authority over big power-grid projects to accelerate upgrades and provide access to new clean-energy projects. This move aims to support the transition to a cleaner and more sustainable energy sector. (Source: The Wall Street Journal, April 26, 202

In [30]:
display_widget(llm_call, semantic_search_retrieve, bm25_retrieve, reciprocal_rank_fusion)

HTML(value='\n    <style>\n        .custom-output {\n            background-color: #f9f9f9;\n            color…

Text(value='', layout=Layout(width='100%'), placeholder='Type your query here')

IntSlider(value=5, description='Top K:', max=20, min=1, style=SliderStyle(description_width='initial'))

Button(description='Get Responses', style=ButtonStyle(button_color='#eee'))

Output()

HBox(children=(VBox(children=(Label(value='Semantic Search'), Output(layout=Layout(border_bottom='1px solid #c…

HBox(children=(VBox(children=(Label(value='Reciprocal Rank Fusion'), Output(layout=Layout(border_bottom='1px s…