## Building a RAG Pipeline with BBC News Data

Have you ever wondered how Large Language Models (LLMs) can stay updated with recent events even after their training cut-off? That’s where **Retrieval-Augmented Generation (RAG)** steps in! In this project, we’ll take on the challenge of building a RAG pipeline using a dataset of news articles from **BBC News**.

The main objective? To empower our LLM to retrieve the most relevant news details from the dataset and weave that information into its responses. The model we’re working with is **llama-3.1 from Ollama**. While the model is impressive, it doesn't have information on latest data and that’s exactly the gap we’ll fill with our RAG system.

Here’s what this blog involves:

* **Query & Retrieval:** Implement a function that fetches relevant news snippets based on the user’s query.
* **Data Formatting:** Organize the retrieved information so that it’s clean and contextually useful.
* **Prompt Engineering:** Combine the original query and the retrieved data into a well-crafted prompt, then feed it into the LLM for a richer, more informed response.


#### Importing necessary libraries

In [4]:
import json
import numpy as np
import pandas as pd
from pprint import pprint
from dateutil import parser
import joblib

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

from langchain_ollama import ChatOllama
from langchain.schema import HumanMessage, SystemMessage, AIMessage

import warnings
warnings.filterwarnings('ignore')

#### Reading dataset

In [5]:
NEWS_DATA = pd.read_csv("news_data_dedup.csv")
NEWS_DATA.head(3)

Unnamed: 0,guid,title,description,venue,url,published_at,updated_at
0,e3dc5caa18f9a16d7edcc09f8d5c2bb4,Harvey Weinstein's 2020 rape conviction overtu...,Victims group describes the New York appeal co...,BBC,https://www.bbc.co.uk/news/world-us-canada-688...,2024-04-25 18:24:04+00,2024-04-26 20:03:00.628113+00
1,297b7152cd95e80dd200a8e1997e10d9,Police and activists clash on Atlanta campus a...,"Meanwhile, hundreds of students march in Washi...",BBC,https://www.bbc.co.uk/news/live/world-us-canad...,2024-04-25 13:40:25+00,2024-04-26 20:03:00.654819+00
2,170bd18d1635c44b9339bdbaf1e62123,Haiti PM resigns as transitional council sworn in,The council will try to restore order and form...,BBC,https://www.bbc.co.uk/news/world-latin-america...,2024-04-25 18:11:02+00,2024-04-26 20:03:00.663393+00


**Defining an utility function for reading the dataset from the given path and then converting the pandas dataframe into a list of dictionaries of records.**

In [6]:
def format_date(date_string):
    date_object = parser.parse(date_string)  #parsing string into datetime object
    formatted_date = date_object.strftime("%Y-%m-%d")
    return formatted_date


def read_dataframe(path):
    df = pd.read_csv(path)
    
    df['published_at'] = df['published_at'].apply(format_date)
    df['updated_at'] = df['updated_at'].apply(format_date)

    df= df.to_dict(orient='records') # Convert the DataFrame to dictionary after formatting
    return df

In [7]:
NEWS_DATA = read_dataframe("news_data_dedup.csv")
NEWS_DATA[:3]

[{'guid': 'e3dc5caa18f9a16d7edcc09f8d5c2bb4',
  'title': "Harvey Weinstein's 2020 rape conviction overturned",
  'description': 'Victims group describes the New York appeal court\'s decision to retry Hollywood mogul as "profoundly unjust".',
  'venue': 'BBC',
  'url': 'https://www.bbc.co.uk/news/world-us-canada-68899382',
  'published_at': '2024-04-25',
  'updated_at': '2024-04-26'},
 {'guid': '297b7152cd95e80dd200a8e1997e10d9',
  'title': 'Police and activists clash on Atlanta campus amid Gaza protests',
  'description': 'Meanwhile, hundreds of students march in Washington DC, and congresswoman Ilhan Omar joins protesters at a New York campus.',
  'venue': 'BBC',
  'url': 'https://www.bbc.co.uk/news/live/world-us-canada-68898923',
  'published_at': '2024-04-25',
  'updated_at': '2024-04-26'},
 {'guid': '170bd18d1635c44b9339bdbaf1e62123',
  'title': 'Haiti PM resigns as transitional council sworn in',
  'description': 'The council will try to restore order and form a new government in 

#### Initiating the embedding model

In order to perform the retrieval, we need to embed the data and query both. To generate embeddings for text, we use the SentenceTransformer library with the BAAI/bge-base-en-v1.5 model. This model is a powerful, pre-trained transformer designed for producing high-quality dense embeddings, suitable for semantic search, retrieval-augmented generation (RAG), and other NLP tasks. By initializing the model using SentenceTransformer(embedding_model_name), we can easily convert textual data into vector representations that capture semantic meaning.

In [8]:
embedding_model_name = "BAAI/bge-base-en-v1.5"
embed_model = SentenceTransformer(embedding_model_name)

In [9]:
embed_model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

News dataset was already embedded using the same embedding model and was saved.

In [10]:
# Loading the embeddings of the dataset
EMBEDDINGS = joblib.load("embeddings.joblib")

**The below process is followed for generating and saving embeddings for the dataset loaded**

The below function **concatenates multiple fields from each record in a dataset into a single text string**. It loops through each data entry, retrieves the specified fields (if they exist, otherwise uses an empty string), appends their values separated by spaces, trims extra whitespace, and stores the combined text in a list. Finally, it returns this list of concatenated strings. This is used when you need to embed the complete dataset.

In [29]:
def concatenate_fields(dataset, fields):
    concatenated_data = []   # list where texts will be stored

    for data in dataset:
        text = ''
        for field in fields:
            context = data.get(field, '') # get the desired filed if found else empty string is used
            if context:
                text += f"{context} "  # add context to text if context is available

        text = text.strip() # strip whitespace from text
        concatenated_data.append(text)

    return concatenated_data


def generate_and_save_embeddings(dataset, fields, model_name="BAAI/bge-base-en-v1.5", output_path='embeddings.joblib'):
    # Concatenate fields into text
    concatenated_texts = concatenate_fields(dataset, fields)

    # Load the embedding model
    model = SentenceTransformer(model_name)

    # Generate embeddings
    embeddings = model.encode(concatenated_texts, batch_size=32, show_progress_bar=True)

    # Save embeddings using joblib
    joblib.dump(embeddings, output_path)
    print(f"Embeddings saved to {output_path}")

    return embeddings


# Example usage:
# dataset = This is the list of dictionaries that we created above
# fields = The column names
# embeddings = generate_and_save_embeddings(dataset, fields, output_path='news_embeddings.joblib')


#### Defining model calling function

Now that we have the dataset in dictionary format and the embeddings ready, we can define the LLM model calling function.

In [11]:
# For printing
def pprint(*args, **kwargs):
    print(json.dumps(*args, indent = 2))

**Initiating the model call with a single prompt input**

In [12]:
def generate_with_single_input(
                            prompt: str,
                            role: str = 'user',
                            top_p: float = 0.1,
                            temperature: float = 0.1,
                            max_tokens: int = 500,
                            model: str = "llama3.1:8b",
                            **kwargs
                                ):


    llm = ChatOllama(
        model = model,
        temperature = temperature,
        top_p= top_p,
        frequency_penalty = 0.5,
        presence_penalty = 0.3,
        max_tokens = max_tokens
        )

    
    role_map = {"user" : HumanMessage,
                "system" : SystemMessage,
                "assistant" : AIMessage}


    content_passed = prompt
    role = role_map[role]
    messages = [role(content=content_passed)]

    response = llm.invoke(messages)

    response_role = "assistant" if response.type == "ai" else response.type
    response_type = {'Role':response_role}

    # Convert to dictionary
    json_dict = response.model_dump()
    json_dict.update(response_type)  # Adding response type with the response dictionary

    return json_dict

#### Building the retriever

We have the dataset in dictionary format, embeddings of the dataset, and the LLM calling function defined. Now we need to provide the LLM with the required set of data as per the user query, for which we need to define the retriever method. The function **`query_news`** takes a list of indices as input and retrieves all documents from a dataset that correspond to those indices. Essentially, it acts like a lookup function: for each index in the list, it fetches the associated document (such as a news article) and returns the collection of these documents. This is commonly used when you already know which specific records you need, often after a search or filtering operation.

In [13]:
def query_news(indices: list) -> list:
    output = [NEWS_DATA[index] for index in indices]
    return output

In [14]:
# Let's say you need to see the values at index 4 and 10 in the news dataset
indices = [4,10]
pprint(query_news(indices))

[
  {
    "guid": "733f744b006fb13033d264efcaf8edad",
    "title": "Prosecutors ask for halt to case against Spain PM's wife",
    "description": "Pedro S\u00e1nchez is deciding whether to resign after a case against his wife by an anti-corruption group.",
    "venue": "BBC",
    "url": "https://www.bbc.co.uk/news/world-europe-68895727",
    "published_at": "2024-04-25",
    "updated_at": "2024-04-26"
  },
  {
    "guid": "d2c3ff79d4e068911d05416ca061cd51",
    "title": "Ukraine uses longer-range US missiles for first time",
    "description": "Missiles secretly delivered this month have been used to strike Russian targets in Crimea, US media say.",
    "venue": "BBC",
    "url": "https://www.bbc.co.uk/news/world-europe-68893196",
    "published_at": "2024-04-25",
    "updated_at": "2024-04-26"
  }
]


**The `retrieve` function is a crucial component of the RAG system. It is designed to identify and return the most relevant documents from a given corpus based on a provided query.**

This function takes two input parameters: `query` and `top_k`. The `query` parameter is a string representing the search query for which the system needs to find the most relevant documents. The `top_k` parameter is an integer that specifies how many of the top similar documents should be retrieved. Upon execution, the function processes the query and compares it against the corpus to determine similarity scores, then returns a list of indices corresponding to the top `k` most relevant documents. These indices serve as references to the original documents, which can later be used for generating precise and contextually relevant responses.


In [16]:
def retrieve(query, top_k = 5): # Default value for document to return is 5

    # Convert the input query into a numerical embedding vector using the pre-trained embedding model.
    query_embedding = embed_model.encode(query)

    # Compute the cosine similarity between the query embedding (reshaped into a 2D array) and all stored embeddings.
    # This returns an array of similarity scores for each stored document.
    similarity_score = cosine_similarity(query_embedding.reshape(1,-1), EMBEDDINGS)[0]

    # Sort the indices of documents in descending order of similarity (highest similarity first).
    # Negative sign (-similarity_score) ensures sorting in descending order.
    similarity_indices = np.argsort(-similarity_score)

    # Select the first 'top_k' indices from the sorted list, which correspond to the most similar documents.
    top_k_indices = similarity_indices[:top_k]

    return top_k_indices


In [17]:
query = "Concerts in North America"
indices = retrieve(query, top_k = 1) # Passing the query to get the top index similar to the query
print(indices)

[350]


In [18]:
retrieved_documents = query_news(indices) # Passing the top index to get the data from news dataset at that index
pprint(retrieved_documents)

[
  {
    "guid": "927257674585bb6ef669cf2c2f409fa7",
    "title": "\u2018The working class can\u2019t afford it\u2019: the shocking truth about the money bands make on tour",
    "description": "As Taylor Swift tops $1bn in tour revenue, musicians playing smaller venues are facing pitiful fees and frequent losses. Should the state step in to save our live music scene?When you see a band playing to thousands of fans in a sun-drenched festival field, signing a record deal with a major label or playing endlessly from the airwaves, it\u2019s easy to conjure an image of success that comes with some serious cash to boot \u2013 particularly when Taylor Swift has broken $1bn in revenue for her current Eras tour. But looks can be deceiving. \u201cI don\u2019t blame the public for seeing a band playing to 2,000 people and thinking they\u2019re minted,\u201d says artist manager Dan Potts. \u201cBut the reality is quite different.\u201dPost-Covid there has been significant focus on grassroots mus

**We will encapsulate the above logic into a function called `get_relevant_data`. This function accepts a query and a `top_k` parameter, then retrieves and returns the top `top_k` most relevant documents.**

In [19]:
def get_relevant_data(query: str, top_k: int = 5) -> list[dict]:
    relevant_indices = retrieve(query, top_k)  # Getting the top_k similar indices with the prompt via similarity search

    # We are using the embeddings to fetch the most similar data and are not passing those embeddings to the LLM
    relevant_data = query_news(relevant_indices)  # Passing the retrieved indices into the news dataset to get relevant data from news dataset
    return relevant_data

In [20]:
query = "Greatest storms in the US"
relevant_data = get_relevant_data(query, top_k = 1)
pprint(relevant_data)

[
  {
    "guid": "3ca548fe82c3fcae2c4c0c635d03eb2e",
    "title": "Large tornado seen touching down in Nebraska",
    "description": "Severe and powerful storms have moved across several US states, leaving many experiencing power shortages.",
    "venue": "BBC",
    "url": "https://www.bbc.co.uk/news/world-us-canada-68860070",
    "published_at": "2024-04-26",
    "updated_at": "2024-04-28"
  }
]


Now that we have the essential retrieved data, we need to create a function that takes a list of documents as input and produces a well-structured string containing the details of each document. The output includes the following fields for each news item:

* **News Title**
* **News Description**
* **News Published Date**
* **News URL**


In [21]:
def format_relevant_data(relevant_data: list[dict]) -> str:
    formatted_documents = []
    
    for document in relevant_data:
        # Formats each document into a structured layout string. Each document is in one different line.
        formatted_document = f"""
        Title:{document['title']}, Description:{document['description']}, Published At: {document['published_at']},\n URL: {document['url']} 
        """

        formatted_documents.append(formatted_document)

    return "\n".join(formatted_documents)

In [22]:
# Checking how the format_relevant_data looks like
example_data = NEWS_DATA[4:8]
print(format_relevant_data(example_data))


        Title:Prosecutors ask for halt to case against Spain PM's wife, Description:Pedro Sánchez is deciding whether to resign after a case against his wife by an anti-corruption group., Published At: 2024-04-25,
 URL: https://www.bbc.co.uk/news/world-europe-68895727 
        

        Title:WATCH: Would you pay a tourist fee to enter Venice?, Description:From Thursday visitors making a trip to the famous city at peak times will be charged a trial entrance fee., Published At: 2024-04-25,
 URL: https://www.bbc.co.uk/news/world-europe-68898441 
        

        Title:Supreme Court divided on whether Trump has immunity, Description:The justices discussed immunity, coups, pardons, Operation Mongoose - and the future of democracy., Published At: 2024-04-25,
 URL: https://www.bbc.co.uk/news/world-us-canada-68901817 
        

        Title:More than 150 killed as heavy rains pound Tanzania, Description:The prime minister warns that El Niño-triggered heavy rains are likely to continue into

#### Generating prompt and calling the model

The next function, **generate\_final\_prompt**, is responsible for creating the complete prompt by seamlessly integrating the user’s query with the retrieved and formatted relevant data. This function ensures that the final prompt includes all essential context for the model to generate an accurate and informative response.

In [23]:
def generate_final_prompt(query, top_k=5, use_rag=True, prompt=None):

    # If RAG is not being used, format the prompt with just the query or return the query directly
    if not use_rag:
        return query

    # Getting the relevant data as per the query passed
    relevant_data = get_relevant_data(query, top_k = top_k)

    # Formatting the relevant data fetched
    retrieve_data_formatted = format_relevant_data(relevant_data)

    # If no prompt is given then use this default
    if prompt is None:
        prompt = (
            f"Answer the user query below. There will be provided additional information for you to compose your answer. "
            f"The relevant information provided is from 2024 and it should be added as your overall knowledge to answer the query, "
            f"you should not rely only on this information to answer the query, but add it to your overall knowledge."
            f"Query: {query}\n"
            f"2024 News: {retrieve_data_formatted}"
        )
    else:
        prompt = prompt.format(query = query, documents = retrieve_data_formatted)

    return prompt

In [24]:
print(generate_final_prompt(query = "Tell me about the US GDP in the past 3 years.", top_k = 1))

Answer the user query below. There will be provided additional information for you to compose your answer. The relevant information provided is from 2024 and it should be added as your overall knowledge to answer the query, you should not rely only on this information to answer the query, but add it to your overall knowledge.Query: Tell me about the US GDP in the past 3 years.
2024 News: 
        Title:America's Economy Is No. 1. That Means Trouble, Description:If you want a single number to capture America’s economic stature, here it is: This year, the U.S. will account for 26.3% of the global gross domestic product, the highest in almost two decades. That’s based on the latest projections from the International Monetary Fund. According to the IMF, Europe’s share of world GDP has dropped 1.4 percentage points since 2018, and Japan’s by 2.1 points. The U.S. share, by contrast, is up 2.3 points., Published At: 2024-04-26,
 URL: https://www.wsj.com/articles/americas-economy-is-no-1-that-

**Calling the model with the prompt**

The **LLM Call** step involves integrating the previously defined function to generate the final prompt and pass it to a language model for processing.

In [25]:
def llm_call(query, top_k=5, use_rag=True, prompt=None):
    prompt = generate_final_prompt(query, top_k, use_rag, prompt)  # Getting the required prompt as per query
    generate_response = generate_with_single_input(prompt)  # Calling the model with the prompt having the required data from dataset
    generated_message = generate_response['content']
    return generated_message

**Calling the model using RAG Approach. Observe the output returned using RAG**

In [26]:
query = "Tell me about the US GDP in the past 3 years."
result_with_rag = llm_call(query, use_rag=True, top_k=3)

In [27]:
print(result_with_rag)

Based on the provided news articles from 2024, here's an overview of the US GDP in the past three years:

1. **2022**: Unfortunately, there is no specific information about the US GDP for 2022 in the provided articles. However, I can provide general knowledge that the US economy has been growing steadily over the past few years.
2. **2023**: The exact numbers are not mentioned in the articles, but it's reported that the US will account for 26.3% of the global gross domestic product (GDP) in 2024, which is the highest in almost two decades. This suggests a strong growth trend in the US economy.
3. **2024**: According to the International Monetary Fund (IMF), the US share of world GDP has increased by 2.3 percentage points since 2018.

It's worth noting that while the US GDP and Dow Jones Industrial Average have been rising, there are concerns about whether these metrics accurately reflect American well-being. The articles mention solid growth, big deficits, and a strong dollar, which st

**Calling the model without using RAG Approach. Observe the output returned is quiet vague and is not as per data we have**

In [28]:
result_without_rag = llm_call(query, use_rag=False, top_k=3)

print(result_without_rag)

Here's an overview of the US Gross Domestic Product (GDP) for the past three years, based on data from the Bureau of Economic Analysis (BEA):

**2020:**

* The COVID-19 pandemic had a significant impact on the US economy.
* GDP contracted by 3.4% in 2020, marking the first recession since 2009.
* The decline was largely due to lockdowns and social distancing measures that reduced consumer spending and business activity.
* However, government stimulus packages and monetary policy support helped mitigate the downturn.

**2021:**

* After a slow start, the US economy rebounded strongly in 2021.
* GDP grew by 5.7% in 2021, driven by:
	+ A rapid recovery in consumer spending, as lockdowns were lifted and vaccination rates improved.
	+ Strong growth in business investment, particularly in technology and healthcare sectors.
	+ Government stimulus packages continued to support the economy.

**2022:**

* The US economy experienced a slowdown in 2022, due to:
	+ Rising inflation (4.7% annual rat

### Conclusion

Using Retrieval-Augmented Generation (RAG) in querying a Large Language Model (LLM) significantly improves factual accuracy and contextual relevance compared to not using RAG. Without RAG, the LLM relies solely on its pre-trained knowledge, which can lead to outdated or hallucinated responses when dealing with domain-specific or recent information. In contrast, RAG enhances the process by retrieving relevant documents or chunks from an external knowledge base and feeding them into the model as context, enabling more precise, up-to-date, and source-grounded answers. This makes RAG particularly useful for dynamic domains like news, legal, and enterprise data, whereas relying on an LLM alone limits the model to static knowledge and increases the risk of incomplete or incorrect answers.

There are several advanced search techniques that outperform the basic cosine similarity approach typically used with embeddings. While cosine similarity is simple and widely adopted for measuring vector similarity, it does not scale well for very large datasets and lacks semantic refinement. Next blogs will show the usage of other advanced search methods in retriever.

### Acknowledgement

This blog draws inspiration from the RAG course by **deeplearning.ai**. However, instead of using the OpenAI API with Together.ai, as demonstrated in the course, I’ve implemented the solution using **Ollama**, an open-source, quantized model that can even run on your CPU. While responses might take a few seconds depending on your input data, this approach eliminates the need for external APIs.

You can check out the original course here: [Retrieval-Augmented Generation (RAG) on Coursera](https://www.coursera.org/learn/retrieval-augmented-generation-rag/).