In [20]:
from utils import (
    retrieve,
    pprint,
    generate_with_single_input,
    read_dataframe,
    display_widget
)

NEWS_DATA Samples: 870
EMBEDDINGS Shape: (870, 768)


In [2]:
NEWS_DATA = read_dataframe("news_data.csv")

In [3]:
len(NEWS_DATA)

870

In [4]:
pprint(NEWS_DATA[9:11])

[
  {
    "guid": "5dae28f191cfd1047f67c409e616fc3f",
    "title": "Paris's Moulin Rouge loses windmill sails overnight",
    "description": "The cause of the sails' collapse from the roof of the world famous cabaret club is not yet clear.",
    "venue": "BBC",
    "url": "https://www.bbc.co.uk/news/world-europe-68895836",
    "published_at": "2024-04-25",
    "updated_at": "2024-04-26"
  },
  {
    "guid": "d2c3ff79d4e068911d05416ca061cd51",
    "title": "Ukraine uses longer-range US missiles for first time",
    "description": "Missiles secretly delivered this month have been used to strike Russian targets in Crimea, US media say.",
    "venue": "BBC",
    "url": "https://www.bbc.co.uk/news/world-europe-68893196",
    "published_at": "2024-04-25",
    "updated_at": "2024-04-26"
  }
]


In [5]:
def query_news(indices):
    """
    Retrieves elements from a dataset based on specified indices.
    """
     
    output = [NEWS_DATA[index] for index in indices]
    return output

In [6]:
indices = [3, 6, 9]
pprint(query_news(indices))

[
  {
    "guid": "e696224ac208878a5cec8bdc9f97c632",
    "title": "Europe risks dying and faces big decisions - Macron",
    "venue": "BBC",
    "url": "https://www.bbc.co.uk/news/world-europe-68898887",
    "published_at": "2024-04-25",
    "updated_at": "2024-04-26"
  },
  {
    "guid": "4f585bad8f61b715fbafe2f022ab0ae8",
    "title": "Supreme Court divided on whether Trump has immunity",
    "description": "The justices discussed immunity, coups, pardons, Operation Mongoose - and the future of democracy.",
    "venue": "BBC",
    "url": "https://www.bbc.co.uk/news/world-us-canada-68901817",
    "published_at": "2024-04-25",
    "updated_at": "2024-04-26"
  },
  {
    "guid": "5dae28f191cfd1047f67c409e616fc3f",
    "title": "Paris's Moulin Rouge loses windmill sails overnight",
    "description": "The cause of the sails' collapse from the roof of the world famous cabaret club is not yet clear.",
    "venue": "BBC",
    "url": "https://www.bbc.co.uk/news/world-europe-68895836",
    "

In [7]:
indices = retrieve("Concerts in North America", top_k=2)
print(indices)

[350 329]


In [8]:
print(NEWS_DATA[350]['description'])

As Taylor Swift tops $1bn in tour revenue, musicians playing smaller venues are facing pitiful fees and frequent losses. Should the state step in to save our live music scene?When you see a band playing to thousands of fans in a sun-drenched festival field, signing a record deal with a major label or playing endlessly from the airwaves, it’s easy to conjure an image of success that comes with some serious cash to boot – particularly when Taylor Swift has broken $1bn in revenue for her current Eras tour. But looks can be deceiving. “I don’t blame the public for seeing a band playing to 2,000 people and thinking they’re minted,” says artist manager Dan Potts. “But the reality is quite different.”Post-Covid there has been significant focus on grassroots music venues as they struggle to stay open. There’s been less focus on the actual ability of artists to tour these venues. David Martin, chief executive officer of the Featured Artists Coalition (FAC), says we’re in a “cost-of-touring cris

In [9]:
retrieved_documents = query_news(indices)
pprint(retrieved_documents)

[
  {
    "guid": "927257674585bb6ef669cf2c2f409fa7",
    "title": "\u2018The working class can\u2019t afford it\u2019: the shocking truth about the money bands make on tour",
    "description": "As Taylor Swift tops $1bn in tour revenue, musicians playing smaller venues are facing pitiful fees and frequent losses. Should the state step in to save our live music scene?When you see a band playing to thousands of fans in a sun-drenched festival field, signing a record deal with a major label or playing endlessly from the airwaves, it\u2019s easy to conjure an image of success that comes with some serious cash to boot \u2013 particularly when Taylor Swift has broken $1bn in revenue for her current Eras tour. But looks can be deceiving. \u201cI don\u2019t blame the public for seeing a band playing to 2,000 people and thinking they\u2019re minted,\u201d says artist manager Dan Potts. \u201cBut the reality is quite different.\u201dPost-Covid there has been significant focus on grassroots mus

In [10]:
def get_relevant_data(query: str, top_k: int = 5) -> list[dict]:
    """
    Retrieve and return the top relevant data items based on a given query.
    """
    relevant_indices = retrieve(query, top_k)
    relevant_data = query_news(relevant_indices)
    
    return relevant_data

In [11]:
query = "Greatest storms in the US"
relevant_data = get_relevant_data(query, top_k = 1)
pprint(relevant_data)

[
  {
    "guid": "3ca548fe82c3fcae2c4c0c635d03eb2e",
    "title": "Large tornado seen touching down in Nebraska",
    "description": "Severe and powerful storms have moved across several US states, leaving many experiencing power shortages.",
    "venue": "BBC",
    "url": "https://www.bbc.co.uk/news/world-us-canada-68860070",
    "published_at": "2024-04-26",
    "updated_at": "2024-04-28"
  }
]


In [12]:
def format_relevant_data(relevant_data):
    """
    Retrieves the top_k most relevant documents based on a given query and constructs an augmented prompt for a RAG system.

    Parameters:
    relevant_data (list): A list with relevant data.

    Returns:
    str: An augmented prompt with the top_k relevant documents, formatted for use in a Retrieval-Augmented Generation (RAG) system."
    """
    formatted_documents = []
    
    for document in relevant_data:
        formatted_document = f"Title: {document['title']}, Description: {document['description']}, \
        Published: {document['published_at']}\nURL: {document['url']}\n"
        
        formatted_documents.append(formatted_document)

    return "\n".join(formatted_documents)

In [13]:
example_data = NEWS_DATA[4:8]
print(format_relevant_data(example_data))

Title: Prosecutors ask for halt to case against Spain PM's wife, Description: Pedro Sánchez is deciding whether to resign after a case against his wife by an anti-corruption group.,         Published: 2024-04-25
URL: https://www.bbc.co.uk/news/world-europe-68895727

Title: WATCH: Would you pay a tourist fee to enter Venice?, Description: From Thursday visitors making a trip to the famous city at peak times will be charged a trial entrance fee.,         Published: 2024-04-25
URL: https://www.bbc.co.uk/news/world-europe-68898441

Title: Supreme Court divided on whether Trump has immunity, Description: The justices discussed immunity, coups, pardons, Operation Mongoose - and the future of democracy.,         Published: 2024-04-25
URL: https://www.bbc.co.uk/news/world-us-canada-68901817

Title: More than 150 killed as heavy rains pound Tanzania, Description: The prime minister warns that El Niño-triggered heavy rains are likely to continue into May.,         Published: 2024-04-25
URL: http

In [14]:
def generate_final_prompt(query, top_k=5, use_rag=True, prompt=None):
    """
    Generates a final prompt based on a user query, optionally incorporating relevant data using retrieval-augmented generation (RAG).

    Args:
        query (str): The user query for which the prompt is to be generated.
        top_k (int, optional): The number of top relevant data pieces to retrieve and incorporate. Default is 5.
        use_rag (bool, optional): A flag indicating whether to use retrieval-augmented generation (RAG)
                                  by including relevant data in the prompt. Default is True.
        prompt (str, optional): A template string for the prompt. It can contain placeholders {query} and {documents}
                                for formatting with the query and formatted relevant data, respectively.

    Returns:
        str: The generated prompt, either consisting solely of the query or expanded with relevant data
             formatted for additional context.
    """
    
    if not use_rag:
        return query

    # Retrieve the top_k relevant data pieces based on the query
    relevant_data = get_relevant_data(query, top_k=top_k)

    # Format the retrieved relevant data
    retrieve_data_formatted = format_relevant_data(relevant_data)

    # If no custom prompt is provided, use the default prompt template
    if prompt is None:
        prompt = (
            f"Answer the user query below. There will be provided additional information for you to compose your answer. "
            f"The relevant information provided is from 2024 and it should be added as your overall knowledge to answer the query, "
            f"you should not rely only on this information to answer the query, but add it to your overall knowledge.\n"
            f"Query:\n{query}\n"
            f"2024 News:\n{retrieve_data_formatted}"
        )
    else:
        # If a custom prompt is provided, format it with the query and formatted relevant data
        prompt = prompt.format(query=query, documents=retrieve_data_formatted)

    return prompt

In [15]:
print(generate_final_prompt("Tell me about the US GDP in the past 3 years."))

Answer the user query below. There will be provided additional information for you to compose your answer. The relevant information provided is from 2024 and it should be added as your overall knowledge to answer the query, you should not rely only on this information to answer the query, but add it to your overall knowledge.
Query:
Tell me about the US GDP in the past 3 years.
2024 News:
Title: America's Economy Is No. 1. That Means Trouble, Description: If you want a single number to capture America’s economic stature, here it is: This year, the U.S. will account for 26.3% of the global gross domestic product, the highest in almost two decades. That’s based on the latest projections from the International Monetary Fund. According to the IMF, Europe’s share of world GDP has dropped 1.4 percentage points since 2018, and Japan’s by 2.1 points. The U.S. share, by contrast, is up 2.3 points.,         Published: 2024-04-26
URL: https://www.wsj.com/articles/americas-economy-is-no-1-that-mea

In [16]:
def llm_call(query, top_k=5, use_rag=True, prompt=None):
    """
    Calls the LLM to generate a response based on a query, optionally using retrieval-augmented generation.

    Returns:
        str: The content of the response generated by the language model.
    """
    

    # Get the prompt with the query + relevant documents
    prompt = generate_final_prompt(query, top_k, use_rag, prompt)

    # Call the LLM
    generated_response = generate_with_single_input(prompt)

    # Get the content
    generated_message = generated_response['content']
    
    return generated_message

In [17]:
query = "Tell me about the US GDP in the past 3 years."
print(llm_call(query, use_rag=True))

Based on the provided information from 2024, I can give you an overview of the US GDP in the past 3 years. However, please note that the information provided is from 2024 and might not reflect the current situation.

According to the International Monetary Fund (IMF) projections in 2024, the US GDP share of the global gross domestic product has been increasing. In 2024, the US is expected to account for 26.3% of the global GDP, the highest in almost two decades.

As for the actual GDP growth rates, the information provided does not give a clear picture of the past 3 years. However, it mentions that the US economic growth slowed in the first quarter of 2024, with a 1.6% seasonally- and inflation-adjusted annual rate.

To give a more comprehensive answer, I would need more information on the GDP growth rates for the past 3 years. However, based on the available information, here is a rough estimate of the US GDP growth rates for the past 3 years:

- 2022: The US GDP growth rate was aroun

In [18]:
query = "Tell me about the US GDP in the past 3 years."
print(llm_call(query, use_rag=False))

The US GDP (Gross Domestic Product) for the past 3 years (2021-2023) is as follows:

1. **2021**: The US GDP in 2021 was $22.67 trillion. The economy experienced a significant rebound from the COVID-19 pandemic, with a growth rate of 5.7% in 2021. This growth was driven by government stimulus packages, vaccination efforts, and a strong labor market.

2. **2022**: The US GDP in 2022 was $25.43 trillion. The economy continued to grow, but at a slower pace than in 2021, with a growth rate of 2.1% in 2022. This slowdown was due to various factors, including rising inflation, supply chain disruptions, and the impact of the Russian invasion of Ukraine on global energy markets.

3. **2023 (Q1)**: The US GDP in Q1 2023 was $25.73 trillion. The economy experienced a slight contraction in Q1 2023, with a growth rate of -1.6%. This contraction was largely due to a decline in business investment and a decrease in government spending.

Please note that these figures are subject to revision and may 

In [19]:
display_widget(llm_call)

HTML(value='\n    <style>\n        .custom-output {\n            background-color: #f9f9f9;\n            color…

Text(value='', description='Query:', layout=Layout(width='100%'), placeholder='Type your query here')

Textarea(value='', description='Augmented prompt layout:', layout=Layout(height='100px', width='100%'), placeh…

IntSlider(value=5, description='Top K:', max=20, min=1, style=SliderStyle(description_width='initial'))

Button(description='Get Responses', style=ButtonStyle(button_color='#f0f0f0'))

Output()

HBox(children=(Label(value='With RAG', layout=Layout(width='45%')), Label(value='Without RAG', layout=Layout(w…

HBox(children=(Output(layout=Layout(border_bottom='1px solid #ccc', border_left='1px solid #ccc', border_right…