# Building a RAG from scratch

In this tutorial, we'll see how to build a RAG (without any RAG libraries) using OpenAI. The RAG will provide a summary of today's news.

RAGs allow LLMs to process external information, to make a case for it, let's first attempt to build a news summarizer without giving providing any external information. As we'll see shortly, this won't work, and we'll slowly add more code until we have a RAG that can summarize today's news.

## Approach 1: no external information

Let's call OpenAI's API to "retrieve" today's news. We'll do `n=3` to see a few different responses.

In [1]:
from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a helpful news assistant that can answer questions about today's news"},
    {"role": "user", "content": "Give me a short summary of today's news"},
  ],
  seed=42,
  n=3,
)

In [20]:
from IPython import display

def display_response(response):
    for i, choice in enumerate(response.choices, start=1):
        content = ''.join(f"\n> {line}" for line in choice.message.content.splitlines())
        
        display.display(display.Markdown(f"""
### Response {i}

{content}
"""))


display_response(response)


### Response 1


> Today's news covers a wide range of topics. In international news, the US is preparing for retaliatory strikes after a drone attack by Iran-backed militants, while Israeli forces disguised as medical staff carried out a raid in a West Bank hospital. In US politics, there are attempts to impeach Secretary Mayorkas over border policies, and Rep. Cori Bush is under investigation for alleged misuse of security funds. California is bracing for heavy rainstorms, and there have been arrests made in connection with murders in the Mojave Desert. Tech giants Microsoft and Alphabet reported their earnings, and companies like UPS and Starbucks announced job cuts. The entertainment world mourns the loss of Broadway star Chita Rivera, while Taylor Swift finds herself in the midst of controversy. Other news includes updates on Northern Ireland power-sharing, the situation in Gaza, and the Super Bowl.



### Response 2


> Today's news covers a range of topics. 
> 
> In international news, the US is preparing retaliatory strikes for a drone attack by Iran-backed militants, undercover Israeli troops dressed as medical staff kill three militants in a West Bank hospital raid, and France's government announces new measures to calm farmers' protests. 
> 
> Domestically, there are reports of House Republicans pressing ahead on impeaching Secretary Mayorkas over border policies, Rep. Cori Bush being under investigation for alleged misuse of security funds, and California bracing for heavy rainstorms. 
> 
> In tech news, Microsoft's earnings beat expectations, UPS is eliminating about 12,000 jobs, Alphabet (Google's parent company) misses expectations on ad revenue, and Starbucks' earnings disappoint as sales outside the US lag. 
> 
> Other news includes Apple's new Stolen Device Protection vulnerability, the death of Broadway star Chita Rivera, Taylor Swift controversy surrounding explicit AI photos, a screening of a 7-hour film about Hitler, and various sports updates. 
> 
> Lastly, there are reports on cases of syphilis and measles on the rise in the US, and the potential spread of Alzheimer's disease through medical treatments.



### Response 3


> Here are the top news highlights from today:
> 
> 1. The US is preparing retaliatory strikes in response to a drone attack by Iran-backed militants in Jordan.
> 2. E. Jean Carroll is facing backlash for saying she will use settlement money from a case against Trump to buy a new wardrobe.
> 3. Israeli forces reportedly disguised as medical staff killed three militants in a raid on a West Bank hospital.
> 4. House Republicans are pushing for the impeachment of Secretary Mayorkas over border policies.
> 5. Democratic Rep. Cori Bush is under investigation by the Justice Department for alleged misuse of security funds.
> 6. California is bracing for heavy rainstorms and potential flooding due to atmospheric rivers.
> 7. Five individuals have been arrested in connection with the murder of six people in the Mojave Desert over a marijuana dispute.
> 8. An Ohio man has been sentenced to 18 years in prison for attempting to firebomb a church that hosted drag shows.
> 9. Trump can stay on the Illinois 2024 ballot despite a 14th Amendment challenge.
> 10. France has announced new measures to address farmers' protests and blockades around Paris.
> 11. After a two-year hiatus, power-sharing in Northern Ireland is set to resume.
> 12. Progressives are urging Biden to reverse the pause on Palestine relief funding and improve conditions in Gaza.
> 13. Microsoft reports strong earnings, driven by cloud and AI investments.
> 14. UPS is cutting 12,000 jobs as part of cost-saving measures.
> 15. Alphabet (Google's parent company) misses expectations on ad revenue, causing its stock to drop.
> 16. Starbucks reports disappointing earnings, with lagging sales outside the US.
> 17. Apple releases the Vision Pro, its augmented reality headset, which sells out after receiving 200,000 pre-orders.
> 18. Analysts predict a potential drop in iPhone shipments by 15% in 2024, causing Apple's stock to fall.
> 19. Apple introduces a new Stolen Device Protection feature on iOS 17.3, but it has a vulnerability.
> 20. Broadway star Chita Rivera passes away at the age of 91.
> 21. Taylor Swift becomes the target of AI-generated explicit images, sparking controversy and calls to ban "deep fakes."
> 22. A 7-hour epic film about Hitler receives a rare screening.
> 23. The NFL's Senior Bowl showcases top prospects for the upcoming draft.
> 24. Super Bowl LVIII tickets in Las Vegas could have a record-breaking average price of $12,700.
> 25. The NBA trade deadline looms, and teams are evaluating potential deals.
> 26. Bill Belichick faces a lack of interest from teams after parting ways with the New England Patriots.
> 27. The mystery of why insects are attracted to light at night gets a new explanation.
> 28. SpaceX successfully launches a Northrop Grumman cargo ship to the International Space Station.
> 29. Rare images potentially show a live newborn great white shark for the first time.
> 30. Japan's Moon Lander, thought to be lost, wakes up and shares new images of the lunar surface.
> 31. Syphilis cases in the US reach levels not seen since 1950, prompting concerns.
> 32. Measles cases resurge in the US due to vaccine skepticism, according to the CDC.
> 33. Obesity and alcohol consumption are contributing to a rise in colorectal cancer among younger generations.
> 34. Rare cases of possible Alzheimer's transmission have been discovered in recipients of contaminated HGH treatment.


In responses 1 and 3, we can see the model generated something that *sounds like news* but it's hardly something that happened today. In response 2, the model (correctly) refused to answer the question, since it doesn't have external information.

## Approach 2: Pass today's news in the model's prompt

Let's now pass our model information about today's news by using the [GNews](https://github.com/ranahaani/GNews) library.

This library allows us to get data from Google News.

In [10]:
from gnews import GNews

google_news = GNews()
news = google_news.get_top_news()

In [11]:
len(news)

38

In [16]:
news[0]

{'title': 'US readies retaliatory strikes for drone attack by Iran-backed militants - ABC News',
 'description': "US readies retaliatory strikes for drone attack by Iran-backed militants  ABC News'I hold them responsible': Biden says he's made a decision on response to attack in Jordan  USA TODAY‘Frog being boiled’: US troop deaths in Jordan incite Republican Iran hawks  Al Jazeera English2024 Presidential Election News: Live Updates  The New York TimesNetanyahu issues condolences to US troops killed in drone strike 2 days ago  The Times of Israel",
 'published date': 'Tue, 30 Jan 2024 20:45:25 GMT',
 'url': 'https://news.google.com/rss/articles/CBMiamh0dHBzOi8vYWJjbmV3cy5nby5jb20vUG9saXRpY3MvdXMtcmVhZGllcy1yZXRhbGlhdG9yeS1zdHJpa2VzLWRyb25lLWF0dGFjay1pcmFuLWJhY2tlZC9zdG9yeT9pZD0xMDY4MDA1NDPSAW5odHRwczovL2FiY25ld3MuZ28uY29tL2FtcC9Qb2xpdGljcy91cy1yZWFkaWVzLXJldGFsaWF0b3J5LXN0cmlrZXMtZHJvbmUtYXR0YWNrLWlyYW4tYmFja2VkL3N0b3J5P2lkPTEwNjgwMDU0Mw?oc=5&hl=en-US&gl=US&ceid=US:en',
 'publisher': 

Now, let's wrote some code to generate a prompt that will contain today's news so the model can summarize them.

In [17]:
def get_descriptions(news):
    """Get news descriptions from a list of dictionaries, as returned by GNews
    """
    return [article["description"] for article in news]

descriptions = get_descriptions(news)
descriptions_text = '\n\n##\n\n'.join(descriptions)

In [18]:
system_prompt = f"""
You are a helpful news assistant that can answer questions about today's news.

Here are the top news from today (separated by ##):

{descriptions_text}
"""

response = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "Give me a short summary of today's news"},
  ],
  seed=42,
  n=3,
)

01/30/2024 04:18:06 PM - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


In [21]:
display_response(response)


### Response 1


> Today's news covers a wide range of topics. In international news, the US is preparing for retaliatory strikes after a drone attack by Iran-backed militants, while Israeli forces disguised as medical staff carried out a raid in a West Bank hospital. In US politics, there are attempts to impeach Secretary Mayorkas over border policies, and Rep. Cori Bush is under investigation for alleged misuse of security funds. California is bracing for heavy rainstorms, and there have been arrests made in connection with murders in the Mojave Desert. Tech giants Microsoft and Alphabet reported their earnings, and companies like UPS and Starbucks announced job cuts. The entertainment world mourns the loss of Broadway star Chita Rivera, while Taylor Swift finds herself in the midst of controversy. Other news includes updates on Northern Ireland power-sharing, the situation in Gaza, and the Super Bowl.



### Response 2


> Today's news covers a range of topics. 
> 
> In international news, the US is preparing retaliatory strikes for a drone attack by Iran-backed militants, undercover Israeli troops dressed as medical staff kill three militants in a West Bank hospital raid, and France's government announces new measures to calm farmers' protests. 
> 
> Domestically, there are reports of House Republicans pressing ahead on impeaching Secretary Mayorkas over border policies, Rep. Cori Bush being under investigation for alleged misuse of security funds, and California bracing for heavy rainstorms. 
> 
> In tech news, Microsoft's earnings beat expectations, UPS is eliminating about 12,000 jobs, Alphabet (Google's parent company) misses expectations on ad revenue, and Starbucks' earnings disappoint as sales outside the US lag. 
> 
> Other news includes Apple's new Stolen Device Protection vulnerability, the death of Broadway star Chita Rivera, Taylor Swift controversy surrounding explicit AI photos, a screening of a 7-hour film about Hitler, and various sports updates. 
> 
> Lastly, there are reports on cases of syphilis and measles on the rise in the US, and the potential spread of Alzheimer's disease through medical treatments.



### Response 3


> Here are the top news highlights from today:
> 
> 1. The US is preparing retaliatory strikes in response to a drone attack by Iran-backed militants in Jordan.
> 2. E. Jean Carroll is facing backlash for saying she will use settlement money from a case against Trump to buy a new wardrobe.
> 3. Israeli forces reportedly disguised as medical staff killed three militants in a raid on a West Bank hospital.
> 4. House Republicans are pushing for the impeachment of Secretary Mayorkas over border policies.
> 5. Democratic Rep. Cori Bush is under investigation by the Justice Department for alleged misuse of security funds.
> 6. California is bracing for heavy rainstorms and potential flooding due to atmospheric rivers.
> 7. Five individuals have been arrested in connection with the murder of six people in the Mojave Desert over a marijuana dispute.
> 8. An Ohio man has been sentenced to 18 years in prison for attempting to firebomb a church that hosted drag shows.
> 9. Trump can stay on the Illinois 2024 ballot despite a 14th Amendment challenge.
> 10. France has announced new measures to address farmers' protests and blockades around Paris.
> 11. After a two-year hiatus, power-sharing in Northern Ireland is set to resume.
> 12. Progressives are urging Biden to reverse the pause on Palestine relief funding and improve conditions in Gaza.
> 13. Microsoft reports strong earnings, driven by cloud and AI investments.
> 14. UPS is cutting 12,000 jobs as part of cost-saving measures.
> 15. Alphabet (Google's parent company) misses expectations on ad revenue, causing its stock to drop.
> 16. Starbucks reports disappointing earnings, with lagging sales outside the US.
> 17. Apple releases the Vision Pro, its augmented reality headset, which sells out after receiving 200,000 pre-orders.
> 18. Analysts predict a potential drop in iPhone shipments by 15% in 2024, causing Apple's stock to fall.
> 19. Apple introduces a new Stolen Device Protection feature on iOS 17.3, but it has a vulnerability.
> 20. Broadway star Chita Rivera passes away at the age of 91.
> 21. Taylor Swift becomes the target of AI-generated explicit images, sparking controversy and calls to ban "deep fakes."
> 22. A 7-hour epic film about Hitler receives a rare screening.
> 23. The NFL's Senior Bowl showcases top prospects for the upcoming draft.
> 24. Super Bowl LVIII tickets in Las Vegas could have a record-breaking average price of $12,700.
> 25. The NBA trade deadline looms, and teams are evaluating potential deals.
> 26. Bill Belichick faces a lack of interest from teams after parting ways with the New England Patriots.
> 27. The mystery of why insects are attracted to light at night gets a new explanation.
> 28. SpaceX successfully launches a Northrop Grumman cargo ship to the International Space Station.
> 29. Rare images potentially show a live newborn great white shark for the first time.
> 30. Japan's Moon Lander, thought to be lost, wakes up and shares new images of the lunar surface.
> 31. Syphilis cases in the US reach levels not seen since 1950, prompting concerns.
> 32. Measles cases resurge in the US due to vaccine skepticism, according to the CDC.
> 33. Obesity and alcohol consumption are contributing to a rise in colorectal cancer among younger generations.
> 34. Rare cases of possible Alzheimer's transmission have been discovered in recipients of contaminated HGH treatment.


Great! The model is now able to return a summary of today's news. Interestingly, it generated it in three formats: a single paragraph, a few short paragraphs and a list. In a production scenario, we'd most likely want to control the format, so being adding more specifics in the prompt would help. But let's leave it like this for now.

An important topic when dealing with factual information are [hallucinations](https://en.wikipedia.org/wiki/Hallucination_(artificial_intelligence)), where models output made-up statements. For this exercise, we won't be checking if the summaries are factually correct, but note that this is a critical concern for any LLM application.

## Interlude 1: Asking topic-specific news

GPT-3.5 is powerful enough to be able to filter out information for us, let's check this by asking the model to summarize sports news.

In [22]:
display_response(client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "Give me a short summary of today's sports news"},
  ],
  seed=42,
  n=3,
))

01/30/2024 04:21:59 PM - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"



### Response 1


> In today's sports news, there are updates on various topics. The Senior Bowl, a showcase for NFL draft prospects, is taking place with top players trying to impress teams. The Super Bowl ticket prices are the most expensive in history, and there are predictions and storylines surrounding the San Francisco 49ers and Kansas City Chiefs. Additionally, the NBA trade deadline is approaching, and teams in the Western Conference are looking to make moves. In the NFL, there is speculation about Bill Belichick's coaching future and what Raheem Morris' return to Atlanta means for the Falcons. Finally, there are interesting findings in the world of insects and light, as well as updates on space missions and the alarming rise of syphilis and measles cases in the US.



### Response 2


> In today's sports news, the trade deadline in the NBA is approaching, with teams looking to make deals and improve their rosters. Bill Belichick, the head coach of the New England Patriots, is reportedly not in high demand for coaching positions in 2024. There is also speculation about the potential retirement of Travis Kelce, a tight end for the Kansas City Chiefs, after the Super Bowl. In other news, the Senior Bowl is underway, showcasing top prospects for the upcoming NFL Draft. Additionally, researchers may have captured the first image of a live newborn great white shark, and Japan's moon lander has come back to life, sharing new images of the lunar surface.



### Response 3


> Today's sports news highlights the upcoming Senior Bowl, where top prospects for the NFL Draft will showcase their skills. The trade deadline in the NBA is also a hot topic, with teams looking to make moves to improve their rosters. In football, there are discussions about the future of Bill Belichick and the return of Raheem Morris to the Atlanta Falcons. In the world of shark science, researchers may have captured the first-ever image of a live newborn great white shark. And in space news, Japan's moon lander has come back to life and is sending back images of the lunar surface.


Great! We see that the model's output only mentions sports news and ignores any other topics.

So far, we've passed all the news information in the model's prompt, so, why do we need RAGs?

LLMs can process a limit amout of information, and this is defined by the model's maximum tokens, for a example, if model has a limit of 100 tokens, it means that the input *and* response cannot exceed 100. [Tokens](https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them) are roughly translated into words, but they're not the same, one token is approximately 3/4 of a word, hence, 100 tokens is approximately 75 wors. In consequence.

For this example, we're using GPT-3.5, which can process up to 4,096 tokens, or roughly 3,072 words.

Imagine if our API returned the top 1,000 news, each one having 20 sentences in the description. That's already 20,000 words! It'd impossible for GPT-3.5 to process such amount of data. That's why we need RAGs, RAGs allow us to more efficiently process data so we only pass relevant information to the LLM.

## Interlude 2: Embeddings and vector similarity

RAGs rank data to determine what's the most relevant to answer our question. For example, if we ask a question about *sports news*, we'd expect our RAG to rank news about *basketball* higher than the ones about *politics*.

In simple terms, this happens through a process called embeddings, where our text is converted into a vector. We're essentially converting text into numbers because it's easier to compare numbers! The trick here is that text that talks about the same topic is closer (in the vector space) than non-related text.

OpenAI has an [API](https://platform.openai.com/docs/guides/embeddings/embedding-models) to create embeddings. Let's use it to create an embedding:

In [27]:
print(descriptions[10])

France's government announces new measures to calm farmers' protests, as barricades squeeze Paris  The Associated PressFrench Farmers Block Roads Around Paris in Growing Standoff  The New York TimesAngry Farmers Cause Highway Mayhem In France Amid Economic Frustration  HuffPostBelgian farmers block roads to Zeebrugge port as French protests spill over  Reuters


In [29]:
response = client.embeddings.create(
    input=descriptions[10],
    model="text-embedding-3-small"
)

embedding = response.data[0].embedding
print(embedding[:3])

01/30/2024 04:43:43 PM - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


[0.009948834, 0.018456765, -0.003657255]


I printed the first three numbers in the embedding, we won't care much about their values, all we care is that they allow us to make comparisons!

Remember that for a given query, we want to return the most relevant data to answer our question. Hence, we need a way to compare our embeddings and get the most similar oens. One way to do so is via a [KDTree](https://en.wikipedia.org/wiki/K-d_tree). Given a list of vectors, a `KDTree` allows us to find the most similar ones given another vector.

As an example, I'll create a KDTree with four vectors, and then query it to find the most similar ones:

In [33]:
from scipy.spatial import KDTree
import numpy as np

data = np.array([
    [1, 0],
    [0, 1],
    [-1, 0],
    [0, -1],
])

kdtree = KDTree(data)

In [34]:
_, i = kdtree.query([0.999, 0], k=1)
data[i]

array([1, 0])

In [37]:
_, i = kdtree.query([0, 0.9], k=1)
data[i]

array([0, 1])

In [38]:
_, i = kdtree.query([0, -0.7], k=1)
data[i]

array([ 0, -1])

In [39]:
_, i = kdtree.query([-0.8, 0.002], k=1)
data[i]

array([-1,  0])

If you've read about RAGs before, you probably encountered the term *vector databases*, at its core they serve the same purpose as a `KDTree`: they allow us to find similar vectors.

Note something critical: since we're comparing similarity, our `KDTree` will **always** return an answer (the `k` most similar vectors), even if the topic has nothing to do with our LLM query. For example, if we ask about sports but we created all our embeddings from politics news, we'll still get an answer.

## Interlude 3: Caching embeddings

To reduce the number of API calls, we'll create a small embeddings store that will cache embeddings locally in a `.json` file.

In [40]:
from pathlib import Path
import json

class EmbeddingsStore:
    def __init__(self):
        self._path = Path("embeddings.json")

        if not self._path.exists():
            self._data = {}
        else:
            self._data = json.loads(self._path.read_text())

    def get_one(self, text):
        if text in self._data:
            return self._data[text]
        
        response = client.embeddings.create(
            input=text,
            model="text-embedding-3-small"
        )

        embedding = response.data[0].embedding

        self._data[text] = embedding
        self._path.write_text(json.dumps(self._data))

        return embedding


    def get_many(self, content):
        return [self.get_one(text) for text in content]

    def __len__(self):
        return len(self._data)

In [41]:
store = EmbeddingsStore()
len(store)

0

## Interlude 4: Comparing embeddings

To illustrate the point that embeddings are closer to each other if their corresponding text talks about similar topics, we'll compute the embeddings for a few words:

In [43]:
words = ["Soccer", "Basketball", "Politics", "Movies"]
words_embeddings = store.get_many(words)
kdtree_words = KDTree(np.array(words_embeddings))

Now, let's find the most similar word (from the `words` list) to a few new words:

In [50]:
_, i = kdtree_topics.query(store.get_one("Sports"), k=1)
words[i]

'Soccer'

In [52]:
_, i = kdtree_topics.query(store.get_one("Films"), k=1)
words[i]

'Movies'

## Interlude 4: computing embeddings for news descriptions

Now, let's compute the embeddings for all the 38 news returned by `GNews`.

In [56]:
embeddings_news = store.get_many(descriptions)
kdtree_news = KDTree(np.array(embeddings_news))

Let's test our embeddings by checking what are the most relevant news for a given query.

In [59]:
_, i = kdtree_news.query(store.get_one("Mexico"), k=1)
descriptions[i]

"Can Biden ‘shut down’ the border right now?  The HillIn Biden’s pledge to ‘shut down’ border, a stunning political shift  CNNWhite House demands Speaker Johnson give Biden 'authority and funding' to 'secure the border'  Fox NewsSenate nears bipartisan border deal that Trump calls 'disaster'  ABC NewsTrump, House Republicans plot to kill border deal  Axios"

Interesting! Even though the description doesn't contain the word "Mexico", our embeddings correctly captured the news meaning, returning an accurate result.

Let's try with a few more:

In [61]:
_, i = kdtree_news.query(store.get_one("NFL"), k=1)
descriptions[i]

"Giants Reporter Heard ‘Dirty Little Secret’ About Bill Belichick  NESNWhy There Was No Market for Bill Belichick to Coach in 2024  Sports IllustratedWhat Raheem Morris' return to Atlanta means for the Falcons  ESPNBill Belichick's next coaching job? No team wants to give him full control now, but don't count out 2025  CBS SportsNFL Rumors: Bill Belichick 'Believes Having No Job' Is Better Than Wrong Fit as HC  Bleacher Report"

In [64]:
_, i = kdtree_news.query(store.get_one("Tech"), k=1)
descriptions[i]

"Apple Vision Pro review: magic, until it's not  The VergeApple Vision Pro review: This is the future of computing and entertainment  CNBCI've tried Vision Pro and other top XR headsets and here's the one most people should buy  ZDNetApple sells out of Vision Pro headsets after getting 200K pre-orders  New York Post "

## Approach 3: Our first RAG

Now that we've explained all the pieces, we're ready to build our first RAG, this is what will happen:

1. We pre-fetch today's news
2. When a user enters a query, we find the top `k` news by looking at their embeddings
3. We pass the top `k` news along with the query so the LLM can answer the question

In [66]:
def news_rag(user_query, verbose=False):
    """A RAG that returns a summary of today's news based on a query
    """
    # note that we've arbitrarily fixed k=3
    _, indexes = kdtree_news.query(store.get_one(user_query), k=3)

    descriptions_relevant = [descriptions[i] for i in indexes]
    descriptions_text = '\n\n##\n\n'.join(descriptions_relevant)

    system_prompt = f"""
You are a helpful news assistant that can answer questions about today's news.

Here are the top news from today (separated by ##), use these to generate your answer,
and disregard news that are not relevant to answer the question:

{descriptions_text}
"""

    if verbose:
        print(f"System prompt: {system_prompt}")

    display_response(client.chat.completions.create(
          model="gpt-3.5-turbo",
          messages=[
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_query},
  ],
  seed=42,
  n=2,
    ))

In [67]:
news_rag("Tell me the most recent sports news")

01/30/2024 06:12:45 PM - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
01/30/2024 06:12:49 PM - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"



### Response 1


> The most recent sports news is about the NBA trade deadline and the Senior Bowl.
> 
> In the NBA, teams are preparing for the trade deadline, which is on February 8th. The Western Conference teams are assessing their needs and exploring ways to make trades to improve their roster. This is a hot topic among fans and analysts, with many players being mentioned as potential trade candidates.
> 
> In the NFL, the Senior Bowl is taking place, featuring top prospects for the upcoming NFL Draft. The Senior Bowl is an important event where players can showcase their skills and impress scouts. There have been live updates on practices, highlighting players who are performing well and potentially boosting their draft stock.
> 
> In other news, there have been rumors and discussions about Bill Belichick, the head coach of the New England Patriots. There have been reports about a "dirty little secret" surrounding him, as well as speculation about his future coaching job prospects. It appears that there may be some hesitation from teams to give him full control, but it is believed that he will continue coaching in the future.



### Response 2


> The most recent sports news includes updates on the NBA trade deadline goals and players likely to be traded, as well as news related to the 2024 Senior Bowl and Bill Belichick's coaching prospects.
> 
> In the NBA, teams in the Western Conference have set their trade deadline goals and are exploring ways to achieve them. Various sources have provided lists of interesting players who could potentially be on the move before the deadline. The Chicago Bulls, Golden State Warriors, and other teams are being closely watched as the trade deadline approaches. 
> 
> For football fans, the 2024 Senior Bowl is taking place, featuring top prospects for the NFL Draft. The event has been generating excitement, and there are updates on practices, risers, and player measurements. One player to keep an eye on is Oregon quarterback Bo Nix. 
> 
> In other news, there have been discussions about New England Patriots head coach Bill Belichick. Reports have emerged about a "dirty little secret" concerning Belichick heard by a Giants reporter. There has also been speculation as to why there was no market for Belichick to coach in 2024. Additionally, there are discussions about the return of Raheem Morris to the Atlanta Falcons coaching staff and the possibility of Belichick finding a new coaching job in 2025.


It's working! We can see that the news are relevant to the user's query (sports). Let's pass `verbose=True` to see what the prompt is:

In [68]:
news_rag("What happened in US politics today?", verbose=True)

01/30/2024 06:13:20 PM - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


System prompt: 
You are a helpful news assistant that can answer questions about today's news.

Here are the top news from today (separated by ##), use these to generate your answer,
and disregard news that are not relevant to answer the question:

Can Biden ‘shut down’ the border right now?  The HillIn Biden’s pledge to ‘shut down’ border, a stunning political shift  CNNWhite House demands Speaker Johnson give Biden 'authority and funding' to 'secure the border'  Fox NewsSenate nears bipartisan border deal that Trump calls 'disaster'  ABC NewsTrump, House Republicans plot to kill border deal  Axios

##

House Republicans Press Ahead on Impeaching Mayorkas Over Border Policies  The New York TimesLawmakers attempt to impeach Secretary Mayorkas over border crisis  Fox NewsEvening Report — GOP presses ahead on Mayorkas impeachment  The Hill

##

Trump can stay on Illinois 2024 ballot after 14th Amendment challenge, officials say  ABC NewsView Full Coverage on Google News



01/30/2024 06:13:23 PM - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"



### Response 1


> In US politics today, there were several significant events. 
> 
> Firstly, there is a debate surrounding President Biden's ability to "shut down" the border. This follows a stunning political shift in Biden's pledge to address border issues. The White House is demanding that Speaker Johnson provide Biden with the authority and funding needed to secure the border. Additionally, the Senate is nearing a bipartisan border deal that has been criticized by former President Trump.
> 
> In other news, House Republicans are pressing ahead with efforts to impeach Secretary Mayorkas over border policies. This comes as lawmakers attempt to address the ongoing border crisis.
> 
> Lastly, in a separate development, officials have stated that former President Trump can remain on the Illinois 2024 ballot despite a 14th Amendment challenge.
> 
> These are the key highlights from today's US politics news.



### Response 2


> Today in US politics, there were several key developments. 
> 
> Firstly, there were discussions surrounding President Biden's stance on the border. Biden pledged to "shut down" the border, representing a significant political shift. The White House demanded that Speaker Johnson grant Biden the necessary authority and funding to secure the border. Meanwhile, the Senate is nearing a bipartisan border deal that has been criticized by former President Trump.
> 
> In addition, House Republicans are pressing forward with attempts to impeach Secretary Mayorkas over the border crisis. This move has brought attention to ongoing concerns surrounding border policies.
> 
> Lastly, there was news regarding former President Trump's potential eligibility to be on the Illinois 2024 ballot. Officials have stated that Trump can remain on the ballot after a challenge based on the 14th Amendment.
> 
> These are the major events that occurred today in US politics.


As we can see, our `KDTree` is correctly retrieving news regarding US politics.

What happens if we ask about a topic that our RAG doesn't know about?

In [70]:
news_rag("What happened in Colombia today?", verbose=True)

System prompt: 
You are a helpful news assistant that can answer questions about today's news.

Here are the top news from today (separated by ##), use these to generate your answer,
and disregard news that are not relevant to answer the question:

France's government announces new measures to calm farmers' protests, as barricades squeeze Paris  The Associated PressFrench Farmers Block Roads Around Paris in Growing Standoff  The New York TimesAngry Farmers Cause Highway Mayhem In France Amid Economic Frustration  HuffPostBelgian farmers block roads to Zeebrugge port as French protests spill over  Reuters

##

Can Biden ‘shut down’ the border right now?  The HillIn Biden’s pledge to ‘shut down’ border, a stunning political shift  CNNWhite House demands Speaker Johnson give Biden 'authority and funding' to 'secure the border'  Fox NewsSenate nears bipartisan border deal that Trump calls 'disaster'  ABC NewsTrump, House Republicans plot to kill border deal  Axios

##

Northern Ireland pow

01/30/2024 06:15:52 PM - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"



### Response 1


> I'm sorry, but I don't have any information on what happened in Colombia today.



### Response 2


> I'm sorry, but I don't have any information about events in Colombia today.


As we discussed earlier, our `KDTree` will *always* return an answer, even though the current set of news that we retrieved don't have any news regarding Colombia, our vector similarity still returned two results. Luckily, the LLM correctly disregarded them as they do not contain any relevant information.

## Approach 4: Our first agent

In our current RAG, we're using the LLM as a text summarizer: we get relevant information and ask the LLM to summarize the information. But LLMs are much more powerful than that and there's no reason to limit ourselves. We can break the problem at hand in multiple pieces and then embed the LLM so it solves each of those smaller (and presumably easier) problems.

In today's LLM jargon, this is known as an [agent](https://developer.nvidia.com/blog/introduction-to-llm-agents/):

> [an AI agent] is a system that can use an LLM to reason through a problem, create a plan to solve the problem, and execute the plan with the help of a set of tools. 

Let's build our first agent!

So far, we've only used `GNews.get_news()` to retrieve today's top news, this function returns the top today's news across all topics, hence, it only contains a handful of news for each topic. We can make our system more powerful by allowing it to figure out the topic first and then retrieving all the news from that topic. To do so, let's use an LLM as a topic classifier.

In [71]:
# these are the values that the GNews.get_news_by_topic function can take
TOPICS = {"WORLD", "NATION", "BUSINESS", "TECHNOLOGY", "ENTERTAINMENT",
          "SPORTS", "SCIENCE", "HEALTH"}

In [72]:
def topic_classifier(user_query):
    """Given a user query, return a topic that GNews can process
    """
    topics_ = ", ".join(TOPICS)
    system_prompt = f"""
You're a system that determines the topic of a news question.

You must classify a user prompt into one of the following values:

{topics_}
"""
    
    response = client.chat.completions.create(
          model="gpt-3.5-turbo",
          messages=[
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "What happened in soccer today?"},
    {"role": "system", "content": "SPORTS"},
    {"role": "user", "content": "I want to know biz news"},
    {"role": "system", "content": "BUSINESS"},
    {"role": "user", "content": user_query},
  ],
  seed=42,
  n=1,
)
    return response.choices[0].message.content

Let's test our function:

In [73]:
topic_classifier("what's new in the movie industry")

01/30/2024 06:27:49 PM - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


'ENTERTAINMENT'

In [74]:
topic_classifier("How's the stock market going?")

01/30/2024 06:27:52 PM - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


'BUSINESS'

In [80]:
def news_agent(user_query, verbose=False):
    """An agent that can retrieve news by topic and summarizes them
    """
    # determine the topic based on the query
    topic = topic_classifier(user_query)

    if verbose:
        print(f"Topic: {topic}")

    # get news that correspond to the selected topic
    news = google_news.get_news_by_topic(topic)

    descriptions = get_descriptions(news)

    # compute the embeddings for the retrieved news
    embeddings = store.get_many(descriptions)

    # find the 10 most relevant news given the query
    kdtree = KDTree(np.array(embeddings))
    _, indexes = kdtree.query(store.get_one(user_query), k=10)

    descriptions_relevant = [descriptions[i] for i in indexes]

    descriptions_text = '\n\n##\n\n'.join(descriptions_relevant)

    system_prompt = f"""
You are a helpful news assistant that can answer questions about today's news.

Here are the top news from today (separated by ##), use these to generate your answer,
and disregard news that are not relevant to answer the question:

{descriptions_text}
"""


    if verbose:
        print(f"System prompt: {system_prompt}")


    display_response(client.chat.completions.create(
          model="gpt-3.5-turbo",
          messages=[
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_query},
  ],
  seed=42,
  n=1,
    ))

Let's run our agent! We'll run it with `verbose=True` so we get some extra feedback on what's happening:

In [82]:
news_agent("Tell me the latest news in the film industry", verbose=True)

01/30/2024 06:54:47 PM - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Topic: ENTERTAINMENT
System prompt: 
You are a helpful news assistant that can answer questions about today's news.

Here are the top news from today (separated by ##), use these to generate your answer,
and disregard news that are not relevant to answer the question:

Melinda Wilson, Wife of Brian Wilson and Architect of His Comeback, Dead at 77  Rolling StoneBeach Boys' Brian Wilson's Wife Melinda Dead at 77  PEOPLEBeach Boys frontman Brian Wilson announces death of his wife: 'We are lost'  Fox NewsBrian Wilson's Wife Melinda Ledbetter Dead At 77  HuffPostThe Beach Boys frontman Brian Wilson makes heartbreaking announcement on social media  PennLive

##

Nintendo, Monsters of the Dark Universe, and More Collide at Universal's Epic Theme Park  GizmodoUniversal unveils Epic Universe details, rides, restaurants  Orlando SentinelUniversal's New Epic Universe Theme Park Revealed: Nintendo, Harry Potter, Dark Universe, and More  IGNSuper Nintendo World opens in Florida next year  The Verge

01/30/2024 06:55:05 PM - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"



### Response 1


> One of the latest news in the film industry is the announcement of Universal's new Epic Universe theme park. It will feature attractions based on popular franchises such as Nintendo, Harry Potter, and the Dark Universe. The park is set to open in Florida next year and has generated much excitement among fans. Additionally, there have been reports of flooding X with graphic Taylor Swift AI images, creating a deepfake debacle. This incident has been deemed preventable and has sparked discussions around the issue of deepfake technology in the film industry.


We can see a few interesting things here!

First, There's an issue with `GNews`, as it's returning the same news multiple times. But the model is able to ignore the redundant information. Unlike the first RAG, this one can retrieve more information since it retrieves more granular information (by retrieving news by a given topic). Lastly, we see a hallucination, one of the retrieved news talks about Taylor Swift's AI-generated images, but it doesn't mention anything regarding the film industry, yet, our model says so in the response.

Let's try one more time:

In [84]:
news_agent("What happened in the US today?")

01/30/2024 06:58:49 PM - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
01/30/2024 06:59:06 PM - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"



### Response 1


> Today in the US, several notable events took place. Here are the highlights:
> 
> 1. A man in Ohio was sentenced to 18 years in prison for attempting to burn down a church that was hosting drag story hour events.
> 
> 2. The killer of pro cyclist Mo Wilson was captured with the help of a want ad for a yoga instructor in Costa Rica.
> 
> 3. In the Trump Georgia case, the prosecutor at the center of the allegations reached a temporary divorce agreement, which means she won't be asked about an alleged affair during the trial.
> 
> 4. Charges were filed in the mass murder case linked to illegal marijuana operations in the Mojave Desert in San Bernardino County, California.
> 
> 5. Fulton County, Georgia, experienced a cyberattack, resulting in a government outage that affected phones, court sites, and tax systems. However, the county's district attorney stated that the hack did not impact the Trump election interference case.
> 
> These are just some of the significant events that occurred today in the US.
