# How to build a RAG

## Install - import - env and global variables

In [None]:
# Pip install pandas, sentence_transformers and newspaper3k
%pip install pandas
%pip install pyarrow
%pip install sentence_transformers
%pip install newspaper3k
%pip install openai
%pip install markdown

In [1]:
# Standard library import
import os
import requests
from datetime import datetime, timedelta
import time
from pathlib import Path
# Package for data transformation and embedding
import pandas as pd
from sentence_transformers import SentenceTransformer
import torch
from newspaper import Article
from openai import AzureOpenAI
import markdown

  from tqdm.autonotebook import tqdm, trange


In [2]:
# Set enviornnement variable
%env NEWS_API_KEY=9c40f9b0d5f0469ca9871366a8ace66b
%env AZURE_OPENAI_ENDPOINT=https://pdd-pilot.openai.azure.com/
%env AZURE_OPENAI_API_KEY=f9ca39007ce541ed9a377100ff0c3e1e

env: NEWS_API_KEY=9c40f9b0d5f0469ca9871366a8ace66b
env: AZURE_OPENAI_ENDPOINT=https://pdd-pilot.openai.azure.com/
env: AZURE_OPENAI_API_KEY=f9ca39007ce541ed9a377100ff0c3e1e


In [3]:
# !!!UPDATE!!!
PATH_DF_SPORTS = Path("/home", "nguyea37", "training", "RAG_WS_15Aug24","df_sports.parquet")
PATH_DF_SPORTS_US_WOMEN_BB = Path("/home", "nguyea37", "training", "RAG_WS_15Aug24","df_sports_us_women_bb.parquet")

## 0. !!DO NOT RUN THIS SECTION!! Getting data

Getting sports news from NewsAPI. NewsAPI is an API providing articles from various data sources. 
Please create an account and get your API key.

We will scrape data with the following filters:
- Articles containing the keywords `euro 2024` and `uefa`
- Articles published between `2024-07-07` and `2024-07-25`
- Articles in English only (`en`)

In the latest version of the provided notebook Mathieu specified "0. !!DO NOT RUN THIS SECTION!! Getting data". The main reason is that he did not want to share his own private API key with the rest of Roche ☺️. That's why he saved the df_sports.parquet for us to use. Please feel free to directly use the provided df_sports.parquet data.

If you want to register to https://newsapi.org and create your own API key, please feel free to do so. Once you obtain the key, please update the cell "# Set enviornnement variable" in notebook to supply the value:
```
%env NEWS_API_KEY=your-secret-api-key-here
```

After that you might need to adjust the five_days_ago and one_month_ago so that you can get non-empty news articles about "uefa" and "euro 2024". For example, the following worked for me:
```
five_days_ago = '09-08-2024'
one_month_ago = '02-07-2024'
```

If successful, you should see response 200 like this:
```
<Response [200]>
https://newsapi.org/v2/everything?q=uefa&from=09-08-2024&to=02-07-2024&apiKey=8bd87073bb9b4cbeafb266b73cc53a91&language=en&page=1
```

In [4]:
# Get today's date
today = datetime.today()

# Calculate yesterday's date
five_days_ago = '09-08-2024'#(today - timedelta(days=5)).strftime('%d-%m-%Y')
# 29 days ago
one_month_ago = '02-07-2024'# (today - timedelta(days=23)).strftime('%d-%m-%Y')

In [5]:
# Function to fetch sports articles
def fetch_sports_articles(
        key_word: str,
        nb_pages: int
    )-> list[dict]:
    l_articles = []
    base_url = 'https://newsapi.org/v2/everything'
    for i in range(1, nb_pages+1):
        params = {
            'q': key_word,
            'from': five_days_ago,
            'to': one_month_ago,
            'apiKey': os.getenv("NEWS_API_KEY"),
            'language': 'en',
            'page': i
        }
        response = requests.get(base_url, params=params)
        print(response)
        print(response.url)
        if response.status_code == 200:
            data = response.json()
            articles = data.get('articles', [])
            l_articles += articles
    return l_articles

In [6]:
# Results for each keyword as a list of dictionnaries
# uefa = fetch_sports_articles(key_word = "uefa", nb_pages = 5)
# euro2024 = fetch_sports_articles(key_word = "euro 2024", nb_pages = 5)
wnba         = fetch_sports_articles(key_word = "wnba", nb_pages = 5)
olympics2024 = fetch_sports_articles(key_word = "olympics 2024", nb_pages = 5)

<Response [200]>
https://newsapi.org/v2/everything?q=wnba&from=09-08-2024&to=02-07-2024&apiKey=9c40f9b0d5f0469ca9871366a8ace66b&language=en&page=1
<Response [200]>
https://newsapi.org/v2/everything?q=wnba&from=09-08-2024&to=02-07-2024&apiKey=9c40f9b0d5f0469ca9871366a8ace66b&language=en&page=2
<Response [200]>
https://newsapi.org/v2/everything?q=wnba&from=09-08-2024&to=02-07-2024&apiKey=9c40f9b0d5f0469ca9871366a8ace66b&language=en&page=3
<Response [200]>
https://newsapi.org/v2/everything?q=wnba&from=09-08-2024&to=02-07-2024&apiKey=9c40f9b0d5f0469ca9871366a8ace66b&language=en&page=4
<Response [200]>
https://newsapi.org/v2/everything?q=wnba&from=09-08-2024&to=02-07-2024&apiKey=9c40f9b0d5f0469ca9871366a8ace66b&language=en&page=5
<Response [200]>
https://newsapi.org/v2/everything?q=olympics+2024&from=09-08-2024&to=02-07-2024&apiKey=9c40f9b0d5f0469ca9871366a8ace66b&language=en&page=1
<Response [200]>
https://newsapi.org/v2/everything?q=olympics+2024&from=09-08-2024&to=02-07-2024&apiKey=9c40f

In [7]:
# Create a pandas dataframe from the concatenation of the 3 lists
# df = pd.DataFrame(uefa+euro2024)
df = pd.DataFrame(wnba+olympics2024)

In [8]:
df

Unnamed: 0,source,author,title,description,url,urlToImage,publishedAt,content
0,"{'id': None, 'name': 'Yahoo Entertainment'}",Anna Washenko,"NBA TV rights go to ESPN, NBC and Amazon as TN...",The NBA and WNBA have inked deals for where ga...,https://consent.yahoo.com/v2/collectConsent?se...,,2024-07-24T23:08:11Z,"If you click 'Accept all', we and our partners..."
1,"{'id': None, 'name': 'NPR'}",,Iowa basketball phenom Caitlin Clark plans to ...,"Caitlin Clark, who is on the verge of becoming...",https://www.npr.org/2024/02/29/1235041451/cait...,https://media.npr.org/assets/img/2024/02/29/ap...,2024-02-29T22:12:57Z,Iowa guard Caitlin Clark claps during the seco...
2,"{'id': None, 'name': 'NPR'}",Scott Simon,"Saturday Sports: WNBA All-Star game, Kansas Ci...",The WNBA All-Star game takes place Saturday. A...,https://www.npr.org/2024/07/20/nx-s1-5046381/s...,https://media.npr.org/include/images/facebook-...,2024-07-20T11:57:45Z,The WNBA All-Star game takes place Saturday. A...
3,"{'id': None, 'name': 'NPR'}",Gus Contreras,WNBA All-Star game will showcase Team USA play...,It's a big weekend for women's basketball. NPR...,https://www.npr.org/2024/07/19/nx-s1-5044781/w...,https://media.npr.org/include/images/facebook-...,2024-07-19T20:37:18Z,It's a big weekend for women's basketball. NPR...
4,"{'id': None, 'name': 'NPR'}",Becky Sullivan,"The WNBA, capturing excitement around Caitlin ...","With record attendance and viewership, the WNB...",https://www.npr.org/2024/06/11/g-s1-3807/wnba-...,https://npr.brightspotcdn.com/dims3/default/st...,2024-06-11T08:00:00Z,"More than 400,000 people attended WNBA games i..."
...,...,...,...,...,...,...,...,...
995,"{'id': 'time', 'name': 'Time'}",Sean Gregory/Paris and Alice Park/Paris,What It Was Like on the Seine During the Paris...,"In the end, Paris gave us an Olympic opener we...",https://time.com/7004283/paris-olympics-openin...,https://api.time.com/wp-content/uploads/2024/0...,2024-07-27T00:06:20Z,Lady Gaga opened up the artistic portion of th...
996,"{'id': 'bleacher-report', 'name': 'Bleacher Re...",Adam Wells,"Windhorst: LeBron James, Steph Curry Eye Teami...",After rumors earlier in the day that the Golde...,https://bleacherreport.com/articles/10109381-w...,https://media.bleacherreport.com/image/upload/...,2024-02-14T16:58:28Z,Ezra Shaw/Getty Images\r\nAfter rumors earlier...
997,"{'id': None, 'name': 'HYPEBEAST'}",info@hypebeast.com (Hypebeast),Dior Names Olympic Skater Aurélien Giraud Its ...,Champion skateboarder Aurélien Giraud is the n...,https://hypebeast.com/2024/2/dior-olympic-skat...,https://image-cdn.hypb.st/https%3A%2F%2Fhypebe...,2024-02-07T17:52:16Z,Champion skateboarder Aurélien Giraud is the n...
998,"{'id': 'time', 'name': 'Time'}",Katherine Pomerantz,Behind the Photo: How Olympic Photographer Jer...,Brouillet’s photo of Brazilian surfer Gabriel ...,https://time.com/7005239/olympics-surfing-phot...,https://api.time.com/wp-content/uploads/2024/0...,2024-07-30T02:00:00Z,"Every four years, a few lucky photographers co..."


In [9]:
# Utility function for extracting the content of the article using newspaper3k
def _extract_content_from_article(url: str):
    try:
        article = Article(url)
        article.download()
        article.parse()
        return article.text
    except:
        return ''

In [10]:
# Utility function for chunking
def _chunking(s: str, chunk_size: int):
    return [
        s[i:i+chunk_size] 
        for i in range(0, len(s), chunk_size)
    ]

In [11]:
def clean_and_chunking(df: pd.DataFrame)-> pd.DataFrame:
    # Filter out empty description / titles
    df = df[(df["title"] != "[Removed]") | (df["description"] != "[Removed]")]
    # Scrape the full content of the article using newspaper3k (NewsAPI doesn't provide the full content)
    df.loc[:, "content"] = (
        df["url"]
        .apply(lambda x: _extract_content_from_article(x))
    )
    # Chunk the content with `chunk_size=500`
    df.loc[:, "chunked_content"] = (
        df["content"]
        .apply(lambda x: _chunking(s = x, chunk_size = 500))
    )
    # chunked_content contains list of string with size <= 500
    df = (
        df
        .explode("chunked_content")
        .dropna(subset=['chunked_content'])
        .reset_index(drop=True)
    )
    return df[["author", "title", "url", "publishedAt", "chunked_content"]]

In [12]:
# Clean the dataframe and chunk the article content
df = clean_and_chunking(df = df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[:, "chunked_content"] = (


In [13]:
# Save the dataframe to parquet
# df.to_parquet(PATH_DF_SPORTS)
df.to_parquet(PATH_DF_SPORTS_US_WOMEN_BB)

## 1. Generate the embeddings

The step consists of generating the embeddings from the chunked article content.

Embeddings are vector representations of data, typically used in natural language processing and machine learning. They map high-dimensional data, like words or images, into lower-dimensional continuous vector spaces. This transformation captures semantic relationships, making similar items have closer vectors.

The column `"chunked_content"` in the DataFrame `df` contains chunks of articles with a size $\leq 500$ characters.

We will use the library `sentence_transformers` to generate the embeddings for each chunk. The embedding produces a 768-dimensional vector.

In [14]:
# df = pd.read_parquet(PATH_DF_SPORTS)
df = pd.read_parquet(PATH_DF_SPORTS_US_WOMEN_BB)

In [15]:
df

Unnamed: 0,author,title,url,publishedAt,chunked_content
0,,Iowa basketball phenom Caitlin Clark plans to ...,https://www.npr.org/2024/02/29/1235041451/cait...,2024-02-29T22:12:57Z,Iowa basketball phenom Caitlin Clark plans to ...
1,,Iowa basketball phenom Caitlin Clark plans to ...,https://www.npr.org/2024/02/29/1235041451/cait...,2024-02-29T22:12:57Z,post on the X social media platform. She than...
2,,Iowa basketball phenom Caitlin Clark plans to ...,https://www.npr.org/2024/02/29/1235041451/cait...,2024-02-29T22:12:57Z,"tting the boards.\n\nThe guard, with one more ..."
3,,Iowa basketball phenom Caitlin Clark plans to ...,https://www.npr.org/2024/02/29/1235041451/cait...,2024-02-29T22:12:57Z,"t. Earlier this month, Clark broke Kelsey Plum..."
4,Scott Simon,"Saturday Sports: WNBA All-Star game, Kansas Ci...",https://www.npr.org/2024/07/20/nx-s1-5046381/s...,2024-07-20T11:57:45Z,"Saturday Sports: WNBA All-Star game, Kansas Ci..."
...,...,...,...,...,...
7301,Katherine Pomerantz,Behind the Photo: How Olympic Photographer Jer...,https://time.com/7005239/olympics-surfing-phot...,2024-07-30T02:00:00Z,"ia boats capturing the surfing events, which a..."
7302,Katherine Pomerantz,Behind the Photo: How Olympic Photographer Jer...,https://time.com/7005239/olympics-surfing-phot...,2024-07-30T02:00:00Z,"ing over 20 images per second, so Brouillet is..."
7303,Katherine Pomerantz,Behind the Photo: How Olympic Photographer Jer...,https://time.com/7005239/olympics-surfing-phot...,2024-07-30T02:00:00Z,"knows the waters of Teahupo'o well, having mo..."
7304,Katherine Pomerantz,Behind the Photo: How Olympic Photographer Jer...,https://time.com/7005239/olympics-surfing-phot...,2024-07-30T02:00:00Z,"much more comfortable on the media boat, takin..."


In [16]:
# Selected model from hugging face
model = SentenceTransformer('thenlper/gte-base')

In [17]:
# Create the embeddings for the article chunks
def create_embeddings(df: pd.DataFrame, model: SentenceTransformer)-> pd.DataFrame:
    embedding = model.encode([x for x in df["chunked_content"].tolist()], show_progress_bar = True)
    return embedding

In [18]:
news_embedding = create_embeddings(df = df, model = model)

Batches:   0%|          | 0/229 [00:00<?, ?it/s]

## 2. Calculate the cosine similarity between the query and the document vectors

1. Calculate the embedding for the `query` (e.g., question).
2. Compute the similarity vector between the `query` and each `document` by using cosine similarity distance metric.
3. Identify the indices of the `top_k` similarity scores.
4. Retrieve the corresponding documents using these indices.

In [19]:
# 1. Calulcate the embedding for the query
# QUERY = "Give multiple factors for why France lost euro 2024"
QUERY = "What was the main factor for US women basketball team to win the summer olympics 2024?"
query_embedding = model.encode([QUERY], show_progress_bar = True)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [20]:
query_embedding

array([[-7.56371673e-03, -3.84708345e-02, -9.25905257e-03,
         1.82052311e-02,  4.38780747e-02,  2.06739604e-02,
         5.22018746e-02,  2.99704429e-02, -3.28086168e-02,
        -5.90499155e-02,  1.82319861e-02, -1.53562734e-02,
        -4.88898419e-02,  3.57510298e-02, -1.74108613e-02,
         6.75951689e-02,  3.89009751e-02,  1.48724942e-02,
         1.91479903e-02,  3.15222591e-02, -9.34780203e-03,
         8.36333632e-03,  2.82293209e-03,  1.87512655e-02,
         1.66486632e-02,  1.25942519e-03,  8.96374322e-03,
        -5.34010492e-03, -7.11187348e-02,  6.10290095e-03,
         3.97683680e-02, -5.01913428e-02, -2.26746183e-02,
         9.37555509e-04,  2.17567682e-02, -2.10640859e-02,
        -4.77420166e-03,  8.94072559e-03, -4.24933183e-04,
        -1.80902854e-02, -5.27624227e-02, -2.63359454e-02,
        -1.33294484e-03,  8.61034356e-03, -6.73992187e-02,
         8.52909964e-03, -1.63018946e-02,  4.62571383e-02,
        -2.09315568e-02, -5.06349839e-02, -4.38528955e-0

In [21]:
# 2. Compute the similarity vector between the `query` and each `document` by applying cosine similarity.
similarity_scores = model.similarity(query_embedding, news_embedding)[0]

In [22]:
# 3. Identify the indices of the `top_k` similarity scores.
scores, indices = torch.topk(similarity_scores, k=5)

In [23]:
indices

tensor([5047, 5549, 4676, 5622, 4788])

In [24]:
# 4. Retrieve the corresponding documents using these indices.
retrieved_context = df.iloc[indices]["chunked_content"].to_list()

In [25]:
retrieved_context

["Check out some of the key players headlining the U.S. women's basketball team as it seeks to win its eighth straight gold medal including Dianna Taurasi, Brittney Griner and Breanna Stewart. (1:47)\n\nOpen Extended Reactions\n\nThe 2024 Olympics are almost finished. August 11th will bring final stretch developments in basketball and volleyball, in addition to the closing ceremony. Can the U.S. women's basketball team and U.S. women's volleyball team bring home gold medals in their respective finals?",
 'Basketball is consistently one of the most anticipated sports at every Olympics, and with LeBron James, Steph Curry, Nikola Jokic, Victor Wembanyama, Breanna Stewart, A’ja Wilson and dozens of other top NBA and WNBA stars competing in Paris, the 2024 Games will be no different. Team USA are heavy favorites on both the men’s and women’s sides, but as we’ve seen many times before in the international game, nothing can be taken for granted, and everything will be exciting.\n\nIf you live

## 3. Craft the prompt

In the RAG, we need 3 elements to create the prompt.
1. System prompt to specify the task
2. Query (e.g: user's question)
3. Context (e.g: retrieved information)

In [26]:
SYSTEM_PROMPT = """
You are a helpful assistant who answers questions based on sports news articles.
"""

In [27]:
USER_PROMPT = f"""
[USER]
{QUERY}

[CONTEXT]
{' Article: '.join(retrieved_context)}
"""

## 4. Generate the answer from GPT-4o

To generate an answer from GPT-4o, we will use the OpenAI client to send a request to Azure OpenAI. This involves setting up the client with the appropriate API keys and endpoint, and then making a request to the GPT-4o model with the desired input. The response from the model will contain the generated answer.

In [28]:
# Open AI client
client = AzureOpenAI(
    azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT"), 
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
    api_version="2023-05-15"
)

In [29]:
response = client.chat.completions.create(
    model="gpt-4o",
    messages= [
        {
            "role": "system",
            "content": SYSTEM_PROMPT
        },
        {
            "role": "user",
            "content": USER_PROMPT
        }
    ]
)

In [30]:
answer = response.choices[0].message.content

In [31]:
answer

"The main factor for the U.S. women's basketball team winning the gold medal at the 2024 Summer Olympics was their overall dominance and key performances from star players, particularly A'ja Wilson. Wilson's consistent high-level play, including a significant scoring contribution of 21 points in the final against France, was crucial in securing their eighth consecutive Olympic gold medal. The team's historic winning streak and depth of talent also played a significant role in their success."