# Retrieval-Augmented Generation (RAG)



---
**Credit**: Adapted from this [OpenAI notebook](https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb)

---

In this notebook we will demonstrate how to use RAG to enable `gpt-3.5-turbo` to answer questions using a set of documents as a reference.

We'll be using a dataset of Wikipedia articles about the 2022 Winter Olympic Games to demonstrate (but the approach is applicable to any set of documents).

Why the 2022 Winter Olympic Games?

Because the training cutoff date for `gpt-3.5-turbo` is **August 2021** ([details](https://platform.openai.com/docs/models/gpt-3-5-turbo#gpt-3-5-turbo)) and it "doesn't know" about the 2022 Winter Olympic Games. Therefore, to get it to correctly answer questions about the 2022 Winter Olympic Games, we need to provide it information "from the future" :). This is where RAG comes in.






## Setup

Let's get started by installing the openai python package and `tiktoken`, a package that can tokenize inputs using BPE, in a manner compatible with the OpenAI models


In [None]:
!pip install --upgrade openai
!pip install tiktoken



Next, let's install our old friends :).

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd



---
We will be using **pre-trained contextual embeddings** as well. In class, we discussed a model like BERT can be used to generate contextual embeddings for any text passage. In this colab, we will use the `text-embedding-ada-002` model ([link](https://openai.com/blog/new-and-improved-embedding-model)) from OpenAI to do the same thing.


---

Finally, let's set the OpenAI API key. You can get yours [here](https://platform.openai.com/account/api-keys), and then enter it under `OPENAI_API_KEY` in your Colab secrets. We will create an OpenAI API client using this key.





In [None]:
from openai import OpenAI # for calling the OpenAI API

from google.colab import userdata, drive

client = OpenAI(api_key=userdata.get('OPENAI_API_KEY'))

In [None]:
# models
EMBEDDING_MODEL = "text-embedding-ada-002"
GPT_MODEL = "gpt-3.5-turbo"

To query the LLM, we will re-use the helper functions we defined in the [How to use the LLM API](https://colab.research.google.com/drive/1BUSmrCy8r11HJk-7H3BYM5kmvmMp92a-?usp=sharing) colab. But since we aren't interested in looking at token probabilities in this colab, I have simplified the function to return just the text response.


In [None]:
def ask_the_LLM(prompt,
                model="gpt-3.5-turbo",
                temperature=0):

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "system",
                   "content": "You are a helpful assistant."},
                  {"role": "user", "content": prompt}],
        temperature=temperature
    )
    return response.choices[0].message.content

Before we try anything fancy, let's simply ask `gpt-3.5-turbo` a question about the 2020 Summer Olympics and see how it responds.

Since these Games happened **before** the model's September 2021 training cutoff date, it is likely to have been trained on information about the 2020 Summer Olympic Games. Let's find out.


In [None]:
query = 'Which athlete won the gold medal in the high jump at the 2020 Summer Olympics?'

OK, let's pop the question!

In [None]:
print(ask_the_LLM(query))

Mutaz Essa Barshim of Qatar and Gianmarco Tamberi of Italy both won the gold medal in the men's high jump at the 2020 Summer Olympics. They finished with the same best height and decided to share the gold medal rather than participate in a jump-off.


We can check that this answer is in fact correct [here](https://en.wikipedia.org/wiki/Athletics_at_the_2020_Summer_Olympics_%E2%80%93_Men%27s_high_jump). Impressive. But now lets change the query around and ask something about the **2022** Winter Olmpics.

In [None]:
query = 'Which athletes won the gold medal in curling at the 2022 Winter Olympics?'

ask_the_LLM(query)

"The gold medal in curling at the 2022 Winter Olympics was won by the Swedish men's team and the South Korean women's team."

If we fact-check this, it turns out that ....
<br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br>













... Sweden did win the men's gold and the South Korean women's team did particpate, but **Great Britain won the women's gold**.


<br>

<br>



`gpt-3.5-turbo` basically made this up 😀. Since this looks plausible but is factually wrong, it is called a "hallucination". But, as we discussed in class, this is just a label people use to describe "plausible but wrong" outputs rather than something deep about the model's inner workings (end of rant).


## Using custom data

To help the model answer questions involving data that it wasn't pretrained on (such as the one above), we can provide relevant custom data **in the prompt itself**. This extra information we provide in the prompt is referred to as **context**.



### Manually enriching the prompt with custom data

We will first show how to do this ***manually*** by finding and adding information (that's relevant to the question) to the prompt.

First, we will use the wikipedia article for the 2022 Winter Olympics curling event as context.

Second, we will **explicitly tell the model to make use of the provided context**.

There's a deeper lesson here: **telling LLMs explicitly what you want them to do often helps**

In [None]:
# text copied and pasted from: https://en.wikipedia.org/wiki/Curling_at_the_2022_Winter_Olympics
# Only the portion of the article up until the medalists is included.

wikipedia_article_on_curling = """Curling at the 2022 Winter Olympics

Article
Talk
Read
Edit
View history
From Wikipedia, the free encyclopedia
Curling
at the XXIV Olympic Winter Games
Curling pictogram.svg
Curling pictogram
Venue	Beijing National Aquatics Centre
Dates	2–20 February 2022
No. of events	3 (1 men, 1 women, 1 mixed)
Competitors	114 from 14 nations
← 20182026 →
Men's curling
at the XXIV Olympic Winter Games
Medalists
1st place, gold medalist(s)		 Sweden
2nd place, silver medalist(s)		 Great Britain
3rd place, bronze medalist(s)		 Canada
Women's curling
at the XXIV Olympic Winter Games
Medalists
1st place, gold medalist(s)		 Great Britain
2nd place, silver medalist(s)		 Japan
3rd place, bronze medalist(s)		 Sweden
Mixed doubles's curling
at the XXIV Olympic Winter Games
Medalists
1st place, gold medalist(s)		 Italy
2nd place, silver medalist(s)		 Norway
3rd place, bronze medalist(s)		 Sweden
Curling at the
2022 Winter Olympics
Curling pictogram.svg
Qualification
Statistics
Tournament
Men
Women
Mixed doubles
vte
The curling competitions of the 2022 Winter Olympics were held at the \
Beijing National Aquatics Centre, one of the Olympic Green venues. Curling \
competitions were scheduled for every day of the games, from February 2 to \
February 20.[1] This was the eighth time that curling was part of the Olympic \
program.

In each of the men's, women's, and mixed doubles competitions, 10 nations \
competed. The mixed doubles competition was expanded for its second appearance \
in the Olympics.[2] A total of 120 quota spots (60 per sex) were distributed to \
the sport of curling, an increase of four from the 2018 Winter Olympics.[3] A \
total of 3 events were contested, one for men, one for women, and one mixed.[4]

Qualification
Main article: Curling at the 2022 Winter Olympics – Qualification
Qualification to the Men's and Women's curling tournaments at the Winter \
Olympics was determined through two methods (in addition to the host nation).\
 Nations qualified teams by placing in the top six at the 2021 World Curling \
 Championships. Teams could also qualify through Olympic qualification events \
 which were held in 2021. Six nations qualified via World Championship \
 qualification placement, while three nations qualified through qualification \
 events. In men's and women's play, a host will be selected for the Olympic \
 Qualification Event (OQE). They would be joined by the teams which competed \
 at the 2021 World Championships but did not qualify for the Olympics, and \
 two qualifiers from the Pre-Olympic Qualification Event (Pre-OQE). The \
 Pre-OQE was open to all member associations.[5]

For the mixed doubles competition in 2022, the tournament field was expanded \
from eight competitor nations to ten.[2] The top seven ranked teams at the \
2021 World Mixed Doubles Curling Championship qualified, along with two teams \
from the Olympic Qualification Event (OQE) – Mixed Doubles. This OQE was open \
to a nominated host and the fifteen nations with the highest qualification \
points not already qualified to the Olympics. As the host nation, China \
qualified teams automatically, thus making a total of ten teams per event \
in the curling tournaments.[6]

Summary
Nations	Men	Women	Mixed doubles	Athletes
 Australia			Yes	2
 Canada	Yes	Yes	Yes	12
 China	Yes	Yes	Yes	12
 Czech Republic			Yes	2
 Denmark	Yes	Yes		10
 Great Britain	Yes	Yes	Yes	10
 Italy	Yes		Yes	6
 Japan		Yes		5
 Norway	Yes		Yes	6
 ROC	Yes	Yes		10
 South Korea		Yes		5
 Sweden	Yes	Yes	Yes	11
 Switzerland	Yes	Yes	Yes	12
 United States	Yes	Yes	Yes	11
Total: 14 NOCs	10	10	10	114
Competition schedule

The Beijing National Aquatics Centre served as the venue of the curling \
competitions.
Curling competitions started two days before the Opening Ceremony and finished \
on the last day of the games, meaning the sport was the only one to have had a \
competition every day of the games. The following was the competition schedule \
for the curling competitions:

RR	Round robin	SF	Semifinals	B	3rd place play-off	F	Final
Date
Event
Wed 2	Thu 3	Fri 4	Sat 5	Sun 6	Mon 7	Tue 8	Wed 9	Thu 10	Fri 11	Sat 12	Sun 13	\
Mon 14	Tue 15	Wed 16	Thu 17	Fri 18	Sat 19	Sun 20
Men's tournament								RR	RR	RR	RR	RR	RR	RR	RR	RR	SF	B	F
Women's tournament									RR	RR	RR	RR	RR	RR	RR	RR	SF	B	F
Mixed doubles	RR	RR	RR	RR	RR	RR	SF	B	F
Medal summary
Medal table
Rank	Nation	Gold	Silver	Bronze	Total
1	 Great Britain	1	1	0	2
2	 Sweden	1	0	2	3
3	 Italy	1	0	0	1
4	 Japan	0	1	0	1
 Norway	0	1	0	1
6	 Canada	0	0	1	1
Totals (6 entries)	3	3	3	9
Medalists
Event	Gold	Silver	Bronze
Men
details	 Sweden
Niklas Edin
Oskar Eriksson
Rasmus Wranå
Christoffer Sundgren
Daniel Magnusson	 Great Britain
Bruce Mouat
Grant Hardie
Bobby Lammie
Hammy McMillan Jr.
Ross Whyte	 Canada
Brad Gushue
Mark Nichols
Brett Gallant
Geoff Walker
Marc Kennedy
Women
details	 Great Britain
Eve Muirhead
Vicky Wright
Jennifer Dodds
Hailey Duff
Mili Smith	 Japan
Satsuki Fujisawa
Chinami Yoshida
Yumi Suzuki
Yurika Yoshida
Kotomi Ishizaki	 Sweden
Anna Hasselborg
Sara McManus
Agnes Knochenhauer
Sofia Mabergs
Johanna Heldin
Mixed doubles
details	 Italy
Stefania Constantini
Amos Mosaner	 Norway
Kristin Skaslien
Magnus Nedregotten	 Sweden
Almida de Val
Oskar Eriksson
"""

In [None]:
query = f"""Use the below article on the 2022 Winter Olympics to answer the \
subsequent question.

Article:
```
{wikipedia_article_on_curling}
```

Question: Which teams won the gold medal in curling at the 2022 Winter Olympics?"""

print(query)

Use the below article on the 2022 Winter Olympics to answer the subsequent question.

Article:
```
Curling at the 2022 Winter Olympics

Article
Talk
Read
Edit
View history
From Wikipedia, the free encyclopedia
Curling
at the XXIV Olympic Winter Games
Curling pictogram.svg
Curling pictogram
Venue	Beijing National Aquatics Centre
Dates	2–20 February 2022
No. of events	3 (1 men, 1 women, 1 mixed)
Competitors	114 from 14 nations
← 20182026 →
Men's curling
at the XXIV Olympic Winter Games
Medalists
1st place, gold medalist(s)		 Sweden
2nd place, silver medalist(s)		 Great Britain
3rd place, bronze medalist(s)		 Canada
Women's curling
at the XXIV Olympic Winter Games
Medalists
1st place, gold medalist(s)		 Great Britain
2nd place, silver medalist(s)		 Japan
3rd place, bronze medalist(s)		 Sweden
Mixed doubles's curling
at the XXIV Olympic Winter Games
Medalists
1st place, gold medalist(s)		 Italy
2nd place, silver medalist(s)		 Norway
3rd place, bronze medalist(s)		 Sweden
Curling at the
202

Take a moment to notice what the prompt has grown to.


OK, let's run it.

In [None]:
ask_the_LLM(query)

"The teams that won the gold medal in curling at the 2022 Winter Olympics were:\n\n- Men's Curling: Sweden\n- Women's Curling: Great Britain\n- Mixed Doubles Curling: Italy"

Let's pretty-print it, taking advantage of the fact that OpenAI LLMs generate output in [Markdown](https://www.markdownguide.org/) format.

In [None]:
from IPython.display import Markdown, display

display(Markdown(ask_the_LLM(query)))

The teams that won the gold medal in curling at the 2022 Winter Olympics were:

- Men's Curling: Sweden
- Women's Curling: Great Britain
- Mixed Doubles Curling: Italy

Nicely done, `gpt-3.5-turbo`!

---

But maybe it wasn't super hard since the answer was literally in the context we provided.


Let's make it a bit harder.


Oskar Eriksson actually won **two** medals in the curling event. So let's be tricky and ask the LLM if any athlete won multiple medals.


In [None]:
query = f"""Use the below article on the 2022 Winter Olympics to answer \
the subsequent question.

Article:
```
{wikipedia_article_on_curling}
```

Question: Did any athlete win multiple medals in curling at the 2022 Winter \
Olympics?"""

Notice that the question has changed. Everything else is unchanged.

In [None]:
display(Markdown(ask_the_LLM(query)))

Yes, at the 2022 Winter Olympics, Oskar Eriksson from Sweden won multiple medals in curling. He won a gold medal in the mixed doubles event and a silver medal in the men's event.

WHOAH!!!!

👏 👍

Google cannot do this. In fact, poor Oskar doesn't show up anywhere on the results page summary.



---




### RETRIEVAL AUGMENTED GENERATION: *Automatically* enriching the prompt with custom data





**Manually** adding extra information into the prompt helps but obviously doesn't scale. So, we will now show how to **automatically** add the custom relevant data to the prompt.

First thing to note. We typically can't just include **all** the custom data into the prompt due to an important reason.

Every model has a limit (called the **context window**) on how many tokens you can send in and get out. For `gpt-3.5-turbo`,  the context window is 16,385 tokens ([link](https://platform.openai.com/docs/models/gpt-3-5-turbo)).

Note that the context window includes both the prompt and the response - **together**, they can't exceed 16,385 tokens. Furthermore, the response cannot exceed 4096 tokens (even if the input + response are less than 16385 tokens).

The size of the context window is a key reason we can't include ALL data in the prompt. Another reason is that the LLM can "get confused" if there's too much information in the prompt. Yet another reason is expense. OpenAI charges by the token and these charges can easily add up.

In this colab, we will limit our input to ~4000 tokens to keep things simple and inexpensive.

(BTW, GPT-4's context window is way bigger - it ranges up to 128K tokens, depending on the particular GPT-4 model)

If we can't include all the custom data, the logical thing to do is to only include data that's **relevant** to the question.

How can we measure the relevance between a question and a piece of (our custom) data?

Using pretrained contextual embeddings and our old friend, "cosine similarity"!



---




This is our overall process.



**RETRIEVAL AUGMENTED GENERATION**

**One-time setup**
* Preprocess the custom dataset by splitting it into 'chunks' (e.g., paragraphs, sections etc.)
* We calculate an embedding vector for each chunk using the `text-embedding-ada-002` model and store it somewhere handy


**Each time we receive a question, we do this:**
* We calculate an embedding vector for the question (again using the same `text-embedding-ada-002` model)
* For each chunk in our custom dataset, we calculate the *cosine similarity* (more or less the dot-product) between that chunk's embedding vector and the question's embedding vector
* We rank the chunks from most-cosine-similar to the question to least-cosine-similar
* Starting from the most-cosine-similar chunk, include as many chunks into the prompt as can fit into the context window
* Send the prompt into `gpt-3.5-turbo`.

![RAG](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4063347e-8920-40c6-86b3-c520084b303c_1272x998.jpeg)

Credit for 👆image: https://magazine.sebastianraschka.com/p/finetuning-large-language-models

#### One-time setup

We first need to break up the custom dataset into "chunks".

Approximately a paragraph of text is usually a good length, but you should experiment for your particular use case. In this example, Wikipedia articles are already grouped into headers, so these can be used to define the chunks.

This preprocessing has already been done in [this notebook](https://github.com/openai/openai-cookbook/blob/main/examples/fine-tuned_qa/olympics-1-collect-data.ipynb), so we will simply load the results and use them.

In [None]:
# OpenAI has hosted the processed dataset, so we can download it directly without having to recreate it.
# This dataset has already been split into chunks (apparently one row for each section of the Wikipedia page)
# and a contextual embedding for each chunk has been computed.
# This file is ~200 MB, so may take a minute depending on your connection speed

embeddings_path = "https://cdn.openai.com/API/examples/data/winter_olympics_2022.csv"

df = pd.read_csv(embeddings_path)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6059 entries, 0 to 6058
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   text       6059 non-null   object
 1   embedding  6059 non-null   object
dtypes: object(2)
memory usage: 94.8+ KB


This dataframe has just two columns: `text` and `embedding`.

In [None]:
df.embedding[0]

'[-0.005021067801862955, 0.00026050032465718687, -0.0046091326512396336, 0.016684994101524353, -0.029633380472660065, 0.03277317062020302, -0.016217919066548347, -0.01712612248957157, -0.0022461817134171724, -0.02706446312367916, 0.030411841347813606, 0.01922796480357647, -0.00896526500582695, -0.027972666546702385, -0.02758343517780304, -0.027479641139507294, 0.013921461068093777, 0.00766783207654953, 0.00015092799731064588, -0.01711314730346203, -0.003931223414838314, -0.006850448902696371, 0.010710312984883785, -0.011190364137291908, -0.017658069729804993, -0.0003863919118884951, 0.030333993956446648, -0.010009699501097202, 0.011871516704559326, -0.0029532830230891705, 0.0026273028925061226, -0.008024626411497593, -0.006370398215949535, -0.004768067970871925, -0.00601360434666276, -0.0364319309592247, 0.008011652156710625, -0.031216248869895935, 0.039130594581365585, -0.0006750708562321961, 0.010781671851873398, 0.0347193218767643, 0.000550192897208035, 0.012786206789314747, -0.0187

Embeddings are being stored as strings! Let's convert them to actual Python lists of numbers so that we can do some math operations on them.

In [None]:
import ast  # for converting embeddings saved as strings to arrays
df['embedding'] = df['embedding'].apply(ast.literal_eval)

Let's print out 5 randomly chosen rows.

In [None]:
df.sample(5)

Unnamed: 0,text,embedding
2280,Jamie Lee Rattray\n\n==Career stats==\n\n===CW...,"[-0.03914495185017586, -0.008968537673354149, ..."
2781,Montenegro at the 2022 Winter Olympics\n\n==Cr...,"[0.00504470756277442, -0.00036894812365062535,..."
2529,Curling at the 2022 Winter Olympics\n\n==Resul...,"[-0.02013818360865116, -0.0084348414093256, -0..."
5944,Latvia at the 2022 Winter Olympics\n\n==Luge==...,"[-0.007558159064501524, 0.0046723769046366215,..."
3053,Ice hockey at the 2022 Winter Olympics – Men's...,"[-0.0061609819531440735, -0.022345054894685745..."


The text chunks can be quite long. Let's look at a few.

In [None]:
for e in df.text.sample(5):
  display(Markdown(e))
  print("\n---\n")

Miloš Roman

==Career statistics==

===Regular season and playoffs===

{| border="0" cellpadding="1" cellspacing="0" style="text-align:center; width:60em"
|- bgcolor="#e0e0e0" 
! colspan="3" bgcolor="#ffffff" | 
! rowspan="99" bgcolor="#ffffff" | 
! colspan="5" | [[Regular season]] 
! rowspan="99" bgcolor="#ffffff" | 
! colspan="5" | [[Playoffs]] 
|- bgcolor="#e0e0e0" 
! [[Season (sports)|Season]] 
! Team 
! League 
! GP 
! [[Goal (ice hockey)|G]] 
! [[Assist (ice hockey)|A]] 
! [[Point (ice hockey)|Pts]] 
! [[Penalty (ice hockey)|PIM]] 
! GP
! G 
! A 
! Pts 
! PIM 
|-
| 2013–14
| [[HC Oceláři Třinec]]
| CZE U18
| 19
| 10
| 6
| 16
| 6
| 2
| 0
| 1
| 1
| 0
|- bgcolor="#f0f0f0"
| 2014–15
| HC Oceláři Třinec
| CZE U18
| 36
| 23
| 19
| 42
| 18
| 10
| 1
| 4
| 5
| 0
|-
| 2015–16
| HC Oceláři Třinec
| CZE U18
| 1
| 0
| 0
| 0
| 0
| 1
| 0
| 2
| 2
| 0
|- bgcolor="#f0f0f0"
| 2015–16
| HC Oceláři Třinec
| CZE U20
| 35
| 16
| 21
| 37
| 18
| 9
| 5
| 5
| 10
| 4
|-  
| 2016–17
| HC Oceláři Třinec
| CZE U20
| 2
| 1
| 2
| 3
| 2
| 10
| 6
| 6
| 12
| 4
|- bgcolor="#f0f0f0"
| [[2016–17 Czech Extraliga season|2016–17]]
| HC Oceláři Třinec
| [[Czech Extraliga|ELH]]
| 1
| 0
| 0
| 0
| 0
| —
| —
| —
| —
| —
|-  
| [[2016–17 Czech 1. Liga season|2016–17]]
| [[HC Frýdek–Místek]]
| [[1st Czech Republic Hockey League|Czech.1]]
| 29
| 4
| 2
| 6
| 24
| 10
| 1
| 1
| 2
| 2
|- bgcolor="#f0f0f0"
| [[2017–18 WHL season|2017–18]]
| [[Vancouver Giants]]
| [[Western Hockey League|WHL]]
| 39
| 10
| 22
| 32
| 10
| 7
| 3
| 3
| 6
| 4
|-  
| [[2018–19 WHL season|2018–19]]
| Vancouver Giants
| WHL
| 59
| 27
| 33
| 60
| 21
| 22
| 4
| 8
| 12
| 12
|- bgcolor="#f0f0f0"
| [[2019–20 WHL season|2019–20]]
| Vancouver Giants
| WHL
| 62
| 24
| 23
| 47
| 28
| —
| —
| —
| —
| —
|-  
| [[2020–21 Czech Extraliga season|2020–21]]
| HC Oceláři Třinec
| ELH
| 34
| 7
| 5
| 12
| 10
| 16
| 2
| 1
| 3
| 2
|- bgcolor="#f0f0f0"
| [[2021–22 Czech Extraliga season|2021–22]]
| HC Oceláři Třinec
| ELH
| 46
| 10
| 14
| 24
| 16
| 14
| 2
| 3
| 5
| 2
|- bgcolor="#e0e0e0"
! colspan="3" | ELH totals
! 81
! 17
! 19
! 36
! 26
! 30
! 4
! 4
! 8
! 4
|}


---



Mélodie Daoust

==Playing career==

===CWHL===

She was called up as an emergency fill-in with the [[Montreal Stars]], and scored three points in her CWHL debut on January 8 (versus the [[Burlington Barracudas]]).


---



2021 Canadian Olympic Curling Trials

==Qualification events==

===Pre-Trials Direct-Entry Event===

====Women====

=====Playoffs=====

{{3TeamBracket-PagePlayoff
| RD3-legs = 0

| RD1 = A vs. B
| RD2 = Second place game
| PRO = Qualify for [[2021 Canadian Olympic Curling Pre-Trials|Pre-Trials]]

| RD1-seed1 = A
| RD1-team1 = '''{{flagicon|NT}} [[Kerry Galusha]]'''
| RD1-score1 = '''10'''
| RD1-seed2 = B
| RD1-team2 = {{flagicon|SK}} [[Jessie Hunkin|Team Silvernagle]]
| RD1-score2 = 8

| RD2-seed1 = B
| RD2-team1 = {{flagicon|SK}} [[Jessie Hunkin|Team Silvernagle]]
| RD2-score1 = 6
| RD2-seed2 = C
| RD2-team2 = '''{{flagicon|NS}} [[Jill Brothers]]'''
| RD2-score2 = '''8'''

| RD3-team1 = {{flagicon|NT}} [[Kerry Galusha]]
| RD3-team2 = {{flagicon|NS}} [[Jill Brothers]]
}}


---



Marius Lindvik

{{short description|Norwegian ski jumper}}
{{Use dmy dates|date=March 2022}}
{{Infobox skier
| name              = Marius Lindvik
| image             = Marius Lindvik.jpg
| nationality       = Norway
| birth_date        = {{birth date and age|df=yes|1998|6|27}}
| birth_place       = [[Sørum]], Norway
| height            = 1.75 m
| club              = [[Rælingen SK]]
| personalbest      = {{convert|245.5|m|abbr=on}}<br />[[Letalnica bratov Gorišek|Planica]], [[2021–22 FIS Ski Jumping World Cup|27 March 2022]]
| seasons           = [[2015–16 FIS Ski Jumping World Cup|2016]]–present
| wins              = 8
| teamwins          = 3
| totalpodiums      = 20
| teampodiums       = 9
| individual_starts = 89
| team_starts       = 18
| updated           = 4 March 2023
| medaltemplates    = 
{{MedalCountry|{{NOR}}}}
{{MedalSport|Men's [[ski jumping]]}}
{{MedalCompetition|[[Ski jumping at the Winter Olympics|Olympic Games]]}}
{{MedalGold|[[2022 Winter Olympics|2022 Beijing]]|[[Ski jumping at the 2022 Winter Olympics – Men's large hill individual|LH individual]]}}
{{MedalCompetition|[[FIS Nordic World Ski Championships|World Championships]]}}
{{MedalSilver|[[FIS Nordic World Ski Championships 2023|2023 Planica]]|[[FIS Nordic World Ski Championships 2023 – Men's team large hill|Team LH]]}}
{{MedalSport|Men's [[ski flying]]}}
{{MedalCompetition|[[FIS Ski Flying World Championships|Ski Flying World Championships]]}}
{{MedalGold|[[FIS Ski Flying World Championships 2022|2022 Vikersund]]|[[FIS Ski Flying World Championships 2022 – Individual|Individual]]}}
{{MedalBronze|2022 Vikersund|[[FIS Ski Flying World Championships 2022 – Team|Team]]}}
}}
'''Marius Lindvik''' (born 27 June 1998) is a Norwegian [[Ski jumping|ski jumper]] and Olympic gold medalist.


---



Therese Johaug

==Cross-country skiing results==

===World Cup===

====Individual podiums====

| bgcolor="#BOEOE6" style="text-align: right;" | {{dts|2020|02|18|format=dmy}}
| style="text-align: left;" | {{flagicon|SWE}} [[Åre]], Sweden
| bgcolor="#BOEOE6" | 0.7&nbsp;km Sprint F
| bgcolor="#BOEOE6" | Stage World Cup
| bgcolor="#BOEOE6" | '''1st'''
|-
| 132
| bgcolor="#BOEOE6" style="text-align: right;" | {{dts|2020|02|20|format=dmy}}
| style="text-align: left;" | {{flagicon|NOR}} [[Meråker]], Norway
| bgcolor="#BOEOE6" | 34&nbsp;km Mass Start F
| bgcolor="#BOEOE6" | Stage World Cup
| bgcolor="#BOEOE6" | '''1st'''
|-
| 133
| bgcolor="#BOEOE6" style="text-align: right;" | {{dts|2020|02|23|format=dmy}}
| style="text-align: left;" | {{flagicon|NOR}} [[Trondheim]], Norway
| bgcolor="#BOEOE6" | 15&nbsp;km Pursuit C
| bgcolor="#BOEOE6" | Stage World Cup
| bgcolor="#BOEOE6" | '''1st'''
|-
| 134
| bgcolor="#BOEOE6" style="text-align: right;" | {{sort|000000002020-02-23-0001|15–23 February 2020}}
| style="text-align: left;" | {{flagicon|SWE}}{{flagicon|NOR}} [[FIS Ski Tour 2020]]
| bgcolor="#BOEOE6" | Overall Standings
| bgcolor="#BOEOE6" | World Cup
| bgcolor="#BOEOE6" | '''1st'''
|-
| 135
| bgcolor="#BOEOE6" style="text-align: right;" | {{dts|2020|02|29|format=dmy}}
| style="text-align: left;" | {{flagicon|FIN}} [[Lahti]], Finland
| bgcolor="#BOEOE6" | 10&nbsp;km Individual C
| bgcolor="#BOEOE6" | World Cup
| bgcolor="#BOEOE6" | '''1st'''
|-
| 136
| style="text-align: right;" | {{dts|2020|03|07|format=dmy}}
| style="text-align: left;" | {{flagicon|NOR}} [[Oslo]], Norway
| 30&nbsp;km Mass Start C
| World Cup
| 2nd
|-
| 137
| rowspan="6" | [[2020–21 FIS Cross-Country World Cup|2020–21]]
| bgcolor="#BOEOE6" style="text-align: right;" | {{dts|2020|11|28|format=dmy}}
| style="text-align: left;" | {{flagicon|FIN}} [[Rukatunturi]], Finland
| bgcolor="#BOEOE6" | 10&nbsp;km Individual C
| bgcolor="#BOEOE6" | Stage World Cup
| bgcolor="#BOEOE6" | '''1st'''
|-
| 138
| bgcolor="#BOEOE6" style="text-align: right;" | {{dts|2020|11|29|format=dmy}}
| style="text-align: left;" | {{flagicon|FIN}} [[Rukatunturi]], Finland
| bgcolor="#BOEOE6" | 10&nbsp;km Pursuit F
| bgcolor="#BOEOE6" | Stage World Cup
| bgcolor="#BOEOE6" | '''1st'''
|-
| 139
| bgcolor="#BOEOE6" style="text-align: right;" | {{sort|000000002020-11-29-0001|27–29 November 2020}}
| style="text-align: left;" | {{flagicon|FIN}} [[2020 Nordic Opening|Nordic Opening]]
| bgcolor="#BOEOE6" | Overall Standings


---



Next, we define a function to calculate the embedding using the `text-embedding-ada-002` model, given a piece of text. The API call is simple (see below). [Link](https://openai.com/blog/new-and-improved-embedding-model).

In [None]:
def get_embedding(text, model=EMBEDDING_MODEL):
    result = client.embeddings.create(
      model=EMBEDDING_MODEL,  # which embedding model we want to use
      input=text,            # feed in the text for which you want to calc the embedding
    )
    return result.data[0].embedding

Let's try it on "HODL is amazing!!" 😃

In [None]:
e = get_embedding("HODL is amazing!!")

Let's see how long the embedding vector is.

In [None]:
len(e)

1536

In [None]:
f = get_embedding("HODL is incredible!!")

We will define a little function to calculate the cosine similarity. The `scipy.spatial.distance.cosine` function is handy here.

In [None]:
from scipy import spatial  # for calculating cosine similarities for search

def cosine_sim(x, y):
  return 1-spatial.distance.cosine(x, y)

In [None]:
cosine_sim(e, f)

0.9934409664350276

Given a dataframe like `df` with a column of text chunks, we can use the `get_embedding` function to calculate the embeddings for all the text chunks in the column.

In [None]:
def compute_doc_embeddings(df):
    """
    Create an embedding for each row in the dataframe using the OpenAI Embeddings API.

    Return a dictionary that maps between each embedding vector and the index of the row that it corresponds to.
    """
    return {
        idx: get_embedding(r.text) for idx, r in df.iterrows()
    }

To calculate the embeddings from scratch, uncomment the below line and run. Warning - it will take some time!

In [None]:
#document_embeddings = compute_doc_embeddings(df)


But happily for us, since OpenAI has already calculated the embeddings for us and made them available in the embedding column of the dataframe `df` we downloaded, we don't have to.

So we have a custom dataset split into chunks, and embedding vectors have been calculated for each.

We also have a function that can calculate the embedding for any query.

Next we will use these embeddings to answer our users' questions.



#### Each time we receive a question

* We calculate an embedding vector for the question with the `get_embedding` funtion we defined above.
* For each chunk in our custom dataset, we calculate the cosine similarity between that chunk's embedding vector and the question's embedding vector
* We rank the sections from most-cosine-similar to the question to least-cosine-similar

We first define a couple of helper functions.

In [None]:
# function to retrieve relevant chunks

def chunks_ranked_by_similarity(query, df, n=100):

    """Returns a list of chunks and their cosine-similarity to the query,
    sorted from most similar to least."""

    query_embedding = get_embedding(query)

    chunks_and_similarities = [
        (row["text"], cosine_sim(query_embedding, row["embedding"]))
        for i, row in df.iterrows()
    ]
    chunks_and_similarities.sort(key=lambda x: x[1], reverse=True)
    chunks, similarities = zip(*chunks_and_similarities)
    return chunks[:n], similarities[:n]

Let's examine this function to see what it pulls up as documents most similar to the query string "curling gold medal"

In [None]:
chunks, similarities = chunks_ranked_by_similarity("curling gold medal", df, n=5)

for chunk, similarity in zip(chunks, similarities):
    print(f"{similarity=:.3f}")
    display(Markdown(chunk))
    print("\n" + 80*'*' + "\n")

similarity=0.879


Curling at the 2022 Winter Olympics

==Medal summary==

===Medal table===

{{Medals table
 | caption        = 
 | host           = 
 | flag_template  = flagIOC
 | event          = 2022 Winter
 | team           = 
 | gold_CAN = 0 | silver_CAN = 0 | bronze_CAN = 1
 | gold_ITA = 1 | silver_ITA = 0 | bronze_ITA = 0
 | gold_NOR = 0 | silver_NOR = 1 | bronze_NOR = 0
 | gold_SWE = 1 | silver_SWE = 0 | bronze_SWE = 2
 | gold_GBR = 1 | silver_GBR = 1 | bronze_GBR = 0
 | gold_JPN = 0 | silver_JPN = 1 | bronze_JPN - 0
}}


********************************************************************************

similarity=0.872


Curling at the 2022 Winter Olympics

==Results summary==

===Women's tournament===

====Playoffs====

=====Gold medal game=====

''Sunday, 20 February, 9:05''
{{#lst:Curling at the 2022 Winter Olympics – Women's tournament|GM}}
{{Player percentages
| team1 = {{flagIOC|JPN|2022 Winter}}
| [[Yurika Yoshida]] | 97%
| [[Yumi Suzuki]] | 82%
| [[Chinami Yoshida]] | 64%
| [[Satsuki Fujisawa]] | 69%
| teampct1 = 78%
| team2 = {{flagIOC|GBR|2022 Winter}}
| [[Hailey Duff]] | 90%
| [[Jennifer Dodds]] | 89%
| [[Vicky Wright]] | 89%
| [[Eve Muirhead]] | 88%
| teampct2 = 89%
}}


********************************************************************************

similarity=0.869


Curling at the 2022 Winter Olympics

==Results summary==

===Mixed doubles tournament===

====Playoffs====

=====Gold medal game=====

''Tuesday, 8 February, 20:05''
{{#lst:Curling at the 2022 Winter Olympics – Mixed doubles tournament|GM}}
{| class="wikitable"
!colspan=4 width=400|Player percentages
|-
!colspan=2 width=200 style="white-space:nowrap;"| {{flagIOC|ITA|2022 Winter}}
!colspan=2 width=200 style="white-space:nowrap;"| {{flagIOC|NOR|2022 Winter}}
|-
| [[Stefania Constantini]] || 83%
| [[Kristin Skaslien]] || 70%
|-
| [[Amos Mosaner]] || 90%
| [[Magnus Nedregotten]] || 69%
|-
| '''Total''' || 87%
| '''Total''' || 69%
|}


********************************************************************************

similarity=0.868


Curling at the 2022 Winter Olympics

==Medal summary==

===Medalists===

{| {{MedalistTable|type=Event|columns=1}}
|-
|Men<br/>{{DetailsLink|Curling at the 2022 Winter Olympics – Men's tournament}}
|{{flagIOC|SWE|2022 Winter}}<br>[[Niklas Edin]]<br>[[Oskar Eriksson]]<br>[[Rasmus Wranå]]<br>[[Christoffer Sundgren]]<br>[[Daniel Magnusson (curler)|Daniel Magnusson]]
|{{flagIOC|GBR|2022 Winter}}<br>[[Bruce Mouat]]<br>[[Grant Hardie]]<br>[[Bobby Lammie]]<br>[[Hammy McMillan Jr.]]<br>[[Ross Whyte]]
|{{flagIOC|CAN|2022 Winter}}<br>[[Brad Gushue]]<br>[[Mark Nichols (curler)|Mark Nichols]]<br>[[Brett Gallant]]<br>[[Geoff Walker (curler)|Geoff Walker]]<br>[[Marc Kennedy]]
|-
|Women<br/>{{DetailsLink|Curling at the 2022 Winter Olympics – Women's tournament}}
|{{flagIOC|GBR|2022 Winter}}<br>[[Eve Muirhead]]<br>[[Vicky Wright]]<br>[[Jennifer Dodds]]<br>[[Hailey Duff]]<br>[[Mili Smith]]
|{{flagIOC|JPN|2022 Winter}}<br>[[Satsuki Fujisawa]]<br>[[Chinami Yoshida]]<br>[[Yumi Suzuki]]<br>[[Yurika Yoshida]]<br>[[Kotomi Ishizaki]]
|{{flagIOC|SWE|2022 Winter}}<br>[[Anna Hasselborg]]<br>[[Sara McManus]]<br>[[Agnes Knochenhauer]]<br>[[Sofia Mabergs]]<br>[[Johanna Heldin]]
|-
|Mixed doubles<br/>{{DetailsLink|Curling at the 2022 Winter Olympics – Mixed doubles tournament}}
|{{flagIOC|ITA|2022 Winter}}<br>[[Stefania Constantini]]<br>[[Amos Mosaner]]
|{{flagIOC|NOR|2022 Winter}}<br>[[Kristin Skaslien]]<br>[[Magnus Nedregotten]]
|{{flagIOC|SWE|2022 Winter}}<br>[[Almida de Val]]<br>[[Oskar Eriksson]]
|}


********************************************************************************

similarity=0.867


Curling at the 2022 Winter Olympics

==Results summary==

===Men's tournament===

====Playoffs====

=====Gold medal game=====

''Saturday, 19 February, 14:50''
{{#lst:Curling at the 2022 Winter Olympics – Men's tournament|GM}}
{{Player percentages
| team1 = {{flagIOC|GBR|2022 Winter}}
| [[Hammy McMillan Jr.]] | 95%
| [[Bobby Lammie]] | 80%
| [[Grant Hardie]] | 94%
| [[Bruce Mouat]] | 89%
| teampct1 = 90%
| team2 = {{flagIOC|SWE|2022 Winter}}
| [[Christoffer Sundgren]] | 99%
| [[Rasmus Wranå]] | 95%
| [[Oskar Eriksson]] | 93%
| [[Niklas Edin]] | 87%
| teampct2 = 94%
}}


********************************************************************************



We can see that several relevant sections of the Wikipedia page for curling at the 2022 Winter Olympics were retrieved. Cool!

#### Starting from the most-cosine-similar section, include as many sections into the prompt as can fit into the context window


Once we've calculated the most relevant pieces of context, we construct a prompt by simply prepending them to the supplied query. We write a fewer helper functions to do just this.

Since we don't want to exceed the context window, we will need to count the tokens in our prompt. We use the `tiktoken` package for this.

In [None]:
import tiktoken  # for counting tokens

def num_tokens(text, model=GPT_MODEL):
    """Return the number of tokens in a string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

We build the prompt by first beginning with a `HEADER` string and then pull in the related chunks sorted in descending order of similarity to the query. We then add these chunks to the query until the token budget is consumed.



In [None]:
HEADER = """
Answer the question using the provided context."\n\nContext:\n
"""

In [None]:
def build_prompt(query, df, model, token_budget):
    """Build a rich prompt, with relevant chunks pulled from a dataframe."""

    chunks, similarities = chunks_ranked_by_similarity(query, df)

    question = f"\n\nQuestion: {query}"

    message = HEADER

    for chunk in chunks:
      # useful to indicate the start of each new potentially relevant
      # article here with the header 'Wikipedia article section:'

        next_article = f'\n\nWikipedia article section:\n"""\n{chunk}\n"""'
        if (
            num_tokens(message + next_article + question, model=model)
            > token_budget
        ):
            break
        else:
            message += next_article
    return message + question

In [None]:
prompt = build_prompt("'Which athletes won the gold medal in curling at \
the 2022 Winter Olympics?", df, GPT_MODEL, 3700)

print(prompt)


Answer the question using the provided context."

Context:



Wikipedia article section:
"""
List of 2022 Winter Olympics medal winners

==Curling==

{{main|Curling at the 2022 Winter Olympics}}
{|{{MedalistTable|type=Event|columns=1|width=225|labelwidth=200}}
|-valign="top"
|Men<br/>{{DetailsLink|Curling at the 2022 Winter Olympics – Men's tournament}}
|{{flagIOC|SWE|2022 Winter}}<br/>[[Niklas Edin]]<br/>[[Oskar Eriksson]]<br/>[[Rasmus Wranå]]<br/>[[Christoffer Sundgren]]<br/>[[Daniel Magnusson (curler)|Daniel Magnusson]]
|{{flagIOC|GBR|2022 Winter}}<br/>[[Bruce Mouat]]<br/>[[Grant Hardie]]<br/>[[Bobby Lammie]]<br/>[[Hammy McMillan Jr.]]<br/>[[Ross Whyte]]
|{{flagIOC|CAN|2022 Winter}}<br/>[[Brad Gushue]]<br/>[[Mark Nichols (curler)|Mark Nichols]]<br/>[[Brett Gallant]]<br/>[[Geoff Walker (curler)|Geoff Walker]]<br/>[[Marc Kennedy]]
|-valign="top"
|Women<br/>{{DetailsLink|Curling at the 2022 Winter Olympics – Women's tournament}}
|{{flagIOC|GBR|2022 Winter}}<br/>[[Eve Muirhead]]<br/>[[

We have now obtained the sections that are most relevant to the question and crafted a rich prompt. As a final step, let's pull it all together into a little function.


In [None]:
def ask(query, df=df, model=GPT_MODEL, token_budget=4096 - 500):
    """Builds the prompt and send it in to the model"""

    prompt = build_prompt(query, df, model=model, token_budget=token_budget)

    return ask_the_LLM(prompt)

#### Send the query into `gpt-3.5-turbo`!


In [None]:
print(ask('Which athletes won the gold medal in curling at the 2022 Winter Olympics?'))

The athletes who won the gold medal in curling at the 2022 Winter Olympics are:

- Men's tournament: Team Sweden consisting of Niklas Edin, Oskar Eriksson, Rasmus Wranå, Christoffer Sundgren, and Daniel Magnusson.
- Women's tournament: Team Italy consisting of Stefania Constantini and Amos Mosaner.
- Mixed doubles tournament: Team Great Britain consisting of Bruce Mouat, Grant Hardie, Bobby Lammie, Hammy McMillan Jr., and Ross Whyte.


Nice!


## Conclusion
By combining pretrained contextual embeddings and `text-davinci-003`, we have created a question-answering model using Retrieval-Augmented Generation that can answer questions in natural language using a custom dataset. It also **tries** not to make stuff up and says "I don't know" when it doesn't know the answer! **But this is not guaranteed.**

For this example we have used a dataset of Wikipedia articles, but that dataset could be replaced with books, articles, documentation, service manuals, or much much more.

