# Retrieval-Augmented Generation (RAG)

Install the Hugging Face libraries to run this notebook.

In [40]:
!pip install transformers wikipedia

Defaulting to user installation because normal site-packages is not writeable


In [41]:
import torch
import torch.nn.functional as F

## Document ingestion

In [42]:
import wikipedia

def extract_wikipedia_pages(page_titles):
    """
    Extracts Wikipedia pages and stores them in a dictionary.

    Args:
        page_titles: A list of Wikipedia page titles to extract.

    Returns:
        A dictionary containing the text of each Wikipedia page.
    """

    page_data = {}
    for title in page_titles:
        try:
            page = wikipedia.page(title)
            content = page.content.strip()
            content = content.replace("\n", "")
            page_data[page.title] = content
        except wikipedia.exceptions.PageError:
            print(f"Page '{title}' not found.")
        except wikipedia.exceptions.DisambiguationError as e:
            print(f"Disambiguation error for '{title}': {e.options}")

    return page_data

In [43]:
page_titles = [
               "Roger Apéry",
               "Owen Willans Richardson",
               "Otto Sackur",
               "Ludvig Lorenz",
               "Klaus von Klitzing",
               "Henri Victor Regnault",
               "Erwin Madelung",
              ]

# Uncomment the next line to scroll through Wikipedia
# wikipedia_data = extract_wikipedia_pages(page_titles)

Save the dictionary using `json.dump()`:

In [44]:
import json

# with open('wikipedia_data.json', 'w') as f:
#     json.dump(wikipedia_data, f, indent=4)

Load the dictionary using `json.load()`:

In [45]:
with open('wikipedia_data.json', 'r') as f:
    wikipedia_data = json.load(f)

In [46]:
for doc in wikipedia_data:
    print(len(wikipedia_data[doc]))

3107
3455
1683
1873
1762
3431
1487


## Document pre-processing

We load just the tokenizer:

In [47]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("nomic-ai/modernbert-embed-base")
model_max_length = tokenizer.model_max_length
model_max_length

8192

In [48]:
encoded_text = tokenizer.encode(["hello", "how are you?"])
tokenizer.decode(encoded_text)

'[CLS]hello[SEP]how are you?[SEP]'

In [49]:
def text_splitting(text, chunk_length = 300, chunk_overlap = 100):
    out = []
    for i in range(0,len(text), chunk_length - chunk_overlap):
        out.append(text[i: i + chunk_length])
    return out

In [50]:
wikipedia_data_splits = {}

for doc in wikipedia_data.keys():
    wikipedia_data_splits[doc] = text_splitting(wikipedia_data[doc])

first_key = page_titles[0]
wikipedia_data_splits[first_key][:2]

["Roger Apéry (French: [apeʁi]; 14 November 1916, Rouen – 18 December 1994, Caen) was a French mathematician most remembered for Apéry's theorem, which states that ζ(3) is an irrational number. Here, ζ(s) denotes the Riemann zeta function.== Biography ==Apéry was born in Rouen in 1916 to a French moth",
 's) denotes the Riemann zeta function.== Biography ==Apéry was born in Rouen in 1916 to a French mother and Greek father. His childhood was spent in Lille until 1926, when the family moved to Paris, where he studied at the Lycée Ledru-Rollin and the Lycée Louis-le-Grand.  He was admitted  at the Écol']

In [51]:
min_doc = min(len(wikipedia_data_splits[doc]) for doc in wikipedia_data_splits)
max_doc = max(len(wikipedia_data_splits[doc]) for doc in wikipedia_data_splits)
av_doc = sum(len(wikipedia_data_splits[doc]) for doc in wikipedia_data_splits) / len(wikipedia_data_splits)

min_doc,max_doc,av_doc

(8, 18, 12.571428571428571)

## Generating embeddings

Now we load the embedder:

In [52]:
from transformers import AutoModel

model = AutoModel.from_pretrained("nomic-ai/modernbert-embed-base")
model

ModernBertModel(
  (embeddings): ModernBertEmbeddings(
    (tok_embeddings): Embedding(50368, 768, padding_idx=50283)
    (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (drop): Dropout(p=0.0, inplace=False)
  )
  (layers): ModuleList(
    (0): ModernBertEncoderLayer(
      (attn_norm): Identity()
      (attn): ModernBertAttention(
        (Wqkv): Linear(in_features=768, out_features=2304, bias=False)
        (rotary_emb): ModernBertRotaryEmbedding()
        (Wo): Linear(in_features=768, out_features=768, bias=False)
        (out_drop): Identity()
      )
      (mlp_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (mlp): ModernBertMLP(
        (Wi): Linear(in_features=768, out_features=2304, bias=False)
        (act): GELUActivation()
        (drop): Dropout(p=0.0, inplace=False)
        (Wo): Linear(in_features=1152, out_features=768, bias=False)
      )
    )
    (1-21): 21 x ModernBertEncoderLayer(
      (attn_norm): LayerNorm((768,), eps=1e-05, e

In [53]:
inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model(**inputs)

output_dim = outputs.last_hidden_state.size(2)
output_dim

768

In [54]:
def embed(chunk_list, doc_type="document"):
    encoded_docs = tokenizer(["search_{}: {}".format(doc_type, chunk) for chunk in chunk_list],
                                 padding = True,
                                 return_tensors="pt")
    output = model(**encoded_docs) # (batch, input_length, output_dim)
    token_embeddings = output.last_hidden_state
    output_embeddings = torch.sum(token_embeddings, 1)
    output_embeddings = F.normalize(output_embeddings, p=2, dim=1)
    return output_embeddings # (batch, output_dim)

In [55]:
embed(["hello", "another document", "and another one"]).shape

torch.Size([3, 768])

In [56]:
def populate_database(dic_splits, batch_size = 1):
    n_chunks = sum([len(dic_splits[doc]) for doc in dic_splits])
    vectorial_database = torch.zeros([n_chunks, output_dim], requires_grad = False)
    chunk_list = []
    n = 0
    for doc in dic_splits.keys():
        chunk_list_doc = dic_splits[doc]
        print(doc, len(chunk_list_doc))
        for i in range(0, len(chunk_list_doc), batch_size):
            batch = chunk_list_doc[i:i+batch_size]
            chunk_list += batch
            embeddings = embed(batch, doc_type = "document")
            vectorial_database[n:n+len(batch)] = embeddings
            n += len(batch)
    return chunk_list, vectorial_database

# Uncomment this to populate the database
# chunk_list, vectorial_database = populate_database(wikipedia_data_splits)

Save the vectorial database using `torch.save()`:

In [57]:
# torch.save(vectorial_database, 'vectorial_database.pth')

# with open('chunk_list.json', 'w') as f:
#     json.dump(chunk_list, f, indent=4)

Load the database using `torch.load()`:

In [58]:
vectorial_database = torch.load('vectorial_database.pth')
vectorial_database.requires_grad_(False)

with open('chunk_list.json', 'r') as f:
    chunk_list = json.load(f)

In [59]:
len(chunk_list), vectorial_database.shape

(88, torch.Size([88, 768]))

In [60]:
for i, embedding_vector in enumerate(vectorial_database[:20]):
    print(embedding_vector[:5], chunk_list[i][:50])

tensor([ 0.0833,  0.0116,  0.0062, -0.0419,  0.0094]) Roger Apéry (French: [apeʁi]; 14 November 1916, Ro
tensor([ 0.0607,  0.0358, -0.0133, -0.0609, -0.0195]) s) denotes the Riemann zeta function.== Biography 
tensor([ 0.0097, -0.0033, -0.0019,  0.0252,  0.0276]) ere he studied at the Lycée Ledru-Rollin and the L
tensor([-0.0010, -0.0098,  0.0206,  0.0136, -0.0032]) ilized in September 1939, taken prisoner of war in
tensor([-0.0066,  0.0334,  0.0181, -0.0138, -0.0306]) irection of Paul Dubreil and René Garnier in 1947.
tensor([ 0.0232,  0.0123,  0.0306,  0.0091, -0.0022]) , where he remained until his retirement.In 1979 h
tensor([-0.0312,  0.0126, -0.0032, -0.0386, -0.0295])  the difficulty is that the corresponding problem 
tensor([ 0.0309,  0.0447, -0.0228, -0.0541, -0.0557])  that might apply to other odd powers (Frits Beuke
tensor([ 0.0506,  0.0487, -0.0094, -0.0016, -0.0635]) e 1960s was president of the Calvados Radical Part
tensor([ 0.0381,  0.0404,  0.0332, -0.0079, -0.0116]) i

## Retrieval

In [61]:
def similarity(query_embeddings, doc_embeddings):
    return query_embeddings @ doc_embeddings.T

In [62]:
query_embeddings = embed([
    "What is TSNE?",
    "Who is Laurens van der Maaten?",
], "query")

doc_embeddings = embed([
    "TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten",
], "document")

similarity(query_embeddings, doc_embeddings)

tensor([[0.6859],
        [0.3260]], grad_fn=<MmBackward0>)

In [63]:
def retrieve(query, 
             vectorial_database = vectorial_database, 
             chunk_list = chunk_list, 
             batch_size = 5, 
             topk = 5):
    query_embedding = embed([query], "query")
    similarity_scores = similarity(query_embedding, vectorial_database)
    topk = torch.topk(similarity_scores, k = topk)

    for score, idx in zip(topk.values[0], topk.indices[0]):
        print(f"Score: {score:.4f}\nText:\n", chunk_list[idx], "\n")
    return "\n".join([chunk_list[i] for i in topk.indices[0].tolist()])

In [64]:
retrieve("When was Erwin Madelung born?")

Score: 0.7402
Text:
 Erwin Madelung (18 May 1881 – 1 August 1972) was a German physicist.He was born in 1881 in Bonn. His father was the surgeon Otto Wilhelm Madelung. He earned a doctorate in 1905 from the University of Göttingen, specializing in crystal structure, and eventually became a professor. It was during this  

Score: 0.5539
Text:
 erlag, Berlin 1922. subsequent editions: 1925, 1936, 1950, 1953, 1957, 1964.== References ==== External links ==Works by or about Erwin Madelung at the Internet ArchiveLiterature by and about Erwin Madelung in the German National Library cataloguePortrait drawing at Frankfurt University 

Score: 0.4500
Text:
 tirement in 1949. He specialized in atomic physics and quantum mechanics, and it was during this time he developed the Madelung equations, an alternative form of the Schrödinger equation.He is also known for the Madelung rule, which states that atomic orbitals are filled in order of increasing       

Score: 0.4431
Text:
 Göttingen, specializ

'Erwin Madelung (18 May 1881 – 1 August 1972) was a German physicist.He was born in 1881 in Bonn. His father was the surgeon Otto Wilhelm Madelung. He earned a doctorate in 1905 from the University of Göttingen, specializing in crystal structure, and eventually became a professor. It was during this \nerlag, Berlin 1922. subsequent editions: 1925, 1936, 1950, 1953, 1957, 1964.== References ==== External links ==Works by or about Erwin Madelung at the Internet ArchiveLiterature by and about Erwin Madelung in the German National Library cataloguePortrait drawing at Frankfurt University\ntirement in 1949. He specialized in atomic physics and quantum mechanics, and it was during this time he developed the Madelung equations, an alternative form of the Schrödinger equation.He is also known for the Madelung rule, which states that atomic orbitals are filled in order of increasing      \nGöttingen, specializing in crystal structure, and eventually became a professor. It was during this time h

### Alternative retrieval: SVM

In [65]:
import numpy as np
from sklearn import svm

def retrieve_SVM(query, 
             vectorial_database = vectorial_database, 
             chunk_list = chunk_list, 
             topk = 5):
    query_embedding = embed([query], "query")
    x = np.concatenate([query_embedding.detach().numpy(), vectorial_database.detach().numpy()])
    y = np.zeros(vectorial_database.size(0) + 1)
    y[0] = 1 # we have a single positive example

    clf = svm.LinearSVC(class_weight='balanced', verbose=False, max_iter=10000, tol=1e-6, C=0.1, dual="auto")
    clf.fit(x, y)
    similarities = clf.decision_function(x)
    sorted_ix = np.argsort(-similarities)
    for k in sorted_ix[1:topk+1]:
        print(f"Score: {similarities[k]:.4f}\nText:\n", chunk_list[k-1], "\n")
    return "\n".join([chunk_list[k-1] for k in sorted_ix[1:topk+1]])

In [66]:
retrieve_SVM("When was Erwin Madelung born?")

Score: 0.0864
Text:
 Erwin Madelung (18 May 1881 – 1 August 1972) was a German physicist.He was born in 1881 in Bonn. His father was the surgeon Otto Wilhelm Madelung. He earned a doctorate in 1905 from the University of Göttingen, specializing in crystal structure, and eventually became a professor. It was during this  

Score: -0.1480
Text:
 erlag, Berlin 1922. subsequent editions: 1925, 1936, 1950, 1953, 1957, 1964.== References ==== External links ==Works by or about Erwin Madelung at the Internet ArchiveLiterature by and about Erwin Madelung in the German National Library cataloguePortrait drawing at Frankfurt University 

Score: -0.5072
Text:
 delung in the German National Library cataloguePortrait drawing at Frankfurt University 

Score: -0.5178
Text:
 tirement in 1949. He specialized in atomic physics and quantum mechanics, and it was during this time he developed the Madelung equations, an alternative form of the Schrödinger equation.He is also known for the Madelung rule, whi

'Erwin Madelung (18 May 1881 – 1 August 1972) was a German physicist.He was born in 1881 in Bonn. His father was the surgeon Otto Wilhelm Madelung. He earned a doctorate in 1905 from the University of Göttingen, specializing in crystal structure, and eventually became a professor. It was during this \nerlag, Berlin 1922. subsequent editions: 1925, 1936, 1950, 1953, 1957, 1964.== References ==== External links ==Works by or about Erwin Madelung at the Internet ArchiveLiterature by and about Erwin Madelung in the German National Library cataloguePortrait drawing at Frankfurt University\ndelung in the German National Library cataloguePortrait drawing at Frankfurt University\ntirement in 1949. He specialized in atomic physics and quantum mechanics, and it was during this time he developed the Madelung equations, an alternative form of the Schrödinger equation.He is also known for the Madelung rule, which states that atomic orbitals are filled in order of increasing      \nGöttingen, specia

## Full pipeline

This model only does extractive question answering!

In [67]:
from transformers import AutoModelForQuestionAnswering, pipeline

model_name = "deepset/tinyroberta-squad2"

QA = pipeline('question-answering', model=model_name, tokenizer=model_name)

Device set to use cpu


In [68]:
def query(prompt):
    topk_chunks = retrieve(prompt)
#     topk_chunks = retrieve_SVM(prompt)
    return QA(question=prompt, context=topk_chunks)

In [69]:
query("When was Erwin Madelung born?")

Score: 0.7402
Text:
 Erwin Madelung (18 May 1881 – 1 August 1972) was a German physicist.He was born in 1881 in Bonn. His father was the surgeon Otto Wilhelm Madelung. He earned a doctorate in 1905 from the University of Göttingen, specializing in crystal structure, and eventually became a professor. It was during this  

Score: 0.5539
Text:
 erlag, Berlin 1922. subsequent editions: 1925, 1936, 1950, 1953, 1957, 1964.== References ==== External links ==Works by or about Erwin Madelung at the Internet ArchiveLiterature by and about Erwin Madelung in the German National Library cataloguePortrait drawing at Frankfurt University 

Score: 0.4500
Text:
 tirement in 1949. He specialized in atomic physics and quantum mechanics, and it was during this time he developed the Madelung equations, an alternative form of the Schrödinger equation.He is also known for the Madelung rule, which states that atomic orbitals are filled in order of increasing       

Score: 0.4431
Text:
 Göttingen, specializ

{'score': 0.3447449505329132, 'start': 16, 'end': 27, 'answer': '18 May 1881'}