# Retrieval Augmented Generation

By Peter Nadel, Digital Humanities Natural Language Processing Specialist


## Background

Retrieval Augmented Generation, or RAG, is a technique that can be used to prompt a Large Language Model (LLM) to answer questions from an existing knowledge base. It doesn't not involve any further training of the model, but rather relies on the model's ability to follow instructions.

## Two-Step Process
Given a user query, RAG decomposes into two main steps. First, we have to go into the knowledge base, what practitioners sometimes call a *corpus*, and retrieve sections of the text that are relevant to the user query. This process is usually known as information retrieval and has been a common task since the inception of NLP. We will implement an algorithm called 'semantic search' to retrieve these text chunks from our corpus. Second, we will pass these retrieved chunks to the LLM in the form of an elaborate prompt, telling the LLM to answer the user query only refering to the information in the context.      




In [None]:
# install dependancies
%%capture
import torch

if torch.cuda.is_available():
  !CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir -q
else:
  !pip install llama-cpp-python -U

!pip install sentence_transformers --no-deps -q
!pip install streamlit langchain pypdf python-docx pandas numpy tiktoken huggingface-hub -q
!pip install numpy==1.23.5 -q
!mkdir BAAI_bge-m3
!huggingface-cli download BAAI/bge-m3 --local-dir BAAI_bge-m3 --local-dir-use-symlinks False
!chmod -R 755 BAAI_bge-m3

In [None]:
# import dependancies
import streamlit as st
import os
from llama_cpp import Llama
import torch
from sentence_transformers import SentenceTransformer
import torch
import pandas as pd
import numpy as np
import pypdf
import docx
from io import StringIO, BytesIO
from langchain.text_splitter import TokenTextSplitter
import re
import requests
import io
from IPython.display import display, HTML

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

## Step 1: Semantic Search

Semantic search utilizes a different language model to represent text chunks in high dimensional vector space, often called an *embedding*. We will split up our corpus into chunks of text so that then we can *embed* it using this model. Once embedded, we can then embed our user query in the same way.

We will then have one matrix of shape (number_of_text_chunks, embedding_dimension) and a one vector (1, embedding_dimension). Taking the dot product between this vector and the transpose of this matrix will give us a (1, number_of_text_chunks) vector the values of which represent the **similarity between the query and each text chunk**.

In standard semantic search, we would then return the results back to the user, but in the case of RAG, we will use these text chunks as the context in a prompt.

**Nota Bene**: The term "large language model" was originally coined to describe these embedding models. In fact, they are remarkably similar to their "generative" counter parts. The main difference is that an embedding model is trained guess randomly masked words from a larger sequence, whereas what is commonly refered to as an LLM is trained to guess the next word in a large sequence. The former is very good for modeling semantic meaning in texts, and the latter is very good for text completion.

### Data prep

In this example, we'll be looking at the first volume of *The Decline and Fall of the Roman Empire* by Edward Gibbon. This is massive text about the the Roman Empire from 200 AD to the Fall of Constninople in 1453. This selection covers the migration and integration of Germanic tribes into the remants of the Western Roman Empire around 400 AD.

It is a useful case to explore with RAG as it is very long and thus cannot be given in its totality to an LLM. Instead, we will have to employ RAG so that the LLM answers accurately and quickly.

Here we will prepare our data for RAG.

In [None]:
# the url to the gutenberg project page
display(HTML(
    """<iframe src="https://gutenberg.org/cache/epub/731/pg731.txt"></iframe>"""
))



In [None]:
res = requests.get('https://gutenberg.org/cache/epub/731/pg731.txt') # using requests to get the text
text = res.text
text[10000:10500]

' has cast the decay and ruin of the ancient civilization, the\r\n      formation and birth of the new order of things, will of itself,\r\n      independent of the laborious execution of his immense plan,\r\n      render “The Decline and Fall of the Roman Empire” an\r\n      unapproachable subject to the future historian: 101 in the\r\n      eloquent language of his recent French editor, M. Guizot:—\r\n\r\n      101 (return) [ A considerable portion of this preface has already\r\n      appeared before us public '

### Chunking

Now that we have our text, we need to embed it. Before we can do so, we need to split it up in to chunks that can be read by the embedding model. This process, known as chunking, can have a profound effect on the output of our RAG. If our chunks are too small or too big then our context will be useless, so it often comes down to experiementation.

We will use an off-the-shelf chunker from `langchain` for this example, but I encourage your to design your own. I've given a default `chunk_size` of 250 tokens and `chunk_overlap`, how much from the last chunk should be carried over into the current chunk, of 50 tokens. Play around with this and see the differences.

In [None]:
text_splitter = TokenTextSplitter(chunk_size=250, chunk_overlap=50) # langchain tokentextsplitter, other exists: https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/
chunks = text_splitter.split_text(text)
len(chunks)

2807

### Embedding model: `bge-m3`

Below we begin the embedding process for our corpus. We are using this embedding model: [bge-m3](https://huggingface.co/BAAI/bge-m3). I like this one from experimentation but there are many other that you can find on HuggingFace. I encourage you to experiment with this choice as well.

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # move data to the gpu
model = SentenceTransformer('/content/BAAI_bge-m3')

embeddings = model.encode(
    chunks, # text input
    batch_size=64, # batch size
    device=device, # gpu device
    show_progress_bar=True,
    convert_to_tensor=True, # converts to Pytorch tensor
    normalize_embeddings=True # allows us to compare embeddings
)

embeddings.shape # (number_of_text_chunks, embedding_dimension)

Batches:   0%|          | 0/44 [00:00<?, ?it/s]

torch.Size([2807, 1024])

### Information retrieval

Now we are ready to start the information retrieval process. As mentioned above, we will take a question from the user, here `query` and search for similar text chunks with a matrix multiplication. We will then extract the indices of the most relevant chunks and compile them into a form to be read by our LLM.

In [None]:
retrieval_instruction = "Represent this sentence for searching relevant passages: " # need to prepend this string
query = "What were the major goals of the Antonines?" # user query
query_embedding = model.encode(retrieval_instruction+query, device=device, convert_to_tensor=True, normalize_embeddings=True)
sim_vector = (embeddings.to(device) @ query_embedding.to(device)) # matmul to compare query_embedding to embeddings
sim_vector.shape # simiarlity scores between each embedding and the query_embedding

torch.Size([2807])

In [None]:
sim_vector.argsort() # sorted by index, same index as chunk_list!

tensor([2794, 2798, 2795,  ...,  131,  438,  430], device='cuda:0')

In [None]:
sim_vector.argsort().cpu().numpy() # takes off of gpu and converts to a numpy array

array([2794, 2798, 2795, ...,  131,  438,  430])

In [None]:
sim_vector.argsort().cpu().numpy()[::-1] # reverses list

array([ 430,  438,  131, ..., 2795, 2798, 2794])

In [None]:
top_10_indices = sim_vector.argsort().cpu().numpy()[::-1][:10] # get top 10, this number is arbitrary
top_10_indices

array([430, 438, 131, 585,   3,   4, 261,  93, 189,  94])

In [None]:
top_10_chunks = [chunks[i] for i in top_10_indices]
top_10_chunks # most relevant chunks to our search query

['   Antonines, who were themselves men of learning and curiosity. It\r\n      was diffused over the whole extent of their empire; the most\r\n      northern tribes of Britons had acquired a taste for rhetoric;\r\n      Homer as well as Virgil were transcribed and studied on the banks\r\n      of the Rhine and Danube; and the most liberal rewards sought out\r\n      the faintest glimmerings of literary merit. 110 The sciences of\r\n      physic and astronomy were successfully cultivated by the Greeks;\r\n      the observations of Ptolemy and the writings of Galen are studied\r\n      by those who have improved their discoveries and corrected their\r\n      errors; but if we except the inimitable Lucian, this age of\r\n      indolence passed away without having produced a single writer of\r\n      original genius, or who excelled in the arts of elegant\r\n      composition.1101 The authority of Plato and',
 ' the fierce giants of the north broke in, and mended\r\n      the puny breed. T

## Step 2: Generation

With our relevant text chunks, we can now move on to generating an answer to our user query. Prompt engineering can be deceptively difficult. I've provided a very simple prompt below, but feel free to change it and see how that affects the output of the model.

For this example, we are using `llama-cpp-python` to load the pre-quantized version of [this LLM](https://huggingface.co/NousResearch/Hermes-2-Theta-Llama-3-8B). This is finetuned version of the base Llama-3-8B model and is particularly good a following user instructions.

In [None]:
# what can we do with these chunks?
base_prompt = """
# Assistant Task
Please answer the user query only with reference to the context passages below.

## Context
{chunks}

## User Query
{query}
""".strip()

In [None]:
filled_prompt = base_prompt.format(chunks='\n'.join(top_10_chunks), query=query) # filling our prompt with our text chunks
filled_prompt

'# Assistant Task\nPlease answer the user query only with reference to the context passages below.\n\n## Context\n   Antonines, who were themselves men of learning and curiosity. It\r\n      was diffused over the whole extent of their empire; the most\r\n      northern tribes of Britons had acquired a taste for rhetoric;\r\n      Homer as well as Virgil were transcribed and studied on the banks\r\n      of the Rhine and Danube; and the most liberal rewards sought out\r\n      the faintest glimmerings of literary merit. 110 The sciences of\r\n      physic and astronomy were successfully cultivated by the Greeks;\r\n      the observations of Ptolemy and the writings of Galen are studied\r\n      by those who have improved their discoveries and corrected their\r\n      errors; but if we except the inimitable Lucian, this age of\r\n      indolence passed away without having produced a single writer of\r\n      original genius, or who excelled in the arts of elegant\r\n      composition.110

In [None]:
# loading the model onto the GPU
llm = Llama.from_pretrained(
        repo_id="NousResearch/Hermes-2-Theta-Llama-3-8B-GGUF",
        filename="*Q4_K_M.gguf",
        verbose=False,
        n_gpu=-1,
        n_ctx=5000
    )

In [None]:
# creating a conversation
messages = [
    {"role": "system", "content": "You are a helpful AI assistant who answers questions."}, # system level prompt, feel free to experiement with this too
    {"role":"user", "content":filled_prompt}
]

In [None]:
# streamed response
max_width = 70
current_length = 0

text = ''
for token in llm.create_chat_completion(messages, max_tokens=-1, stream=True):
    if 'content' in token['choices'][0]['delta']:
        if current_length + len(token['choices'][0]['delta']['content']) + 1 > max_width:
            print()
            current_length = 0
        text += token['choices'][0]['delta']['content']
        print(token['choices'][0]['delta']['content'], end='', flush=True)
        current_length += len(token['choices'][0]['delta']['content']) + 1
print()
messages.append({"role": "assistant", "content": text})

The major goals of the Antonines, according to the context
 passage provided, were:

1. To maintain the dignity and
 stability of the Roman Empire without attempting to enlarge
 its limits.
2. To invite the friendship of the barbar
ians through honorable means.
3. To cultivate the sciences
, arts, and literature by promoting learning and rewarding
 merit.
4. To restore a manly spirit of freedom and
 encourage original genius in writing.

The Antonines,
 specifically Hadrian and Marcus Aurelius, worked together
 to achieve these goals and were known for their virtuous
 conduct and administration of the empire. They also shared
 a joint interest in various fields such as agriculture,
 military affairs, and civil administration.


## Full process

Below I will refactor the code above into a couple functions for your ease of use later on.

In [None]:
# sem search functions
def get_text_chunks(chunk_size=250, chunk_overlap=50):
    res = requests.get('https://gutenberg.org/cache/epub/731/pg731.txt')
    text = res.text

    text_splitter = TokenTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    chunks = text_splitter.split_text(text)
    return chunks

def get_embedding(chunks):
    model = SentenceTransformer('/content/BAAI_bge-m3')
    embeddings = model.encode(
        chunks,
        batch_size=64,
        device='cuda',
        show_progress_bar=True,
        convert_to_tensor=True,
        normalize_embeddings=True
    )
    return embeddings

def semantic_search(query, embeddings, chunks, model, device='cuda', k=10):
    retrieval_instruction = "Represent this sentence for searching relevant passages: " # need to prepend this string
    query_embedding = model.encode(retrieval_instruction+query, device=device, convert_to_tensor=True, normalize_embeddings=True)
    sim_vector = (embeddings.to(device) @ query_embedding.to(device))
    top_10_indices = sim_vector.argsort().cpu().numpy()[::-1][:k]
    top_10_chunks = [chunks[i] for i in top_10_indices]
    return top_10_chunks

In [None]:
# prompts
base_prompt = """
# Assistant Task
Please answer the user query only with reference to the context passages below.

## Context
{chunks}

## User Query
{query}
""".strip()

system_prompt = """
You are a helpful AI assistant who answers questions.
""".strip()

# generation functions
def init_messages(base_prompt, system_prompt, query, text_chunks):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role":"user", "content":base_prompt.format(chunks='\n'.join(text_chunks), query=query)}
    ]
    return messages

def generate_response(messages, llm, max_width=70):
    current_length = 0

    text = ''
    for token in llm.create_chat_completion(messages, max_tokens=-1, stream=True):
        if 'content' in token['choices'][0]['delta']:
            if current_length + len(token['choices'][0]['delta']['content']) + 1 > max_width:
                print()
                current_length = 0
            text += token['choices'][0]['delta']['content']
            print(token['choices'][0]['delta']['content'], end='', flush=True)
            current_length += len(token['choices'][0]['delta']['content']) + 1
    print()
    messages.append({"role": "assistant", "content": text})
    return messages

In [None]:
# example with functions, getting started
chunks = get_text_chunks()
embeddings = get_embedding(chunks)

query = "What were the major failures of the Antonines?"
top_10_chunks = semantic_search(query, embeddings, chunks, model)
messages = init_messages(base_prompt, system_prompt, query, top_10_chunks)
messages = generate_response(messages, llm)

Batches:   0%|          | 0/44 [00:00<?, ?it/s]

The major failures of the Antonines, according to Edward
 Gibbon's "History of the Decline and Fall of Rome,"
 include:

1. Inability to produce original genius or
 excel in elegant composition: Although they encouraged
 learning and curiosity, the Antonines did not produce any
 notable writers of original genius.
2. Weakness in
 administration: The Antonines failed to address the
 underlying issues that led to the decline of the Roman
 Empire, such as economic problems, military weakness, and
 social decay.
3. Failure to reform the state: Pertinax
's attempts at reform were cut short by his assassination
 by the Praetorian Guards.
4. Inability to maintain
 unity: The empire was dismembered under Valerian and
 Gallienus, reducing it to a low point from which
 recovery seemed impossible.

These failures contributed to
 the decline of the Roman Empire during the Antonine period
.


In [None]:
# adding a new message
query = "Tell me about the rise of Christianity in the Empire."
top_10_chunks = semantic_search(query, embeddings, chunks, model)
messages.append({"role":"user", "content":base_prompt.format(chunks='\n'.join(top_10_chunks), query=query)})
messages = generate_response(messages, llm)

According to Edward Gibbon's "History of the Decline and
 Fall of Rome," the rise of Christianity in the Roman
 Empire can be attributed to several factors:

1. The
 teachings of Jesus Christ: The message of love, compassion
, and forgiveness preached by Jesus Christ resonated with
 many people, especially those who were marginalized or
 oppressed by society.
2. Persecution: The early
 Christians faced persecution under various Roman emperors,
 which only served to strengthen their faith and attract
 more converts.
3. Social conditions: The Roman Empire was
 experiencing a period of economic decline, political
 instability, and moral decay, creating an environment in
 which people were open to new ideas and beliefs.
4. The
 example of the martyrs: The willingness of Christians to
 die for their faith inspired many others to convert to
 Christianity.
5. The spread of Christianity through trade
 and commerce: As trade routes expanded, so did the spread
 of Christianity, as merchants and 

KeyboardInterrupt: 

## Problems with RAG

There are a couple notable limitation with this simple implementation of RAG:


1.   Information retrieval is a key part of the process, but if your information retrieval is inaccurate then the RAG response will be likewise inaccurate. We used one technique for information retrieval but there are many more that try to allivate this problem.
2.   The prompt is of incredible important to the model. This can feel very arbitrary and difficult to control
3.  The LLM can still make mistakes and misunderstand the context of certain text chunks. Better models will do this less, but it is ultimately impossible to completely avoid.


That said, you should experiment with the code above and try to resolve these issues in this small example.

For any questions, feel free to reach out to peter.nadel@tufts.edu.
