# Retrieval-augmented generation with open-source large language models

## Introduction

Retrieval-augmented generation (RAG) is a framework that uses pre-trained large language models (LLMs) to generate responses to queries grounded by an external knowledge base. The technique is described in the paper [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401).

We will use this framework to build a information retrieval system (IRS) realized by a chat prompt that will be able to answer questions supported by documents. The following sequence diagram depicts the general querying flow:

```mermaid
sequenceDiagram
    App->>Orchestrator: Prompt
    Orchestrator->>Search: Query prompt
    Search->>Corpus: Similiarity search
    Corpus->>Search: Return content
    Search->>Orchestrator: Retrieve content
    Orchestrator->>GPT: Send prompt and content
    GPT->>GPT: Reasoning and response generation
    GPT->>Orchestrator: Return response
    Orchestrator->>App: Return result
```

The orchestrator is the integration code. We will use [LangChain](https://www.langchain.com/) here to coordinate the workflow. Our search service can be either a managed or some kind of self-hosted service. For now, we will go for a self-hosted service. The corpus service is our knowledge base. It is a vector database enabled to build an index for the similarity search from given data. The index is built by chunking our provided data and constructing embeddings from the chunks. This will be realized by using [Faiss](https://faiss.ai/index.html), [Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/) and [Transformers](https://huggingface.co/docs/transformers/index). Our corpus is made of documents. Each document will have a unique id and contains some sort of text. Said that, a document can be any kind of text. It can be single words, an newspaper article or the whole encyclopedia.

In [1]:
import torch
import transformers

from langchain import LLMChain
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.document_loaders import PyPDFLoader
from langchain.schema.runnable.passthrough import RunnablePassthrough

## Quantization

Due to the limited memory of our available GPU we need to quantize our model. Quantization is a technique to reduce the memory footprint of a model tremendously. This is achieved by reducing the precision of the model's weights. It enables us to use a larger model. Further details about the technique are described in the papers [Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference](https://arxiv.org/abs/1712.05877) and [AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration](https://arxiv.org/abs/2306.00978).

In [2]:
# !git lfs clone https://huggingface.co/mistralai/Mistral-7B-v0.1 ../resource/model/Mistral-7B-v0.1
# WARNING: 'git lfs clone' is deprecated and will not be updated
#           with new flags from 'git clone'

# 'git clone' has been updated in upstream Git to have comparable
# speeds to 'git lfs clone'.
# Cloning into '../resource/model/Mistral-7B-v0.1'...
# remote: Enumerating objects: 79, done.
# remote: Counting objects: 100% (75/75), done.
# remote: Compressing objects: 100% (74/74), done.
# remote: Total 79 (delta 39), reused 0 (delta 0), pack-reused 4
# Unpacking objects: 100% (79/79), 470.69 KiB | 2.39 MiB/s, done.
# Downloading LFS objects: 100% (3/3), 15 GB | 32 MB/s  

In [3]:
!tree -L 1 ../resource/model/Mistral-7B-v0.1

[01;34m../resource/model/Mistral-7B-v0.1[0m
├── README.md
├── config.json
├── generation_config.json
├── pytorch_model-00001-of-00002.bin
├── pytorch_model-00002-of-00002.bin
├── pytorch_model.bin.index.json
├── special_tokens_map.json
├── tokenizer.json
├── tokenizer.model
└── tokenizer_config.json

0 directories, 10 files


In [4]:
!du -sh ../resource/model/Mistral-7B-v0.1

28G	../resource/model/Mistral-7B-v0.1


In [5]:
!nvidia-smi --query-gpu=gpu_name,driver_version,memory.total --format=csv

name, driver_version, memory.total [MiB]
NVIDIA GeForce RTX 3060, 546.29, 12288 MiB


Our downloaded, unquantized model has a size of ~28 GB, whereas our GPU has only a memory of ~12 GB. Therefore, we will use quantization to reduce the model's size to ~7 GB. This will enable us to use the model on the given GPU. The technique is described in the blog post [Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes).

To achieve this we have to set some preconfigurations for our model loading with the help of the library [bitsandbytes](https://github.com/TimDettmers/bitsandbytes).

In [6]:
# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "bfloat16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

# Retrieving the availabile compute dtype from torch
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

In [7]:
# Initializing the bitsandbytes config
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

In [8]:
# Loading our model
model_name_or_path = "../resource/model/Mistral-7B-v0.1"

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    quantization_config=bnb_config,
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

The name 'model_name_or_path' already indicates, that we also could use a model directly from the [Hugging Face model hub](https://huggingface.co/models) and download it on demand. This is handy in situation where if we would like to check different models and it ensures that we always use the latest version of the model. Anyhow, for this tutorial we will use the model code we already have downloaded.

The following code checks how much memory has been occupied by the model:

In [9]:
from pynvml import nvmlInit, nvmlDeviceGetHandleByIndex, nvmlDeviceGetMemoryInfo

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")

print_gpu_utilization()

GPU memory occupied: 6447 MB.


## Generation pipeline

We will setup a pipeline for our communication with the model on the GPU. Therefore we need to initializing a [tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer). The tokenizer is used to convert our text input into a numerical representation that can be processed by the model.

In [10]:
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name_or_path)
tokenizer.pad_token = tokenizer.eos_token # TODO explain this
tokenizer.padding_side = "right" # TODO explain this

In [11]:
text_generation_pipeline = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.8,
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=1000,
    do_sample=True,
)

The task here shall be to perform the generation of text ("text-generation"). Other forms of tasks can be found [here](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.pipeline.task).

The other configuration parameters are:

- temperature: Determines the randomness of the generated text. The higher the value, the more random the text. The lower the value, the more conservative the text.
- repetition_penalty: Determines how much the model will avoid repeating the same word. The higher the value, the more likely the model will avoid repeating the same word.
- return_full_text: Determines if the output should be the full text or only the generated part.
- max_new_tokens: Determines the maximum number of tokens that can be generated. This is a safety measure to avoid infinite loops.
- do_sample: Determines if the model should use sampling or greedy decoding. Sampling is more creative, but greedy decoding is faster.

## Setup index with embeddings

In [12]:
import nest_asyncio
nest_asyncio.apply()

# Articles to index
document = "../resource/data/example-docs/lufthansa-abb-07-2021-en.pdf"

loader = PyPDFLoader(document)
docs = loader.load()

text_splitter = CharacterTextSplitter(chunk_size=100, 
                                      chunk_overlap=0)
chunked_documents = text_splitter.split_documents(docs)

In [13]:
sentence_transformer = "../resource/model/all-mpnet-base-v2"

# Load chunked documents into the FAISS index
database = FAISS.from_documents(
    chunked_documents, HuggingFaceEmbeddings(model_name=sentence_transformer)
)

# Prepare the db to serve as retriever
retriever = database.as_retriever()

## Prompt template

In [14]:
question = "Can I get a refund for a lost ticket?"
prompt_template = """Answer the question. Use only the following corpus for your answer: {corpus}

Question: {question}
"""

In [15]:
# Create prompt from prompt template 
prompt = PromptTemplate(
    input_variables=["corpus", "question"],
    template=prompt_template,
)

## Querying

In [16]:
llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

# Create llm chain 
llm_chain = LLMChain(llm=llm, prompt=prompt)

In [17]:
rag_chain = ( 
 {"corpus": retriever, "question": RunnablePassthrough()}
    | llm_chain
)

answer = rag_chain.invoke("Can I get a refund for a lost ticket?")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


## Result

In [18]:
import pprint as pp

pp.pprint(answer['question'])
pp.pprint(answer['text'])

'Can I get a refund for a lost ticket?'
('Chosen answer:\n'
 'Yes, you can apply for a refund of the lost ticket, provided that:\n'
 '1) the ticket or portion thereof has not been used\n'
 '2) the ticket has not been lost for more than 5 years\n'
 '3) you present lost ticket as evidence of loss, and\n'
 '4) you undertake in writing to repay to us the amount refunded in the event '
 'that the lost ticket or portion thereof is presented and redeemed by some '
 'other person.\n'
 '\n'
 '## Answer - Add one\n'
 '| Question | Chosen answer |')
