# Retrieval-augmented generation with open-source, quantized large language models

## Introduction

Retrieval-augmented generation (RAG) is a technique that combines the benefits of large-scale pre-trained language models (LMs) with the benefits of retrieving relevant information from a large corpus. The technique is described in the paper [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401).

We will use this framework to build a chat system that will be able to answer questions providing a given context from selective data.

First, we need to import some libraries. [PyTorch](https://pytorch.org/) is a library for tensor computation and deep learning. We will use the torch utilities to basic interactions with our GPU. [Transformers](https://huggingface.co/docs/transformers/index) is a library for state-of-the-art NLP models. We will use it to load our model, specify the tokenizer and to use [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) to provide configurations for quantizing the model. Quantization is a technique to reduce the memory footprint of a model tremendously. This is achieved by reducing the precision of the model's weights. It allows us to use a larger model. Further details about the technique are described in the papers [Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference](https://arxiv.org/abs/1712.05877) and [AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration](https://arxiv.org/abs/2306.00978).

In [1]:
import torch
import transformers
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)

## Download model

For educational reason we will use here the original, unquantized model as our starting point. The download will take some time hence we surpress the execution here and only provide the outcome.

In [2]:
# !git lfs clone https://huggingface.co/mistralai/Mistral-7B-v0.1 ../resource/model/Mistral-7B-v0.1
# WARNING: 'git lfs clone' is deprecated and will not be updated
#           with new flags from 'git clone'

# 'git clone' has been updated in upstream Git to have comparable
# speeds to 'git lfs clone'.
# Cloning into '../resource/model/Mistral-7B-v0.1'...
# remote: Enumerating objects: 79, done.
# remote: Counting objects: 100% (75/75), done.
# remote: Compressing objects: 100% (74/74), done.
# remote: Total 79 (delta 39), reused 0 (delta 0), pack-reused 4
# Unpacking objects: 100% (79/79), 470.69 KiB | 2.39 MiB/s, done.
# Downloading LFS objects: 100% (3/3), 15 GB | 32 MB/s  

In [3]:
!tree -L 1 ../resource/model/Mistral-7B-v0.1

[01;34m../resource/model/Mistral-7B-v0.1[0m
├── README.md
├── config.json
├── generation_config.json
├── pytorch_model-00001-of-00002.bin
├── pytorch_model-00002-of-00002.bin
├── pytorch_model.bin.index.json
├── special_tokens_map.json
├── tokenizer.json
├── tokenizer.model
└── tokenizer_config.json

0 directories, 10 files


In [4]:
!cat ../resource/model/Mistral-7B-v0.1/config.json

{
  "architectures": [
    "MistralForCausalLM"
  ],
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "rms_norm_eps": 1e-05,
  "rope_theta": 10000.0,
  "sliding_window": 4096,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.34.0.dev0",
  "use_cache": true,
  "vocab_size": 32000
}


In [5]:
!du -sh ../resource/model/Mistral-7B-v0.1

28G	../resource/model/Mistral-7B-v0.1


In [6]:
!nvidia-smi --query-gpu=gpu_name,memory.total --format=csv

name, memory.total [MiB]
NVIDIA GeForce RTX 3060, 12288 MiB


We obviously have a mismatch between the size of the model and available memory on our GPU.

Now we are loading the default [model configuration](https://huggingface.co/docs/transformers/main_classes/configuration) from our local model base. The name 'model_name_or_path' already indicates, that we also could use a model directly from the [Hugging Face model hub](https://huggingface.co/models) and download it on demand. This is handy in situation where if we would like to check different models and it ensures that we always use the latest version of the model. Anyhow, for this tutorial we will use the model code we already have downloaded.

In [7]:
model_name_or_path = "../resource/model/Mistral-7B-v0.1"

# Load model config
model_config = transformers.AutoConfig.from_pretrained(
    model_name_or_path,
)

## Quantization

Next we set some preconfigurations for our model loading with the help of the library [bitsandbytes](https://github.com/TimDettmers/bitsandbytes). We are using the enhancement of shrinking the model weights to 4-bit integers. This will reduce the memory footprint of our model roughly by a factor of 4! The technique is described in the blog post [Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes).

In [8]:
# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "bfloat16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

# Retrieving the availabile compute dtype from torch
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

In [9]:
# Initializing the bitsandbytes config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

## Loading model

Loading our model from our local code base with our preconfigurations.

In [10]:
# Loading our model
model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    quantization_config=bnb_config,
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Let's check how much memory has been utilized on our GPU.

In [11]:
from pynvml import *

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")

print_gpu_utilization()

GPU memory occupied: 5887 MB.


## Tokenizer

Now initializing the [tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer). The tokenizer is used to convert our text input into a numerical representation that can be processed by the model.

In [12]:
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Next we are setting up a pipeline to obtaining a text in and ouput for out GPU.

## Pipeline

We are using HuggingFace [pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines) utility for this purpose.

In [13]:
text_generation_pipeline = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.8,
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=1000,
    do_sample=True,
)

We initialize our pipeline with our pre-trained, now quantized model and the configured tokenizer. The task here shall be to perform the generation of text ("text-generation"). Other forms of tasks can be found [here](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.pipeline.task).

The other configuration parameters are:

- temperature: Determines the randomness of the generated text. The higher the value, the more random the text. The lower the value, the more conservative the text.
- repetition_penalty: Determines how much the model will avoid repeating the same word. The higher the value, the more likely the model will avoid repeating the same word.
- return_full_text: Determines if the output should be the full text or only the generated part.
- max_new_tokens: Determines the maximum number of tokens that can be generated. This is a safety measure to avoid infinite loops.
- do_sample: Determines if the model should use sampling or greedy decoding. Sampling is more creative, but greedy decoding is faster.

## LangChain

In the next step we are using the library [langchain](https://www.langchain.com/). For our vector database we will use [Faiss](https://faiss.ai/index.html). [Sentence Transformers](https://www.sbert.net/) is a library for state-of-the-art sentence embeddings. We will use it to encode our context and queries. [Faiss](https://faiss.ai) is a library for efficient similarity search. We will use it to retrieve relevant information from our corpus.

In [14]:

from langchain import LLMChain
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.document_loaders import PyPDFLoader
from langchain.schema.runnable.passthrough import RunnablePassthrough


Our corpus is made of documents. Each document will have a unique id and contains some sort of text. Said that, a document can be any kind of text. It can be single words, an newspaper article or the whole encyclopedia.

## PDF loading

In [15]:
import nest_asyncio
nest_asyncio.apply()

# Articles to index
document = "../resource/data/example-docs/lufthansa-abb-07-2021-en.pdf"

In [16]:
loader = PyPDFLoader(document)
docs = loader.load()

In [17]:
text_splitter = CharacterTextSplitter(chunk_size=100, 
                                      chunk_overlap=0)
chunked_documents = text_splitter.split_documents(docs)

## Sentence transformer

In [18]:
model_name_or_path = "../resource/model/all-mpnet-base-v2"

# Load chunked documents into the FAISS index
database = FAISS.from_documents(
    chunked_documents, HuggingFaceEmbeddings(model_name=model_name_or_path)
)

# Prepare the db to serve as retriever
retriever = database.as_retriever()

## Prompt

In [19]:
prompt_template = """Answer the question based only on the following context:
{context}

Question: {question}
"""

# Create prompt from prompt template 
prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)

## LLMChain

In [20]:
llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

# Create llm chain 
llm_chain = LLMChain(llm=llm, prompt=prompt)

In [21]:
rag_chain = ( 
 {"context": retriever, "question": RunnablePassthrough()}
    | llm_chain
)

answer = rag_chain.invoke("Can I get a refund for a lost ticket?")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [22]:
import pprint as pp

pp.pprint(answer)

{'context': [Document(page_content='LUFTHANSA   18/25 \n  \n10.2.1.2. If you have already used a portion of the ticket, not less than the difference between the \nfare paid and the fare applicable to th e segments you have already flown.  \n \nVoluntary Refund  \n10.3.  \n10.3.1. If you request a refund for reasons other than those mentioned under paragraph 10.2.1. of \nthis section, the amount of the refund will thus, provided the respective fare conditions stipulate  \nas much, correspond to:  \n \n10.3.1.1. if no portion of the ticket has been used, an amount equal to the fare paid, less any \nreasonable service charges or cancellation fees;  \n \n10.3.1.2. if a portion of the ticket has been used, the difference between the fare paid and the \napplicable fare for  flown segments  for which the ticket has been used, less any  applicable service \ncharges or cancellation fees.  \n \nRefund for a lost ticket  \n10.4.  \n10.4.1. If a ticket or portion thereof is lost, a refund will be 