# RAG with Biitsandbytes Quantized LLM

In this notebook, We perform a simplified RAG (Retrieval Augmented Generation).
We also apply quantization by Bitsandbytes to LLM (Large Language Model).

## Environment

OS

In [1]:
!cat /etc/os-release

PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy


Python

In [2]:
!python -V

Python 3.11.12


CUDA

In [3]:
!nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0


cuDNN

In [4]:
import torch
print(torch.backends.cudnn.version())

90300


In [5]:
!find / -name cudnn_version.h

/usr/include/cudnn_version.h
/usr/local/lib/python3.11/dist-packages/nvidia/cudnn/include/cudnn_version.h
/usr/local/lib/python3.11/dist-packages/tensorflow/include/external/cuda_cudnn/include/cudnn_version.h
/usr/local/lib/python3.11/dist-packages/tensorflow/include/third_party/gpus/cudnn/include/cudnn_version.h
find: ‘/proc/78/task/78/net’: Invalid argument
find: ‘/proc/78/net’: Invalid argument


In [6]:
!cat /usr/local/lib/python3.11/dist-packages/nvidia/cudnn/include/cudnn_version.h | grep CUDNN_MAJOR -A 2

#define CUDNN_MAJOR 9
#define CUDNN_MINOR 3
#define CUDNN_PATCHLEVEL 0
--
#define CUDNN_VERSION (CUDNN_MAJOR * 10000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

/* cannot use constexpr here since this is a C-only file */


GPU

In [7]:
!nvidia-smi

Sun May 18 07:39:22 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   43C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## Preparation

Install following additional packages.
* bitsandbytes
* langchain-community
* langchain-huggingface
* pdfminer.six
* chromadb

In [8]:
!pip install bitsandbytes

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.5-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-

In [9]:
!pip install langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.24-py3-none-any.whl.metadata (2.5 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.9.1-py3-none-any.whl.metadata (3.8 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain-community)
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB

In [10]:
!pip install langchain-huggingface

Collecting langchain-huggingface
  Downloading langchain_huggingface-0.2.0-py3-none-any.whl.metadata (941 bytes)
Downloading langchain_huggingface-0.2.0-py3-none-any.whl (27 kB)
Installing collected packages: langchain-huggingface
Successfully installed langchain-huggingface-0.2.0


In [11]:
!pip install pdfminer.six

Collecting pdfminer.six
  Downloading pdfminer_six-20250506-py3-none-any.whl.metadata (4.2 kB)
Downloading pdfminer_six-20250506-py3-none-any.whl (5.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m46.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pdfminer.six
Successfully installed pdfminer.six-20250506


In [12]:
!pip install chromadb

Collecting chromadb
  Downloading chromadb-1.0.9-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.9 kB)
Collecting fastapi==0.115.9 (from chromadb)
  Downloading fastapi-0.115.9-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.2-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-4.0.1-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.22.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting opentelemetry-api>=1.2.0 (from chromadb)
  Downloading opentelemetry_api-1.33.1-py3-none-any.whl.metadata (1.6 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.33.1-py3-none-any.whl.metadata (2.5 kB)
Collecting opentelemetry-instrumentation-fastapi>=0.41b0 (from chromadb)


Fix the random number seed.

In [1]:
import random
import numpy as np
import torch

def set_seed(seed=0):
    # for Python
    random.seed(seed)

    # for NumPy
    np.random.seed(seed)

    # for PyTorch, CUDA and cuDNN
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

In [2]:
set_seed()

Get device.

In [3]:
import torch

if torch.cuda.is_available():
    device = torch.device('cuda')
    device_id = 0
else:
    device = torch.device('cpu')
    device_id = -1

print(device)
print(device_id)

cuda
0


## Text Generation Model

In [4]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import BitsAndBytesConfig, GPTQConfig

B_INST = '[INST]'
E_INST = '[/INST]'
B_SYS = '<<SYS>>\n'
E_SYS = '\n<</SYS>>\n\n'
DEFAULT_SYSTEM_PROMPT = 'あなたは誠実で優秀な日本人のアシスタントです。質問にできるだけ正確に答えてください。'
DEFAULT_SYSTEM_RAG_PROMPT = 'あなたは誠実で優秀な日本人のアシスタントです。参考情報を元にして質問にできるだけ正確に答えてください。'
PROMPT = '## 質問:\n{question}'
RAG_PROMPT = '## 参考情報:\n{context}\n\n## 質問:\n{question}'


def make_context(results):
    """
    Convert the results of a vector search into a bulleted format for embedding as reference information in a prompt.
    """
    context = [doc for doc in results['documents'][0]]
    context = '\n* '.join(context)
    context = '* ' + context
    return context


class TextGenerator:
    def __init__(self, model_name_or_path, quantization_method=None):
        """
        Setup LLM.
        """
        print(f'model_name_or_path={model_name_or_path}, '
              f'quantization_method={quantization_method}')

        # Load pretrained tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

        # Quantization settings
        if quantization_method == 'bitsandbytes':
            quantization_config = BitsAndBytesConfig(load_in_4bit=True,
                                                     bnb_4bit_use_double_quant=True,
                                                     bnb_4bit_quant_type='nf4',
                                                     bnb_4bit_compute_dtype=torch.bfloat16)
        else:
            quantization_config = None

        # Load pretrained model
        self.model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                                          torch_dtype='auto',
                                                          quantization_config=quantization_config)

        return

    def make_prompt(self, query, context=None):
        """
        Make a prompt.
        """
        if context is None:
            # Without reference information
            _system_prompt = DEFAULT_SYSTEM_PROMPT
            _prompt = PROMPT.format(question=query)
        else:
            # With reference information (RAG)
            _system_prompt = DEFAULT_SYSTEM_RAG_PROMPT
            _prompt = RAG_PROMPT.format(context=context, question=query)

        prompt = '{bos_token}{b_inst} {system}{prompt} {e_inst} '.format(
            bos_token=self.tokenizer.bos_token,
            b_inst=B_INST,
            system=f'{B_SYS}{_system_prompt}{E_SYS}',
            prompt=_prompt,
            e_inst=E_INST,
        )
        print(prompt)
        return prompt

    def generate_answer(self, prompt):
        """
        Input a prompt to LLM to generate an answer text,
        """
        with torch.no_grad():
            token_ids = self.tokenizer.encode(prompt,
                                              add_special_tokens=False,
                                              return_tensors='pt')

        output_ids = self.model.generate(
            token_ids.to(self.model.device),
            max_new_tokens=256,
            pad_token_id=self.tokenizer.pad_token_id,
            eos_token_id=self.tokenizer.eos_token_id,
        )

        output = self.tokenizer.decode(output_ids.tolist()[0][token_ids.size(1):],
                                       skip_special_tokens=True)

        return output

    def run(self, query, context):
        """
        Generate an answer text to a query.
        """
        # Make prompt.
        prompt = self.make_prompt(query=query, context=context)

        # Generate answer text.
        answer = self.generate_answer(prompt=prompt)

        return answer

We use [ELYZA-japanese-Llama-2-13b-fast-instruct](https://huggingface.co/elyza/ELYZA-japanese-Llama-2-13b-fast-instruct), which based on Llama 2 with additional pre-training to expand Japanese language capabilities by Elyza Inc.

In [5]:
# LLM for text generation
model_name = 'elyza/ELYZA-japanese-Llama-2-13b-fast-instruct'
quantization_method='bitsandbytes'
generator = TextGenerator(model_name_or_path=model_name, quantization_method=quantization_method)

model_name_or_path=elyza/ELYZA-japanese-Llama-2-13b-fast-instruct, quantization_method=bitsandbytes


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/983 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/705k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.40M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/624 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/700 [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/29.9k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

pytorch_model-00001-of-00003.bin:   0%|          | 0.00/9.97G [00:00<?, ?B/s]

pytorch_model-00002-of-00003.bin:   0%|          | 0.00/9.87G [00:00<?, ?B/s]

pytorch_model-00003-of-00003.bin:   0%|          | 0.00/6.45G [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/31.4k [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

## Vector DB

Build a vector database

In [6]:
import chromadb
from chromadb.config import Settings
from langchain_community.document_loaders import PDFMinerLoader
from langchain_huggingface import HuggingFaceEmbeddings


class VectorStore:
    def __init__(self, embedding_model_name, db_path='./db', chunk_size=256, is_persist=True):
        """
        Initialization
        """
        self.collection = None
        self.embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name)
        self.db_path = db_path
        self.chunk_size = chunk_size
        self.is_persist = is_persist

        # Initialize client
        if self.is_persist:
            # Persist in storage
            self.client = chromadb.PersistentClient(path=self.db_path)
        else:
            # In-memory
            settings = Settings(allow_reset=True)
            self.client = chromadb.EphemeralClient(settings=settings)
            self.client.reset()

        return

    def extract_text(self, target_file):
        """
        Read PDF files and extract the text for each page.
        """
        loader = PDFMinerLoader(target_file)
        text = loader.load()
        pages = text[0].page_content.split('\x0c')
        return pages

    def create_collection(self, collection_name):
        """
        Create a collection.
        """
        self.collection = self.client.create_collection(name=collection_name)
        return

    def get_collection(self, collection_name):
        """
        Get the collection.
        """
        self.collection = self.client.get_collection(name=collection_name)
        return

    def add_collection(self, target_files):
        """
        Add data to the collection.
        """
        for i, target_file in enumerate(target_files):
            pages = self.extract_text(target_file)
            print(f'{target_file}: # of pages={len(pages)}')

            for j, page in enumerate(pages, start=1):
                text = page.replace('\n', '')
                if text == '':
                    continue

                # Split text into chunks
                chunks = [page[idx:(idx + self.chunk_size)].replace('\n', '')
                          for idx in range(0, len(page), self.chunk_size)]

                # Convert chunk to embedding vector, and add it to vector DB
                for k, chunk in enumerate(chunks):
                    embedded_docs = self.embeddings.embed_documents([chunk])
                    self.collection.add(
                        embeddings=embedded_docs,
                        documents=[chunk],
                        metadatas=[{'source': target_file, 'page': j, 'chunk': k}],
                        ids=[f'F{i + 1:03}-P{j + 1:03}-C{k + 1:03}']
                    )

        print(f'# of entries={self.collection.count()}')
        return

    def retrieve(self, query, n_results=5):
        """
        Vector search.
        """
        embedded_query = self.embeddings.embed_query(query)
        results = self.collection.query(
            query_embeddings=embedded_query,
            n_results=n_results,
        )
        return results

Get the PDF file "[アジャイルソフトウェア開発宣言の読みとき方](https://www.ipa.go.jp/jinzai/skill-standard/plus-it-ui/itssplus/ps6vr70000001i7c-att/000065601.pdf)" published by IPA..

In [7]:
!wget https://www.ipa.go.jp/jinzai/skill-standard/plus-it-ui/itssplus/ps6vr70000001i7c-att/000065601.pdf

--2025-05-18 07:48:23--  https://www.ipa.go.jp/jinzai/skill-standard/plus-it-ui/itssplus/ps6vr70000001i7c-att/000065601.pdf
Resolving www.ipa.go.jp (www.ipa.go.jp)... 3.169.137.43, 3.169.137.71, 3.169.137.55, ...
Connecting to www.ipa.go.jp (www.ipa.go.jp)|3.169.137.43|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2019698 (1.9M) [application/pdf]
Saving to: ‘000065601.pdf’


2025-05-18 07:48:23 (14.6 MB/s) - ‘000065601.pdf’ saved [2019698/2019698]



In [8]:
target_files = [
    '000065601.pdf',
]

In [9]:
# Embedding model
embedding_model_name = 'intfloat/multilingual-e5-large'

# Settings
db_path = './db'
chunk_size = 256
is_persist = True
collection_name = 'my_collection'

Create a vectore database.

In [10]:
vector_store = VectorStore(embedding_model_name=embedding_model_name,
                           db_path=db_path,
                           chunk_size=chunk_size,
                           is_persist=is_persist)
vector_store.create_collection(collection_name=collection_name)
vector_store.add_collection(target_files=target_files)

modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/160k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/690 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/418 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/201 [00:00<?, ?B/s]



000065601.pdf: # of pages=21
# of entries=75


If you use an existing collection, you can get the collection by name without using create_collection and add_collection.

In [11]:
# from google.colab import drive
# drive.mount('/content/drive')

In [12]:
# path_to_db_on_drive = '/content/drive/xxxxx'
# vector_store = VectorStore(embedding_model_name=embedding_model_name,
#                            db_path=path_to_db_on_drive,
#                            chunk_size=chunk_size,
#                            is_persist=is_persist)
# vector_store.get_collection(collection_name=collection_name)

## Ordinary Text Generation

In [13]:
# Question
query = 'アジャイルソフトウェア開発宣言で、プロセスやツールよりも重視していることは？'

# Generate answer
answer = generator.run(query=query, context=None)
print(f'\n## 回答:\n{answer}')

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<s>[INST] <<SYS>>
あなたは誠実で優秀な日本人のアシスタントです。質問にできるだけ正確に答えてください。
<</SYS>>

## 質問:
アジャイルソフトウェア開発宣言で、プロセスやツールよりも重視していることは？ [/INST] 

## 回答:
アジャイルソフトウェア開発宣言では、プロセスやツールよりも重視していることが3つあります。

1. 顧客からのフィードバックを重視する
2. 短期的な計画を重視する
3. チームワークを重視する

アジャイルソフトウェア開発宣言は、これらの価値観を重視して開発を行っています。


## RAG: Text Generation with Reference to Documents Relevant to the Query

In [14]:
# Question
query = 'アジャイルソフトウェア開発宣言で、プロセスやツールよりも重視していることは？'

# search at vector database
results = vector_store.retrieve(query=query, n_results=5)

# Process search results for use in LLM
context = make_context(results)

# Generate answer
answer = generator.run(query, context)
print(f'\n## 回答:\n{answer}')

<s>[INST] <<SYS>>
あなたは誠実で優秀な日本人のアシスタントです。参考情報を元にして質問にできるだけ正確に答えてください。
<</SYS>>

## 参考情報:
* プロセスやツール、ドキュメント、契約、計画」にも価値があることを、明言しています。よってアジャイルソフトウェア開発でも“価値のある”必要なドキュメントは作成しますし、事前に計画を立てて作業を進めていくことは、言うまでもありません。また開発を効率的に進めるためには、有用なツールを活用することも重要です。アジャイルソフトウェア開発宣言で伝えようとしていることは、まずマインドセットがあって、そのうえで「プロセスやツール、ドキュメント、契約、計画」を考えるべきである、ということなのです。このマインドセット
* アジャイルソフトウェア開発宣言私たちは、ソフトウェア開発の実践あるいは実践を手助けをする活動を通じて、よりよい開発方法を見つけだそうとしている。この活動を通して、私たちは以下の価値に至った。プロセスやツールよりも個人と対話を、包括的なドキュメントよりも動くソフトウェアを、契約交渉よりも顧客との協調を、計画に従うことよりも変化への対応を、価値とする。すなわち、左記のことがらに価値があることを認めながらも、私たちは右記のことがらにより価値をおく。Kent BeckMike Bee
* 「アジャイルソフトウェア開発宣言」に対する誤解と真意「アジャイルソフトウェア開発宣言」のうち、価値について言及している文は「〜よりも」とあることから、一見すると左記のことがら「プロセスやツール、ドキュメント、契約、計画」は疎かにしてもよいと解釈されがちです。ここから、アジャイルソフトウェア開発ではドキュメントを作成しなくてもよいとか、計画は考えなくてもよいなどの誤解が生じることが、よくあります。ですが、見落とされがちな「左記のことがらにも価値があることを認めながらも」という一文にあるとおり、 「
* アジャイル宣言の背後にある原則私たちは以下の原則に従う:顧客満足を最優先し、価値のあるソフトウェアを早く継続的に提供します。要求の変更はたとえ開発の後期であっても歓迎します。変化を味方につけることによって、お客様の競争力を引き上げます。動くソフトウェアを、2-3週間から2-3ヶ月というできるだけ短い時間間隔でリリース