## Retrieval Augmented Generation with LanceDB  

**Objective:**
Use Llama 2.0, Langchain and LanceDB to create a Retrieval Augmented Generation (RAG) system.

This will allow us to ask questions about our documents (that were not included in the training data), without fine-tunning the Large Language Model (LLM).

Here Text Splitting will help LLM to give accurate answers without hallucination.


## What is a Retrieval Augmented Generation (RAG) system?

Retrieval-augmented generation (RAG) is an AI framework for improving the quality of LLM-generated responses by grounding the model on external sources of knowledge to supplement the LLM’s internal representation of information. Implementing RAG in an LLM-based question answering system has two main benefits:
1. It ensures that the model has access to the most current, reliable facts, and that users have access to the model’s sources, ensuring that its claims can be checked for accuracy and ultimately trusted.
2. RAG has additional benefits. By grounding an LLM on a set of external, verifiable facts, the model has fewer opportunities to pull information baked into its parameters. This reduces the chances that an LLM will leak sensitive data, or ‘hallucinate’ incorrect or misleading information.


The orchestration of the retriever and generator will be done using Langchain. A specialized function from Langchain allows us to create the receiver-generator in one line of code.




In [3]:
# Installation

!pip install transformers accelerate einops langchain langchain_community xformers bitsandbytes lancedb sentence_transformers

Collecting langchain_community
  Downloading langchain_community-0.3.9-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain_community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain
  Downloading langchain-0.3.9-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.15 (from langchain)
  Downloading langchain_core-0.3.21-py3-none-any.whl.metadata (6.3 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Downloading pydantic_settings-2.6.1-py3-none-any.whl.metadata (3.5 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading marshmallow-3.23.1-py3-none-any.whl.metadata (7.5 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading t

In [4]:
# imports

from torch import cuda, bfloat16
import torch
import transformers
from transformers import AutoTokenizer
from time import time
import lancedb
from langchain_community.llms import HuggingFacePipeline
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain_community.vectorstores import LanceDB

In [5]:
# Dataset(in txt format)

!wget https://gist.githubusercontent.com/PrashantDixit0/10fd4ab8a7d0d37de361af2a06ecfbe2/raw/indianEconomy.txt

--2024-12-04 19:20:46--  https://gist.githubusercontent.com/PrashantDixit0/10fd4ab8a7d0d37de361af2a06ecfbe2/raw/indianEconomy.txt
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2620 (2.6K) [text/plain]
Saving to: ‘indianEconomy.txt’


2024-12-04 19:20:46 (46.0 MB/s) - ‘indianEconomy.txt’ saved [2620/2620]



In [6]:
# load data

dataloader = TextLoader("indianEconomy.txt", encoding="utf8")
documents = dataloader.load()

## Text Chunking

We have discussed various types Text Chunking for LLM Applications and Tips and Tricks related to it.

Refer - https://medium.com/p/a420efc96a13/edit

Here you can try out different Text Spplitting Strategies according to your data and Tips and Tricks discussed in Blog.

For Now, we are going to use Recursive Text Splitting using LangChain

In [7]:
recursive_text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=10000, chunk_overlap=200
)
all_splits = recursive_text_splitter.split_documents(documents)

## Embeddings Generator

Creating embeddings using Sentence Transformer with HuggingFace embeddings.

In [8]:
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

  embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## LanceDB for vector storage and searching

Initialize LanceDB with the Recursive Text Chunking, the *embeddings*   Sentence Transformer object will be used for extract embeddings from all the text splits

In [10]:
db = lancedb.connect("/tmp/lancedb")
table = db.create_table(
    "rag_table",
    data=[
        {
            "vector": embeddings.embed_query("Indian Economoy"),
            "text": "Current and future details of Indian Economy",
            "id": "1",
        }
    ],
    mode="overwrite",
)

vectordb = LanceDB.from_documents(documents, embeddings, connection=db)

## Initialize RAG Chain


## Chat Models

Here you can change to any other LLM for Chat Model.

Refer to LangChain, There are few Chat Models which can be used as Chat model to generate answers in RAG.
https://python.langchain.com/docs/integrations/chat/

In [11]:
from langchain.chat_models import ChatOpenAI
import os

os.environ["OPENAI_API_KEY"] = "sk-proj-..."
llm = ChatOpenAI()

  llm = ChatOpenAI()


In [12]:
# Retreiver
retriever = vectordb.as_retriever()

qa = RetrievalQA.from_chain_type(
    llm=llm, chain_type="stuff", retriever=retriever, verbose=True
)

In [13]:
# Results of RAG

query = "What is growth of Indian Economy?"

result = qa({"query": query})

print(result)

  result = qa({"query": query})




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
{'query': 'What is growth of Indian Economy?', 'result': 'The Indian economy grew at a healthy rate of 7.8 percent in the first quarter of the ongoing financial year. Economists at the Reserve Bank of India have pegged growth at 6.8 percent in the second quarter, which is marginally higher than expectations. Overall, the economy is showing signs of sustained momentum, with various economic indicators pointing towards growth in different sectors. However, there are some concerns such as sluggish global demand affecting exports, lack of broad-based pick-up in the investment cycle, and challenges in creating high-quality jobs for the increasing labor force. Additionally, there are concerns about rising household borrowings and potential implications of a credit slowdown on the overall economic growth.'}
