<a href="https://colab.research.google.com/github/helloworld-chhanda/RAG_with_Langchain/blob/main/recipes/RAG/RAG_with_Langchain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Retrieval Augmented Generation (RAG) with Langchain
*Using IBM Granite Models*

## In this notebook
This notebook contains instructions for performing Retrieval Augumented Generation (RAG). RAG is an architectural pattern that can be used to augment the performance of language models by recalling factual information from a knowledge base, and adding that information to the model query. The most common approach in RAG is to create dense vector representations of the knowledge base in order to retrieve text chunks that are semantically similar to a given user query.

RAG use cases include:
- Customer service: Answering questions about a product or service using facts from the product documentation.
- Domain knowledge: Exploring a specialized domain (e.g., finance) using facts from papers or articles in the knowledge base.
- News chat: Chatting about current events by calling up relevant recent news articles.

In its simplest form, RAG requires 3 steps:

- Initial setup:
  - Index knowledge-base passages for efficient retrieval. In this recipe, we take embeddings of the passages, and store them in a vector database.
- Upon each user query:
  - Retrieve relevant passages from the database. In this recipe, we use an embedding of the query to retrieve semantically similar passages.
  - Generate a response by feeding retrieved passage into a large language model, along with the user query.

## Setting up the environment

Install dependencies.

In [1]:
%pip install git+https://github.com/ibm-granite-community/utils \
    transformers \
    langchain_community \
    'langchain_huggingface[full]' \
    langchain_milvus \
    replicate \
    wget

Collecting git+https://github.com/ibm-granite-community/utils
  Cloning https://github.com/ibm-granite-community/utils to /tmp/pip-req-build-s1e1v074
  Running command git clone --filter=blob:none --quiet https://github.com/ibm-granite-community/utils /tmp/pip-req-build-s1e1v074
  Resolved https://github.com/ibm-granite-community/utils to commit 19b0757d88cb6d052c53f4dbe04f3ea3f36cf011
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting langchain_community
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting langchain_milvus
  Downloading langchain_milvus-0.2.1-py3-none-any.whl.metadata (3.8 kB)
Collecting replicate
  Downloading replicate-1.0.7-py3-none-any.whl.metadata (29 kB)
Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting langchain_huggingface[full]
  Dow

## Selecting System Components

### Choose your Embeddings Model

Specify the model to use for generating embedding vectors from text.

To use a model from a provider other than Huggingface, replace this code cell with one from [this Embeddings Model recipe](https://github.com/ibm-granite-community/granite-kitchen/blob/main/recipes/Components/Langchain_Embeddings_Models.ipynb).

In [2]:
%pip install git+https://github.com/ibm-granite-community/utils \
    'langchain_huggingface[full]' \
    langchain_ibm

Collecting git+https://github.com/ibm-granite-community/utils
  Cloning https://github.com/ibm-granite-community/utils to /tmp/pip-req-build-7szlsvkb
  Running command git clone --filter=blob:none --quiet https://github.com/ibm-granite-community/utils /tmp/pip-req-build-7szlsvkb
  Resolved https://github.com/ibm-granite-community/utils to commit 19b0757d88cb6d052c53f4dbe04f3ea3f36cf011
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting langchain_ibm
  Downloading langchain_ibm-0.3.15-py3-none-any.whl.metadata (5.2 kB)
Collecting ibm-watsonx-ai<2.0.0,>=1.3.28 (from langchain_ibm)
  Downloading ibm_watsonx_ai-1.3.32-py3-none-any.whl.metadata (6.9 kB)
Collecting lomond (from ibm-watsonx-ai<2.0.0,>=1.3.28->langchain_ibm)
  Downloading lomond-0.3.3-py2.py3-none-any.whl.metadata (4.1 kB)
Collecting ibm-cos-sdk<2.15.0,>=2.12.0 (from ibm-watsonx-ai<2.0.0,>=1.3.28->

In [8]:
from langchain_huggingface import HuggingFaceEmbeddings
from transformers import AutoTokenizer

embeddings_model_path = "ibm-granite/granite-embedding-30m-english"
embeddings_model = HuggingFaceEmbeddings(model_name=embeddings_model_path)
# embeddings_model = HuggingFaceEmbeddings(
#     model_name=embeddings_model_path,
# )
embeddings_tokenizer = AutoTokenizer.from_pretrained(embeddings_model_path)

### Choose your Vector Database

Specify the database to use for storing and retrieving embedding vectors.

To connect to a vector database other than Milvus substitute this code cell with one from [this Vector Store recipe](https://github.com/ibm-granite-community/granite-kitchen/blob/main/recipes/Components/Langchain_Vector_Stores.ipynb).

In [9]:
%pip install git+https://github.com/ibm-granite-community/utils \
    langchain_community \
    langchain_chroma \
    langchain_milvus

Collecting git+https://github.com/ibm-granite-community/utils
  Cloning https://github.com/ibm-granite-community/utils to /tmp/pip-req-build-54esygq6
  Running command git clone --filter=blob:none --quiet https://github.com/ibm-granite-community/utils /tmp/pip-req-build-54esygq6
  Resolved https://github.com/ibm-granite-community/utils to commit 19b0757d88cb6d052c53f4dbe04f3ea3f36cf011
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting langchain_chroma
  Using cached langchain_chroma-0.2.5-py3-none-any.whl.metadata (1.1 kB)
Collecting chromadb>=1.0.9 (from langchain_chroma)
  Downloading chromadb-1.0.16-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.4 kB)
Collecting pybase64>=1.4.1 (from chromadb>=1.0.9->langchain_chroma)
  Downloading pybase64-1.4.2-cp311-cp311-manylinux1_x86_64.manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_5

In [10]:
from langchain_community.embeddings import FakeEmbeddings

embeddings_model = FakeEmbeddings(size=384)

In [11]:
from langchain_milvus import Milvus
import uuid

db_file = f"/tmp/milvus_{str(uuid.uuid4())[:8]}.db"
vector_db = Milvus(embedding_function=embeddings_model, connection_args={"uri": db_file}, auto_id=True)

### Choose your LLM
The LLM will be used for answering the question, given the retrieved text.

Select a Granite Code model from the [`ibm-granite`](https://replicate.com/ibm-granite) org on Replicate. Here we use the Replicate Langchain client to connect to the model.

To connect to a model on a provider other than Replicate, substitute this code cell with one from the [LLM component recipe](https://github.com/ibm-granite-community/granite-kitchen/blob/main/recipes/Components/Langchain_LLMs.ipynb).

In [12]:
%pip install git+https://github.com/ibm-granite-community/utils \
    langchain_community \
    replicate \
    langchain_ollama \
    langchain_ibm \
    langchain_openai

Collecting git+https://github.com/ibm-granite-community/utils
  Cloning https://github.com/ibm-granite-community/utils to /tmp/pip-req-build-bsd6e1kh
  Running command git clone --filter=blob:none --quiet https://github.com/ibm-granite-community/utils /tmp/pip-req-build-bsd6e1kh
  Resolved https://github.com/ibm-granite-community/utils to commit 19b0757d88cb6d052c53f4dbe04f3ea3f36cf011
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting langchain_ollama
  Downloading langchain_ollama-0.3.6-py3-none-any.whl.metadata (2.1 kB)
Collecting langchain_openai
  Downloading langchain_openai-0.3.29-py3-none-any.whl.metadata (2.4 kB)
Collecting ollama<1.0.0,>=0.5.1 (from langchain_ollama)
  Downloading ollama-0.5.3-py3-none-any.whl.metadata (4.3 kB)
Collecting langchain_core (from ibm-granite-community-utils==0.1.dev83)
  Downloading langchain_core-0.3.74-py3-none-any.

In [13]:
%pip install git+https://github.com/ibm-granite-community/utils \
    langchain_community \
    replicate

Collecting git+https://github.com/ibm-granite-community/utils
  Cloning https://github.com/ibm-granite-community/utils to /tmp/pip-req-build-gpnebara
  Running command git clone --filter=blob:none --quiet https://github.com/ibm-granite-community/utils /tmp/pip-req-build-gpnebara
  Resolved https://github.com/ibm-granite-community/utils to commit 19b0757d88cb6d052c53f4dbe04f3ea3f36cf011
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [14]:
from getpass import getpass
import os

os.environ["REPLICATE_API_TOKEN"] = getpass("Paste your Replicate token here: ")


Paste your Replicate token here: ··········


In [15]:
token = os.environ.get("REPLICATE_API_TOKEN")

In [16]:
from langchain_community.llms import Replicate
from ibm_granite_community.notebook_utils import get_env_var

# model = Replicate(
#     model="ibm-granite/granite-3.3-8b-instruct",
#     replicate_api_token=get_env_var('REPLICATE_API_TOKEN'),
# )
model = Replicate(
    model="ibm-granite/granite-3.3-8b-instruct",
    replicate_api_token=token
)


In [17]:
if token is None:
    raise ValueError("REPLICATE_API_TOKEN is not set.")


In [18]:
from getpass import getpass
from langchain_community.llms import Replicate

# Step 1: Securely get the token from user (you won't see what you type)
token = getpass("Paste your Replicate API token (starts with r8_): ")
print("Token prefix:", token[:5])  # Should print: r8_...


# Step 2: Pass token directly to Replicate model (do NOT use get_env_var)
model = Replicate(
    model="ibm-granite/granite-3.3-8b-instruct",
    replicate_api_token=token
)

# Step 3: Write the prompt
prompt = """\
<|start_of_role|>user<|end_of_role|>\
Tell a story about a duck who likes french fries.<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>"""

# Step 4: Invoke the model and print the result
response = model.invoke(prompt)
print("\nResponse:\n", response)


Paste your Replicate API token (starts with r8_): ··········
Token prefix: r8_1z

Response:
 Once upon a time, in a bustling city park, lived a charming duck named Dimitri. Dimitri was no ordinary duck; he had an extraordinary liking for French fries. Every day, as the sun began to set, people would leave the park, often discarding their half-eaten French fries. Dimitri, with his keen sense of smell, would waddle over to the discarded treasures, quacking in delight.

One day, a kind-hearted girl named Lily noticed Dimitri's fondness for French fries. She decided to bring him some every day, ensuring they were fresh and crispy. Dimitri was overjoyed and would eagerly await Lily's arrival, waddling towards her with a happy quack.

Word spread about Dimitri and his peculiar taste in food. Soon, the park was filled with people who would save their French fries for Dimitri. He became a beloved figure in the park, a symbol of friendship between humans and animals.

And so, Dimitri, the Frenc

## Building the Vector Database

In this example, we take the State of the Union speech text, split it into chunks, derive embedding vectors using the embedding model, and load it into the vector database for querying.

### Download the document

Here we use President Biden's State of the Union address from March 1, 2022.

In [19]:
import os
import wget

filename = 'state_of_the_union.txt'
url = 'https://raw.githubusercontent.com/IBM/watson-machine-learning-samples/master/cloud/data/foundation_models/state_of_the_union.txt'

if not os.path.isfile(filename):
  wget.download(url, out=filename)

### Split the document into chunks

Split the document into text segments that can fit into the model's context window.

In [20]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

loader = TextLoader(filename)
documents = loader.load()
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer=embeddings_tokenizer,
    chunk_size=embeddings_tokenizer.max_len_single_sentence,
    chunk_overlap=0,
)
texts = text_splitter.split_documents(documents)
doc_id = 0
for text in texts:
    text.metadata["doc_id"] = (doc_id:=doc_id+1)
print(f"{len(texts)} text document chunks created")

19 text document chunks created


### Populate the vector database

NOTE: Population of the vector database may take over a minute depending on your embedding model and service.

In [21]:
ids = vector_db.add_documents(texts)
print(f"{len(ids)} documents added to the vector database")

19 documents added to the vector database


## Querying the Vector Database

### Conduct a similarity search

Search the database for similar documents by proximity of the embedded vector in vector space.

In [22]:
query = "What did the president say about Ketanji Brown Jackson?"
docs = vector_db.similarity_search(query)
print(f"{len(docs)} documents returned")
for doc in docs:
    print(doc)
    print("=" * 80)  # Separator for clarity

4 documents returned
page_content='As Frances Haugen, who is here with us tonight, has shown, we must hold social media platforms accountable for the national experiment they’re conducting on our children for profit. 

It’s time to strengthen privacy protections, ban targeted advertising to children, demand tech companies stop collecting personal data on our children. 

And let’s get all Americans the mental health services they need. More people they can turn to for help, and full parity between physical and mental health care. 

Third, support our veterans. 

Veterans are the best of us. 

I’ve always believed that we have a sacred obligation to equip all those we send to war and care for them and their families when they come home. 

My administration is providing assistance with job training and housing, and now helping lower-income veterans get VA care debt-free.  

Our troops in Iraq and Afghanistan faced many dangers. 

One was stationed at bases and breathing in toxic smoke fro

## Answering Questions

### Automate the RAG pipeline

Build a RAG chain with the model and the document retriever.

First we create the prompts for Granite to perform the RAG query. We use the Granite chat template and supply the placeholder values that the LangChain RAG pipeline will replace.

Next, we construct the RAG pipeline by using the Granite prompt templates previously created.

In [31]:
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.retrieval import create_retrieval_chain

# Create a standard ChatPromptTemplate
prompt_template = ChatPromptTemplate.from_template("""Answer the following question based only on the provided context:

<context>
{context}
</context>

Question: {input}""")

# Assemble the retrieval-augmented generation chain
combine_docs_chain = create_stuff_documents_chain(
    llm=model,
    prompt=prompt_template,
)
rag_chain = create_retrieval_chain(
    retriever=vector_db.as_retriever(),
    combine_docs_chain=combine_docs_chain,
)

### Generate a retrieval-augmented response to a question

Use the RAG chain to process a question. The document chunks relevant to that question are retrieved and used as context.

In [32]:
output = rag_chain.invoke({"input": query})

print(output['answer'])

The provided context does not contain any information about Ketanji Brown Jackson. Therefore, I cannot answer this question based solely on the given context.


In [33]:
import replicate
import os
from ibm_granite_community.notebook_utils import get_env_var

try:
    # Ensure the environment variable is set
    replicate_api_token = get_env_var('REPLICATE_API_TOKEN')
    os.environ['REPLICATE_API_TOKEN'] = replicate_api_token

    # List models using the replicate library to test authentication
    models = replicate.models.list()
    print("Successfully authenticated with Replicate API.")
    # print(f"Found {len(models)} models.") # Uncomment to see the number of models
except replicate.exceptions.ReplicateError as e:
    print(f"Error authenticating with Replicate API: {e}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Successfully authenticated with Replicate API.
