# Retrieval Augmented Generation (RAG) with Langchain
*With IBM Granite Models*

## In this notebook
This notebook contains instructions for performing Retrieval Augumented Generation (RAG). RAG is an architectural pattern that can be used to augment the performance of language models by recalling factual information from a knowledge base, and adding that information to the model query. The most common approach in RAG is to create dense vector representations of the knowledge base in order to retrieve text chunks that are semantically similar to a given user query.

RAG use cases include:
- Customer service: Answering questions about a product or service using facts from the product documentation.
- Domain knowledge: Exploring a specialized domain (e.g., finance) using facts from papers or articles in the knowledge base.
- News chat: Chatting about current events by calling up relevant recent news articles.

In its simplest form, RAG requires 3 steps:

- Initial setup:
  - Index knowledge-base passages for efficient retrieval. In this recipe, we take embeddings of the passages using WatsonX, and store them in a vector database.
- Upon each user query:
  - Retrieve relevant passages from the database. In this recipe, we using an embedding of the query to retrieve semantically similar passages.
  - Generate a response by feeding retrieved passage into a large language model, along with the user query.

## Setting up the environment

Ensure you are running python 3.10 in a freshly-created virtual environment.

In [1]:
import sys
assert sys.version_info >= (3, 10) and sys.version_info < (3, 11), "Use Python 3.10 to run this notebook."

### Install and import the dependencies

Install the dependencies in one `pip` command, so that pip's dependency resolver can include them all.

In [None]:
! pip install \
  "git+https://github.com/ibm-granite-community/utils.git" \
  "langchain-core" \
  "langchain" \
  "ibm-watsonx-ai" \
  "langchain_ibm" \
  "wget" \
  "pydantic" \
  "sqlalchemy" \
  "langchain-community" \
  "langchain-huggingface" \
  "pinecone"

In [19]:
from ibm_granite_community.langchain_utils import find_langchain_model, find_langchain_vector_db

## Selecting System Components

### Choose your Embeddings Model

 LangChain retrievals use `embed_documents` and `embed_query` under the hood to generate embedding vectors for uploaded documents and user query respectively.

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings
embeddings_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

### Choose your Vector Database

In [None]:
import random, string
vector_db_provider = "milvus"

# Generate random filename for the milvus db
db_file = f"/tmp/milvus_{''.join(random.choices(string.ascii_lowercase + string.digits, k=10))}.db"
print(f"The vector database will be saved to {db_file}")

vector_db = find_langchain_vector_db(vector_db_provider, embeddings_model, connection_args={"uri": db_file}, auto_id=True)

### Choose your LLM
Specify the model that will be used for inferencing.

In [6]:
model_id = "ibm-granite/granite-8b-code-instruct-128k"

llm = find_langchain_model("replicate", model_id)

## Building the Vector Database

In this example, we take the State of the Union speech text, split it into chunks, derive embedding vectors using the embedding model, and load it into the vector database for querying.

### Download the document

Here we use President Biden's State of the Union address from March 1, 2022.

In [7]:
import os, wget

filename = 'state_of_the_union.txt'
url = 'https://raw.github.com/IBM/watson-machine-learning-samples/master/cloud/data/foundation_models/state_of_the_union.txt'

if not os.path.isfile(filename):
  wget.download(url, out=filename)

### Split the document into chunks

Split the document into text segments that can fit into the model's context window.

In [8]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

loader = TextLoader(filename)
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

### Create and populate the vector database

NOTE: Population of the vector database may take over a minute depending on your embedding model and service.

In [None]:
# vector_db = vector_db_class.from_documents(texts, embeddings)
vector_db.add_documents(texts)

## Querying the Vector Database

### Conduct a similarity search

Search the database for similar documents by proximity of the embedded vector in vector space.

In [None]:
query = "What did the president say about Ketanji Brown Jackson"
docs = vector_db.similarity_search(query)
print(docs[0].page_content)

## Answering Questions

### Automate the RAG pipeline

Build a question-answering chain with the model and the document retriever.

In [11]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=vector_db.as_retriever()) # , chain_type_kwargs={"verbose": False})

### Generate a retrieval-augmented response to a question

Use the question-answering chain to process the query. 

In [None]:
query = "What did the president say about Ketanji Brown Jackson"
qa.invoke(query)