# RAG with Langchain and ChromaDB

This notebook uses Langchain and ChromaDB for a simple RAG example

References:
- [What is RAG?](https://www.datacamp.com/blog/what-is-retrieval-augmented-generation-rag)
- [What is a Vector Database?](https://learn.microsoft.com/en-us/semantic-kernel/memories/vector-db)
- [Langchain and RAG](https://python.langchain.com/docs/use_cases/question_answering/)
- [Langchain Document Loaders](https://python.langchain.com/docs/integrations/document_loaders/)

In [1]:
import bs4
from langchain import hub
from langchain_chroma import Chroma
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
import os
from dotenv import load_dotenv

# Load the environment variables
load_dotenv(override=True)

True

In [2]:
# remove the current data/vectordb folder
if os.path.exists("data/vectordb"):
    os.system("rm -rf data/vectordb")

In [3]:
# Load, chunk and index the contents of the blog.
# We only care about the post content, title and header.
bs_strainer = bs4.SoupStrainer(class_=("post-content", "post-title", "post-header"))
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs={"parse_only": bs_strainer},
)
docs = loader.load()

# Split the documents into chunks of 1000 characters with 200 characters overlap.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

## Embedding Model

We need a different model to take care of embeddings. The following cell uses different embeddings. You need to uncomment the one you wish to use.
The available embeddings are:
- HuggingFace Sentence Transformer: [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2). You can pull any of the available Sentence Transformer Embeddings from https://huggingface.co/sentence-transformers
- Ollama embedding: [nomic-embed-text](https://ollama.com/library/nomic-embed-text), [llama3](https://ollama.com/library/llama3). Ollama also provide embedding models. More information can be found at https://ollama.com/blog/embedding-models
- GPT4All. This is a free embedding that will be downloaded when use the first time. https://python.langchain.com/docs/integrations/text_embedding/gpt4all/

In [4]:
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)
#from langchain_community.embeddings import OllamaEmbeddings
#from langchain_community.embeddings import GPT4AllEmbeddings


# Index all documents in a single vector store
# This one is using the all-MiniLM-L6-v2 model for embedding
embedding = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# This one is using the Ollama model for embedding, pick one of them
# you will need to pull the model from the ollama server first
#embedding = OllamaEmbeddings(model="nomic-embed-text")  # special embedding from Ollama
#embedding = OllamaEmbeddings(model="llama3:8b-instruct-q8_0")

# another open-source embedding function
# embedding = GPT4AllEmbeddings()

# Using the embeddings to index the documents in Chroma
vectorstore = Chroma.from_documents(documents=splits, embedding=embedding, persist_directory="data/vectordb")

  from .autonotebook import tqdm as notebook_tqdm


## Language Model

You can use any LLM for your RAG. Below are three different providers. Each provider has more than one model so you can select the model to use.

In [5]:
###### OLLAMA #####
from langchain_community.llms import Ollama
llm = Ollama(model="llama3:8b-instruct-q8_0", temperature=0)

###### OPENAI #####
# from langchain_openai.chat_models import ChatOpenAI
# openai_models = ["gpt-3.5-turbo-0125", "gpt-4-turbo", "gpt-4-turbo-preview"]
# llm = ChatOpenAI(
#     model_name=openai_models[0],
#     temperature=0,
#     api_key=os.environ["OPENAI_API_KEY"])

###### GROQ #####
# from langchain_groq.chat_models import ChatGroq
# groq_model = ["mixtral-8x7b-32768", "gemma-7b-it", "llama2-70b-4096", "llama3-70b-8192", "llama3-8b-8192"]
# llm = ChatGroq(
#     temperature=0,
#     max_tokens=4096,
#     model_name=groq_model[3], 
#     api_key = os.environ["GROQ_API_KEY"])

In [16]:
# Retrieve and generate using the relevant snippets of the blog.
retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
# Use the chain to generate a response to a question. The answer will be generated and outputted.
rag_chain.invoke("What is Task Decomposition?")

## Review Documents
The code below shows the documents retrieved to answer the question

In [None]:
from langchain_core.runnables import RunnableParallel

rag_chain_from_docs = (
    RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
    | prompt
    | llm
    | StrOutputParser()
)

rag_chain_with_source = RunnableParallel(
    {"context": retriever, "question": RunnablePassthrough()}
).assign(answer=rag_chain_from_docs)

rag_chain_with_source.invoke("What is Task Decomposition")

## Using Retriever and Ollama Directly

In [21]:
question = "What is Task Decomposition?"
docs_result = retriever.invoke(question)

In [11]:
from pprint import pprint 

pprint(docs_result)

[Document(page_content='Fig. 1. Overview of a LLM-powered autonomous agent system.\nComponent One: Planning#\nA complicated task usually involves many steps. An agent needs to know what they are and plan ahead.\nTask Decomposition#\nChain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process.', metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}),
 Document(page_content='Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be

In [15]:
context = format_docs(docs_result)
pprint(context)

('Fig. 1. Overview of a LLM-powered autonomous agent system.\n'
 'Component One: Planning#\n'
 'A complicated task usually involves many steps. An agent needs to know what '
 'they are and plan ahead.\n'
 'Task Decomposition#\n'
 'Chain of thought (CoT; Wei et al. 2022) has become a standard prompting '
 'technique for enhancing model performance on complex tasks. The model is '
 'instructed to “think step by step” to utilize more test-time computation to '
 'decompose hard tasks into smaller and simpler steps. CoT transforms big '
 'tasks into multiple manageable tasks and shed lights into an interpretation '
 'of the model’s thinking process.\n'
 '\n'
 'Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple '
 'reasoning possibilities at each step. It first decomposes the problem into '
 'multiple thought steps and generates multiple thoughts per step, creating a '
 'tree structure. The search process can be BFS (breadth-first search) or DFS '
 '(depth-first search) wit

In [22]:
# Use the prompt template to provide the context and question to the model.
print(prompt)
prompt_value = prompt.invoke(
    {
        "context": context,
        "question": question
    }
)

In [26]:
print(type(prompt_value))
print(prompt_value.to_string())

<class 'langchain_core.prompt_values.ChatPromptValue'>
Human: You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: What is Task Decomposition? 
Context: Fig. 1. Overview of a LLM-powered autonomous agent system.
Component One: Planning#
A complicated task usually involves many steps. An agent needs to know what they are and plan ahead.
Task Decomposition#
Chain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process.

Tree of Thoughts (Yao et al. 2023) extends C

In [28]:
import ollama

# Generate a response using the Ollama model.
model_name = 'llama3:8b-instruct-q8_0'
ollama.generate(model=model_name, prompt=prompt_value.to_string())

{'model': 'llama3:8b-instruct-q8_0',
 'created_at': '2024-04-28T23:22:23.562461Z',
 'response': 'Task Decomposition is a technique that breaks down complex tasks into smaller and simpler steps, allowing an agent or model to plan ahead and make more manageable decisions. This can be achieved through techniques such as Chain of Thought (CoT) or Tree of Thoughts, which involve decomposing problems into multiple thought steps and generating multiple thoughts per step. Task decomposition can also be done through simple prompting, task-specific instructions, or human inputs.',
 'done': True,
 'context': [128006,
  882,
  128007,
  198,
  198,
  35075,
  25,
  1472,
  527,
  459,
  18328,
  369,
  3488,
  12,
  598,
  86,
  4776,
  9256,
  13,
  5560,
  279,
  2768,
  9863,
  315,
  31503,
  2317,
  311,
  4320,
  279,
  3488,
  13,
  1442,
  499,
  1541,
  956,
  1440,
  279,
  4320,
  11,
  1120,
  2019,
  430,
  499,
  1541,
  956,
  1440,
  13,
  5560,
  2380,
  23719,
  7340,
  323,
  25