# Intro to Retrieval Augmented Generation Systems, LangChain & ChromaDB

This notebook walks through building a question/answer system that retrieves information to formulate responses, effectively grounding the LLM with specific information. A pre-trained LLM, or likely even a fine-tuned LLM will not be sufficient (in and of itself) when you want a system that understands specific, possibly private data or information that was not in its training dataset.

In this lab you will:
* Learn about the different components of a retrieval augmented system
* Build a simple retrieval augmented generation system 
* Use LangChain and ChromaDB to simplify and scale the process

This lab uses a special kernel with langchain dependencies. Run the cell below to create the kernel.

In [1]:
!cd ~/asl-ml-immersion && make langchain_kernel

./kernels/langchain.sh
Installed kernelspec langchain_kernel in /home/jupyter/.local/share/jupyter/kernels/langchain_kernel
Collecting langchain==0.0.217
  Downloading langchain-0.0.217-py3-none-any.whl.metadata (13 kB)
Collecting PyYAML>=5.4.1 (from langchain==0.0.217)
  Downloading PyYAML-6.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
Collecting dataclasses-json<0.6.0,>=0.5.7 (from langchain==0.0.217)
  Downloading dataclasses_json-0.5.14-py3-none-any.whl.metadata (22 kB)
Collecting langchainplus-sdk>=0.0.17 (from langchain==0.0.217)
  Downloading langchainplus_sdk-0.0.20-py3-none-any.whl.metadata (8.7 kB)
Collecting numexpr<3.0.0,>=2.8.4 (from langchain==0.0.217)
  Downloading numexpr-2.8.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.9 kB)
Collecting openapi-schema-pydantic<2.0,>=1.2 (from langchain==0.0.217)
  Downloading openapi_schema_pydantic-1.2.4-py3-none-any.whl (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Select the kernel `langchain_kernel` in the top right before going forward in the notebook.

### Setup

In [1]:
import pandas as pd
import scipy
from langchain.chains import ConversationalRetrievalChain, RetrievalQA
from langchain.document_loaders import WikipediaLoader
from langchain.embeddings import VertexAIEmbeddings
from langchain.llms import VertexAI
from langchain.memory import ConversationBufferMemory
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from vertexai.language_models import TextEmbeddingModel, TextGenerationModel

### Build a simple retrieval augmented generation system

In this toy example, we want to ground an LLM on information that an off-the-shelf LLM would not know. For example, instructions left for a house sitter that will be watching two pets.

In [2]:
# List of things we want to ground the LLM on.
information = [
    "Estrella is a dog",
    "Finnegan is a cat",
    "Finnegan gets fed five times daily. Estrella gets fed three times daily.",
    "Estrella usually goes on one long walk per day, but needs to go outside every 4-6 hours",
    "Please play with Finnegan for 30 minutes each day. His favorite toy is the fake mouse!",
]

information_df = pd.DataFrame({"text": information})
information_df.head()

Unnamed: 0,text
0,Estrella is a dog
1,Finnegan is a cat
2,Finnegan gets fed five times daily. Estrella g...
3,Estrella usually goes on one long walk per day...
4,Please play with Finnegan for 30 minutes each ...


At the core of most retrieval generation systems is a vector database. A vector database stores embedded representations of information. 

Let's add a column to our information dataframe that is an embedded representation of the text. We will use the [Vertex AI text-embeddings API](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings).

In [3]:
embedding_model = TextEmbeddingModel.from_pretrained("textembedding-gecko")
information_df["vector"] = [
    x.values for x in embedding_model.get_embeddings(information)
]
information_df.head()

Unnamed: 0,text,vector
0,Estrella is a dog,"[0.020266957581043243, 0.010058210231363773, -..."
1,Finnegan is a cat,"[-0.027006812393665314, -0.029612787067890167,..."
2,Finnegan gets fed five times daily. Estrella g...,"[-0.02811681292951107, 0.006287926342338324, -..."
3,Estrella usually goes on one long walk per day...,"[0.00867326557636261, 0.03359326347708702, -0...."
4,Please play with Finnegan for 30 minutes each ...,"[0.0019507030956447124, -0.017996784299612045,..."


Retrieval systems need a way of finding the most relevant information to answer a given query. This is done with a nearest neighbor (semantic similarity) search. Let's define a function to take in a query (text) input and return a distance metric for each text in our information. We will need to: 
* Embed the query with the same embedding model used for the information 
* Computes a distance metric between the query vector and each information vector. We will use cosine similarity, one of the many similarity measures that can be used.
* Returns a list of distance metrics between the query vector and each information vector 

In [4]:
def embed_and_compute_distances(query: str):
    # Get vector for query string
    query_embedding = embedding_model.get_embeddings([query])[
        0
    ].values  # Query embedding

    distances = []

    # Compute distances between query vector and all information vectors
    for _, row in information_df.iterrows():
        distances.append(
            {
                "information": row.text,
                "distance": scipy.spatial.distance.cosine(
                    query_embedding, row.vector
                ),
            }
        )

    return distances

Test this function out on an example.

In [5]:
embed_and_compute_distances(query="What type of animal is Estrella?")

[{'information': 'Estrella is a dog', 'distance': 0.1424507274732465},
 {'information': 'Finnegan is a cat', 'distance': 0.4717404010234435},
 {'information': 'Finnegan gets fed five times daily. Estrella gets fed three times daily.',
  'distance': 0.27034889233707404},
 {'information': 'Estrella usually goes on one long walk per day, but needs to go outside every 4-6 hours',
  'distance': 0.20111010858109857},
 {'information': 'Please play with Finnegan for 30 minutes each day. His favorite toy is the fake mouse!',
  'distance': 0.4042683765833356}]

Notice that the vector that has the lowest cosine similarity (meaning most similiar) to the vector for "What type of animal is Estrella?" is the vector for "Estrella is a dog". This highlights the core assumption that underpins retrieval augmented systems: information relevant to answering a question will be close in vector space to the question itself.

Now all we have to do is write a function that incorporates the text corresponding to the closest information vectors in a prompt, then send that prompt to an LLM to answer the question with the information.

Start by writing a helper function to put together this prompt. `context` will be the relevant information strings (found via nearest neighbor search) and `query` will be the query string.

In [6]:
def get_prompt(query: str, context: list[str]):
    prompt = f"""
    Using only the provided context, answer the question.
    
    Context:
    {','.join(context)}
    
    Question: {query}.
    
    If you cannot answer the question using only the provided context, respond that you do not have the context needed to answer the question.
    """
    return prompt

Now put everything together in a function that 
* Embeds the query
* Computes the distance between query vector and all information vectors 
* Gets the k most relevant information texts by sorting by distance 
* Uses the k most relevant information texts in a prompt to an LLM along with the query 
* Returns the LLM response and the information used (citations) 

In [7]:
model = TextGenerationModel.from_pretrained("text-bison@002")


def retrieval_chain(query: str, k: int = 2):
    # Compute distances for query and all information vectors
    distances = embed_and_compute_distances(query)

    # Sort the information from smallest distance to greatest distance
    sorted_distances = sorted(distances, key=lambda x: x["distance"])

    # Get the text corresponding to the k closest vectors
    closest_information_texts = [x["information"] for x in sorted_distances[:k]]

    # Incorporate the closest k information texts in a prompt to an LLM
    prompt = get_prompt(query, closest_information_texts)

    # Send prompt through LLM
    response = model.predict(prompt)
    print(f"Response: {response.text}")
    print(f"Information used: {closest_information_texts}")

In [8]:
retrieval_chain("What type of animal is Estrella?")

Response:  Estrella is a dog.
Information used: ['Estrella is a dog', 'Estrella usually goes on one long walk per day, but needs to go outside every 4-6 hours']


In [9]:
retrieval_chain("How many times a day do I need to feed Finnegan?")

Response:  Finnegan gets fed five times daily.
Information used: ['Finnegan gets fed five times daily. Estrella gets fed three times daily.', 'Please play with Finnegan for 30 minutes each day. His favorite toy is the fake mouse!']


In [10]:
retrieval_chain("What stock should I invest in this month?")

Response:  The provided context does not mention anything about stocks or investments, so I cannot answer this question.
Information used: ['Please play with Finnegan for 30 minutes each day. His favorite toy is the fake mouse!', 'Finnegan gets fed five times daily. Estrella gets fed three times daily.']


Notice that the prompt is constructed such that if a question is asked that cannot be answered from the information provided, the LLM will not try to answer it.

It is also worth noting that we are arbitrarily setting k=2 (including the closest 2 information texts in the prompt). Different use cases require different k's and there is no perfect one-size-fits-all. 

### Simplify and Scale with LangChain and Chroma
Of course with only 5 examples of grounding information, we could easily include all five in a prompt. In other words, the extra retrieval step to identify *what* is needed in the prompt was unnessesary. Of course in the real world we may have thousands or millions of grounding information examples. Additionally as the number of grounding examples grows, simply computing a distance for every single vector is incredibly innefficient. In other words, production retrieval augmented generation systems require:
* Scalable vector databases to store large amounts of information
* Efficient ways of performing nearest neighbor searches 

Of course there are many options for a vectorstore, including managed and scalable offerings like [Vertex AI Vector Search](https://cloud.google.com/vertex-ai/docs/vector-search/overview). For simplicity, in this lab we will use [Chroma](https://python.langchain.com/docs/integrations/vectorstores/chroma) as a vectorstore and [Langchain](https://github.com/langchain-ai/langchain) to orchestrate the retrieval system. Langchain will provide classes and methods that help simplify the steps we had to implement ourselves in the toy example above.   

#### Document Loading

Langchain provides classes to load data from different sources. Some useful data loaders are [Google Cloud Storage Directory Loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/google_cloud_storage_directory), [Google Drive Loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/google_drive), [Recursive URL Loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/recursive_url_loader), [PDF Loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/how_to/pdf), [JSON Loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/how_to/json), [Wikipedia Loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/wikipedia), and [more](https://python.langchain.com/docs/modules/data_connection/document_loaders/).

In this notebook we will use the Wikipedia loader to create a private knowledge base of wikipedia articles about large language models, but the overall process is similiar regardless of which document loader you use.

In [11]:
docs = WikipediaLoader(query="Large Language Models", load_max_docs=10).load()

# Take a look at a single document
docs[0]

Document(page_content='A large language model (LLM) is a language model notable for its ability to achieve general-purpose language generation. LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training process. LLMs are artificial neural networks typically built with a transformer-based architecture. Some recent implementations are based on alternative architectures such as recurrent neural network variants and Mamba (a state space model).LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word. Up to 2020, fine tuning was the only way a model could be adapted to be able to accomplish specific tasks. Larger sized models, such as GPT-3, however, can be prompt-engineered to achieve similar results. They are thought to acquire knowledge about syntax, semantics and "ontology" inherent in human language co

#### Split text into chunks 
Now that we have the documents we will split them into chunks. Each chunk will become one vector in the vector store. To do this we will define a chunk size (number of characters) and a chunk overlap (amount of overlap i.e. sliding window). The perfect chunk size can be difficult to determine. Too large of a chunk size leads to too much information per chunk (individual chunks not specific enough), however too small of a chunk size leads to not enough information per chunk. In both cases, nearest neighbors lookup with a query/question embedding may struggle to retrieve the actually relevant chunks, or fail altogether if the chunks are too large to use as context with an LLM query.

In this notebook we will use a chunk size of 800 chacters and a chunk overlap of 400 characters, but feel free to experiment with other sizes! Note: you can specify a custom `length_function` with `RecursiveCharacterTextSplitter` if you want chunk size/overlap to be determined by something other than Python's len function. In addition to `RecursiveCharacterTextSplitter`, there are [other text splitters](https://python.langchain.com/docs/modules/data_connection/document_transformers/) you can consider.

In [12]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=400,
    length_function=len,
)

chunks = text_splitter.split_documents(docs)

# Look at the first two chunks
chunks[0:2]

[Document(page_content='A large language model (LLM) is a language model notable for its ability to achieve general-purpose language generation. LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training process. LLMs are artificial neural networks typically built with a transformer-based architecture. Some recent implementations are based on alternative architectures such as recurrent neural network variants and Mamba (a state space model).LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word. Up to 2020, fine tuning was the only way a model could be adapted to be able to accomplish specific tasks. Larger sized models, such', metadata={'title': 'Large language model', 'summary': 'A large language model (LLM) is a language model notable for its ability to achieve general-purpose language generation.

In [13]:
print(f"Number of documents: {len(docs)}")
print(f"Number of chunks: {len(chunks)}")

Number of documents: 10
Number of chunks: 91


#### Embed Document Chunks 
Now we need to embed the document chunks and store them in a vectorstore. For this, we can use any text embedding model, however we need to be sure to use the same text embedding model when we embed our queries/questions at prediction time. To make things simple we will use the PaLM API for Embeddings. The langchain library provides a nice wrapper class around the PaLM Embeddings API, VertexAIEmbeddings().

Since Vertex AI Vector Search takes awhile (~45 minutes) to create an index, we will use Chroma instead to keep things simple. Of course, in a real-world use case with a large private knowledge-base, you may not be able to fit everything in memory. Langchain has a nice wrapper class for Chroma which allows us to pass in a list of documents, and an embedding class to create the vector store.

In [14]:
embedding = VertexAIEmbeddings(
    model_name="textembedding-gecko@001"
)  # PaLM embedding API

# set persist directory so the vector store is saved to disk
db = Chroma.from_documents(chunks, embedding, persist_directory="./vectorstore")

#### Putting it all together 

Now that everything is in place, we can tie it all together with a langchain chain. A langchain chain simply orchestrates the multiple steps required to use an LLM for a specific use case. In this case the process we will chain together first embeds the query/question, then performs a nearest neighbors lookup to find the relevant chunks, then uses the relevant chunks to formulate a response with an LLM. We will use the Chroma database as our vector store and PaLM as our LLM. Langchain provides a wrapper around PaLM, `VertexAI()`.

For this simple Q/A use case we can use langchain's `RetrievalQA` to link together the process.

In [15]:
# vector store
retriever = db.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 10},  # number of nearest neighbors to retrieve
)

# PaLM API
# You can also set temperature, top_p, top_k
llm = VertexAI(model_name="text-bison@001", max_output_tokens=1024)

# q/a chain
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
)

Now that everything is tied together we can send queries and get answers!

In [16]:
def ask_question(question: str):
    response = qa({"query": question})
    print(f"Response: {response['result']}\n")

    citations = {doc.metadata["source"] for doc in response["source_documents"]}
    print(f"Citations: {citations}\n")

    # uncomment below to print source chunks used
    # print(f"Source Chunks Used: {response['source_documents']}")

In [17]:
ask_question("What technology underpins large language models?")

Response: The technology that underpins large language models is the transformer architecture.

Citations: {'https://en.wikipedia.org/wiki/Large_language_model', 'https://en.wikipedia.org/wiki/GPT-3', 'https://en.wikipedia.org/wiki/Generative_pre-trained_transformer', 'https://en.wikipedia.org/wiki/Language_model', 'https://en.wikipedia.org/wiki/Open-source_artificial_intelligence'}



In [18]:
ask_question("When was the transformer introduced?")

Response: The transformer architecture was introduced in 2017.

Citations: {'https://en.wikipedia.org/wiki/Large_language_model', 'https://en.wikipedia.org/wiki/GPT-3', 'https://en.wikipedia.org/wiki/Generative_pre-trained_transformer', 'https://en.wikipedia.org/wiki/BERT_(language_model)', 'https://en.wikipedia.org/wiki/Prompt_engineering'}



Congrats! You have now built a toy retrieval augmented generation system from scratch and applied the learnings to build a more real system using a vector database and orchestration with langchain.