# Tutorial 9 - Augment LLM with Retrieval Tool for Question-Answering

In this tutorial, we will use LangChain to create a typical RAG (Retrieval Augmented Generation) application for the Q&A task.

**LangChain** is an open-source package that aims to augment Large Language Models (LLMs) like GPT-3 with various tools to enhance their capabilities. It provides a framework for integrating LLMs with other tools such as search engines, databases, computational tools, and more.


**RAG** is a technique for augmenting LLM knowledge with additional data.
LLMs can reason about wide-ranging topics, but their knowledge is limited to the public data up to a specific point in time that they were trained on. If you want to build AI applications that can reason about private data or data introduced after a model’s cutoff date, you need to augment the knowledge of the model with the specific information it needs. The process of bringing the appropriate information and inserting it into the model prompt is known as Retrieval Augmented Generation (RAG).
LangChain has a number of components designed to help build Q&A applications, and RAG applications more generally.

In [1]:
! pip install langchain
! pip install sentence-transformers
! pip install langchain_community

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
 

#### Step 1: Indexing Load

To load private data into a RAG application, the initial step is to utilize DocumentLoaders. DocumentLoaders are specialized objects designed to retrieve and load data from a specific source. Here, we load our data from Google Drive.

A typical RAG application has two main components:


1.   Indexing: The indexing component of a RAG application involves a pipeline responsible for ingesting data from a given source and performing the necessary steps to prepare it for efficient retrieval. This process typically occurs offline, prior to the runtime of the application.


2.   Retrieval and generation: The retrieval and generation component forms the core of the RAG application. At runtime, when a user query is provided, this component is responsible for retrieving the relevant data from the previously indexed dataset. It leverages the indexing structure and algorithms to efficiently identify the most appropriate information for the given query. Once the relevant data has been retrieved, it is passed to the underlying model, which performs the generation step. The model uses the retrieved data, along with the query, to generate a coherent and contextually appropriate response. This response is then presented to the user as the output of the RAG application.


In [2]:
import os
import textwrap

from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


#### Step 2. Indexing: Split

The loaded document is often too long to fit in the context window of many models. Even for those models that could fit the full post in their context window, models can struggle to find information in very long inputs.

To handle this we’ll split the Document into chunks for embedding and vector storage. This should help us retrieve only the most relevant bits of the text at run time.

In this case we’ll split our documents into chunks of 1000 characters with 50 characters of overlap between chunks. The overlap helps mitigate the possibility of separating a statement from important context related to it. We use the RecursiveCharacterTextSplitter, which will recursively split the document using common separators like new lines until each chunk is the appropriate size. This is the recommended text splitter for generic text use cases.

We set add_start_index=True so that the character index at which each split Document starts within the initial Document is preserved as metadata attribute “start_index”.

In [3]:
from langchain.document_loaders import TextLoader
text_loader = TextLoader("/content/drive/MyDrive/data.txt")
document = text_loader.load()
document

[Document(metadata={'source': '/content/drive/MyDrive/data.txt'}, page_content="Amit, a health-conscious man from suburban India, regularly visited his local doctor, Dr. Kapoor, for check-ups. After a routine check-up, Dr. Kapoor prescribed a medication to Amit for his recurring headaches. The doctor advised Amit to take the medicine only when needed, not exceeding a certain limit, as misuse could lead to side effects. Amit was concerned about how to keep track of his medication intake. That's when he learned about Healthify, a healthcare system that was highly accessible in his area.\n\nWith the help of Healthify, Amit's prescription and medication use became more manageable. The system allowed him to store his prescription digitally, complete with reminders on when to take the medicine. Healthify also kept track of the number of pills he had taken, making it impossible for him to exceed the recommended limit.\n\nHealthify not only helped Amit but also indirectly contributed to improv

In [4]:
def split_text_into_lines(text, width=110):
    lines = text.split("\n")
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]
    wrapped_text = "\n".join(wrapped_lines)
    return wrapped_text

In [5]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=50, add_start_index=True
)

document_chunks = text_splitter.split_documents(document)

In [6]:
len(document_chunks)

3

#### Step 3. Indexing: Store

Now we need to index our document chunks so that we can search over them at runtime. The most common way to do this is to embed the contents of each document split and insert these embeddings into a vector database (or vector store). When we want to search over our splits, we take a text search query, embed it, and perform some sort of “similarity” search to identify the stored splits with the most similar embeddings to our query embedding.

We can embed and store all of our document splits in a single command using the chromadb vector store and HuggingFaceEmbeddings model.

The HuggingFaceEmbeddings component acts as a wrapper around a text embedding model, which is responsible for converting textual input into dense vector representations, also known as embeddings.

The VectorStore component serves as a wrapper around a vector database, specifically designed for storing and querying embeddings. It provides an interface to interact with the underlying database that efficiently manages the storage and retrieval of vectors. In the case of a RAG application, the chromadb VectorStore is commonly used.



In [7]:
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings()

  embeddings = HuggingFaceEmbeddings()
  embeddings = HuggingFaceEmbeddings()
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

ChromaDB is a vector database designed for managing and querying high-dimensional vectors, which are commonly used in machine learning and artificial intelligence applications, particularly in areas like natural language processing. It provides efficient storage, indexing, and retrieval capabilities for vector data, enabling fast similarity searches and nearest neighbor queries. ChromaDB is optimized for handling large-scale datasets and supports real-time operations, making it suitable for use cases such as recommendation systems, semantic search, and clustering. Its architecture is designed to integrate seamlessly with modern AI workflows, offering scalability and flexibility for developers and researchers working with complex data.

In [8]:
!pip install chromadb
from langchain.vectorstores import Chroma
vector_store = Chroma.from_documents(
    documents=document_chunks,
    embedding=embeddings
)

Collecting chromadb
  Downloading chromadb-0.6.3-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.21.0-py2.py3-none-any.whl.metadata (2.9 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.21.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.31.1-py

#### Step 4. Retrieve Answers from Documents

Now let’s write the actual application logic. We want to create a simple application that takes a user question, and searches for documents relevant to that question.

We use the load_qa_chain function in the LangChain library, specifically in the question-answering module. It is used to load a chain that enables question-answering functionality over a set of documents.

We use Llama model as the fundation QA model. The LLaMA is a state-of-the-art foundational language model developed by Meta (formerly Facebook). It is designed to assist researchers in advancing their work in the field of artificial intelligence. LLaMA models come in various sizes, typically ranging from 7 billion to 65 billion parameters, allowing for flexibility in terms of computational requirements and research applications. These models are trained on a diverse range of internet text, enabling them to generate coherent and contextually relevant text based on the input they receive. LLaMA is intended to be more efficient and accessible, providing a powerful tool for natural language processing tasks while being optimized for performance on lower-resource hardware compared to some other large language models.

In [None]:
# Sign up in the huggingface website to download model use your huggingfe token.
# Your token can be found in https://huggingface.co/docs/hub/security-tokens
import os
os.environ["HUGGINGFACEHUB_API_TOKEN"] = "YOUR_TOKEN"
from langchain import HuggingFaceHub

qa_model = HuggingFaceHub(
    repo_id="google/gemma-7b",
    model_kwargs={
        "temperature": 0.7,
        "max_new_tokens": 50,
    }
)

load_qa_chain provides the most generic interface for answering questions. It loads a chain that you can do QA for your input documents and uses ALL of the text in the documents.

In [11]:
from langchain.chains.question_answering import load_qa_chain

qa_chain = load_qa_chain(qa_model, chain_type="stuff")

stuff: https://python.langchain.com/docs/versions/migrating_chains/stuff_docs_chain
map_reduce: https://python.langchain.com/docs/versions/migrating_chains/map_reduce_chain
refine: https://python.langchain.com/docs/versions/migrating_chains/refine_chain
map_rerank: https://python.langchain.com/docs/versions/migrating_chains/map_rerank_docs_chain

See also guides on retrieval and question-answering here: https://python.langchain.com/docs/how_to/#qa-with-rag
  qa_chain = load_qa_chain(qa_model, chain_type="stuff")


In [12]:
question = "What the name of doctor and the name of patient?"

In [13]:
search_results = vector_store.similarity_search(question, k=1)
search_content = search_results[0].page_content

The qa_chain prompts the answer for the question use the instruction: Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Finally, LLM generate answer based on the search_resultes and the prompt.


In [14]:
answers = qa_chain.run(input_documents=search_results, question=question)
print(answers)

  answers = qa_chain.run(input_documents=search_results, question=question)


Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

Amit, a health-conscious man from suburban India, regularly visited his local doctor, Dr. Kapoor, for check-ups. After a routine check-up, Dr. Kapoor prescribed a medication to Amit for his recurring headaches. The doctor advised Amit to take the medicine only when needed, not exceeding a certain limit, as misuse could lead to side effects. Amit was concerned about how to keep track of his medication intake. That's when he learned about Healthify, a healthcare system that was highly accessible in his area.

With the help of Healthify, Amit's prescription and medication use became more manageable. The system allowed him to store his prescription digitally, complete with reminders on when to take the medicine. Healthify also kept track of the number of pills he had taken, making it impossible for him to exceed the recommended 