# Rag From Scratch: Indexing

![Screenshot 2024-03-25 at 8.23.02 PM.png](indexing.png)

## Set Environment Vars and API Keys

In [1]:
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())

import os
os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
os.environ['LANGCHAIN_PROJECT'] = 'advanced-rag'
os.environ['LANGCHAIN_API_KEY'] = os.getenv("LANGCHAIN_API_KEY")
os.environ['GROQ_API_KEY'] = os.getenv("GROQQ_API_KEY")

## Part 12: Multi-representation Indexing

Flow:

 ![Screenshot 2024-03-16 at 5.54.55 PM.png](multiindexing.png)

Docs:

https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector

- đọc 2 bài trên 2 trang web để làm docs
- thực hiện tóm tắt 2 bài
- lưu trữ vào không gian chromaDB, docs sẽ được ánh xạ đến một doc_id và được embedding thông qua hf_embeddings
- tìm kiếm bằng similarity

In [None]:
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = WebBaseLoader("https://medium.com/@pankaj_pandey/introduction-to-retrieval-augmented-generation-rag-9209bf8a076d")
docs = loader.load()

loader = WebBaseLoader("https://medium.com/humansdotai/an-introduction-to-ai-agents-e8c4afd2ee8f")
docs.extend(loader.load())

In [None]:
# !pip install langchain beautifulsoup4 requests

In [16]:
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = WebBaseLoader("https://cloud.google.com/use-cases/retrieval-augmented-generation")
docs = loader.load()

loader = WebBaseLoader("https://en.wikipedia.org/wiki/Retrieval-augmented_generation")
docs.extend(loader.load())

In [26]:
print(docs)
for doc in docs:
    print(doc.metadata)
    print(doc.page_content)

new_docs = []
for doc in docs:
    new_doc = doc
    new_doc.page_content = doc.page_content[:1000]
    new_docs.append(new_doc)

[Document(metadata={'source': 'https://cloud.google.com/use-cases/retrieval-augmented-generation', 'title': 'What is Retrieval-Augmented Generation (RAG)? | Google Cloud', 'description': 'Retrieval-augmented generation (RAG) combines LLMs with external knowledge bases to improve their outputs. Learn more with Google Cloud.', 'language': 'en-US'}, page_content='What is Retrieval-Augmented Generation (RAG)? | Google CloudPage ContentsTopicsRAGWhat is Retrieval-Augmented Generation (RAG)?RAG (Retrieval-Augmented Generation) is an AI framework that combines the strengths of traditional information retrieval systems (such as search and databases) with the capabilities of generative large language models (LLMs). By combining your data and world knowledge with LLM language skills, grounded generation is more accurate, up-to-date, and relevant to your specific needs. Check out this e-book to unlock your “Enterprise Truth.”Get started for free35:30Grounding for Gemini with Vertex AI Search and 

In [27]:
import uuid

from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_groq import ChatGroq

chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | ChatGroq()
    | StrOutputParser()
)

summaries = chain.batch(new_docs, {"max_concurrency": 1})

In [28]:
summaries

['Retrieval-Augmented Generation (RAG) is an AI framework that combines traditional information retrieval systems with generative large language models. RAG enhances the accuracy, timeliness, and relevance of AI-generated content by integrating data and world knowledge with language skills. The RAG process involves retrieving and pre-processing relevant information from external data sources, such as web pages, knowledge bases, and databases. This information is then utilized to generate more grounded and specific outputs for users. To learn more and get started for free, you can access the e-book "Unlock your \'Enterprise Truth\'" and explore Google Cloud\'s Vertex AI Search and DIY RAG options.',
 'Retrieval-augmented generation is a method of generating responses to queries that involves first retrieving relevant documents from a large corpus of text, and then using those documents to inform the response generation. The process involves four main steps: indexing, retrieval, augmenta

In [31]:
from langchain.storage import InMemoryByteStore
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceBgeEmbeddings

model_name = "BAAI/bge-small-en"
model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": True}
hf_embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs
)

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries",
                     embedding_function=hf_embeddings)

# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"

# The retriever
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

# Docs linked to summaries
summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

# Add
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

In [30]:
# !pip install chromadb

Collecting google-auth>=1.0.1 (from kubernetes>=28.1.0->chromadb)
  Downloading google_auth-2.38.0-py2.py3-none-any.whl.metadata (4.8 kB)
Collecting websocket-client!=0.40.0,!=0.41.*,!=0.42.*,>=0.32.0 (from kubernetes>=28.1.0->chromadb)
  Using cached websocket_client-1.8.0-py3-none-any.whl.metadata (8.0 kB)
Collecting requests-oauthlib (from kubernetes>=28.1.0->chromadb)
  Downloading requests_oauthlib-2.0.0-py2.py3-none-any.whl.metadata (11 kB)
Collecting oauthlib>=3.2.2 (from kubernetes>=28.1.0->chromadb)
  Downloading oauthlib-3.2.2-py3-none-any.whl.metadata (7.5 kB)
Collecting durationpy>=0.7 (from kubernetes>=28.1.0->chromadb)
  Downloading durationpy-0.9-py3-none-any.whl.metadata (338 bytes)
Collecting coloredlogs (from onnxruntime>=1.14.1->chromadb)
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl.metadata (12 kB)
Collecting flatbuffers (from onnxruntime>=1.14.1->chromadb)
  Downloading flatbuffers-25.1.24-py2.py3-none-any.whl.metadata (875 bytes)
Collecting deprecated>=1.

In [32]:
doc_ids

['6d570b08-c111-47b5-a195-5d4f60265c72',
 '8c09b199-5afc-47d9-821c-616687d94d60']

In [33]:
retriever.docstore.mget(doc_ids)

[Document(metadata={'source': 'https://cloud.google.com/use-cases/retrieval-augmented-generation', 'title': 'What is Retrieval-Augmented Generation (RAG)? | Google Cloud', 'description': 'Retrieval-augmented generation (RAG) combines LLMs with external knowledge bases to improve their outputs. Learn more with Google Cloud.', 'language': 'en-US'}, page_content='What is Retrieval-Augmented Generation (RAG)? | Google CloudPage ContentsTopicsRAGWhat is Retrieval-Augmented Generation (RAG)?RAG (Retrieval-Augmented Generation) is an AI framework that combines the strengths of traditional information retrieval systems (such as search and databases) with the capabilities of generative large language models (LLMs). By combining your data and world knowledge with LLM language skills, grounded generation is more accurate, up-to-date, and relevant to your specific needs. Check out this e-book to unlock your “Enterprise Truth.”Get started for free35:30Grounding for Gemini with Vertex AI Search and 

In [34]:
query = "What is agent"
sub_docs = vectorstore.similarity_search(query,k=1)
sub_docs[0]

Document(metadata={'doc_id': '6d570b08-c111-47b5-a195-5d4f60265c72'}, page_content='Retrieval-Augmented Generation (RAG) is an AI framework that combines traditional information retrieval systems with generative large language models. RAG enhances the accuracy, timeliness, and relevance of AI-generated content by integrating data and world knowledge with language skills. The RAG process involves retrieving and pre-processing relevant information from external data sources, such as web pages, knowledge bases, and databases. This information is then utilized to generate more grounded and specific outputs for users. To learn more and get started for free, you can access the e-book "Unlock your \'Enterprise Truth\'" and explore Google Cloud\'s Vertex AI Search and DIY RAG options.')

In [35]:
retrieved_docs = retriever.get_relevant_documents(query,n_results=1)
retrieved_docs[0].page_content[0:500]

  retrieved_docs = retriever.get_relevant_documents(query,n_results=1)
Number of requested results 4 is greater than number of elements in index 2, updating n_results = 2


'What is Retrieval-Augmented Generation (RAG)? | Google CloudPage ContentsTopicsRAGWhat is Retrieval-Augmented Generation (RAG)?RAG (Retrieval-Augmented Generation) is an AI framework that combines the strengths of traditional information retrieval systems (such as search and databases) with the capabilities of generative large language models (LLMs). By combining your data and world knowledge with LLM language skills, grounded generation is more accurate, up-to-date, and relevant to your specifi'

## Part 13: ColBERT

RAGatouille makes it as simple to use ColBERT.

ColBERT generates a contextually influenced vector for each token in the passages.

ColBERT similarly generates vectors for each token in the query.

Then, the score of each document is the sum of the maximum similarity of each query embedding to any of the document embeddings:

See [here](https://hackernoon.com/how-colbert-helps-developers-overcome-the-limits-of-rag) and [here](https://python.langchain.com/docs/integrations/retrievers/ragatouille) and [here](https://til.simonwillison.net/llms/colbert-ragatouille).

- sử dụng colbert là hệ RAG pretrained model luôn
- sử dụng index() để indexing tạo bộ database
- sử dụng search() để retrieval
- sử dụng invoke() để thực hiện trả về output từ input

In [36]:
!pip install ragatouille

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting ragatouille
  Downloading ragatouille-0.0.8.post4-py3-none-any.whl.metadata (15 kB)
Collecting colbert-ai==0.2.19 (from ragatouille)
  Downloading colbert-ai-0.2.19.tar.gz (86 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting faiss-cpu<2.0.0,>=1.7.4 (from ragatouille)
  Downloading faiss_cpu-1.10.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Collecting fast-pytorch-kmeans==0.2.0.1 (from ragatouille)
  Downloading fast_pytorch_kmeans-0.2.0.1-py3-none-any.whl.metadata (1.1 kB)
Collecting llama-index>=0.7 (from ragatouille)
  Downloading llama_index-0.12.16-py3-none-any.whl.metadata (12 kB)
Collecting onnx<2.0.0,>=1.15.0 (from ragatouille)
  Downloading onnx-1.17.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (16 kB)
Collecting sentence-transformers<3.0.0,>=2.2.2 (from ragatouille)
  Downloading sentence_transformers-2.7.0-py3-none-any.whl.metadata (11 kB)
Collecting srsly==2.4.8 (from ragatouille)
  Downloading srsly-2.4.8-c

In [37]:
from ragatouille import RAGPretrainedModel
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

  self.scaler = torch.cuda.amp.GradScaler()


In [38]:
import requests

def get_wikipedia_page(title: str):
    """
    Retrieve the full text content of a Wikipedia page.

    :param title: str - Title of the Wikipedia page.
    :return: str - Full text content of the page as raw string.
    """
    # Wikipedia API endpoint
    URL = "https://en.wikipedia.org/w/api.php"

    # Parameters for the API request
    params = {
        "action": "query",
        "format": "json",
        "titles": title,
        "prop": "extracts",
        "explaintext": True,
    }

    # Custom User-Agent header to comply with Wikipedia's best practices
    headers = {"User-Agent": "RAGatouille_tutorial/0.0.1 (ben@clavie.eu)"}

    response = requests.get(URL, params=params, headers=headers)
    data = response.json()

    # Extracting page content
    page = next(iter(data["query"]["pages"].values()))
    return page["extract"] if "extract" in page else None

full_document = get_wikipedia_page("Document_retrieval")

In [None]:
RAG.index(
    collection=[full_document],
    index_name="Doc-1",
    max_document_length=180,
    split_documents=True,
)

This is a behaviour change from RAGatouille 0.8.0 onwards.
This works fine for most users and smallish datasets, but can be considerably slower than FAISS and could cause worse results in some situations.
If you're confident with FAISS working on your machine, pass use_faiss=True to revert to the FAISS-using behaviour.
--------------------


[Feb 06, 10:16:00] #> Creating directory .ragatouille/colbert/indexes/Doc-1 


#> Starting...
#> Starting...
nranks = 2 	 num_gpus = 2 	 device=0
nranks = 2 	 num_gpus = 2 	 device=1


  self.scaler = torch.cuda.amp.GradScaler()
  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
  self.scaler = torch.cuda.amp.GradScaler()
  return torch.cuda.amp.autocast() if self.activated else NullContextManager()


[Feb 06, 10:16:06] [1] 		 #> Encoding 3 passages..
[Feb 06, 10:16:06] [0] 		 #> Encoding 4 passages..
[Feb 06, 10:16:08] [0] 		 avg_doclen_est = 114.16667175292969 	 len(local_sample) = 4
[Feb 06, 10:16:08] [1] 		 avg_doclen_est = 114.16667175292969 	 len(local_sample) = 3
[Feb 06, 10:16:08] [0] 		 Creating 256 partitions.
[Feb 06, 10:16:08] [0] 		 *Estimated* 799 embeddings.
[Feb 06, 10:16:08] [0] 		 #> Saving the indexing plan to .ragatouille/colbert/indexes/Doc-1/plan.json ..
Clustering 775 points in 128D to 256 clusters, redo 1 times, 20 iterations
  Preprocessing in 0.00 s
  Iteration 19 (0.03 s, search 0.02 s): objective=108.294 imbalance=1.456 nsplit=0       
[Feb 06, 10:16:08] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...


  sub_sample = torch.load(sub_sample_path)
Process Process-2:
Traceback (most recent call last):
  File "/home/trung/.conda/envs/python310/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/trung/.conda/envs/python310/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/trung/.conda/envs/python310/lib/python3.10/site-packages/colbert/infra/launcher.py", line 134, in setup_new_process
    return_val = callee(config, *args)
  File "/home/trung/.conda/envs/python310/lib/python3.10/site-packages/colbert/indexing/collection_indexer.py", line 33, in encode
    encoder.run(shared_lists)
  File "/home/trung/.conda/envs/python310/lib/python3.10/site-packages/colbert/indexing/collection_indexer.py", line 68, in run
    self.train(shared_lists) # Trains centroids from selected passages
  File "/home/trung/.conda/envs/python310/lib/python3.10/site-packages/colbert/indexing/collection_i

In [5]:
results = RAG.search(query="What is an example for form based indexing?", k=3)
results

Loading searcher for index Doc-1 for the first time... This may take a few seconds
[Jul 06, 00:08:11] #> Loading codec...
[Jul 06, 00:08:11] #> Loading IVF...
[Jul 06, 00:08:11] Loading segmented_lookup_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Jul 06, 00:08:48] #> Loading doclens...


100%|██████████| 1/1 [00:00<00:00, 3279.36it/s]

[Jul 06, 00:08:48] #> Loading codes and residuals...



100%|██████████| 1/1 [00:00<00:00, 130.10it/s]

[Jul 06, 00:08:48] Loading filter_pids_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...





[Jul 06, 00:09:20] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
Searcher loaded!

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . What is an example for form based indexing?, 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([ 101,    1, 2054, 2003, 2019, 2742, 2005, 2433, 2241, 5950, 2075, 1029,
         102,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103])
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])





[{'content': '== Variations ==\nThere are two main classes of indexing schemata for document retrieval systems: form based (or word based), and content based indexing. The document classification scheme (or indexing algorithm) in use determines the nature of the document retrieval system.\n\n\n=== Form based ===\nForm based document retrieval addresses the exact syntactic properties of a text, comparable to substring matching in string searches. The text is generally unstructured and not necessarily in a natural language, the system could for example be used to process large sets of chemical representations in molecular biology. A suffix tree algorithm is an example for form based indexing.',
  'score': 25.978090286254883,
  'rank': 1,
  'document_id': '83decb83-f58e-4d89-b9c1-51f09daadcfa',
  'passage_id': 2},
 {'content': '== Example: PubMed ==\nThe PubMed form interface features the "related articles" search which works through a comparison of words from the documents\' title, abstr

In [6]:
retriever = RAG.as_langchain_retriever(k=3)
retriever.invoke("What is an example for form based indexing?")



[Document(page_content='== Variations ==\nThere are two main classes of indexing schemata for document retrieval systems: form based (or word based), and content based indexing. The document classification scheme (or indexing algorithm) in use determines the nature of the document retrieval system.\n\n\n=== Form based ===\nForm based document retrieval addresses the exact syntactic properties of a text, comparable to substring matching in string searches. The text is generally unstructured and not necessarily in a natural language, the system could for example be used to process large sets of chemical representations in molecular biology. A suffix tree algorithm is an example for form based indexing.'),
 Document(page_content='== Example: PubMed ==\nThe PubMed form interface features the "related articles" search which works through a comparison of words from the documents\' title, abstract, and MeSH terms using a word-weighted algorithm.\n\n\n== See also ==\nCompound term processing\n