
## Practical Experiment with Multi-Vector Retrieval: Exploring 3 Advanced Techniques in Table-Heavy Documents

<a target="_blank" href="https://colab.research.google.com/github/https://colab.research.google.com/drive/16ZMcOtHU0hjfXP2lKnZV8gRcGuSJ6aRC?usp=sharing">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This notebook focuses on implementing and evaluating three Multi-Vector-Retriever approaches in table-rich documents from the perspective of practical work, it will not be an introductory tutorial on Multi-Vector-Retriever. The basic principles we have introduced in the first 2 videos([3种高级索引方法，有效提升RAG性能](https://www.bilibili.com/video/BV1dH4y1C7Ck/),  [【RAG实战】 Multi-Vector-Retrieval实现三种高级索引方法 （含Claude/GPT3-3.5评估结果）](https://www.bilibili.com/video/BV1Vu4y1H72s/)), you can click to learn.

Three Methods with [MultiVectorRetriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector):


**[Smaller chunks](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#smaller-chunks)**

**[Summaries](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary)**

**[Hypothetical Queries](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#hypothetical-queries)**

**Tools we will use:**

HTML documents parser: UnstructuredHTMLLoader

MultiVectorRetriever

[LCEL](https://python.langchain.com/docs/expression_language/) (Langchain Expression Language)

Notes:

1. Show experiments in real work, not an entry-level tutorial
2. Running all of these methods will cost `$4-$8`. If you care, test them with some simple files.

## Packages

In [None]:
!pip install langchain unstructured[all-docs] pydantic lxml langchainhub sentence_transformers chromadb openai -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m58.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m502.4/502.4 kB[0m [31m38.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m181.5/181.5 kB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m397.5/397.5 kB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m275.1/275.1 kB[0m [31m24.6 M

## Configuration

Configure the embeddings and model. We will use HuggingFacingEmbeddings(`all-MiniLM-L6-v2`) and `gpt-3.5-turbo-16k` model.

In [None]:
# create folder results/ if not exists to save the experiment results.

import os

# Define the folder path
results_folder = "results/"

# Check if the folder already exists
if not os.path.exists(results_folder):
    # Create the folder if it doesn't exist
    os.makedirs(results_folder)
    print(f"Created folder: {results_folder}")
else:
    print(f"Folder already exists: {results_folder}")


Created folder: results/


In [None]:
from langchain.document_loaders.html import UnstructuredHTMLLoader
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser

from IPython.display import Markdown
import warnings
import pandas as pd
import re


# Supress warnings
warnings.filterwarnings("ignore")

# embeddings
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
# Equivalent to SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

import os
import getpass

os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')

# model = ChatOpenAI(temperature=0, model="gpt-4")
model = ChatOpenAI(temperature=0.1, model="gpt-3.5-turbo-16k")



OpenAI API Key:··········


## Data Loading

We will use the [Tesla-10-K-2022-Filing](https://www.sec.gov/Archives/edgar/data/1318605/000095017023001409/tsla-20221231.htm) which has plenty of tables and texts. You can download it.



In [None]:
# download data
!wget https://www.dropbox.com/scl/fi/fqyvmodovgk21p06giezu/Tesla-10-K-2022-Filing.html .

--2023-12-14 16:00:24--  https://www.dropbox.com/scl/fi/fqyvmodovgk21p06giezu/Tesla-10-K-2022-Filing.html
Resolving www.dropbox.com (www.dropbox.com)... 162.125.13.18, 2620:100:6035:18::a27d:5512
Connecting to www.dropbox.com (www.dropbox.com)|162.125.13.18|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘Tesla-10-K-2022-Filing.html’

Tesla-10-K-2022-Fil     [ <=>                ]  68.51K  --.-KB/s    in 0.1s    

2023-12-14 16:00:24 (639 KB/s) - ‘Tesla-10-K-2022-Filing.html’ saved [70154]

--2023-12-14 16:00:24--  http://./
Resolving . (.)... failed: No address associated with hostname.
wget: unable to resolve host address ‘.’
FINISHED --2023-12-14 16:00:24--
Total wall clock time: 0.4s
Downloaded: 1 files, 69K in 0.1s (639 KB/s)


In [None]:
from langchain.vectorstores import utils as chromautils

# load data
doc_path = "./Tesla-10-K-2022-Filing.html"

loader = UnstructuredHTMLLoader(doc_path, mode="paged")
docs = loader.load()
docs = chromautils.filter_complex_metadata(docs) # https://github.com/langchain-ai/langchain/issues/8556#issuecomment-1806835287

# # rag baseline
# data_texts = [element.page_content for element in docs]

## Evaluate Dataset

```
qna_dict = {
        "What is the value of cash and cash equivalents in 2022?": "16,253 $ millions",
        "What is the value of cash and cash equivalents in 2021?": "17,576 $ millions",
        "What is the net value of accounts receivable in 2022?": "2,952 $ millions",
        "What is the net value of accounts receivable in 2021?": "1,913 $ millions",
        "What is the total stockholders' equity? in 2022?": "44,704 $ millions",
        "What is the total stockholders' equity? in 2021?": "30,189 $ millions",
        "What are total operational expenses for research and development in 2022?": "3,075 $ millions",
        "What are total operational expenses for research and development in 2021?": "2,593 $ millions",
    }
```

In [None]:
qna_dict = {
        "What is the value of cash and cash equivalents in 2022?": "16,253 $ millions",
        "What is the value of cash and cash equivalents in 2021?": "17,576 $ millions",
        "What is the net value of accounts receivable in 2022?": "2,952 $ millions",
        "What is the net value of accounts receivable in 2021?": "1,913 $ millions",
        "What is the total stockholders' equity? in 2022?": "44,704 $ millions",
        "What is the total stockholders' equity? in 2021?": "30,189 $ millions",
        "What are total operational expenses for research and development in 2022?": "3,075 $ millions",
        "What are total operational expenses for research and development in 2021?": "2,593 $ millions",
    }

## (Function) Predict answer and evaluate

In [None]:
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnableParallel, RunnablePassthrough
from operator import itemgetter
from langchain.vectorstores import Chroma

def predict_answer_and_evaluate(question, expected_answer, retriever, model):
    """
      This function predicts an answer for a question using a LLM and then evaluates it against the expected answer.

      Args:
        question: The question to be answered.
        expected_answer: The expected answer to the question.
        retriever: The retriever used for retrieving relevant context.
        model: The LLM model used for prediction and evaluation.

      Returns:
        A dictionary containing the following information:
          - question: The original question.
          - llm_answer: The answer predicted by the LLM.
          - expected_answer: The expected answer to the question.
          - is_correct: A string indicating whether the predicted answer matches the expected answer ("Yes") or not ("No").
    """

    # Rag template for retrieving relevant context and prompting the LLM
    rag_template = """Answer the question based only on the following context:
    {context}

    Question: {question}
    """
    rag_prompt = ChatPromptTemplate.from_template(rag_template)
    # Parallel execution for context retrieval and question processing
    setup_and_retrieval = RunnableParallel(
        {"context": retriever, "question": RunnablePassthrough()}
    )
    # Chain the retrieval, prompting, prediction, and parsing steps
    rag_chain = setup_and_retrieval | rag_prompt | model | StrOutputParser()

    eval_template = ChatPromptTemplate.from_template("""
            Input:
            Question: {question}
            LLM-Answer: {llm_answer}
            Expected-Answer: {expected_answer}

            Task:
            Compare LLM-Answer and Expected-Answer and determine if they convey the same meaning or information.

            Output:
            Yes: If LLM-Answer and Expected-Answer convey the same meaning or information.
            No: If LLM-Answer and Expected-Answer do not convey the same meaning or information.

            Example:
            Question: What is the capital of France?
            LLM-Answer: Paris
            Expected-Answer: The City of Lights

            Output:
            Yes""")
    # Chain together the evaluation steps
    eval_chain = (
        {
            "llm_answer": itemgetter("question") | rag_chain,
            "question": itemgetter("question"),
            "expected_answer": itemgetter("expected_answer"),
        }
        | eval_template
        | model
        | StrOutputParser()
    )
    # llm answer
    llm_answer = rag_chain.invoke(question)

    #output of evaluation chain
    eval_answer = eval_chain.invoke({"question": question, "expected_answer": expected_answer})
    # result
    result = {
    "question": question,
    "llm_answer": llm_answer.strip(),
    "expected_answer": expected_answer,
    "is_correct": eval_answer
    }

    return result

## RAG Baseline

In [None]:

# rag baseline
data_texts = [element.page_content for element in docs]

vectorstore = Chroma.from_texts(data_texts, collection_name="text_table",
                        embedding=embeddings)

# retriever
retriever = vectorstore.as_retriever()


In [None]:
import time
basic_answers = []
for question in qna_dict.keys():
    expected_answer = qna_dict[question]
    # result
    res = predict_answer_and_evaluate(question, expected_answer, retriever, model)
    time.sleep(60) # resolve: Error Code 429
    basic_answers.append(res)
basic_answers_df = pd.DataFrame(basic_answers) # you may need to verify the is_correct manually.
basic_answers_df.to_excel('results/rag_baseline_results.xlsx', index=False)

## Smaller Chunks

In [None]:
import uuid
# from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.schema.document import Document
from langchain.storage import InMemoryStore
from langchain.retrievers.multi_vector import MultiVectorRetriever
import time
t0=time.time()

parent_chunk_size = 10000
child_chunk_size = 400
id_key = 'doc_id'
collection_name = 'split_documents'

# Create the parent documents
parent_text_splitter = RecursiveCharacterTextSplitter(chunk_size=parent_chunk_size)
docs = parent_text_splitter.split_documents(docs)

# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name=collection_name, embedding_function=embeddings
)
# The storage layer for the parent documents
store = InMemoryStore()
id_key = id_key
# The retriever (empty to start)
smaller_chunk_retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)

doc_ids = [str(uuid.uuid4()) for _ in docs]
# The splitter to use to create smaller chunks from parent documents
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=child_chunk_size)

sub_docs = []
for i, doc in enumerate(docs):
    _id = doc_ids[i]
    # print(f'_id: {_id}\n\n')
    _sub_docs = child_text_splitter.split_documents([doc])
    for _doc in _sub_docs:
        _doc.metadata[id_key] = _id
        # print(f'_doc: {_doc}\n\n')
    sub_docs.extend(_sub_docs)
# print(sub_docs[0])
# Add texts
smaller_chunk_retriever.vectorstore.add_documents(sub_docs)
smaller_chunk_retriever.docstore.mset(list(zip(doc_ids, docs)))
t1 = time.time()

print(t1-t0, 's')

In [None]:
import time

smaller_chunk_answers = []
for question in qna_dict.keys():
    expected_answer = qna_dict[question]
    # result
    res = predict_answer_and_evaluate(question, expected_answer, smaller_chunk_retriever, model)
    time.sleep(60) # resolve: Error Code 429
    smaller_chunk_answers.append(res)

smaller_chunk_answers_df = pd.DataFrame(smaller_chunk_answers)
smaller_chunk_answers_df.to_excel('results/rag_smaller_chunk_results.xlsx', index=False)

## Summaries

In [None]:
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.schema.document import Document
from langchain.storage import InMemoryStore
import os

# helper functions
#to save time the summaries were pre-calculated
TABLE_SUMMARIES_CSV = "./table_summaries_0.csv"

def summarize(texts):
    """
    This function summarizes the given texts using a GPT-3.5 model. It also checks if a CSV file with previous summaries exists,
    if it does, it loads the summaries from there instead of generating new ones.

    Args:
        texts (list): A list of texts to be summarized.

    Returns:
        list: A list of summarized texts.

    """
    # Prompt
    prompt_text = """You are an assistant tasked with summarizing tables and text. \
    Give a concise summary of the table or text. Table or text chunk: {element} """
    prompt = ChatPromptTemplate.from_template(prompt_text)

    # Summary chain
    # model = ChatOpenAI(temperature=0.1, model="gpt-3.5-turbo-16k-0613", openai_api_key=open_ai_key)
    summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()

    tables = [i for i in texts]
    table_summaries = []

    # open csv file if it exists
    if os.path.exists(TABLE_SUMMARIES_CSV):
        t_frame = pd.read_csv(TABLE_SUMMARIES_CSV)
        table_summaries = [elem[1] for elem in t_frame.values.tolist()]
    else:
        for i in range(0, len(tables)):
            res = summarize_chain.invoke(tables[i])
            table_summaries.append(res)

        t_frame = pd.DataFrame(table_summaries)
        t_frame.to_csv(TABLE_SUMMARIES_CSV)

    return table_summaries

def setup_retriever(sections):
    """
    This function sets up a retriever for the given sections of text. It first summarizes the sections, then creates a
    Chroma vectorstore to index the summaries. It also sets up an InMemoryStore for the parent documents and a
    MultiVectorRetriever to retrieve the documents. Finally, it adds the summarized texts to the vectorstore and the
    original sections to the docstore.

    Args:
        sections (list): A list of sections of text to be indexed and retrieved.

    Returns:
        MultiVectorRetriever: A retriever set up with the given sections of text.
    """
    text_summaries = summarize(sections)
    # The vectorstore to use to index the child chunks
    vectorstore = Chroma(
        collection_name="summaries",

        embedding_function=embeddings,
    )

    # The storage layer for the parent documents
    store = InMemoryStore()
    id_key = "doc_id"

    # The retriever (empty to start)
    retriever = MultiVectorRetriever(
        vectorstore=vectorstore,
        docstore=store,
        id_key=id_key,
    )

    # Add texts
    doc_ids = [str(uuid.uuid4()) for _ in sections]
    summary_texts = [Document(page_content=s, metadata={id_key: doc_ids[i]}) for i, s in enumerate(text_summaries)]
    retriever.vectorstore.add_documents(summary_texts)
    retriever.docstore.mset(list(zip(doc_ids, sections)))

    return retriever


In [None]:
data_texts = [element.page_content for element in docs]

summary_retriever = setup_retriever(data_texts)

In [None]:
import time

summary_answers = []
for question in qna_dict.keys():
    expected_answer = qna_dict[question]
    # result
    res = predict_answer_and_evaluate(question, expected_answer, summary_retriever, model)
    time.sleep(60) # resolve: Error Code 429
    summary_answers.append(res)

summary_answers_df = pd.DataFrame(summary_answers)
summary_answers_df.to_excel('results/rag_summary_results.xlsx', index=False)


## Hypothetical Queries

In [None]:

import uuid

# from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.document import Document
from langchain.schema.output_parser import StrOutputParser

import re

hypo_query_chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Generate a list of exactly 3 hypothetical questions that the below document could be used to answer:\n\n{doc}")
    | model
    | StrOutputParser()
    | (lambda x: [re.sub(r"\d+\. ", "", i) for i in x.strip().split("\n\n")[1:]])
)


In [None]:

# %%timeit
# hypothetical_questions = hypo_query_chain.batch(docs, {"max_concurrency": 5}) # failed

# workaround
hypothetical_questions = []
for doc in docs:
    hypo_queries = hypo_query_chain.invoke(doc, {"max_concurrency": 5})
    hypothetical_questions.append(hypo_queries)

In [None]:

from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.schema.document import Document
from langchain.storage import InMemoryStore


# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name='hypo-queries', embedding_function=embeddings
)
# The storage layer for the parent documents
store = InMemoryStore()
id_key = "doc_id"
# The retriever (empty to start)
hypo_questions_retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

question_docs = []
for i, question_list in enumerate(hypothetical_questions):
    question_docs.extend(
        [Document(page_content=s, metadata={id_key: doc_ids[i]}) for s in question_list]
    )

hypo_questions_retriever.vectorstore.add_documents(question_docs)
hypo_questions_retriever.docstore.mset(list(zip(doc_ids, docs)))

In [None]:
import time

hypo_query_answers = []
for question in qna_dict.keys():
    expected_answer = qna_dict[question]
    # result
    res = predict_answer_and_evaluate(question, expected_answer, hypo_questions_retriever, model)
    time.sleep(60) # resolve: Error Code 429
    hypo_query_answers.append(res)

hypo_query_answers_df = pd.DataFrame(hypo_query_answers)
hypo_query_answers_df.to_excel('results/rag_hypo_query_results.xlsx', index=False)


## Experiment Results


In [None]:
import pandas as pd
import os

# Define the results folder path
results_folder = "results/"

# List all Excel files in the folder
excel_files = [f for f in os.listdir(results_folder) if f.endswith(".xlsx")]

def add_method_column(filename):
    """
    Adds a new column named 'method' to the loaded dataframe with the filename as the value.

    Args:
    filename: The name of the Excel file to be loaded.

    Returns:
    The Pandas DataFrame with the 'method' column added.
    """

    df = pd.read_excel(os.path.join(results_folder, filename))
    method_name = os.path.splitext(filename)[0]
    df["method"] = method_name
    return df

# Combine all dataframes with added method column
combined_df = pd.concat([add_method_column(f) for f in excel_files])
# Save the combined dataframe to a new file
combined_df.to_excel("experiment_results(raw).xlsx", index=False)

# Group the dataframe by method
grouped_df = combined_df.groupby('method')
# Calculate the accuracy for each method
method_accuracy = combined_df.groupby('method')['is_correct'].value_counts(normalize=True) * 100

method_accuracy = method_accuracy.reset_index()
# Count the occurrences of "Yes" and "No"
method_is_correct_count = grouped_df['is_correct'].value_counts().reset_index()#.unstack(fill_value=0)

experiment_results = method_is_correct_count.merge(method_accuracy, on=['method', 'is_correct'])

experiment_results.to_excel('experiment_results(agg).xlsx', index=False)