# 04 MultiVector Retriever

To improve the retrieval effectiveness, in practice we can store multiple vectors per document. It's been proved in multiple use cases.

LangChain provides a retriever component `MultiVectorRetriever` which support such mechanism. It can be implemented in the following methods:

- Smaller Chunks
  
  Split a document into smaller chunks, and embed them.

- Summary

  Create a summary for each document, embed that along with (or instead of) the document.

- Hypothetical questions

  Create hypothetical questions that each document would be appropriate to answer, embed those along with (or instead of) the document.

In [5]:
!pip install langchain openai chromadb tiktoken pypdf



In [8]:
!wget -O nvidia_10q_2023.pdf https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/19771e6b-cc29-4027-899e-51a0c386111e.pdf

--2023-10-19 20:09:53--  https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/19771e6b-cc29-4027-899e-51a0c386111e.pdf
正在解析主机 d18rn0p25nwr6d.cloudfront.net (d18rn0p25nwr6d.cloudfront.net)... 18.155.204.142, 18.155.204.120, 18.155.204.129, ...
正在连接 d18rn0p25nwr6d.cloudfront.net (d18rn0p25nwr6d.cloudfront.net)|18.155.204.142|:443... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度：458891 (448K) [application/pdf]
正在保存至: “nvidia_10q_2023.pdf”


2023-10-19 20:09:56 (412 KB/s) - 已保存 “nvidia_10q_2023.pdf” [458891/458891])



In [9]:
import os
os.environ['OPENAI_API_KEY'] = "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"

In [10]:
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.document_loaders import PyPDFLoader

In [11]:
loaders = [ PyPDFLoader('./nvidia_10q_2023.pdf') ]
docs = []
for l in loaders:
    docs.extend(l.load())
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)
docs = text_splitter.split_documents(docs)

In [12]:
len(docs)

51

In [13]:
print(docs[6])

page_content="NVIDIA CORPORATION AND SUBSIDIARIES\nCONDENSED CONSOLIDATED STATEMENTS OF SHAREHOLDERS’ EQUITY\nFOR THE SIX MONTHS ENDED JULY 30, 2023 AND JULY 31, 2022\n(Unaudited)\nCommon Stock\nOutstandingAdditional\nPaid-in\nCapitalAccumulated Other\nComprehensive LossRetained\nEarningsTotal\nShareholders'\nEquity (In millions, except per share data) Shares Amount\nBalances, January 29, 2023 2,466 $ 2 $ 11,971 $ (43) $ 10,171 $ 22,101 \nNet income — — — — 8,232 8,232 \nOther comprehensive loss — — — (8) — (8)\nIssuance of common stock from stock plans 14 — 247 — — 247 \nTax withholding related to vesting of restricted stock units (3) — (1,179) — — (1,179)\nShares repurchased (8) — (1) — (3,283) (3,284)\nCash dividends declared and paid ($0.08 per common share) — — — — (199) (199)\nStock-based compensation — — 1,591 — — 1,591 \nBalances, July 30, 2023 2,469 $ 2 $ 12,629 $ (51) $ 14,921 $ 27,501 \nBalances, January 30, 2022 2,506 $ 3 $ 10,385 $ (11) $ 16,235 $ 26,612 \nNet income — — —

In [14]:
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="full_documents",
    embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)
import uuid
doc_ids = [str(uuid.uuid4()) for _ in docs]

In [15]:
# The splitter to use to create smaller chunks
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

In [16]:
sub_docs = []
for i, doc in enumerate(docs):
    _id = doc_ids[i]
    _sub_docs = child_text_splitter.split_documents([doc])
    for _doc in _sub_docs:
        _doc.metadata[id_key] = _id
    sub_docs.extend(_sub_docs)

In [17]:
print(doc_ids)

['a18441ae-c055-4779-beed-06abc70302c2', 'bf0fcc26-3612-4913-bf5d-16f8b6b8bfe3', 'a871868f-4045-49f2-9a74-cfa1e2bddf29', 'd88b9dd7-d530-4ab7-9f5b-4da73520a796', 'db1b983c-8626-4f53-8445-cf54f8d21be3', '47acabdc-e30c-4eb5-b1b1-eb96fbae562c', '5437718b-9414-4008-b81e-7d602e094987', 'c7cb60e6-e6a5-4de1-a1d4-0b769fea4751', 'fa0f6bfa-a7b7-4890-8e49-6a50a3d780e1', 'b8be8d2a-3bbc-49df-a395-bcd3c4d4f75c', 'af906c9e-b9eb-4e9d-a1df-921381afe43d', 'd7e2dec4-75dd-4d87-9247-ef5b3ec21c1d', '29b488ad-33a5-495f-a6c7-9f3380e7c612', 'fc44e236-839e-4042-9f85-405d98544c9d', '64e81d25-8a8b-4768-a842-8e3857defe75', 'af47834d-2eb9-4c2a-9dce-75097d6495e3', '0d52302c-6300-4a9a-acf5-c1fe77e6ec75', '4c53e9c8-844f-4365-bd5c-64d0132a7561', '394fbd91-db87-4ec2-8675-4b5e399e70f1', '337ad6fc-635c-47ab-8cec-2c5a9fdcfd9c', '1764076c-5511-41e4-9f67-ff28b6a6b74f', '75a35c2f-e994-451c-93a7-6c5df17bf7c4', 'c24dee6a-4a20-494f-8ae0-e97e59b63c65', '2a87f1e0-1aa4-4501-a399-a50400a568c6', '06fbce28-0619-46c1-ac43-a96010b5bf97',

In [18]:
print(sub_docs)

[Document(page_content='UNITED ST ATES\nSECURITIES AND EXCHANGE COMMISSION\nWashington, D.C. 20549\nFORM 10-Q\n☒ QUARTERL Y REPORT PURSUANT T O SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\nFor the quarterly period ended July 30, 2023\nOR\n☐ TRANSITION REPORT PURSUANT T O SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\nCommission file number: 0-23985\nNVIDIA CORPORATION', metadata={'source': './nvidia_10q_2023.pdf', 'page': 0, 'doc_id': 'a18441ae-c055-4779-beed-06abc70302c2'}), Document(page_content='For the quarterly period ended July 30, 2023\nOR\n☐ TRANSITION REPORT PURSUANT T O SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\nCommission file number: 0-23985\nNVIDIA CORPORATION\n(Exact name of registrant as specified in its charter)\nDelaware 94-3177549\n(State or other jurisdiction of (I.R.S. Employer\nincorporation or organization) Identification No.)', metadata={'source': './nvidia_10q_2023.pdf', 'page': 0, 'doc_id': 'a18441ae-c055-4779-beed-0

In [19]:
print(sub_docs[1])

page_content='For the quarterly period ended July 30, 2023\nOR\n☐ TRANSITION REPORT PURSUANT T O SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\nCommission file number: 0-23985\nNVIDIA CORPORATION\n(Exact name of registrant as specified in its charter)\nDelaware 94-3177549\n(State or other jurisdiction of (I.R.S. Employer\nincorporation or organization) Identification No.)' metadata={'source': './nvidia_10q_2023.pdf', 'page': 0, 'doc_id': 'a18441ae-c055-4779-beed-06abc70302c2'}


In [20]:
retriever.vectorstore.add_documents(sub_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

In [21]:
# Vectorstore alone retrieves the small chunks
similar_docs = retriever.vectorstore.similarity_search("What is the gross margin?")

In [22]:
similar_docs

[Document(page_content='Gross margin increased from a year ago and sequentially, primarily reflecting growth in Data Center sales. The year-on-year increase also\nreflects the impact on the year-ago gross margin from $1.34 billion in inventory provisions and related charges.\nOperating expenses were up 10% from a year ago and up 6% sequentially, primarily driven by compensation and benefits, including stock-based', metadata={'doc_id': 'ef4615a4-652f-4b43-9fcf-b3155997200d', 'page': 26, 'source': './nvidia_10q_2023.pdf'}),
 Document(page_content='distributors, and channel partners. We expect this concentration trend will continue.\nGross Margin\nOur overall gross margin increased to 70.1% and 68.2% for the second quarter and first half of fiscal year 2024, respectively, from 43.5% and\n55.7% for the second quarter and first half of fiscal year 2023, respectively. The increase in the second quarter and first half of fiscal year 2024', metadata={'doc_id': '669b0a8f-4971-4b53-84fe-b6efac6e

In [23]:
relevant_docs = retriever.get_relevant_documents("What is the gross margin?")

In [24]:
len(relevant_docs)

4

In [25]:
relevant_docs[0]

Document(page_content='Second Quarter of Fiscal Year 2024 Summary\nThree Months Ended\n July 30, 2023 April 30, 2023 July 31, 2022Quarter-over-Quarter\nChangeYear-over-Year\nChange\n($ in millions, except per share data)\nRevenue $ 13,507 $ 7,192 $ 6,704 88 % 101 %\nGross margin 70.1 % 64.6 % 43.5 % 5.5 pts 26.6 pts\nOperating expenses $ 2,662 $ 2,508 $ 2,416 6 % 10 %\nOperating income $ 6,800 $ 2,140 $ 499 218 % 1,263 %\nNet income $ 6,188 $ 2,043 $ 656 203 % 843 %\nNet income per diluted share $ 2.48 $ 0.82 $ 0.26 202 % 854 %\nWe specialize in markets where our computing platforms can provide tremendous acceleration for applications. These platforms incorporate\nprocessors, interconnects, software, algorithms, systems, and services to deliver unique value. Our platforms address four large markets where\nour expertise is critical: Data Center, Gaming, Professional Visualization, and Automotive.\nRevenue for the second quarter of fiscal year 2024 was $13.51 billion, up 101% from a year

In [26]:
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

In [27]:
from langchain.llms import OpenAI
from langchain.chains import ConversationalRetrievalChain
qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0), retriever, memory=memory)

In [28]:
result = qa({"question": "What is the gross margin?"})

In [29]:
result

{'question': 'What is the gross margin?',
 'chat_history': [HumanMessage(content='What is the gross margin?'),
  AIMessage(content=' The gross margin was 70.1% for the second quarter of fiscal year 2024.')],
 'answer': ' The gross margin was 70.1% for the second quarter of fiscal year 2024.'}

In [30]:
result = qa({"question": "What is the main contribution to it?"})
result

{'question': 'What is the main contribution to it?',
 'chat_history': [HumanMessage(content='What is the gross margin?'),
  AIMessage(content=' The gross margin was 70.1% for the second quarter of fiscal year 2024.'),
  HumanMessage(content='What is the main contribution to it?'),
  AIMessage(content=' Higher revenue from Compute GPUs of 208% and 112%, respectively, and lower inventory provisions.')],
 'answer': ' Higher revenue from Compute GPUs of 208% and 112%, respectively, and lower inventory provisions.'}

## Summary

In [31]:
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
import uuid
from langchain.schema.document import Document

In [32]:
chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | ChatOpenAI(max_retries=0)
    | StrOutputParser()
)

In [35]:
pip install tenacity

Note: you may need to restart the kernel to use updated packages.


In [36]:
import openai
from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
)  # for exponential backoff

In [41]:
import datetime


In [49]:
docss = docs[:2]

In [None]:
@retry(wait=wait_random_exponential(min=65, max=70), stop=stop_after_attempt(1))
def completion_with_backoff():
    i = datetime.datetime.now()  #获取当前时间
    print('今天是{}月{}日{}点{}分{}秒'.format(i.month,i.day,i.hour,i.minute,i.second))
    summaries = chain.batch(docs[:2], {"max_concurrency": 5})

In [47]:
len(docs)

51

In [50]:
completion_with_backoff()

今天是10月19日21点22分36秒


RetryError: RetryError[<Future at 0x7f55906473d0 state=finished raised RateLimitError>]

In [None]:
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="summaries",
    embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

In [None]:
summary_docs = [Document(page_content=s,metadata={id_key: doc_ids[i]}) for i, s in enumerate(summaries)]

In [None]:
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

In [None]:
sub_docs = vectorstore.similarity_search("What is the gross margin?")
sub_docs[0]

In [None]:
retrieved_docs = retriever.get_relevant_documents("What is the gross margin?")
retrieved_docs[0]

In [None]:
qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0), retriever, memory=ConversationBufferMemory(memory_key="chat_history", return_messages=True))

In [None]:
result = qa({"question": "What is the gross margin?"})
result

In [None]:
result = qa({"question": "What is the main contribution to it?"})
result

## Hypothetical Questions

In [None]:
functions = [
    {
      "name": "hypothetical_questions",
      "description": "Generate hypothetical questions",
      "parameters": {
        "type": "object",
        "properties": {
          "questions": {
            "type": "array",
            "items": {
                "type": "string"
              },
          },
        },
        "required": ["questions"]
      }
    }
  ]

In [None]:
from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser
chain = (
    {"doc": lambda x: x.page_content}
    # Only asking for 3 hypothetical questions, but this could be adjusted
    | ChatPromptTemplate.from_template("Generate a list of 3 hypothetical questions that the below document could be used to answer:\n\n{doc}")
    | ChatOpenAI(max_retries=0, model="gpt-4").bind(functions=functions, function_call={"name": "hypothetical_questions"})
    | JsonKeyOutputFunctionsParser(key_name="questions")
)

In [None]:
hypothetical_questions = chain.batch(docs, {"max_concurrency": 5})

In [None]:
vectorstore = Chroma(
    collection_name="hypo-questions",
    embedding_function=OpenAIEmbeddings()
)
store = InMemoryStore()
id_key = "doc_id"
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

In [None]:
question_docs = []
for i, question_list in enumerate(hypothetical_questions):
    question_docs.extend([Document(page_content=s,metadata={id_key: doc_ids[i]}) for s in question_list])

In [None]:
question_docs

In [None]:
retriever.vectorstore.add_documents(question_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

In [None]:
qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0), retriever, memory=ConversationBufferMemory(memory_key="chat_history", return_messages=True))

In [None]:
result = qa({"question": "What is the gross margin?"})
result

In [None]:
result = qa({"question": "What is the main contribution to it?"})
result