# Chapter 5 Retrieval
- [I. Vector database retrieval](#I. Vector database retrieval)
- [1.1 Similarity search](#1.1-Similarity search)
- [1.2 Solving diversity: Maximum marginal relevance (MMR)](#1.2-Solving diversity: Maximum marginal relevance (MMR))
- [1.3 Solving specificity: Using metadata](#1.3-Solving specificity: Using metadata)
- [1.4 Solving specificity: Using self-query retriever in metadata](#1.4-Solving specificity: Using self-query retriever in metadata)
- [1.5 Other techniques: Compression](#1.5-Other techniques: Compression)
- [II. Combining various technologies](#II. Combining various technologies)
- [III. Other types of retrieval](#III. Other types of retrieval)

Retrieval is at the heart of our Retrieval Augmented Generation (RAG) pipeline.

Let’s get the vector database (`VectorDB`) we stored in the previous lesson.

## 1. Vector database retrieval

Create a new `.env` file in the current folder, with the content `OPENAI_API_KEY = "sk-..."`

This chapter requires the use of the `lark` package, run the following command to install

In [1]:
!pip install -Uq lark

In [3]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']


### 1.1 Similarity Search

In [4]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
persist_directory = 'docs/chroma/cs229_lectures/'
persist_directory_chinese = 'docs/chroma/matplotlib/'

Load the vector database (`VectorDB`) saved in the previous lesson

In [5]:
embedding = OpenAIEmbeddings()

In [6]:
vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

In [7]:
print(vectordb._collection.count())

209


In [8]:
vectordb_chinese = Chroma(
    persist_directory=persist_directory_chinese,
    embedding_function=embedding
)

In [9]:
print(vectordb_chinese._collection.count())

27


Let's now look at an example with Maximum Marginal Relevance. So we'll load the information about mushrooms from the example below.

Let's now run it with MMR. Let's pass in k equal to 2. We still want two documents returned, but let's set fetch k equal to 3 where we initially fetched all three documents. We can now see that the information about poison is returned in the documents we retrieved.

In [10]:
texts = [
"""The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
"""A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
"""A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

In [11]:
texts_chinese = [
"""Amanita phalloides has a large and eye-catching epigeous fruiting body (basidiocarp)""",
"""A mushroom with large fruiting bodies is the death amanita (Amanita phalloides). Some varieties are completely white. """,
"""A. phalloides, also known as the death cap, is the most poisonous of all known mushrooms.""",
]

For this example, we will create a small database that we can use as an example.

In [12]:
smalldb = Chroma.from_texts(texts, embedding=embedding)

100%|██████████| 1/1 [00:00<00:00,  1.28it/s]


In [13]:
smalldb_chinese = Chroma.from_texts(texts_chinese, embedding=embedding)

100%|██████████| 1/1 [00:00<00:00,  2.30it/s]


Here are the questions we asked for this example:

In [14]:
question = "Tell me about all-white mushrooms with large fruiting bodies"

In [15]:
question_chinese = "告诉我关于具有大型子实体的全白色蘑菇的信息"

Now, we can run a similarity search, setting k=2, to return only the two most relevant documents.

We can see that there is no mention of the fact that it is toxic.

In [16]:
smalldb.similarity_search(question, k=2)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.', metadata={}),
 Document(page_content='The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).', metadata={})]

In [17]:
smalldb_chinese.similarity_search(question_chinese, k=2)

[Document(page_content='一种具有大型子实体的蘑菇是毒鹅膏菌（Amanita phalloides）。某些品种全白。', metadata={}),
 Document(page_content='毒鹅膏菌（Amanita phalloides）具有大型且引人注目的地上（epigeous）子实体（basidiocarp）', metadata={})]

Now, let's run Maximum Marginal Relevance (MMR).

Set k=2, since we still want to return two documents. Set fetch_k=3, where fetch_k is all the documents we originally fetched (3).

In [18]:
smalldb.max_marginal_relevance_search(question,k=2, fetch_k=3)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.', metadata={}),
 Document(page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.', metadata={})]

In [19]:
smalldb_chinese.max_marginal_relevance_search(question,k=2, fetch_k=3)

[Document(page_content='一种具有大型子实体的蘑菇是毒鹅膏菌（Amanita phalloides）。某些品种全白。', metadata={}),
 Document(page_content='A. phalloides，又名死亡帽，是已知所有蘑菇中最有毒的一种。', metadata={})]

We can now see that the document we retrieved has poison information returned.

### 1.2 Solving Diversity: Maximum Marginal Relevance (MMR)

We have just introduced the question through an example: how to enhance the diversity of search results.

Maximum marginal relevance tries to achieve the best of both worlds between the relevance of the query and the diversity of the results.

Let's go back to an example from the last lesson, when we ran a similarity search on a vector database through a question

We can look at the first two documents, just the first few characters, and see that they are identical.

In [20]:
question = "what did they say about matlab?"
docs_ss = vectordb.similarity_search(question,k=3)

In [21]:
docs_ss[0].page_content[:100]

'those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people '

In [22]:
docs_ss[1].page_content[:100]

'those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people '

In [23]:
question_chinese = "Matplotlib是什么？"
docs_ss_chinese = vectordb_chinese.similarity_search(question_chinese,k=3)

In [24]:
docs_ss_chinese[0].page_content[:100]

'第⼀回：Matplotlib 初相识\n⼀、认识matplotlib\nMatplotlib 是⼀个 Python 2D 绘图库，能够以多种硬拷⻉格式和跨平台的交互式环境⽣成出版物质量的图形，⽤来绘制各种'

In [25]:
docs_ss_chinese[1].page_content[:100]

'第⼀回：Matplotlib 初相识\n⼀、认识matplotlib\nMatplotlib 是⼀个 Python 2D 绘图库，能够以多种硬拷⻉格式和跨平台的交互式环境⽣成出版物质量的图形，⽤来绘制各种'

Note: Difference in results when using `MMR`.

In [26]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)

In [27]:
docs_mmr_chinese = vectordb_chinese.max_marginal_relevance_search(question_chinese,k=3)

When we get the results after running MMR, we can see that the first one is the same as the previous one because that is the most similar.

In [28]:
docs_mmr[0].page_content[:100]

'those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people '

In [29]:
docs_mmr_chinese[0].page_content[:100]

'第⼀回：Matplotlib 初相识\n⼀、认识matplotlib\nMatplotlib 是⼀个 Python 2D 绘图库，能够以多种硬拷⻉格式和跨平台的交互式环境⽣成出版物质量的图形，⽤来绘制各种'

But when we get to the second one, we can see that it is different.

It gets some diversity in the responses.

In [30]:
docs_mmr[1].page_content[:100]

'algorithm then? So what’s different? How come  I was making all that noise earlier about \nleast squa'

In [31]:
docs_mmr_chinese[1].page_content[:100]

'By Datawhale 数据可视化开源⼩组\n© Copyright © Copyright 2021.y轴分为左右两个，因此 tick1 对应左侧的轴； tick2 对应右侧的轴。\nx轴分为上下两个'

### 1.3 Solving Specificity: Using Metadata

In the previous lesson, we showed a problem where we asked a question about a particular chapter in a document, but the results we got also included results from other chapters.

To solve this problem, many vector databases support operations on `metadata`.

`metadata` provides context for each embedded chunk.

In [32]:
question = "what did they say about regression in the third lecture?"

In [33]:
question_chinese = "他们在第二讲中对Figure说了些什么？"  

Now, we will solve this problem manually by specifying a metadata filter `filter`

In [34]:
docs = vectordb.similarity_search(
    question,
    k=3,
    filter={"source":"docs/cs229_lectures/MachineLearning-Lecture03.pdf"}
)

In [35]:
docs_chinese = vectordb_chinese.similarity_search(
    question_chinese,
    k=3,
    filter={"source":"docs/matplotlib/第二回：艺术画笔见乾坤.pdf"}
)

Next, we can see that the results are from the corresponding chapters

In [36]:
for d in docs:
    print(d.metadata)

{'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 0}
{'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 14}
{'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 4}


In [35]:
for d in docs_chinese:
    print(d.metadata)
    

{'source': 'docs/matplotlib/第二回：艺术画笔见乾坤.pdf', 'page': 9}
{'source': 'docs/matplotlib/第二回：艺术画笔见乾坤.pdf', 'page': 10}
{'source': 'docs/matplotlib/第二回：艺术画笔见乾坤.pdf', 'page': 0}


Of course, we can't solve this problem manually every time, which would be unintelligent.

The next section will show how to solve this problem through LLM

### 1.4 Solving the peculiarities: using self-query retrievers in metadata

We have an interesting challenge: we often want to infer metadata from the query itself.

To solve this problem, we can use SelfQueryRetriever, which uses LLM to extract:

1. The query string for the vector search, ie: the question

2. The metadata filter to pass in

Most vector databases support metadata filters, so no new database or index is required.

In [38]:
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

In [39]:
llm = OpenAI(temperature=0)

`AttributeInfo` is where we can specify the different fields in the metadata and what they correspond to.

In the metadata, we only have two fields, `source` and `page`.

We will fill in the name, description, and type of each attribute.

This information will actually be passed to LLM, so it needs to be as detailed as possible.

In [40]:
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, should be one of `docs/cs229_lectures/MachineLearning-Lecture01.pdf`, `docs/cs229_lectures/MachineLearning-Lecture02.pdf`, or `docs/cs229_lectures/MachineLearning-Lecture03.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]

In [41]:
metadata_field_info_chinese = [
    AttributeInfo(
        name="source",
        description="讲义来源于 `docs/matplotlib/第一回：Matplotlib初相识.pdf`, `docs/matplotlib/第二回：艺术画笔见乾坤.pdf`, or `docs/matplotlib/第三回：布局格式定方圆.pdf` 的其中之一",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="讲义的那一页",
        type="integer",
    ),
]

In [42]:
document_content_description = "Lecture notes"
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)

In [43]:
document_content_description_chinese = "课堂讲义"
retriever_chinese = SelfQueryRetriever.from_llm(
    llm,
    vectordb_chinese,
    document_content_description_chinese,
    metadata_field_info_chinese,
    verbose=True
)

In [44]:
question = "what did they say about regression in the third lecture?"

In [45]:
question_chinese = "他们在第二讲中对Figure说了些什么？"  

When you first execute the next line, you will get a warning about predict_and_parse being deprecated. This can be safely ignored.

In [46]:
docs = retriever.get_relevant_documents(question)



query='regression' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='source', value='docs/cs229_lectures/MachineLearning-Lecture03.pdf') limit=None


In [47]:
docs_chinese = retriever_chinese.get_relevant_documents(question_chinese)

query='Figure' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='source', value='docs/matplotlib/第二讲：艺术画解破.pdf') limit=None


In [48]:
for d in docs:
    print(d.metadata)

{'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 14}
{'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 0}
{'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 10}
{'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 10}


In [49]:
for d in docs_chinese:
    print(d.metadata)

### 1.5 Other tips: Compression

Another way to improve the quality of retrieved documents is compression.

The information most relevant to the query may be hidden in a document with a lot of irrelevant text.

Passing full documents through the application may result in more expensive LLM calls and poorer responses.

Contextual compression is designed to solve this problem.

In [50]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [51]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))

In [52]:
llm = OpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)  # 压缩器

In [53]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

In [54]:
compression_retriever_chinese = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb_chinese.as_retriever()
)

In [55]:
question = "what did they say about matlab?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)



Document 1:

"MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of learning algorithms."
----------------------------------------------------------------------------------------------------
Document 2:

"MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of learning algorithms."
----------------------------------------------------------------------------------------------------
Document 3:

"And the student said, "Oh, it was the MATLAB." So for those of you that don't know MATLAB yet, I hope you do learn it. It's not hard, and we'll actually have a short MATLAB tutorial in one o

In [56]:
question_chinese = "Matplotlib是什么？"
compressed_docs_chinese = compression_retriever_chinese.get_relevant_documents(question_chinese)
pretty_print_docs(compressed_docs_chinese)

Document 1:

Matplotlib 是⼀个 Python 2D 绘图库，能够以多种硬拷⻉格式和跨平台的交互式环境⽣成出版物质量的图形，⽤来绘制各种静态，动态，交互式的图表。
----------------------------------------------------------------------------------------------------
Document 2:

Matplotlib 是⼀个 Python 2D 绘图库，能够以多种硬拷⻉格式和跨平台的交互式环境⽣成出版物质量的图形，⽤来绘制各种静态，动态，交互式的图表。


Now when we ask the question and look at the resulting documents

We can see two things.

1. They are much shorter than normal documents

2. There are still some duplicates, this is because we are using a semantic search algorithm under the hood.

This is the problem we solved earlier in this course using MMR.

This is a great example of how you can combine various techniques to get the best possible result.

## 2. Combining various technologies

To do this, we can set the search type to MMR when creating the retriever from the vector database.

We can then rerun the process and see that we are returned a filtered result set that does not contain any duplicate information.

In [57]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr")
)

In [58]:
compression_retriever_chinese = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb_chinese.as_retriever(search_type = "mmr")
)

In [59]:
question = "what did they say about matlab?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Document 1:

"MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of learning algorithms."
----------------------------------------------------------------------------------------------------
Document 2:

"And the student said, "Oh, it was the MATLAB." So for those of you that don't know MATLAB yet, I hope you do learn it. It's not hard, and we'll actually have a short MATLAB tutorial in one of the discussion sections for those of you that don't know it."


In [60]:
question_chinese = "Matplotlib是什么？"
compressed_docs_chinese = compression_retriever_chinese.get_relevant_documents(question_chinese)
pretty_print_docs(compressed_docs_chinese)

Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised APIError: Request failed due to server shutdown {
  "error": {
    "message": "Request failed due to server shutdown",
    "type": "server_error",
    "param": null,
    "code": null
  }
}
 500 {'error': {'message': 'Request failed due to server shutdown', 'type': 'server_error', 'param': None, 'code': None}} {'Date': 'Sun, 16 Jul 2023 05:28:06 GMT', 'Content-Type': 'application/json', 'Content-Length': '141', 'Connection': 'keep-alive', 'access-control-allow-origin': '*', 'openai-model': 'text-davinci-003', 'openai-organization': 'user-xnghkpntwvm31crtmex7n2j0', 'openai-processing-ms': '1159', 'openai-version': '2020-10-01', 'strict-transport-security': 'max-age=15724800; includeSubDomains', 'x-ratelimit-limit-requests': '3000', 'x-ratelimit-limit-tokens': '250000', 'x-ratelimit-remaining-requests': '2999', 'x-ratelimit-remaining-tokens': '249744', 'x-ratelimit-reset-reques

Document 1:

Matplotlib 是⼀个 Python 2D 绘图库，能够以多种硬拷⻉格式和跨平台的交互式环境⽣成出版物质量的图形，⽤来绘制各种静态，动态，交互式的图表。


## 3. Other types of retrieval

It is worth noting that vetordb is not the only tool for retrieving documents.

The `LangChain` retriever abstraction includes other ways of retrieving documents, such as: `TF-IDF` or `SVM`.

In [61]:
from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [62]:
# Load PDF
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()
all_page_text = [p.page_content for p in pages]
joined_page_text = " ".join(all_page_text)

# Split text
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)


In [77]:
# Load PDF
loader_chinese = PyPDFLoader("docs/matplotlib/第一回：Matplotlib初相识.pdf")
pages_chinese = loader_chinese.load()
all_page_text_chinese = [p.page_content for p in pages_chinese]
joined_page_text_chinese = " ".join(all_page_text_chinese)

# Split text
text_splitter_chinese = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits_chinese = text_splitter_chinese.split_text(joined_page_text_chinese)

In [64]:
# Search
svm_retriever = SVMRetriever.from_texts(splits, embedding)
tfidf_retriever = TFIDFRetriever.from_texts(splits)

In [66]:
question = "What are major topics for this class?"  # 这门课的主要主题是什么？
docs_svm = svm_retriever.get_relevant_documents(question)
docs_svm[0]

Document(page_content="let me just check what questions you have righ t now. So if there are no questions, I'll just \nclose with two reminders, which are after class today or as you start to talk with other \npeople in this class, I just encourage you again to start to form project partners, to try to \nfind project partners to do your project with. And also, this is a good time to start forming \nstudy groups, so either talk to your friends  or post in the newsgroup, but we just \nencourage you to try to star t to do both of those today, okay? Form study groups, and try \nto find two other project partners.  \nSo thank you. I'm looking forward to teaching this class, and I'll see you in a couple of \ndays.   [End of Audio]  \nDuration: 69 minutes", metadata={})

In [67]:
question = "what did they say about matlab?"  # 他们关于Matlab说了些什么？
docs_tfidf = tfidf_retriever.get_relevant_documents(question)
docs_tfidf[0]

Document(page_content="Saxena and Min Sun here did, wh ich is given an image like this, right? This is actually a \npicture taken of the Stanford campus. You can apply that sort of cl ustering algorithm and \ngroup the picture into regions. Let me actually blow that up so that you can see it more \nclearly. Okay. So in the middle, you see the lines sort of groupi ng the image together, \ngrouping the image into [inaudible] regions.  \nAnd what Ashutosh and Min did was they then  applied the learning algorithm to say can \nwe take this clustering and us e it to build a 3D model of the world? And so using the \nclustering, they then had a lear ning algorithm try to learn what the 3D structure of the \nworld looks like so that they could come up with a 3D model that you can sort of fly \nthrough, okay? Although many people used to th ink it's not possible to take a single \nimage and build a 3D model, but using a lear ning algorithm and that sort of clustering \nalgorithm is the first ste

In [78]:
# Chinese Version
svm_retriever_chinese = SVMRetriever.from_texts(splits_chinese, embedding)
tfidf_retriever_chinese = TFIDFRetriever.from_texts(splits_chinese)

In [79]:
question_chinese = "这门课的主要主题是什么？" 
docs_svm_chinese = svm_retriever_chinese.get_relevant_documents(question_chinese)
docs_svm_chinese[0]

Document(page_content='fig, ax = plt.subplots()  \n# step4 绘制图像，  这⼀模块的扩展参考第⼆章进⼀步学习\nax.plot(x, y, label=\'linear\')  \n# step5 添加标签，⽂字和图例，这⼀模块的扩展参考第四章进⼀步学习\nax.set_xlabel(\'x label\') \nax.set_ylabel(\'y label\') \nax.set_title("Simple Plot")  \nax.legend() ;\n思考题\n请思考两种绘图模式的优缺点和各⾃适合的使⽤场景\n在第五节绘图模板中我们是以 OO 模式作为例⼦展示的，请思考并写⼀个 pyplot 绘图模式的简单模板', metadata={})

In [80]:
question_chinese = "Matplotlib是什么？"
docs_tfidf_chinese = tfidf_retriever_chinese.get_relevant_documents(question_chinese)
docs_tfidf_chinese[0]

Document(page_content='fig, ax = plt.subplots()  \n# step4 绘制图像，  这⼀模块的扩展参考第⼆章进⼀步学习\nax.plot(x, y, label=\'linear\')  \n# step5 添加标签，⽂字和图例，这⼀模块的扩展参考第四章进⼀步学习\nax.set_xlabel(\'x label\') \nax.set_ylabel(\'y label\') \nax.set_title("Simple Plot")  \nax.legend() ;\n思考题\n请思考两种绘图模式的优缺点和各⾃适合的使⽤场景\n在第五节绘图模板中我们是以 OO 模式作为例⼦展示的，请思考并写⼀个 pyplot 绘图模式的简单模板', metadata={})