# Retrieval

Retrieval is the centerpiece of our retrieval augmented generation (RAG) flow. 

Let's get our vectorDB from before.

### 課程補充：檢索方式－maximum marginal relevance (MMR)最大邊際相關性
- 不只有「最接近的」回答，而是多元的所有回答（既相關又有多元性），因為 similarity 搜尋都會返回最相關但都可能是重複的資訊
- 演算法公式：
    - `MMR = λ * Similarity(query, doc) - (1-λ) * max(Similarity(doc, selected_docs))`
    - λ：介於0-1的數字，決定要「相關重要」還是「多樣化重要」
    - 這個公式基本上就是「相關性」－「多樣性」，所以 λ 的數值決定結果比較偏向多樣還是相關
    - query：問題的向量、doc：篩選出的回答的向量（多半為複數)，Similarity(query, doc)越高代表與回答與問題的相關性越高（點積）
    - 運行邏輯：
        - 1. 先由相似度選出 fetch_k 個相關的回答（ fetch_k langchain 預設 20 ）
        - 2. 開始進入mmr的篩選，第一輪：
                ```python
                已選文件 = []  # 還沒選任何文件

                # 計算每個候選文件的MMR
                MMR_A = λ * Sim(query, A) - (1-λ) * 0  # 沒有已選文件，所以max=0
                MMR_B = λ * Sim(query, B) - (1-λ) * 0
                MMR_C = λ * Sim(query, C) - (1-λ) * 0
                MMR_D = λ * Sim(query, D) - (1-λ) * 0  
                MMR_E = λ * Sim(query, E) - (1-λ) * 0

                # 假設A分數最高，選擇A
                已選文件 = [A]
                ```
        - 3. 第二輪，直到已選文件達到 k = n ：
                ```python
                已選文件 = [A]  # 現在有一個已選文件了

                # 從剩下的候選文件中選擇
                MMR_B = λ * Sim(query, B) - (1-λ) * Sim(B, A)
                MMR_C = λ * Sim(query, C) - (1-λ) * Sim(C, A)  
                MMR_D = λ * Sim(query, D) - (1-λ) * Sim(D, A)
                MMR_E = λ * Sim(query, E) - (1-λ) * Sim(E, A)

                # 假設C分數最高，選擇C
                已選文件 = [A, C]
                ```            
- 文件檢索流程（先相關性，再MMR）：
    - 先找出一堆相關的候選文件
    - 從其中挑選出前fetch_k相關的
    - 用MMR從fetch_k挑選出最終的k個

### 課堂補充：檢索方式 － LLM aided retrival
- 定義：基本上就是讓llm幫忙做更聰明的檢索
    - 傳統檢索：用戶問題 → 向量化 → 相似度搜尋 → 返回文件
    - llm檢索：用戶問題 → llm理解與改寫 → 多角度搜尋 → llm篩選 → 返回文件
- 優勢：
    - 進一步理解用戶問題背後的意圖
    - 傳統搜尋只會找到相似的詞彙
- 幾種實踐方式：
    - question expansion（問題擴展）：把一個用戶的問題擴展成ｎ個，幫助找到更全面的答案
    - query rewriting（查詢改寫）：把模糊的問題改寫成更適合資料庫檢索的精確查詢
    - multi-step reasoning（多步推理）：讓LLM分解複雜問題，逐步檢索
    - SelfQuery（課本例子）:搜尋內容並篩選條件，例如「1980年有哪些關於外星人的電影？」會將查詢分成「外星人」，並且篩選條件為`eq('year', 1980)`提升精確度
    - [進階]HyDE（Hypothetical Document Embeddings）：生成假想文件，然後用它來檢索
    - [進階]]Self-RAG（自我反思檢索）：檢索後自我反思是否還需要更多訊息

### 課堂補充：優化策略 － Compression
- 定義：檢索後壓縮的技術
    - 流程：檢索回來的文件太多，llm把這些文件壓縮成最相關的核心資訊
    - 只把壓縮後的精華內容丟給最終的生成回覆模型
- 好處：
    - llm有context window的限制
    - 處理到上限值的量費用高
    - 噪音干擾（返回完整chunk裡面可能很多和問題不相關的訊息）
    - 效率低下，llm回覆時需要讀完全部的內容才能回答

## Vectorstore retrieval


In [1]:
# 載入環境
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

In [2]:
#!pip install lark

### Similarity Search

In [3]:
# 載入embedding資料庫（之前已經處理過的chunk已存在裡面）
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
persist_directory = 'docs/chroma/'

In [4]:
embedding = OpenAIEmbeddings()
vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

In [5]:
# 查詢裡面已經有209個chunks
print(vectordb._collection.count())

209


In [6]:
# 接下來示範 mmr 的操作流程，先建立一些和蘑菇相關的文本
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

In [7]:
# 建立一個小的蘑菇資料庫
smalldb = Chroma.from_texts(texts, embedding=embedding)

In [8]:
# 跟蘑菇有關的問題，要煮飯
question = "Tell me about all-white mushrooms with large fruiting bodies"

In [None]:
# 先進行相似度的示範
smalldb.similarity_search(question, k=2)

結果：
- [Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.', metadata={}),
 Document(page_content='The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).', metadata={})]



 可以看到返回兩個結果並沒有包含到「有毒蘑菇」的相關資訊，這是很危險的

In [None]:
# 接下來示範mmr搜尋，可以看到參數多了fetch_k，代表會預選3個資訊，再透過mmr選出２個
smalldb.max_marginal_relevance_search(question,k=2, fetch_k=3)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.', metadata={}),
 Document(page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.', metadata={})]

結果：
- 可以發現返回的兩項裡面，其中一項被替換成其他的資訊

### Addressing Diversity: Maximum marginal relevance

Last class we introduced one problem: how to enforce diversity in the search results.
 
`Maximum marginal relevance` strives to achieve both relevance to the query *and diversity* among the results.

In [11]:
question = "what did they say about matlab?"
docs_ss = vectordb.similarity_search(question,k=3)

In [12]:
docs_ss[0].page_content[:100]

'those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people '

In [13]:
docs_ss[1].page_content[:100]

'those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people '

##### 前情提要
上一課時，我們故意示範了加入兩份重複的pdf進入到資料庫，結果發現用相似度檢索時，docs[0] & docs[1]返回了一模一樣的結果，這顯示了相似度的檢索是無法去重的。

Note the difference in results with `MMR`.

In [14]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)

In [15]:
docs_mmr[0].page_content[:100]

'those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people '

In [16]:
docs_mmr[1].page_content[:100]

'algorithm then? So what’s different? How come  I was making all that noise earlier about \nleast squa'

##### 使用mmr的結果
由於 mmr 會避免過於相近的文本成為回答的結果，因此可以看到複製的兩份文本即使都在資料庫中，也可以透過mmr的方式取除，因此上面兩項回答可以是不一樣的；這證明mmr某種程度上可以實現去重。

### Addressing Specificity: working with metadata

In last lecture, we showed that a question about the third lecture can include results from other lectures as well.

To address this, many vectorstores support operations on `metadata`.

`metadata` provides context for each embedded chunk.

##### 前情提要
上一課時，另外一個檢索會出現的錯誤是想要找第三講的檔案內容，但檢索回來還包含其他講，這代表一般向量搜尋不理解邏輯條件。

因此在這裡，我們嘗試透過手動提供filter來處理。

In [17]:
# 一樣的問題
question = "what did they say about regression in the third lecture?"

In [18]:
# 手動加入filter
docs = vectordb.similarity_search(
    question,
    k=3,
    filter={"source":"docs/cs229_lectures/MachineLearning-Lecture03.pdf"} # 一樣的向量查詢，但加上了meta data的過濾檢索
)

In [19]:
# 檢驗metadata來驗證來源，確實都來自第三講
for d in docs:
    print(d.metadata)

{'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 0}
{'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 14}
{'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 4}


### Addressing Specificity: working with metadata using self-query retriever

But we have an interesting challenge: we often want to infer the metadata from the query itself.

To address this, we can use `SelfQueryRetriever`, which uses an LLM to extract:
 
1. The `query` string to use for vector search
2. A metadata filter to pass in as well

Most vector databases support metadata filters, so this doesn't require any new databases or indexes.

In [20]:
# 接下來是 SelfQueryRetriever ，使用llm進行推斷
# 載入 SelfQueryRetriever 和 AttributeInfo
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
# AttributeInfo: 提供llm meta data的架構
# 

In [21]:
# 目的：教llm認識我的資料庫結構, 就像給llm一份資料的字典
metadata_field_info = [
    AttributeInfo(
        name="source", # 欄位名稱
        description="The lecture the chunk is from, should be one of `docs/cs229_lectures/MachineLearning-Lecture01.pdf`, `docs/cs229_lectures/MachineLearning-Lecture02.pdf`, or `docs/cs229_lectures/MachineLearning-Lecture03.pdf`", # 這欄的意義是什麼
        type="string", # 資料類型
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]

**Note:** The default model for `OpenAI` ("from langchain.llms import OpenAI") is `text-davinci-003`. Due to the deprication of OpenAI's model `text-davinci-003` on 4 January 2024, you'll be using OpenAI's recommended replacement model `gpt-3.5-turbo-instruct` instead.

In [22]:
document_content_description = "Lecture notes" # 告訴llm文件內容是什麼（資料庫是關於什麼）
llm = OpenAI(model='gpt-3.5-turbo-instruct', temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm, # 剛剛的模型設定
    vectordb, # 向量資料庫
    document_content_description, # 剛剛的文件描述
    metadata_field_info, # 剛剛建立的meta data 描述
    verbose=True  # 顯示推理過程，後面就可以看到llm的思考過程
)

In [23]:
question = "what did they say about regression in the third lecture?"

**You will receive a warning** about predict_and_parse being deprecated the first time you executing the next line. This can be safely ignored.

In [24]:
docs = retriever.get_relevant_documents(question)

query='regression' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='source', value='docs/cs229_lectures/MachineLearning-Lecture03.pdf') limit=None




In [25]:
for d in docs:
    print(d.metadata)

{'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 14}
{'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 0}
{'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 10}
{'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 10}


### Additional tricks: compression

Another approach for improving the quality of retrieved docs is compression.

Information most relevant to a query may be buried in a document with a lot of irrelevant text. 

Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

Contextual compression is meant to fix this. 

In [26]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [27]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))


In [28]:
# Wrap our vectorstore
llm = OpenAI(temperature=0, model="gpt-3.5-turbo-instruct")
compressor = LLMChainExtractor.from_llm(llm)

In [29]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

In [30]:
question = "what did they say about matlab?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Document 1:

- "those homeworks will be done in either MATLA B or in Octave"
- "I know some people call it a free ve rsion of MATLAB"
- "MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data."
- "there's also a software package called Octave that you can download for free off the Internet."
- "it has somewhat fewer features than MATLAB, but it's free, and for the purposes of this class, it will work for just about everything."
- "once a colleague of mine at a different university, not at Stanford, actually teaches another machine learning course."
----------------------------------------------------------------------------------------------------
Document 2:

- "those homeworks will be done in either MATLA B or in Octave"
- "I know some people call it a free ve rsion of MATLAB"
- "MATLAB is I guess part of the programming language that makes it very easy to write 

## Combining various techniques

In [31]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr")
)

In [32]:
question = "what did they say about matlab?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Document 1:

- "those homeworks will be done in either MATLA B or in Octave"
- "I know some people call it a free ve rsion of MATLAB"
- "MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data."
- "there's also a software package called Octave that you can download for free off the Internet."
- "it has somewhat fewer features than MATLAB, but it's free, and for the purposes of this class, it will work for just about everything."
- "once a colleague of mine at a different university, not at Stanford, actually teaches another machine learning course."
----------------------------------------------------------------------------------------------------
Document 2:

"Oh, it was the MATLAB."
----------------------------------------------------------------------------------------------------
Document 3:

- learning algorithms to teach a car how to drive at reasonably high 

## Other types of retrieval

It's worth noting that vectordb as not the only kind of tool to retrieve documents. 

The `LangChain` retriever abstraction includes other ways to retrieve documents, such as TF-IDF or SVM.

In [33]:
from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [34]:
# Load PDF
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)


In [35]:
# Retrieve
svm_retriever = SVMRetriever.from_texts(splits,embedding)
tfidf_retriever = TFIDFRetriever.from_texts(splits)

In [36]:
question = "What are major topics for this class?"
docs_svm=svm_retriever.get_relevant_documents(question)
docs_svm[0]



Document(page_content="let me just check what questions you have righ t now. So if there are no questions, I'll just \nclose with two reminders, which are after class today or as you start to talk with other \npeople in this class, I just encourage you again to start to form project partners, to try to \nfind project partners to do your project with. And also, this is a good time to start forming \nstudy groups, so either talk to your friends  or post in the newsgroup, but we just \nencourage you to try to star t to do both of those today, okay? Form study groups, and try \nto find two other project partners.  \nSo thank you. I'm looking forward to teaching this class, and I'll see you in a couple of \ndays.   [End of Audio]  \nDuration: 69 minutes", metadata={})

In [37]:
question = "what did they say about matlab?"
docs_tfidf=tfidf_retriever.get_relevant_documents(question)
docs_tfidf[0]

Document(page_content="Saxena and Min Sun here did, wh ich is given an image like this, right? This is actually a \npicture taken of the Stanford campus. You can apply that sort of cl ustering algorithm and \ngroup the picture into regions. Let me actually blow that up so that you can see it more \nclearly. Okay. So in the middle, you see the lines sort of groupi ng the image together, \ngrouping the image into [inaudible] regions.  \nAnd what Ashutosh and Min did was they then  applied the learning algorithm to say can \nwe take this clustering and us e it to build a 3D model of the world? And so using the \nclustering, they then had a lear ning algorithm try to learn what the 3D structure of the \nworld looks like so that they could come up with a 3D model that you can sort of fly \nthrough, okay? Although many people used to th ink it's not possible to take a single \nimage and build a 3D model, but using a lear ning algorithm and that sort of clustering \nalgorithm is the first ste