# Retrieval
이전 섹션에서 의미론적 검색의 기본 사항을 다루었고, 많은 사용 사례에서 잘 작동한다는 것을 확인했습니다. 하지만 일부 경우에는 잘 작동하지 않을 수도 있다는 것도 확인했고, 상황이 어떻게 잘못될 수 있는지도 보았습니다.

이번에는 Retrieval에 대해 더 자세히 알아보고, 해당 실패했던 사례를 완화할 수 있는 방법에 대하여 알아 봅니다..

## Vectorestore retrieval

<img src="fig10.png" width="600">

* Vectorestore의 데이타 엑세스/인덱싱
  - 기본 의미적 유사성(semantic similarity)
  - 최대 한계 관련성(maximum marginal relevance)
  - 메타데이터 포함
 
* LLM 지원 검색

Retrieval(검색)은 쿼리가 들어오고 분할된 청크에서 해당 쿼리와 관련된 가장 관련성이 높은 것을 찾는 것입니다.

In [3]:
import os
import openai

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

## Addressing Diversity: Maximum marginal relevance

MMR(Maximum Marginal Relevance)의 아이디어는 항상 임베딩 공간에서 쿼리와 가장 유사한 문서를 가져오면 실제로는 특별한 경우 중 하나에서 본 것처럼 다양한 정보를 놓칠 수 있다는 것이다.

<img src="fig11.png" width="450">

요리사가 모든 흰 버섯에 대해 질문하는 예를 확인해보겠습니다.
여기서 가장 유사한 결과를 살펴보면 이는 처음 두 문서가 될 입니다다. 여기에는 자실(fruiting body)체 및 모두 흰색이라는 쿼리와 유사한 많은 정보가 포함되어 있다.

여기서 해당 버섯이 독성이 있는지와 같은 정보를 추가로 얻고자 한다면, 다양한 문서 세원를 는하므로 MMR을 사용하는 것이 중요하다.

MMR의 기본 아돌입니다는

* Vectorestore를 하면 에 질의
* 'fetch_k' 개의 가장 유사한 답한 후변을 선택
* 이 답변들 중에서 'k'개 가장 다양한 하는 것이다. 것이다.

<img src="fig12.png" width="350">

In [5]:
from langchain.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

In [6]:
embedding = OpenAIEmbeddings()

In [7]:
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

In [8]:
smalldb = Chroma.from_texts(texts, embedding=embedding)

In [9]:
question = "Tell me about all-white mushrooms with large fruiting bodies"

In [10]:
smalldb.similarity_search(question, k=2)

[Document(metadata={}, page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(metadata={}, page_content='The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).')]

In [11]:
smalldb.max_marginal_relevance_search(question, k=2, fetch_k=3)

[Document(metadata={}, page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(metadata={}, page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.')]

In [112]:
persist_directory = 'docs/chroma/'

vectordb = Chroma(
    embedding_function=embedding,
    persist_directory=persist_directory
)

In [15]:
vectordb._collection.count()

208

In [16]:
question = "what did they say about matlab?"
docs_ss = vectordb.similarity_search(question, k=3)

In [17]:
docs_ss[0].page_content[:100]

'those homeworks will be done in either MATLAB or in Octave, which is sort of — I \nknow some people c'

In [18]:
docs_ss[1].page_content[:100]

'those homeworks will be done in either MATLAB or in Octave, which is sort of — I \nknow some people c'

MMR에 따른 결과의 차이를 주목해 보세요.

In [20]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)

In [21]:
docs_mmr[0].page_content[:100]


'those homeworks will be done in either MATLAB or in Octave, which is sort of — I \nknow some people c'

In [135]:
docs_mmr[1].page_content[:700]


'into his office and he said, "Oh, professor, professor, thank you so much for your \nmachine learning class. I learned so much from it. There\'s this stuff that I learned in your \nclass, and I now use every day. And it\'s helped me make lots of money, and here\'s a \npicture of my big house."  \nSo my friend was very excited. He said, "Wow. That\'s great. I\'m glad to hear this \nmachine learning stuff was actually useful. So what was it that you learned? Was it \nlogistic regression? Was it the PCA? Was it the data networks? What was it that you \nlearned that was so helpful?" And the student said, "Oh, it was the MATLAB."  \nSo for those of you that don\'t know MATLAB yet, I hope you do learn it. It\'s '

### Addressing Specificity: working with metadata

지난 강의에서 우리는 세 번째 강의에 대한 질문에 대한 답변이 다른 강의의 결과도 포함할 수 있음을 보여주었습니다.

이 문제를 해결하기 위해 많은 벡터 저장소가 메타데이터에 대한 작업을 지원합니다.

메타데이터는 포함된 각 청크에 대한 컨텍스트를 제공합니다.

In [24]:
question = "what did they say about regression in the third lecture?"

In [25]:
docs = vectordb.similarity_search(
    question,
    k=3,
    filter={"source":"docs/cs229_lectures/MachineLearning-Lecture03.pdf"}
)

In [26]:
for d in docs:
    print(d.metadata)

{'page': 0, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 4, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 6, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}


## Addressing Specificity: working with metadata using self-query retriever

또 다른 검색 유형은 셀프 쿼리(self-query)이다.
의미상으로 조회하려는 콘텐츠뿐만 아니라 필터링하려는 일부 메타데이터에 대한 언급도 포함하는 질문을 받을 때 유용하다.

예를 들면 "1980년에 제작된 외계인에 관한 영화는 무엇인가?" 와 같은 질문에는 실제로 두 가지 구성 요소가 있다.

의미론적인 부분으로 '외계인' 에 관한 영화이다. 그래서 우리는 영화 데이터베이스에서 외계인을 찾고자한다.
그러나 여기서 추가적으로 '1980년'이라는 연도인 각 영화에 대한 메타데이터를 실제로 참조하는 부분도 있다.

우리가 할 수 있는 일은 언어 모델 자체를 사용하여 원래 질문을 필터와 검색어라는 두 가지 별도 항목으로 분할하는 것이다.

대부분의 벡터 저장소는 메타데이터 필터를 지원한다. 따라서 1980년과 같은 메타데이터를 기반으로 레코드를 쉽게 필터링할 수 있다.

이를 해결하기 위해 LLM을 사용하여 다음을 추출하는 SelfQueryRetriever를 사용할 수 있습니다.

벡터 검색에 사용할 쿼리 문자열 전달할 메타데이터 필터 대부분의 벡터 데이터베이스는 메타데이터 필터를 지원하므로 새 데이터베이스나 인덱스가 필요하지 않습니다.

<img src="fig13.png" width="350">

In [139]:
from langchain_openai import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

In [143]:
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, should be one of `docs/cs229_lectures/MachineLearning-Lecture01.pdf`, `docs/cs229_lectures/MachineLearning-Lecture02.pdf`, or `docs/cs229_lectures/MachineLearning-Lecture03.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]

In [145]:
#!pip install lark

In [147]:
document_content_description = "Lecture notes"
llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)

In [149]:
question = "what did they say about regression in the third lecture?"

In [151]:
docs = retriever.get_relevant_documents(question)

In [153]:
for d in docs:
    print(d.metadata)

{'page': 0, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 10, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 5, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 2, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}


## Additional tricks: compression

검색된 문서의 품질을 향상시키는 또 다른 방법은 압축입니다.

압축(compression)에 대해서도 설명하자면, 이 방법은 검색된 구절 중 가장 관련성이 높은 부분만 추출하는 데 유용할 수 있다.

예를 들어, 질문을 하면 처음 한두 문장만 관련 부분이더라도 저장된 문서 전체를 돌려받는다.

압축을 사용하면 언어 모델을 통해 모든 문서를 실행하고 가장 관련성이 높은 세그먼트를 추출한 다음 가장 관련성이 높은 세그먼트만 최종 언어 모델 호출에 전달할 수 있다.

이 방법은 언어 모델을 더 많이 호출해야 하지만, 가장 중요한 사항에만 최종 답에 대해 집중 때문에 다소 trade-off 가 존재한다고 볼 수 있습다때한 것입니다.

<img src="fig14.png" width="300">

In [98]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [100]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))

In [39]:
# Wrap our vectorstore
llm = OpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)

In [40]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

In [41]:
question = "what did they say about matlab?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Document 1:

- those homeworks will be done in either MATLAB or in Octave
- I know some people call it a free version of MATLAB
- MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data
- it's sort of an extremely easy to learn tool to use for implementing a lot of learning algorithms
- there's also a software package called Octave that you can download for free off the Internet
- it has somewhat fewer features than MATLAB, but it's free, and for the purposes of this class, it will work for just about everything
- once a colleague of mine at a different university, not at Stanford, actually teaches another machine learning course
----------------------------------------------------------------------------------------------------
Document 2:

- those homeworks will be done in either MATLAB or in Octave
- I know some people call it a free version of MATLAB
- MATLAB is

## Combining various techniques

In [43]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr")
)

In [44]:
question = "what did they say about matlab?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Document 1:

- those homeworks will be done in either MATLAB or in Octave
- I know some people call it a free version of MATLAB
- MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data
- it's sort of an extremely easy to learn tool to use for implementing a lot of learning algorithms
- there's also a software package called Octave that you can download for free off the Internet
- it has somewhat fewer features than MATLAB, but it's free, and for the purposes of this class, it will work for just about everything
- once a colleague of mine at a different university, not at Stanford, actually teaches another machine learning course
----------------------------------------------------------------------------------------------------
Document 2:

"Oh, it was the MATLAB."
----------------------------------------------------------------------------------------------------


## Other types of retrieval

vectordb가 문서를 검색하는 유일한 도구가 아니라는 점은 주목할 가치가 있습니다.

LangChain 검색기 추상화에는 TF-IDF 또는 SVM과 같은 문서를 검색하는 다른 방법이 포함됩니다.

In [46]:
from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import warnings

warnings.filterwarnings("ignore")

In [47]:
# Load PDF
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)

In [48]:
# Retrieve
svm_retriever = SVMRetriever.from_texts(splits,embedding)
tfidf_retriever = TFIDFRetriever.from_texts(splits)

In [49]:
question = "What are major topics for this class?"
docs_svm=svm_retriever.get_relevant_documents(question, dual=True)
docs_svm[0]

Document(metadata={}, page_content="Testing, testing. Okay, cool. Thanks.   So all right, online resources. The class has a home page, so it's in on the handouts. I \nwon't write on the chalkboard — http:// cs229.stanford.edu. And so when there are \nhomework assignments or things like that, we usually won't sort of — in the mission of \nsaving trees, we will usually not give out many handouts in class. So homework \nassignments, homework solutions will be posted online at the course home page.  \nAs far as this class, I've also written, and I guess I've also revised every year a set of \nfairly detailed lecture notes that cover the technical content of this class. And so if you \nvisit the course homepage, you'll also find the detailed lecture notes that go over in detail \nall the math and equations and so on that I'll be doing in class.  \nThere's also a newsgroup, su.class.cs229, also written on the handout. This is a \nnewsgroup that's sort of a forum for people in the class to ge

In [50]:
question = "what did they say about matlab?"
docs_tfidf=tfidf_retriever.get_relevant_documents(question)
docs_tfidf[0]

Document(metadata={}, page_content="yourselves. You can also come and talk to me or the TAs if you want to brainstorm ideas \nwith us.  \nOkay. So one more organizational question. I'm curious, how many of you know \nMATLAB? Wow, cool, quite a lot. Okay. So as part of the — act ually how many of you \nknow Octave or have used Octave? Oh, okay, much smaller number.  \nSo as part of this class, especially in the homeworks, we'll ask you to implement a few \nprograms, a few machine learning algorithms as part of the homeworks. And most of  those homeworks will be done in either MATLAB or in Octave, which is sort of — I \nknow some people call it a free version of MATLAB, which it sort of is, sort of isn't.  \nSo I guess for those of you that haven't seen MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to \nwrite codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it's 