# Retrieval
Retrieval is the centerpiece of our retrieval augmented generation (RAG) flow.

Let's get our vectorDB from before.

## Vectorstore retrieval

In [1]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

In [None]:
# !pip install lark

## Similarity Search

In [18]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
persist_directory = 'docs/chroma/'

In [19]:
embedding = OpenAIEmbeddings()
vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

ValueError: You are using a deprecated configuration of Chroma. Please pip install chroma-migrate and run `chroma-migrate` to upgrade your configuration. See https://docs.trychroma.com/migration for more information or join our discord at https://discord.gg/8g5FESbj for help!

In [7]:
print(vectordb._collection.count())

AttributeError: 'SegmentAPI' object has no attribute '_collection'

In [8]:
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

In [20]:
smalldb = Chroma.from_texts(texts, embedding=embedding)

In [21]:
question = "Tell me about all-white mushrooms with large fruiting bodies"

In [22]:
smalldb.similarity_search(question, k=2)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.', metadata={}),
 Document(page_content='The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).', metadata={})]

In [23]:
smalldb.max_marginal_relevance_search(question,k=2, fetch_k=3)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.', metadata={}),
 Document(page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.', metadata={})]

## Addressing Diversity: Maximum marginal relevance

지난 수업에서 우리는 한 가지 문제, 즉 검색 결과에 다양성을 적용하는 방법을 소개했습니다.

최대 한계 관련성은 검색어와의 관련성과 결과 간의 다양성을 모두 달성하기 위해 노력합니다.

In [24]:
question = "what did they say about matlab?"
docs_ss = vectordb.similarity_search(question,k=3)

AttributeError: 'SegmentAPI' object has no attribute 'similarity_search'

In [25]:
docs_ss[0].page_content[:100]

NameError: name 'docs_ss' is not defined

In [26]:
docs_ss[1].page_content[:100]


NameError: name 'docs_ss' is not defined

Note the difference in results with MMR.



In [27]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)


AttributeError: 'SegmentAPI' object has no attribute 'max_marginal_relevance_search'

In [28]:
docs_mmr[0].page_content[:100]


NameError: name 'docs_mmr' is not defined

In [29]:
docs_mmr[1].page_content[:100]


NameError: name 'docs_mmr' is not defined

## Addressing Specificity: working with metadata

지난 강의에서 우리는 세 번째 강의에 대한 질문이 다른 강의의 결과도 포함할 수 있음을 보여주었습니다.

이 문제를 해결하기 위해 많은 벡터 저장소가 메타데이터에 대한 작업을 지원합니다.

메타데이터는 포함된 각 청크에 대한 컨텍스트를 제공합니다.

In [30]:
question = "what did they say about regression in the third lecture?"

In [31]:
docs = vectordb.similarity_search(
    question,
    k=3,
    filter={"source":"docs/cs229_lectures/MachineLearning-Lecture03.pdf"}
)

AttributeError: 'SegmentAPI' object has no attribute 'similarity_search'

In [33]:
for d in docs:
    print(d.metadata)

NameError: name 'docs' is not defined

## Addressing Specificity: working with metadata using self-query retriever

하지만 흥미로운 문제가 있습니다. 쿼리 자체에서 메타데이터를 추론하려는 경우가 많습니다.

이를 해결하기 위해 LLM을 사용하여 다음을 추출하는 SelfQueryRetriever를 사용할 수 있습니다.

벡터 검색에 사용할 쿼리 문자열 전달할 메타데이터 필터 대부분의 벡터 데이터베이스는 메타데이터 필터를 지원하므로 새 데이터베이스나 인덱스가 필요하지 않습니다.

In [34]:
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

In [35]:
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, should be one of `docs/cs229_lectures/MachineLearning-Lecture01.pdf`, `docs/cs229_lectures/MachineLearning-Lecture02.pdf`, or `docs/cs229_lectures/MachineLearning-Lecture03.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]

In [36]:
document_content_description = "Lecture notes"
llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)

ValueError: Self query retriever with Vector Store type <class 'chromadb.api.segment.SegmentAPI'> not supported.

In [37]:
question = "what did they say about regression in the third lecture?"

You will receive a warning about predict_and_parse being deprecated the first time you executing the next line. This can be safely ignored.

In [38]:
docs = retriever.get_relevant_documents(question)

NameError: name 'retriever' is not defined

In [39]:
for d in docs:
    print(d.metadata)

NameError: name 'docs' is not defined

## Additional tricks: compression

검색된 문서의 품질을 향상시키는 또 다른 방법은 압축입니다.

쿼리와 가장 관련성이 높은 정보는 관련 없는 텍스트가 많은 문서에 묻힐 수 있습니다.

신청서를 통해 전체 문서를 전달하면 더 많은 비용이 드는 LLM 호출과 더 낮은 응답으로 이어질 수 있습니다.

컨텍스트 압축은 이 문제를 해결하기 위한 것입니다.

In [40]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [41]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))

In [42]:
# Wrap our vectorstore
llm = OpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)

In [43]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

AttributeError: 'SegmentAPI' object has no attribute 'as_retriever'

In [44]:
question = "what did they say about matlab?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

NameError: name 'compression_retriever' is not defined

## Combining various techniques

In [45]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr")
)

AttributeError: 'SegmentAPI' object has no attribute 'as_retriever'

In [46]:
question = "what did they say about matlab?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

NameError: name 'compression_retriever' is not defined

## Other types of retrieval

vectordb가 문서를 검색하는 유일한 도구가 아니라는 점은 주목할 가치가 있습니다.

LangChain 검색기 추상화에는 TF-IDF 또는 SVM과 같은 문서를 검색하는 다른 방법이 포함됩니다.

In [47]:
from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [48]:
# Load PDF
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)

In [49]:
# Retrieve
svm_retriever = SVMRetriever.from_texts(splits,embedding)
tfidf_retriever = TFIDFRetriever.from_texts(splits)



In [50]:
question = "What are major topics for this class?"
docs_svm=svm_retriever.get_relevant_documents(question)
docs_svm[0]

Document(page_content="let me just check what questions you have righ t now. So if there are no questions, I'll just \nclose with two reminders, which are after class today or as you start to talk with other \npeople in this class, I just encourage you again to start to form project partners, to try to \nfind project partners to do your project with. And also, this is a good time to start forming \nstudy groups, so either talk to your friends  or post in the newsgroup, but we just \nencourage you to try to star t to do both of those today, okay? Form study groups, and try \nto find two other project partners.  \nSo thank you. I'm looking forward to teaching this class, and I'll see you in a couple of \ndays.   [End of Audio]  \nDuration: 69 minutes", metadata={})

In [51]:
question = "what did they say about matlab?"
docs_tfidf=tfidf_retriever.get_relevant_documents(question)
docs_tfidf[0]

Document(page_content="Saxena and Min Sun here did, wh ich is given an image like this, right? This is actually a \npicture taken of the Stanford campus. You can apply that sort of cl ustering algorithm and \ngroup the picture into regions. Let me actually blow that up so that you can see it more \nclearly. Okay. So in the middle, you see the lines sort of groupi ng the image together, \ngrouping the image into [inaudible] regions.  \nAnd what Ashutosh and Min did was they then  applied the learning algorithm to say can \nwe take this clustering and us e it to build a 3D model of the world? And so using the \nclustering, they then had a lear ning algorithm try to learn what the 3D structure of the \nworld looks like so that they could come up with a 3D model that you can sort of fly \nthrough, okay? Although many people used to th ink it's not possible to take a single \nimage and build a 3D model, but using a lear ning algorithm and that sort of clustering \nalgorithm is the first ste