### Lesson 5: Retrieval

Retrieval is the centerpiece of our retrieval augmented generation (RAG) flow.

#### Vectorstore retrieval

In [1]:
from dotenv import load_dotenv

load_dotenv()

True

#### Similarity Search

In [2]:
from langchain_community.embeddings.cohere import CohereEmbeddings
from langchain_community.vectorstores.chroma import Chroma

In [3]:
persist_directory = "./.chroma/"

In [4]:
embedding = CohereEmbeddings(model="embed-multilingual-light-v3.0")

vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding,
)

In [5]:
vectordb._collection.count()

208

In [6]:
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

In [7]:
# in-memory
smalldb = Chroma.from_texts(texts=texts, embedding=embedding)

In [8]:
question = "Tell me about all-white mushrooms with large fruiting bodies"

In [9]:
smalldb.similarity_search(query=question, k=2)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(page_content='The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).')]

In [10]:
smalldb.max_marginal_relevance_search(query=question, k=2, fetch_k=3)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.')]

#### Addressing Diversity: Maximum marginal relevance

In last lesson we introduced one problem: how to enforce diversity in the search results.

**Maximum marginal relevance** strives to achieve both relevance to the query and diversity among the results.

In [11]:
question = "What did they say about matlab?"

doc_ss = vectordb.similarity_search(query=question, k=3)

In [12]:
print(doc_ss[0].page_content[:100])

those homeworks will be done in either MATLA B or in Octave, which is sort of — I 
know some people 


In [13]:
print(doc_ss[1].page_content[:100])

those homeworks will be done in either MATLA B or in Octave, which is sort of — I 
know some people 


Now with MMR:

In [14]:
docs_mmr = vectordb.max_marginal_relevance_search(query=question, k=3)

In [15]:
print(docs_mmr[0].page_content[:100])

those homeworks will be done in either MATLA B or in Octave, which is sort of — I 
know some people 


In [16]:
print(docs_mmr[1].page_content[:100])

into his office and he said, "Oh, professo r, professor, thank you so much for your 
machine learnin


#### Addressing Specificity: working with metadata

In last lesson, we showed that a question about the third lecture can include results from other lectures as well.

To address this, many vectorstores support operations on **metadata**.

**metadata** provides context for each embedded chunk.

In [17]:
question = "what did they say about regression in the third lecture?"

In [18]:
docs = vectordb.similarity_search(
    query=question,
    k=3,
    filter={"source": "docs/cs229_lectures/MachineLearning-Lecture03.pdf"},
)

In [19]:
for d in docs:
    print(d.metadata)

{'page': 0, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 0, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 0, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}


### Addressing Specificity: working with metadata using self-query retriever

But we have an interesting challenge: we often want to infer the metadata from the query itself.

To address this, we can use `SelfQueryRetriever`, which uses an LLM to extract:

1. The query string to use for vector search
2. A metadata filter to pass in as well

Most vector databases support metadata filters, so this doesn't require any new databases or indexes.

In [20]:
from langchain.chains.query_constructor.schema import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain_community.llms.cohere import Cohere

In [21]:
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, \
            should be one of `docs/cs229_lectures/MachineLearning-Lecture01.pdf`, \
            `docs/cs229_lectures/MachineLearning-Lecture02.pdf`, \
            or `docs/cs229_lectures/MachineLearning-Lecture03.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]

In [23]:
document_content_description = "Lecture notes"

llm = Cohere(model="command-light", temperature=0.1)

retriever = SelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vectordb,
    document_contents=document_content_description,
    metadata_field_info=metadata_field_info,
    verbose=True,
)

In [24]:
question = "what did they say about regression in the third lecture?"

In [25]:
docs = retriever.get_relevant_documents(query=question)

In [26]:
for d in docs:
    print(d.metadata)

{'page': 2, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 2, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 0, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 0, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}


#### Additional tricks: compression

Another approach for improving the quality of retrieved docs is compression.

Information most relevant to a query may be buried in a document with a lot of irrelevant text.

Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

Contextual compression is meant to fix this.

In [27]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [28]:
def pretty_print_docs(docs: list) -> None:
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

In [29]:
llm = Cohere(model="command-light", temperature=0.1)

compressor = LLMChainExtractor.from_llm(llm=llm)

In [30]:
compressor_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(),
)

In [31]:
question = "what did they say about matlab?"

compressed_docs = compressor_retriever.get_relevant_documents(query=question)



In [32]:
compressed_docs

[Document(page_content='I extracted the following:\n\n"So one day, he was in his office, and an old student of his from, like, ten years ago came into his office and he said, \\"Oh, professor, thank you so much for your..."\n\nThis is the part of the context that might be relevant to answering the question.', metadata={'page': 8, 'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf'}),
 Document(page_content='I extracted the following:\n\n"So one day, he was in his office, and an old student of his from, like, ten years ago came into his office and he said, \\"Oh, professor, thank you so much for your..."\n\nThis is the part of the context that might be relevant to answering the question.', metadata={'page': 8, 'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf'}),
 Document(page_content='The relevant parts from the context are:\n\n- They talked about machine learning and how it helped with his work, and how he uses it daily. \n\n- They also mentioned a picture of his

In [33]:
pretty_print_docs(compressed_docs)

Document 1:

I extracted the following:

"So one day, he was in his office, and an old student of his from, like, ten years ago came into his office and he said, \"Oh, professor, thank you so much for your..."

This is the part of the context that might be relevant to answering the question.
----------------------------------------------------------------------------------------------------
Document 2:

I extracted the following:

"So one day, he was in his office, and an old student of his from, like, ten years ago came into his office and he said, \"Oh, professor, thank you so much for your..."

This is the part of the context that might be relevant to answering the question.
----------------------------------------------------------------------------------------------------
Document 3:

The relevant parts from the context are:

- They talked about machine learning and how it helped with his work, and how he uses it daily. 

- They also mentioned a picture of his big house, which cou

#### Combining various techniques

In [34]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type="mmr"),
)

In [35]:
question = "what did they say about matlab?"

compressed_docs = compression_retriever.get_relevant_documents(
    query=question,
)



In [36]:
pretty_print_docs(compressed_docs)

Document 1:

I extracted the following:

"So one day, he was in his office, and an old student of his from, like, ten years ago came into his office and he said, \"Oh, professor, thank you so much for your..."

This is the part of the context that might be relevant to answering the question.
----------------------------------------------------------------------------------------------------
Document 2:

The relevant parts of the context are:

- They are discussing a machine learning class and how the content was helpful. 

- They are talking about discussion sections and the purpose of them, which are to help students review topics they have not covered yet. 

- They are looking for a short MATLAB tutorial for students who are new to MATLAB.
----------------------------------------------------------------------------------------------------
Document 3:

The relevant part of context for answering this question is:

"mathematical work, he feels like he's disc overing truth and beauty in 

#### Other types of retrieval

It's worth noting that vectordb as not the only kind of tool to retrieve documents.

The `LangChain` retriever abstraction includes other ways to retrieve documents, such as **TF-IDF** or **SVM**.

In [37]:
from langchain.retrievers import SVMRetriever, TFIDFRetriever
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader

In [38]:
pages = PyPDFLoader(
    file_path="./docs/cs229_lectures/MachineLearning-Lecture01.pdf"
).load()

all_page_text = [p.page_content for p in pages]

joined_page_text = " ".join(all_page_text)

In [39]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=150)

splits = text_splitter.split_text(text=joined_page_text)

In [None]:
svm_retriever = SVMRetriever.from_texts(texts=splits, embeddings=embedding)

In [42]:
tfidf_retriever = TFIDFRetriever.from_texts(texts=splits)

In [44]:
question = "What are major topics for this class?"

docs_svm = svm_retriever.get_relevant_documents(query=question)

print(docs_svm[0].page_content)

let me just check what questions you have righ t now. So if there are no questions, I'll just 
close with two reminders, which are after class today or as you start to talk with other 
people in this class, I just encourage you again to start to form project partners, to try to 
find project partners to do your project with. And also, this is a good time to start forming 
study groups, so either talk to your friends  or post in the newsgroup, but we just 
encourage you to try to star t to do both of those today, okay? Form study groups, and try 
to find two other project partners.  
So thank you. I'm looking forward to teaching this class, and I'll see you in a couple of 
days.   [End of Audio]  
Duration: 69 minutes




In [45]:
question = "what did they say about matlab?"

docs_tfidf = tfidf_retriever.get_relevant_documents(query=question)

print(docs_tfidf[0].page_content)

Saxena and Min Sun here did, wh ich is given an image like this, right? This is actually a 
picture taken of the Stanford campus. You can apply that sort of cl ustering algorithm and 
group the picture into regions. Let me actually blow that up so that you can see it more 
clearly. Okay. So in the middle, you see the lines sort of groupi ng the image together, 
grouping the image into [inaudible] regions.  
And what Ashutosh and Min did was they then  applied the learning algorithm to say can 
we take this clustering and us e it to build a 3D model of the world? And so using the 
clustering, they then had a lear ning algorithm try to learn what the 3D structure of the 
world looks like so that they could come up with a 3D model that you can sort of fly 
through, okay? Although many people used to th ink it's not possible to take a single 
image and build a 3D model, but using a lear ning algorithm and that sort of clustering 
algorithm is the first step. They were able to.  
I'll just 