# Retrieval

Retrieval is the centerpiece of our retrieval augmented generation (RAG) flow. 

Let's get our vectorDB from before.

In [1]:
import os
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_google_genai import GoogleGenerativeAIEmbeddings
# from langchain_core.vectorstores import InMemoryVectorStore
from langchain_cohere import CohereEmbeddings
import numpy as np

os.environ["GOOGLE_API_KEY"] = os.environ["GEMINI_API_KEY"]
llm_model = "gemini-2.0-flash-lite" # "gemma-3-27b-it" # 

llm = ChatGoogleGenerativeAI(
    model=llm_model,
    temperature=0.9,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)

gembeddings = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")

# vector_store = InMemoryVectorStore(embeddings)

cembeddings = CohereEmbeddings(model="embed-english-v3.0")

In [2]:
from langchain.vectorstores import Chroma
persist_directory = 'docs/chroma/'

vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=gembeddings
)

print(vectordb._collection.count())

  vectordb = Chroma(


208


In [4]:
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

smalldb = Chroma.from_texts(texts, embedding=gembeddings)
question = "Tell me about all-white mushrooms with large fruiting bodies"

In [11]:
ss = smalldb.similarity_search(question, k=2)
for s in ss:
    print(s.page_content)

A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.
The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).


In [10]:
ss = smalldb.max_marginal_relevance_search(question, k=2)
for s in ss:
    print(s.page_content)

A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.
The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).


In [4]:
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

smalldb = Chroma.from_texts(texts, embedding=cembeddings)
question = "Tell me about all-white mushrooms with large fruiting bodies"

ss = smalldb.similarity_search(question, k=2)
for s in ss:
    print(s.page_content)

ss = smalldb.max_marginal_relevance_search(question, k=2)
for s in ss:
    print(s.page_content)

A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.
A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.
A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.
A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.


### Addressing Diversity: Maximum marginal relevance

Last class we introduced one problem: how to enforce diversity in the search results.
 
`Maximum marginal relevance` strives to achieve both relevance to the query *and diversity* among the results.

In [5]:
question = "what did they say about matlab?"
docs_ss = vectordb.similarity_search(question,k=3)

print(docs_ss[0].page_content[:100])
print("******************")
print(docs_ss[1].page_content[:100])

those homeworks will be done in either MATLAB or in Octave, which is sort of — I 
know some people c
******************
those homeworks will be done in either MATLAB or in Octave, which is sort of — I 
know some people c


In [6]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)

print(docs_mmr[0].page_content[:100])
print("******************")
print(docs_mmr[1].page_content[:100])

those homeworks will be done in either MATLAB or in Octave, which is sort of — I 
know some people c
******************
Okay, and using this matrix vector notation, I think, I don't know, I think we did this 
whole thing


### Addressing Specificity: working with metadata

In last lecture, we showed that a question about the third lecture can include results from other lectures as well.

To address this, many vectorstores support operations on `metadata`.

`metadata` provides context for each embedded chunk.

In [5]:
question = "what did they say about regression in the third lecture?"

docs = vectordb.similarity_search(
    question,
    k=3,
    filter={"source":"docs/MachineLearning-Lecture03.pdf"}
)

for d in docs:
    print(d.metadata["source"])

docs/MachineLearning-Lecture03.pdf
docs/MachineLearning-Lecture03.pdf
docs/MachineLearning-Lecture03.pdf


### Addressing Specificity: working with metadata using self-query retriever

But we have an interesting challenge: we often want to infer the metadata from the query itself.

To address this, we can use `SelfQueryRetriever`, which uses an LLM to extract:
 
1. The `query` string to use for vector search
2. A metadata filter to pass in as well

Most vector databases support metadata filters, so this doesn't require any new databases or indexes.

In [8]:
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, should be one of `docs/MachineLearning-Lecture01.pdf`, `docs/MachineLearning-Lecture02.pdf`, or `docs/MachineLearning-Lecture03.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]

document_content_description = "Lecture notes"
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)

question = "what did they say about regression in the third lecture?"

docs = retriever.invoke(question)

In [10]:
for d in docs:
    print(d.metadata["source"], d.metadata["page"])

docs/MachineLearning-Lecture03.pdf 2
docs/MachineLearning-Lecture03.pdf 14
docs/MachineLearning-Lecture03.pdf 10
docs/MachineLearning-Lecture03.pdf 2


### Additional tricks: compression

Another approach for improving the quality of retrieved docs is compression.

Information most relevant to a query may be buried in a document with a lot of irrelevant text. 

Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

Contextual compression is meant to fix this. 

In [12]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))


# Wrap our vectorstore
compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

question = "what did they say about matlab?"
compressed_docs = compression_retriever.invoke(question)
pretty_print_docs(compressed_docs)

Document 1:

MATLAB is I guess part of the programming language that makes it very easy to 
write codes using matrices, to write code for numerical routines, to move data around, to 
plot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of 
learning algorithms.
----------------------------------------------------------------------------------------------------
Document 2:

MATLAB is I guess part of the programming language that makes it very easy to 
write codes using matrices, to write code for numerical routines, to move data around, to 
plot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of 
learning algorithms.
----------------------------------------------------------------------------------------------------
Document 3:

And the student said, "Oh, it was the MATLAB."
So for those of you that don't know MATLAB yet, I hope you do learn it. It's not hard, 
and we'll actually have a short MATLAB tutorial in one

### Combine with MMR

In [13]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr")
)

question = "what did they say about matlab?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Document 1:

MATLAB is I guess part of the programming language that makes it very easy to 
write codes using matrices, to write code for numerical routines, to move data around, to 
plot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of 
learning algorithms.


## Other types of retrieval

It's worth noting that vectordb as not the only kind of tool to retrieve documents. 

The `LangChain` retriever abstraction includes other ways to retrieve documents, such as TF-IDF or SVM.

In [14]:
from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load PDF
loader = PyPDFLoader("docs/MachineLearning-Lecture01.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)

# Retrieve
svm_retriever = SVMRetriever.from_texts(splits,gembeddings)
tfidf_retriever = TFIDFRetriever.from_texts(splits)


In [15]:
question = "What are major topics for this class?"
docs_svm=svm_retriever.get_relevant_documents(question)
docs_svm[0]

Document(metadata={}, page_content="find project partners to do your project with. And also, this is a good time to start forming \nstudy groups, so either talk to your friends or post in the newsgroup, but we just \nencourage you to try to start to do both of those today, okay? Form study groups, and try \nto find two other project partners.  \nSo thank you. I'm looking forward to teaching this class, and I'll see you in a couple of \ndays. [End of Audio]  \nDuration: 69 minutes")

In [16]:
question = "What are major topics for this class?"
docs_svm=tfidf_retriever.get_relevant_documents(question)
docs_svm[0]

Document(metadata={}, page_content="personally could, and this is an instance of maybe computers learning to do things that \nthey were not programmed explicitly to do.  \nHere's a more recent, a more modern, more formal definition of machine learning due to \nTom Mitchell, who says that a well-posed learning problem is defined as follows: He \nsays that a computer program is set to learn from an experience E with respect to some \ntask T and some performance measure P if its performance on T as measured by P \nimproves with experience E. Okay. So not only is it a definition, it even rhymes.  \nSo, for example, in the case of checkers, the experience E that a program has would be \nthe experience of playing lots of games of checkers against itself, say. The task T is the \ntask of playing checkers, and the performance measure P will be something like the \nfraction of games it wins against a certain set of human opponents. And by this \ndefinition, we'll say that Arthur Samuel's checke