# Retrieval

Retrieval is the centerpiece of our retrieval augmented generation (RAG) flow. 

This project explores and implements various document retrieval and optimization techniques using LangChain's vector storage and retrieval abstractions. The main objectives are to retrieve relevant documents effectively and enhance response quality by combining multiple advanced retrieval strategies. Here’s a summary of the key tasks and methods applied:

## Key Components and Techniques

1. **Vector-Based Similarity Search**:
   - Uses `Chroma` for vector storage, with `OpenAIEmbeddings` to convert text into embeddings.
   - Demonstrates similarity search to retrieve documents most similar to a query.
   - Introduces **Maximum Marginal Relevance (MMR)** to achieve diverse and relevant search results by balancing relevance with diversity.

2. **Metadata-Driven Retrieval**:
   - Adds metadata filtering to refine searches by limiting results to specific sources or attributes.
   - Explores **SelfQueryRetriever** to automatically infer metadata filters from natural language queries, enhancing retrieval specificity without needing additional manual filtering.

3. **Contextual Compression**:
   - Applies **ContextualCompressionRetriever** to extract only query-relevant sections from long documents.
   - Combines document compression with **MMR** to ensure results are concise, focused, and varied, especially useful in handling long and multi-topic documents.

4. **Alternative Retrieval Methods**:
   - Implements non-vector-based methods like **TF-IDF** and **SVM** for document retrieval, showing flexibility in retrieval strategies.
   - **TF-IDF**: Ranks documents based on term frequency and inverse document frequency, suitable for keyword-based matching.
   - **SVM**: Uses Support Vector Machine classification for scenarios where labeled data is available for training, enabling relevant vs. non-relevant document classification.

## Summary of Project Goals

This project demonstrates how to effectively retrieve, filter, and compress relevant document content for optimized responses. It leverages vector embeddings, metadata filtering, compression techniques, and alternative retrieval methods to handle various types of retrieval tasks, balancing specificity, relevance, and response efficiency. This approach is particularly useful for applications that require precise and diverse information retrieval from large document collections.


## Vectorstore retrieval


In [1]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

In [2]:
#!pip install lark

### Similarity Search

1. **Imports**: Load `Chroma` for vector storage and `OpenAIEmbeddings` to generate text embeddings.
2. **Initialize Vector DB**: Set up `vectordb` to persist vectors in `'docs/chroma/'` using `OpenAIEmbeddings`.
3. **Check Collection Count**: Display current entry count in `vectordb`.
4. **Define Text Data**: Create sample texts describing *Amanita phalloides* for testing.
5. **Create Small DB**: Use `Chroma.from_texts` to store sample texts in `smalldb` as embeddings.
6. **Define Query**: Write a question about "all-white mushrooms with large fruiting bodies."
7. **Similarity Search**: Use `smalldb.similarity_search` to find the top 2 relevant documents based on the query.
8. **Max Marginal Relevance Search**: Use `smalldb.max_marginal_relevance_search` to retrieve diverse yet relevant documents.


#### Explanation of `vectordb` and `smalldb` Count Issue

- **Issue**: After executing `smalldb = Chroma.from_texts(texts, embedding=embedding)`, the result of `print(vectordb._collection.count())` remains 209, without the 3 new text entries.
- **Reason**: `smalldb` is a separate database instance and is not connected to `vectordb`. The `from_texts` method creates a new database (`smalldb`) without modifying `vectordb`.
- **Solution**: To add the new texts directly to `vectordb`, use `vectordb.add_texts(texts)` and then check the count.

This ensures the document count in `vectordb` increases, showing 212.



In [49]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
persist_directory = 'docs/chroma/'

In [50]:
embedding = OpenAIEmbeddings()
vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

In [51]:
print(vectordb._collection.count())

209


In [52]:
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

In [53]:
smalldb = Chroma.from_texts(texts, embedding=embedding)

In [54]:
print(vectordb._collection.count())

209


In [55]:
question = "Tell me about all-white mushrooms with large fruiting bodies"

In [56]:
smalldb.similarity_search(question, k=2)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.', metadata={}),
 Document(page_content='The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).', metadata={})]

In [57]:
smalldb.max_marginal_relevance_search(question,k=2, fetch_k=3)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.', metadata={}),
 Document(page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.', metadata={})]

### Addressing Diversity: Maximum marginal relevance

Last class we introduced one problem: how to enforce diversity in the search results.
 
`Maximum marginal relevance` strives to achieve both **relevance** to the query and **diversity** among the results.

In [58]:
question = "what did they say about matlab?"
docs_ss = vectordb.similarity_search(question,k=3)

In [59]:
docs_ss[0].page_content[:100]

'those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people '

In [60]:
docs_ss[1].page_content[:100]

'those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people '

Note the difference in results with `MMR`.

In [61]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)

In [62]:
docs_mmr[0].page_content[:100]

'those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people '

In [63]:
docs_mmr[1].page_content[:100]

'algorithm then? So what’s different? How come  I was making all that noise earlier about \nleast squa'

### Addressing Specificity: working with metadata

In last lecture, we showed that a question about the third lecture can include results from other lectures as well.

To address this, many vectorstores support operations on `metadata`.

`metadata` provides context for each embedded chunk.

In [64]:
question = "what did they say about regression in the third lecture?"

In [65]:
docs = vectordb.similarity_search(
    question,
    k=3,
    filter={"source":"docs/cs229_lectures/MachineLearning-Lecture03.pdf"}
)

In [66]:
for d in docs:
    print(d.metadata)

{'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 0}
{'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 14}
{'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 4}


### Addressing Specificity: working with metadata using self-query retriever

But we have an interesting challenge: we often want to infer the metadata from the query itself.

To address this, we can use `SelfQueryRetriever`, which uses an LLM to extract:
 
1. The `query` string to use for vector search
2. A metadata filter to pass in as well

Most vector databases support metadata filters, so this doesn't require any new databases or indexes.

In [67]:
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

In [68]:
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, should be one of `docs/cs229_lectures/MachineLearning-Lecture01.pdf`, `docs/cs229_lectures/MachineLearning-Lecture02.pdf`, or `docs/cs229_lectures/MachineLearning-Lecture03.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]

**Note:** The default model for `OpenAI` ("from langchain.llms import OpenAI") is `text-davinci-003`. Due to the deprication of OpenAI's model `text-davinci-003` on 4 January 2024, you'll be using OpenAI's recommended replacement model `gpt-3.5-turbo-instruct` instead.

In [69]:
document_content_description = "Lecture notes"
llm = OpenAI(model='gpt-3.5-turbo-instruct', temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)

In [70]:
question = "what did they say about regression in the third lecture?"

**You will receive a warning** about predict_and_parse being deprecated the first time you executing the next line. This can be safely ignored.

In [71]:
docs = retriever.get_relevant_documents(question)

query='regression' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='source', value='docs/cs229_lectures/MachineLearning-Lecture03.pdf') limit=None




In [72]:
for d in docs:
    print(d.metadata)

{'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 14}
{'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 0}
{'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 10}
{'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 10}


In [73]:
question = "what did they say about regression in the second lecture?"

In [74]:
docs = retriever.get_relevant_documents(question)

query='regression' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='source', value='docs/cs229_lectures/MachineLearning-Lecture02.pdf') limit=None


In [75]:
for d in docs:
    print(d.metadata)

{'source': 'docs/cs229_lectures/MachineLearning-Lecture02.pdf', 'page': 2}
{'source': 'docs/cs229_lectures/MachineLearning-Lecture02.pdf', 'page': 5}
{'source': 'docs/cs229_lectures/MachineLearning-Lecture02.pdf', 'page': 12}
{'source': 'docs/cs229_lectures/MachineLearning-Lecture02.pdf', 'page': 0}


In [76]:
docs[0].page_content[:1000]

"Instructor (Andrew Ng) :All right, so who thought driving could be that dramatic, right? \nSwitch back to the chalkboard, please. I s hould say, this work was done about 15 years \nago and autonomous driving has come a long way. So many of you will have heard of the \nDARPA Grand Challenge, where one of my colleagues, Sebastian Thrun, the winning \nteam's drive a car across a desert by itself.  \nSo Alvin was, I think, absolutely amazing wo rk for its time, but autonomous driving has \nobviously come a long way since then. So what  you just saw was an example, again, of \nsupervised learning, and in particular it was an  example of what they  call the regression \nproblem, because the vehicle is trying to predict a continuous value variables of a \ncontinuous value steering directions , we call the regression problem.  \nAnd what I want to do today is talk about our first supervised learning algorithm, and it \nwill also be to a regression task. So for the running example that I'm goi

### Additional tricks: compression

Another approach for improving the quality of retrieved docs is compression.

Information most relevant to a query may be buried in a document with a lot of irrelevant text. 

Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

Contextual compression is meant to fix this. 

In [77]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [78]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))


In [79]:
# Wrap our vectorstore
llm = OpenAI(temperature=0, model="gpt-3.5-turbo-instruct")
compressor = LLMChainExtractor.from_llm(llm)

In [80]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

In [81]:
question = "what did they say about matlab?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Document 1:

- "those homeworks will be done in either MATLA B or in Octave"
- "I know some people call it a free ve rsion of MATLAB"
- "MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data."
- "there's also a software package called Octave that you can download for free off the Internet."
- "it has somewhat fewer features than MATLAB, but it's free, and for the purposes of this class, it will work for just about everything."
- "once a colleague of mine at a different university, not at Stanford, actually teaches another machine learning course."
----------------------------------------------------------------------------------------------------
Document 2:

- "those homeworks will be done in either MATLA B or in Octave"
- "I know some people call it a free ve rsion of MATLAB"
- "MATLAB is I guess part of the programming language that makes it very easy to write 

## Combining various techniques
This example demonstrates how to optimize document retrieval by combining **contextual compression** and **Maximum Marginal Relevance (MMR)** techniques. This approach improves retrieval quality by focusing on query-relevant content while ensuring diverse results.

This combined technique is ideal for:

- **Long Document Retrieval**: When documents are lengthy and cover multiple topics, compression helps remove unrelated sections.
- **Diverse Information Needs**: For queries requiring information on multiple aspects, MMR ensures that returned results cover a range of relevant details instead of repetitive content.



In [82]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr")
)

In [83]:
question = "what did they say about matlab?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Document 1:

- "those homeworks will be done in either MATLA B or in Octave"
- "I know some people call it a free ve rsion of MATLAB"
- "MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data."
- "there's also a software package called Octave that you can download for free off the Internet."
- "it has somewhat fewer features than MATLAB, but it's free, and for the purposes of this class, it will work for just about everything."
- "once a colleague of mine at a different university, not at Stanford, actually teaches another machine learning course."
----------------------------------------------------------------------------------------------------
Document 2:

"Oh, it was the MATLAB."
----------------------------------------------------------------------------------------------------
Document 3:

- learning algorithms to teach a car how to drive at reasonably high 

## Other types of retrieval

It's worth noting that vectordb as not the only kind of tool to retrieve documents. 

The `LangChain` retriever abstraction includes other ways to retrieve documents, such as TF-IDF or SVM.

- TF-IDF: Based on term frequency and inverse document frequency weighting, suitable for short texts and keyword matching.
- SVM: Uses a machine learning model for classification, suitable for text retrieval tasks that require classification.

In [84]:
from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [85]:
# Load PDF
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)


In [86]:
# Retrieve
svm_retriever = SVMRetriever.from_texts(splits,embedding)
tfidf_retriever = TFIDFRetriever.from_texts(splits)

In [87]:
question = "What are major topics for this class?"
docs_svm=svm_retriever.get_relevant_documents(question)
docs_svm[0]



Document(page_content="let me just check what questions you have righ t now. So if there are no questions, I'll just \nclose with two reminders, which are after class today or as you start to talk with other \npeople in this class, I just encourage you again to start to form project partners, to try to \nfind project partners to do your project with. And also, this is a good time to start forming \nstudy groups, so either talk to your friends  or post in the newsgroup, but we just \nencourage you to try to star t to do both of those today, okay? Form study groups, and try \nto find two other project partners.  \nSo thank you. I'm looking forward to teaching this class, and I'll see you in a couple of \ndays.   [End of Audio]  \nDuration: 69 minutes", metadata={})

In [88]:
question = "what did they say about matlab?"
docs_tfidf=tfidf_retriever.get_relevant_documents(question)
docs_tfidf[0]

Document(page_content="Saxena and Min Sun here did, wh ich is given an image like this, right? This is actually a \npicture taken of the Stanford campus. You can apply that sort of cl ustering algorithm and \ngroup the picture into regions. Let me actually blow that up so that you can see it more \nclearly. Okay. So in the middle, you see the lines sort of groupi ng the image together, \ngrouping the image into [inaudible] regions.  \nAnd what Ashutosh and Min did was they then  applied the learning algorithm to say can \nwe take this clustering and us e it to build a 3D model of the world? And so using the \nclustering, they then had a lear ning algorithm try to learn what the 3D structure of the \nworld looks like so that they could come up with a 3D model that you can sort of fly \nthrough, okay? Although many people used to th ink it's not possible to take a single \nimage and build a 3D model, but using a lear ning algorithm and that sort of clustering \nalgorithm is the first ste