## Setting Up Our Environment
Don't forget to run ```export OPENAI_API_KEY=sk-...``` to set your api key in the environment variables before running Jupyter. You can set up alternative api keys with hugging face or other client sites to operate with LangChain but that is beyond our scope.

In [1]:
## Uncomment and run this cell if you need to install the required packages 
# !pip install langchain-community langchain-openai

# How to use a vectorstore as a retriever

A vector store retriever is a [retriever](/docs/concepts/retrievers/) that uses a [vector store](/docs/concepts/vectorstores/) to retrieve documents. It is a lightweight wrapper around the vector store class to make it conform to the retriever [interface](/docs/concepts/runnables/).
It uses the search methods implemented by a vector store, like similarity search and MMR, to query the texts in the vector store.

In this guide we will cover:

1. How to instantiate a retriever from a vectorstore;
2. How to specify the search type for the retriever;
3. How to specify additional search parameters, such as threshold scores and top-k.

## Creating a retriever from a vectorstore

You can build a retriever from a vectorstore using its [.as_retriever](https://python.langchain.com/api_reference/core/vectorstores/langchain_core.vectorstores.base.VectorStore.html#langchain_core.vectorstores.base.VectorStore.as_retriever) method. Let's walk through an example.

First we instantiate a vectorstore. We will use an in-memory [FAISS](https://python.langchain.com/api_reference/community/vectorstores/langchain_community.vectorstores.faiss.FAISS.html) vectorstore:

In [2]:
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

loader = TextLoader("../session2/some_data/FDR_State_of_Union_1944.txt")

documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(texts, embeddings)

def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

Created a chunk of size 1535, which is longer than the specified 1000


We can then instantiate a retriever:

In [3]:
retriever = vectorstore.as_retriever()

This creates a retriever (specifically a [VectorStoreRetriever](https://python.langchain.com/api_reference/core/vectorstores/langchain_core.vectorstores.base.VectorStoreRetriever.html)), which we can use in the usual way:

In [4]:
docs = retriever.invoke("what did the president say about germany?")

## Similarity Metrics
The core of the retriever is the <b>similarity calculation. </b>

When we specify ```search_type="similarity"```, we're using a similarity metric to compare vectors. 

As of this date (3/18/2025), these are two out of the three available ```search types``` in the LangChain ```VectorStoreRetriever``` class:

- <b>Cosine Similarity</b> (```search_type='similarity'```):
    - Measures the cosine of the angle between two vectors:
      - $$
\text{similarity}(v_1, v_2) = \frac{v_1 \cdot v_2}{\|v_1\| \times \|v_2\|}
$$

      - Where
         - ```v1 · v2``` is the dot product
         - ```||v||``` is the magnitude of vector v.
Cosine similarity ranges from -1 (completely opposite) to 1 (identical direction), with 0 indicating orthogonality (no similarity).


- <b>Maximum marginal relevance</b> (```search_type='mmr'```):
    - Maximizing relevance to the query and minimizes redundancy among selected documents:
      - $$
\text{MMR} = \arg\max_{d_i \in R \setminus S} \left[ \lambda \times \text{sim}(d_i, q) - (1-\lambda) \times \max_{d_j \in S} \text{sim}(d_i, d_j) \right]
$$
      - Where:
        - ```R``` is the set of all document vectors
        - ```S``` is the set of already selected document vectors (initialized as empty)
        - ```q``` is the query vector
        - ```sim(x, y)``` is the similarity function (like cosine similarity)
        - ```λ``` is a parameter between 0 and 1 that controls the trade-off between relevance and diversity
          - λ = 1: <b>Pure relevance-based ranking</b> (equivalent to standard similarity search)
          - λ = 0: <b>Pure diversity-based ranking</b> (selecting documents most different from those already chosen)
          - 0 < λ < 1: <b>Balanced approach</b>, typically values like 0.5-0.7 work well
        - ```arg max``` selects the document that maximizes the expression

### Computational Complexity
The computational complexity of MMR is higher than standard similarity search:
   
- ```cosine```: O(n) where n is the number of documents
    
- ```MMR```: O(n × k) where k is the number of documents to return

This is because at each step, MMR needs to compute similarities between each remaining document and all already selected documents.

In [5]:
retriever = vectorstore.as_retriever(search_type="similarity",
                                    search_kwargs={"k": 3}        # Number of documents to return
                                    )
docs = retriever.invoke("what did the president say about germany?")
# docs
len(docs)

3

In [6]:
pretty_print_docs(docs)

Document 1:

Let us remember the lessons of 1918. In the summer of that year the tide turned in favor of the allies. But this Government did not relax. In fact, our national effort was stepped up. In August, 1918, the draft age limits were broadened from 21-31 to 18-45. The President called for "force to the utmost," and his call was heeded. And in November, only three months later, Germany surrendered.

That is the way to fight and win a warâ€”all outâ€”and not with half-an-eye on the battlefronts abroad and the other eye-and-a-half on personal, selfish, or political interests here at home.

Therefore, in order to concentrate all our energies and resources on winning the war, and to maintain a fair and stable economy at home, I recommend that the Congress adopt:
----------------------------------------------------------------------------------------------------
Document 2:

The foreign policy that we have been followingâ€”the policy that guided us at Moscow, Cairo, and Teheranâ€”is ba

In [7]:
retriever = vectorstore.as_retriever(search_type="mmr",
                                    search_kwargs={
                                        "k": 3,
                                        "lambda_mult": 0.0      # λ = 0 (high diversity)
                                    }
                                )
docs = retriever.invoke("what did the president say about germany?")
# docs
len(docs)

3

In [8]:
pretty_print_docs(docs)

Document 1:

Let us remember the lessons of 1918. In the summer of that year the tide turned in favor of the allies. But this Government did not relax. In fact, our national effort was stepped up. In August, 1918, the draft age limits were broadened from 21-31 to 18-45. The President called for "force to the utmost," and his call was heeded. And in November, only three months later, Germany surrendered.

That is the way to fight and win a warâ€”all outâ€”and not with half-an-eye on the battlefronts abroad and the other eye-and-a-half on personal, selfish, or political interests here at home.

Therefore, in order to concentrate all our energies and resources on winning the war, and to maintain a fair and stable economy at home, I recommend that the Congress adopt:
----------------------------------------------------------------------------------------------------
Document 2:

The fact is the very contrary. It has been shown time and again that if the standard of living of any country go

## Passing search parameters

We can pass parameters to the underlying vectorstore's search methods using `search_kwargs`.

### The last ```search_type```: Similarity score threshold retrieval

For example, we can set a similarity score threshold and only return documents with a score above that threshold.

In [9]:
retriever = vectorstore.as_retriever(
    search_type="similarity_score_threshold", 
    search_kwargs={"score_threshold": 0.72}
)
docs = retriever.invoke("what did the president say about germany?")
# docs
len(docs)

2

In [10]:
pretty_print_docs(docs)

Document 1:

Let us remember the lessons of 1918. In the summer of that year the tide turned in favor of the allies. But this Government did not relax. In fact, our national effort was stepped up. In August, 1918, the draft age limits were broadened from 21-31 to 18-45. The President called for "force to the utmost," and his call was heeded. And in November, only three months later, Germany surrendered.

That is the way to fight and win a warâ€”all outâ€”and not with half-an-eye on the battlefronts abroad and the other eye-and-a-half on personal, selfish, or political interests here at home.

Therefore, in order to concentrate all our energies and resources on winning the war, and to maintain a fair and stable economy at home, I recommend that the Congress adopt:
----------------------------------------------------------------------------------------------------
Document 2:

The foreign policy that we have been followingâ€”the policy that guided us at Moscow, Cairo, and Teheranâ€”is ba

### Specifying top k

We can also limit the number of documents `k` returned by the retriever.

In [11]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 1})

In [12]:
docs = retriever.invoke("what did the president say about germany?")
len(docs)

1

In [13]:
docs[0].page_content

'Let us remember the lessons of 1918. In the summer of that year the tide turned in favor of the allies. But this Government did not relax. In fact, our national effort was stepped up. In August, 1918, the draft age limits were broadened from 21-31 to 18-45. The President called for "force to the utmost," and his call was heeded. And in November, only three months later, Germany surrendered.\n\nThat is the way to fight and win a warâ€”all outâ€”and not with half-an-eye on the battlefronts abroad and the other eye-and-a-half on personal, selfish, or political interests here at home.\n\nTherefore, in order to concentrate all our energies and resources on winning the war, and to maintain a fair and stable economy at home, I recommend that the Congress adopt:'