# LangChain MongoDB Integration - Self-Querying Retrieval

This notebook is a companion to the [Self-Querying Retrieval](https://www.mongodb.com/docs/atlas/ai-integrations/langchain/parent-document-retrieval/) page. Refer to the page for set-up instructions and detailed explanations.

<a target="_blank" href="https://colab.research.google.com/github/mongodb/docs-notebooks/blob/main/ai-integrations/langchain-self-query-retrieval.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Prerequisites

To complete this tutorial, you must have the following:
- A MongoDB Atlas cluster
- A Voyage AI API key
- An OpenAI API key

## Set up the environment

In [None]:
pip install --quiet --upgrade langchain-mongodb langchain-voyageai langchain-openai langchain langchain-community langchain-core lark

In [None]:
import os

os.environ["VOYAGE_API_KEY"] = "<voyage-key>"
os.environ["OPENAI_API_KEY"] = "<openai-key>"
MONGODB_URI = "<connection-string>"

## Instantiate the vector store

In [None]:
from langchain_mongodb import MongoDBAtlasVectorSearch
from langchain_voyageai import VoyageAIEmbeddings

# Use the voyage-3-large embedding model
embedding_model = VoyageAIEmbeddings(model="voyage-3-large")

# Create the vector store
vector_store = MongoDBAtlasVectorSearch.from_connection_string(
   connection_string = MONGODB_URI,
   embedding = embedding_model,
   namespace = "langchain_db.self_query",
   text_key = "page_content"
)

## Add data to the vector store

In [None]:
from langchain_core.documents import Document

docs = [
    Document(
        page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
        metadata={"year": 1993, "rating": 7.7, "genre": "action"},
    ),
    Document(
        page_content="A fight club that is not a fight club, but is a fight club",
        metadata={"year": 1994, "rating": 8.7, "genre": "action"},
    ),
    Document(
        page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
        metadata={"year": 2010, "genre": "thriller", "rating": 8.2},
    ),
    Document(
        page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
        metadata={"year": 2019, "rating": 8.3, "genre": "drama"},
    ),
    Document(
        page_content="Three men walk into the Zone, three men walk out of the Zone",
        metadata={"year": 1979, "rating": 9.9, "genre": "science fiction"},
    ),
    Document(
        page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
        metadata={"year": 2006, "genre": "thriller", "rating": 9.0},
    ),
    Document(
        page_content="Toys come alive and have a blast doing so",
        metadata={"year": 1995, "genre": "animated", "rating": 9.3},
    ),
    Document(
        page_content="The toys come together to save their friend from a kid who doesn't know how to play with them",
        metadata={"year": 1997, "genre": "animated", "rating": 9.1},
    ),
]

# Add data to the vector store, which automaticaly embeds the documents
vector_store.add_documents(docs)

## Create the Vector Search index with filters

In [None]:
# Use LangChain helper method to create the vector search index
vector_store.create_vector_search_index(
   dimensions = 1024, # The dimensions of the vector embeddings to be indexed
   filters = [ "genre", "rating", "year" ], # The metadata fields to be indexed for filtering
   wait_until_complete = 60 # Number of seconds to wait for the index to build (can take around a minute)
)

## Create the Self-Querying Retriever

### Define metadata field and document information

In [None]:
from langchain.chains.query_constructor.schema import AttributeInfo

# Define the document content description 
document_content_description = "Brief summary of a movie"

# Define the metadata fields to filter on
metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie",
        type="string",
    ),
    AttributeInfo(
        name="year",
        description="The year the movie was released",
        type="integer",
    ),
    AttributeInfo(
        name="rating", 
        description="A 1-10 rating for the movie", 
        type="float"
    ),
]

### Initialize the self-querying retriever

In [None]:
from langchain_mongodb.retrievers import MongoDBAtlasSelfQueryRetriever
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o")
retriever = MongoDBAtlasSelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vector_store,
    metadata_field_info=metadata_field_info,
    document_contents=document_content_description
)

## Run Queries with the Self-Querying Retriever

### Queries with filters

In [None]:
# This example specifies a filter (rating > 9)
retriever.invoke("What are some highly rated movies (above 9)?")

In [None]:
# This example specifies a semantic search and a filter (rating > 9)
retriever.invoke("I want to watch a movie about toys rated higher than 9")

In [None]:
# This example specifies a composite filter (rating >= 9 and genre = thriller)
retriever.invoke("What's a highly rated (above or equal 9) thriller film?")

In [None]:
# This example specifies a query and composite filter (year > 1990 and year < 2005 and genre = action)
retriever.invoke(
    "What's a movie after 1990 but before 2005 that's all about dinosaurs, " +
    "and preferably has the action genre"
)

### Query with no filters

In [None]:
# This example only specifies a semantic search query
retriever.invoke("What are some movies about dinosaurs")

## Use the Retriever in Your RAG Pipeline

In [None]:
import pprint
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough

# Configure self-query retriever with a document limit
retriever = MongoDBAtlasSelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vector_store,
    metadata_field_info=metadata_field_info,
    document_contents=document_content_description,
    enable_limit=True
)

# Define a prompt template
template = """
   Use the following pieces of context to answer the question at the end.
   {context}
   Question: {question}
"""
prompt = PromptTemplate.from_template(template)

# Construct a chain to answer questions on your data
chain = (
   { "context": retriever, "question": RunnablePassthrough()}
   | prompt   
   | llm
   | StrOutputParser()
)

# Prompt the chain
question = "What are two movies about toys after 1990?" # year > 1990 and document limit of 2
answer = chain.invoke(question)

print("Question: " + question)
print("Answer: " + answer)

# Return source documents
documents = retriever.invoke(question)
print("\nSource documents:")
pprint.pprint(documents)