# Self-querying retriever with elasticsearch and langchain
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/langchain/notebooks/langchain/self-query-retriever-examples/langchain-self-query-retriever.ipynb)

This workbook demonstrates example of Elasticsearch's [Self-query retriever](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.self_query.base.SelfQueryRetriever.html) to convert unstructured query into a structured query and apply structured query to a vectorstore. 

Before we begin, we first split the documents into chunks with `langchain` and then using [`ElasticsearchStore.from_documents`](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html#langchain.vectorstores.elasticsearch.ElasticsearchStore.from_documents), we create a `vectorstore` and index data to elasticsearch.


We will then see few examples query demonstrating full power of elasticsearch powered self-query retriever.


## Install packages and import modules


In [1]:
from langchain.schema import Document
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
import os

## Create documents 
Next, we will create list of documents with summary of movies using [langchain Schema Document](https://api.python.langchain.com/en/latest/schema/langchain.schema.document.Document.html), containing each document's `page_content` and `metadata` .



In [2]:
docs = [
    Document(
        page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
        metadata={
            "year": 1993,
            "rating": 7.7,
            "genre": "science fiction",
            "director": "Steven Spielberg",
            "title": "Jurassic Park",
        },
    ),
    Document(
        page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
        metadata={
            "year": 2010,
            "director": "Christopher Nolan",
            "rating": 8.2,
            "title": "Inception",
        },
    ),
    Document(
        page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
        metadata={
            "year": 2006,
            "director": "Satoshi Kon",
            "rating": 8.6,
            "title": "Paprika",
        },
    ),
    Document(
        page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
        metadata={
            "year": 2019,
            "director": "Greta Gerwig",
            "rating": 8.3,
            "title": "Little Women",
        },
    ),
    Document(
        page_content="Toys come alive and have a blast doing so",
        metadata={
            "year": 1995,
            "genre": "animated",
            "director": "John Lasseter",
            "rating": 8.3,
            "title": "Toy Story",
        },
    ),
    Document(
        page_content="Three men walk into the Zone, three men walk out of the Zone",
        metadata={
            "year": 1979,
            "rating": 9.9,
            "director": "Andrei Tarkovsky",
            "genre": "science fiction",
            "rating": 9.9,
            "title": "Stalker",
        },
    ),
]

## Connect to Chroma

In [3]:
from langchain_community.vectorstores.chroma import Chroma

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

vectorstore = Chroma.from_documents(
    docs,
    embeddings,
)

  warn_deprecated(


## Setup query retriever

Next we will instantiate self-query retriever by providing a bit information about our document attributes and a short description about the document. 

We will then instantiate retriever with [SelfQueryRetriever.from_llm](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.self_query.base.SelfQueryRetriever.html)

In [4]:
# Add details about metadata fields
metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie. Can be either 'science fiction' or 'animated'.",
        type="string or list[string]",
    ),
    AttributeInfo(
        name="year",
        description="The year the movie was released",
        type="integer",
    ),
    AttributeInfo(
        name="director",
        description="The name of the movie director",
        type="string",
    ),
    AttributeInfo(
        name="rating", description="A 1-10 rating for the movie", type="float"
    ),
]

document_content_description = "Brief summary of a movie"

# Set up openAI llm with sampling temperature 0
llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)

# instantiate retriever
retriever = SelfQueryRetriever.from_llm(
    llm, vectorstore, document_content_description, metadata_field_info, verbose=True
)

  warn_deprecated(


## Test retriever with simple query

We will test the retriever with a simple query:  `What are some movies about dream`. 

The output shows all the relevant documents to the query.

In [5]:
# This example only specifies a relevant query
retriever.get_relevant_documents("What are some movies about dream")

[Document(page_content='Leo DiCaprio gets lost in a dream within a dream within a dream within a ...', metadata={'director': 'Christopher Nolan', 'rating': 8.2, 'title': 'Inception', 'year': 2010}),
 Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'director': 'Satoshi Kon', 'rating': 8.6, 'title': 'Paprika', 'year': 2006}),
 Document(page_content='Toys come alive and have a blast doing so', metadata={'director': 'John Lasseter', 'genre': 'animated', 'rating': 8.3, 'title': 'Toy Story', 'year': 1995}),
 Document(page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them', metadata={'director': 'Greta Gerwig', 'rating': 8.3, 'title': 'Little Women', 'year': 2019})]

## Test retriever with simple query and filter

We will now test the retriever with a query:  `Has Andrei Tarkovsky directed any science fiction movies`. 

This query has a filter on the metadata `genre` and  `director`. 


In [6]:
retriever.get_relevant_documents(
    "Has Andrei Tarkovsky directed any science fiction movies"
)

[Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'director': 'Andrei Tarkovsky', 'genre': 'science fiction', 'rating': 9.9, 'title': 'Stalker', 'year': 1979}),
 Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'director': 'Steven Spielberg', 'genre': 'science fiction', 'rating': 7.7, 'title': 'Jurassic Park', 'year': 1993})]

## Instantiate retriever to filter k documents

We will now instantiate retriever again to fetch k number of documents. We can do this my setting `enable_limit=True` when instantiating the retriever. 

We will then test retriever to filter k documents.

In [7]:
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectorstore,
    document_content_description,
    metadata_field_info,
    enable_limit=True,
    verbose=True,
)

## Test the retriever to filter k documents

We will now test the retriever with a query:  `what are two movies about dream`. 

The output would show exactly `2` documents. 

In [8]:
retriever.get_relevant_documents("what are two movies about dream")

[Document(page_content='Leo DiCaprio gets lost in a dream within a dream within a dream within a ...', metadata={'director': 'Christopher Nolan', 'rating': 8.2, 'title': 'Inception', 'year': 2010}),
 Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'director': 'Satoshi Kon', 'rating': 8.6, 'title': 'Paprika', 'year': 2006})]

## Test retriever for complex queries

We will try some complex queries with filters and `1 limit`.


Query: `Show that one movie which was about dream and was released after the year 1992 but before 2007?`. 


In [9]:
retriever.get_relevant_documents(
    "Show that one movie which was about dream and was released after the year 1992 but before 2007?"
)

[Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'director': 'Steven Spielberg', 'genre': 'science fiction', 'rating': 7.7, 'title': 'Jurassic Park', 'year': 1993})]