# How to do "self-querying" retrieval with YandexGPTEmbeddings encoder and YandexGPT/ChatYandexGPT models

For demonstration purposes we'll use a Chroma vector store. We've created a small demo set of documents that contain summaries of movies.

### Install dependencies

In [None]:
%pip install langchain-chroma

In [None]:
%pip install langchain_core
%pip install langchain_community
%pip install yandexcloud
%pip install lark==1.1.7

### Import modules and packages for YandexGPT/ChatYandexGPT and LangChain

In [64]:
import os

from langchain_community.llms import YandexGPT
from langchain_community.chat_models import ChatYandexGPT
from langchain_community.embeddings.yandex import YandexGPTEmbeddings

from langchain_core.prompts import PromptTemplate
from langchain_core.documents import Document
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever

from langchain_chroma import Chroma
from langchain.retrievers.self_query.chroma import ChromaTranslator

import lark

In [65]:
from langchain.chains.query_constructor.base import (
    StructuredQueryOutputParser,
    get_query_constructor_prompt,
)

In [66]:
from langchain.chains.query_constructor.parser import (
    Comparator,
    Operator
)

### Create vector store

First we populate a vector store with some data. Then we define an YandexGPTEmbedding. We will use a Chroma vectorstore, but this guide is compatible with any LangChain vector store.

In [67]:
docs = [
    Document(
        page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
        metadata={"year": 1993, "rating": 7.7, "genre": "science fiction"},
    ),
    Document(
        page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
        metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2},
    ),
    Document(
        page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
        metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6},
    ),
    Document(
        page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
        metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3},
    ),
    Document(
        page_content="Toys come alive and have a blast doing so",
        metadata={"year": 1995, "genre": "animated"},
    ),
    Document(
        page_content="Three men walk into the Zone, three men walk out of the Zone",
        metadata={
            "year": 1979,
            "director": "Andrei Tarkovsky",
            "genre": "thriller",
            "rating": 9.9,
        },
    ),
]

In [68]:
metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']",
        type="string",
    ),
    AttributeInfo(
        name="year",
        description="The year the movie was released",
        type="integer",
    ),
    AttributeInfo(
        name="director",
        description="The name of the movie director",
        type="string",
    ),
    AttributeInfo(
        name="rating", description="A 1-10 rating for the movie", type="float"
    ),
]

In [69]:
document_content_description = "Brief summary of a movie"

In [70]:
embeddings = YandexGPTEmbeddings(api_key="your-api-key", folder_id ="your-folder-id")

In [71]:
vectorstore = Chroma.from_documents(docs, embeddings)

### Define YandexGPT and ChatYandexGPT models

Unfortunately, the lite version of YandexGPT doesn't produce the correct output format, no matter how much we tweak the prompt.

In [72]:
llm1 = YandexGPT(temperature=0.0, api_key="your-api-key", folder_id ="your-folder-id",
                model_name = "yandexgpt")

In [73]:
llm2 = ChatYandexGPT(temperature=0.0, api_key="your-api-key", folder_id ="your-folder-id", model_name = "yandexgpt")

### Define Query Construction and Parsing Pipeline

Defines a comprehensive pipeline for constructing and parsing queries based on specified metadata and document content, ensuring compliance with defined comparators and operators.

In [98]:
prompt = get_query_constructor_prompt(
    document_content_description,
    metadata_field_info,
    allowed_comparators=[
        Comparator.EQ,
        Comparator.NE,
        Comparator.GT,
        Comparator.GTE,
        Comparator.LT,
        Comparator.LTE,
        Comparator.CONTAIN,
        Comparator.LIKE,
        Comparator.IN,
        Comparator.NIN,
    ],
    allowed_operators=[Operator.AND, Operator.OR, Operator.NOT],

    schema_prompt = "Return the answer in JSON format with the following fields: 'query', 'filter'.Return only JSON without any other comments. NEVER use 'ge' filter. Instead of it use 'gte'. Avoid using like, contain, or max in filter value. Filter values should only be derived solely from request data. Strong follow examples structure.")
output_parser = StructuredQueryOutputParser.from_components(
    allowed_comparators=[
        Comparator.EQ,
        Comparator.NE,
        Comparator.GT,
        Comparator.GTE,
        Comparator.LT,
        Comparator.LTE,
        Comparator.CONTAIN,
        Comparator.LIKE,
        Comparator.IN,
        Comparator.NIN,
    ],
    allowed_operators=[Operator.AND, Operator.OR, Operator.NOT],
)
query_constructor_1 = prompt | llm1 | output_parser
query_constructor_2 = prompt | llm2 | output_parser

Let's see how `prompt.format` can process our input query.

In [84]:
prompt.format(query="What are some sci-fi movies from the 90's directed by Luc Besson about taxi drivers")

'Your goal is to structure the user\'s query to match the request schema provided below.\n\nReturn the answer in JSON format with the following fields: \'query\', \'filter\'.Return only JSON without any other comments. NEVER use \'ge\' filter. Instead of it use \'gte\'. Avoid using like, contain, or max in filter value. Filter values should only be derived solely from request data. Strong follow examples structure.\n\n<< Example 1. >>\nData Source:\n```json\n{\n    "content": "Lyrics of a song",\n    "attributes": {\n        "artist": {\n            "type": "string",\n            "description": "Name of the song artist"\n        },\n        "length": {\n            "type": "integer",\n            "description": "Length of the song in seconds"\n        },\n        "genre": {\n            "type": "string",\n            "description": "The song genre, one of "pop", "rock" or "rap""\n        }\n    }\n}\n```\n\nUser Query:\nWhat are songs by Taylor Swift or Katy Perry about teenage romance

If we check the same input query against `query_constructor.invoke`, we can see the processed query with llm and the output result containing the query value and the filter value. Both llms show the same results.

In [85]:
query_constructor_1.invoke(
    {
        "query": "What are some sci-fi movies from the 90's directed by Luc Besson about taxi drivers"
    }
)

StructuredQuery(query='taxi driver', filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='genre', value='science fiction'), Comparison(comparator=<Comparator.GTE: 'gte'>, attribute='year', value=1990), Comparison(comparator=<Comparator.LT: 'lt'>, attribute='year', value=2000), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='director', value='Luc Besson')]), limit=None)

In [87]:
query_constructor_2.invoke(
    {
        "query": "What are some sci-fi movies from the 90's directed by Luc Besson about taxi drivers"
    }
)

StructuredQuery(query='taxi driver', filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='genre', value='science fiction'), Comparison(comparator=<Comparator.GTE: 'gte'>, attribute='year', value=1990), Comparison(comparator=<Comparator.LT: 'lt'>, attribute='year', value=2000), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='director', value='Luc Besson')]), limit=None)

In [89]:
query_constructor_1.invoke(
    {
        "query": "dinosaur movie with rating less than 8"
    }
)

StructuredQuery(query='dinosaur movie', filter=Comparison(comparator=<Comparator.LT: 'lt'>, attribute='rating', value=8), limit=None)

In [88]:
query_constructor_2.invoke(
    {
        "query": "dinosaur movie with rating less than 8"
    }
)

StructuredQuery(query='dinosaur movie', filter=Comparison(comparator=<Comparator.LT: 'lt'>, attribute='rating', value=8), limit=None)

### Creating self-querying retriever with YandexGPT query constructor

In [101]:
retriever = SelfQueryRetriever(
    query_constructor = query_constructor_1,
    vectorstore = vectorstore,
    structured_query_translator=ChromaTranslator(),
    search_kwargs={"k": 1}
)

Try using our retriever.

In [102]:
print(retriever.invoke({
        "query": "Триллер"
        })
)
print()
print(retriever.invoke({
        "query": "Фантастика"
        })
)
print()
print(retriever.invoke({
        "query": "dinosaur movie with rating less than 8"})
)
print()
print(retriever.invoke({
        "query": "What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated"
        })
)

[Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'director': 'Andrei Tarkovsky', 'genre': 'thriller', 'rating': 9.9, 'year': 1979})]

[Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'genre': 'science fiction', 'rating': 7.7, 'year': 1993})]

[Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'genre': 'science fiction', 'rating': 7.7, 'year': 1993})]

[Document(page_content='Toys come alive and have a blast doing so', metadata={'genre': 'animated', 'year': 1995})]


### Creating self-querying retriever with ChatYandexGPT query constructor

In [103]:
retriever = SelfQueryRetriever(
    query_constructor = query_constructor_2,
    vectorstore = vectorstore,
    structured_query_translator=ChromaTranslator(),
    search_kwargs={"k": 1}
)

Try using our retriever.

In [104]:
print(retriever.invoke({
        "query": "Триллер"
        })
)
print()
print(retriever.invoke({
        "query": "Фантастика"
        })
)
print()
print(retriever.invoke({
        "query": "dinosaur movie with rating less than 8"})
)
print()
print(retriever.invoke({
        "query": "What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated"
        })
)

[Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'director': 'Andrei Tarkovsky', 'genre': 'thriller', 'rating': 9.9, 'year': 1979})]

[Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'genre': 'science fiction', 'rating': 7.7, 'year': 1993})]

[Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'genre': 'science fiction', 'rating': 7.7, 'year': 1993})]

[Document(page_content='Toys come alive and have a blast doing so', metadata={'genre': 'animated', 'year': 1995})]
