# MyScale

>[MyScale](https://docs.myscale.com/en/) is an integrated vector database. You can access your database in SQL and also from here, LangChain.
>`MyScale` can make use of [various data types and functions for filters](https://blog.myscale.com/2023/06/06/why-integrated-database-solution-can-boost-your-llm-apps/#filter-on-anything-without-constraints). It will boost up your LLM app no matter if you are scaling up your data or expand your system to broader application.

In the notebook, we'll demo the `SelfQueryRetriever` wrapped around a `MyScale` vector store with some extra pieces we contributed to LangChain. 

In short, it can be condensed into 4 points:
1. Add `contain` comparator to match the list of any if there is more than one element matched
2. Add `timestamp` data type for datetime match (ISO-format, or YYYY-MM-DD)
3. Add `like` comparator for string pattern search
4. Add arbitrary function capability

## Creating a MyScale vector store
MyScale has already been integrated to LangChain for a while. So you can follow [this notebook](/docs/integrations/vectorstores/myscale) to create your own vectorstore for a self-query retriever.

**Note:** All self-query retrievers requires you to have `lark` installed (`pip install lark`). We use `lark` for grammar definition. Before you proceed to the next step, we also want to remind you that `clickhouse-connect` is also needed to interact with your MyScale backend.

In [None]:
! pip install lark clickhouse-connect langchain-core

In this tutorial we follow other example's setting and use `OpenAIEmbeddings`. Remember to get an OpenAI API Key for valid access to LLMs.

In [1]:
import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
os.environ["MYSCALE_HOST"] = getpass.getpass("MyScale URL:")
os.environ["MYSCALE_PORT"] = getpass.getpass("MyScale Port:")
os.environ["MYSCALE_USERNAME"] = getpass.getpass("MyScale Username:")
os.environ["MYSCALE_PASSWORD"] = getpass.getpass("MyScale Password:")

In [2]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.schema import Document
from langchain.vectorstores import MyScale

embeddings = OpenAIEmbeddings()

## Create some sample data
As you can see, the data we created has some differences compared to other self-query retrievers. We replaced the keyword `year` with `date` which gives you finer control on timestamps. We also changed the type of the keyword `gerne` to a list of strings, where an LLM can use a new `contain` comparator to construct filters. We also provide the `like` comparator and arbitrary function support to filters, which will be introduced in next few cells.

Now let's look at the data first.

In [None]:
docs = [
    Document(
        page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
        metadata={"date": "1993-07-02", "rating": 7.7, "genre": ["science fiction"]},
    ),
    Document(
        page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
        metadata={"date": "2010-12-30", "director": "Christopher Nolan", "rating": 8.2},
    ),
    Document(
        page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
        metadata={"date": "2006-04-23", "director": "Satoshi Kon", "rating": 8.6},
    ),
    Document(
        page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
        metadata={"date": "2019-08-22", "director": "Greta Gerwig", "rating": 8.3},
    ),
    Document(
        page_content="Toys come alive and have a blast doing so",
        metadata={"date": "1995-02-11", "genre": ["animation"]},
    ),
    Document(
        page_content="Three men walk into the Zone, three men walk out of the Zone",
        metadata={
            "date": "1979-09-10",
            "director": "Andrei Tarkovsky",
            "genre": ["science fiction", "adventure"],
            "rating": 9.9,
        },
    ),
]
vectorstore = MyScale.from_documents(
    docs,
    embeddings,
)

## Creating our self-querying retriever
Just like other retrievers... simple and nice.

We also introduce you the `VirtualColumnName`, where you can prompt your columns that links to certain SQL functions / columns.

Self query retrievers are powerful and it can be even stronger than before. Taking MyScale as an example, sometimes the user may add complex SQL function to a column name. So we add a `VirtualColumnName` for self-query retriever to reduce token usage with extra functionality.

1. a function that dynamically creates function mapped column names, which becomes handy when the user want to compare between current time / location or other attributes along with a column.
2. complex function call in plain string can now be replaced with a shorter nickname in prompt, which saves number of token for very long function names.
3. refined the logic in MyScale translator to meet this standard.

This change will not affect other self query retrievers as it preserves the original plain string interface under comparison. The actual difference happens under `QueryTransformer` in `query_construct.parser` for `Lark` parser.

`Comparator.attribute` can now be either a string or a `VirtualColumnName`. I have defined the default behaviour which stops the user to use `VirtualColumnName` under other vectorstores than MyScale.

This functionality will boost the self-query retrievers as a bridge between text and simple SQL query. And we believe this will help users to expand their usage to this retriever. Other SQL vector database is theoretically compatible to this new feature as well.


In [4]:
from langchain.chains.query_constructor.base import AttributeInfo, VirtualColumnName
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever

metadata_field_info = [
    # You can use plain string to specify a column
    # If not specified, the MyScale's Self querying retriever will append `metadata.<your-attribute-name>` to your column.
    # This is a default behavior if you used LangChain to insert data into MyScale
    # the line below is equivalent to
    #   |||
    #   vvv
    # AttributeInfo(
    #     name=VirtualColumnName(
    #         name="genre", column=f"{vectorstore.metadata_column}.genre"
    #     ),
    #     description="The length of genres of the movie",
    #     type="integer",
    # )
    AttributeInfo(
        name="genre",
        description="The genres of the movie",
        type="list[string]",
    ),
    # Or if you wang to use a customized column name, you should use virtual column name
    # This will help you to expand how you can use this self query retriever
    # If you want to include length of a list, just define it as a new column
    # This will teach the LLM to use it as a column when constructing filter.
    AttributeInfo(
        name=VirtualColumnName(
            name="length(genre)", column=f"length({vectorstore.metadata_column}.genre)"
        ),
        description="The length of genres of the movie",
        type="integer",
    ),
    # Virtual columns can also help you with SQL functions.
    # Now you can define a column as timestamp. By simply set the type to timestamp.
    AttributeInfo(
        # Virtual column names are used for translating long name for columns
        name=VirtualColumnName(
            name="date",
            column=f"parseDateTime32BestEffort({vectorstore.metadata_column}.date)",
        ),
        description="The date the movie was released",
        type="timestamp",
    ),
    AttributeInfo(
        name=VirtualColumnName(
            name="director", column=f"{vectorstore.metadata_column}.director"
        ),
        description="The name of the movie director",
        type="string",
    ),
    AttributeInfo(
        name=VirtualColumnName(
            name="rating", column=f"{vectorstore.metadata_column}.rating"
        ),
        description="A 1-10 rating for the movie",
        type="float",
    ),
]
document_content_description = "Brief summary of a movie"
llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm, vectorstore, document_content_description, metadata_field_info, verbose=True
)

## Testing it out with self-query retriever's existing functionalities
And now we can try actually using our retriever!

In [5]:
# This example only specifies a relevant query
retriever.get_relevant_documents("What are some movies about dinosaurs")

query='dinosaur' filter=None limit=None


[Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'date': '1993-07-02', 'director': '', 'genre': ['science fiction'], 'rating': 7.7}),
 Document(page_content='Toys come alive and have a blast doing so', metadata={'date': '1995-02-11', 'director': '', 'genre': ['animation'], 'rating': 0.0}),
 Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'date': '1979-09-10', 'director': 'Andrei Tarkovsky', 'genre': ['science fiction', 'adventure'], 'rating': 9.9}),
 Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'date': '2006-04-23', 'director': 'Satoshi Kon', 'genre': [], 'rating': 8.6})]

In [6]:
# This example only specifies a filter
retriever.get_relevant_documents("I want to watch a movie rated higher than 8.5")

query=' ' filter=Comparison(comparator=<Comparator.GT: 'gt'>, attribute='metadata.rating', value=8.5) limit=None


[Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'date': '1979-09-10', 'director': 'Andrei Tarkovsky', 'genre': ['science fiction', 'adventure'], 'rating': 9.9}),
 Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'date': '2006-04-23', 'director': 'Satoshi Kon', 'genre': [], 'rating': 8.6})]

In [7]:
# This example specifies a query and a filter
retriever.get_relevant_documents("Has Greta Gerwig directed any movies about women")

query='women' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='metadata.director', value='Greta Gerwig') limit=None


[Document(page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them', metadata={'date': '2019-08-22', 'director': 'Greta Gerwig', 'genre': [], 'rating': 8.3})]

In [8]:
# This example specifies a composite filter
retriever.get_relevant_documents(
    "What's a highly rated (above 8.5) science fiction film?"
)

query=' ' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.GT: 'gt'>, attribute='metadata.rating', value=8.5), Comparison(comparator=<Comparator.CONTAIN: 'contain'>, attribute='genre', value='science fiction')]) limit=None


[Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'date': '1979-09-10', 'director': 'Andrei Tarkovsky', 'genre': ['science fiction', 'adventure'], 'rating': 9.9})]

In [9]:
# This example specifies a query and composite filter
retriever.get_relevant_documents(
    "What's a movie after 1990 but before 2005 that's all about toys, and preferably is animation"
)

query='toys' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.GT: 'gt'>, attribute='parseDateTime32BestEffort(metadata.date)', value={'date': '1990-01-01', 'type': 'date'}), Comparison(comparator=<Comparator.LT: 'lt'>, attribute='parseDateTime32BestEffort(metadata.date)', value={'date': '2005-12-31', 'type': 'date'}), Comparison(comparator=<Comparator.CONTAIN: 'contain'>, attribute='genre', value='animation')]) limit=None


[Document(page_content='Toys come alive and have a blast doing so', metadata={'date': '1995-02-11', 'director': '', 'genre': ['animation'], 'rating': 0.0})]

# Wait a second... what else?

Self-query retriever with MyScale can do more! Let's find out.

In [10]:
# You can use length(genres) to do anything you want
retriever.get_relevant_documents("What's a movie that have more than 1 genres?")

query=' ' filter=Comparison(comparator=<Comparator.GT: 'gt'>, attribute='length(metadata.genre)', value=1) limit=None


[Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'date': '1979-09-10', 'director': 'Andrei Tarkovsky', 'genre': ['science fiction', 'adventure'], 'rating': 9.9})]

In [11]:
# Fine-grained datetime? You got it already.
retriever.get_relevant_documents("What's a movie that release after feb 1995?")

query=' ' filter=Comparison(comparator=<Comparator.GT: 'gt'>, attribute='parseDateTime32BestEffort(metadata.date)', value={'date': '1995-02-01', 'type': 'date'}) limit=None


[Document(page_content='Toys come alive and have a blast doing so', metadata={'date': '1995-02-11', 'director': '', 'genre': ['animation'], 'rating': 0.0}),
 Document(page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them', metadata={'date': '2019-08-22', 'director': 'Greta Gerwig', 'genre': [], 'rating': 8.3}),
 Document(page_content='Leo DiCaprio gets lost in a dream within a dream within a dream within a ...', metadata={'date': '2010-12-30', 'director': 'Christopher Nolan', 'genre': [], 'rating': 8.2}),
 Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'date': '2006-04-23', 'director': 'Satoshi Kon', 'genre': [], 'rating': 8.6})]

In [12]:
# Don't know what your exact filter should be? Use string pattern match!
retriever.get_relevant_documents("What's a movie whose name is like Andrei?")

query='Andrei' filter=Comparison(comparator=<Comparator.LIKE: 'like'>, attribute='metadata.director', value='Andrei') limit=None


[Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'date': '1979-09-10', 'director': 'Andrei Tarkovsky', 'genre': ['science fiction', 'adventure'], 'rating': 9.9})]

In [13]:
# Contain works for lists: so you can match a list with contain comparator!
retriever.get_relevant_documents(
    "What's a movie who has genres science fiction and adventure?"
)

query=' ' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.CONTAIN: 'contain'>, attribute='genre', value='science fiction'), Comparison(comparator=<Comparator.CONTAIN: 'contain'>, attribute='genre', value='adventure')]) limit=None


[Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'date': '1979-09-10', 'director': 'Andrei Tarkovsky', 'genre': ['science fiction', 'adventure'], 'rating': 9.9})]

## Filter k

We can also use the self query retriever to specify `k`: the number of documents to fetch.

We can do this by passing `enable_limit=True` to the constructor.

In [14]:
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectorstore,
    document_content_description,
    metadata_field_info,
    enable_limit=True,
    verbose=True,
)

In [15]:
# This example only specifies a relevant query
retriever.get_relevant_documents("what are two movies about dinosaurs")

query='dinosaur' filter=None limit=2


[Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'date': '1993-07-02', 'director': '', 'genre': ['science fiction'], 'rating': 7.7}),
 Document(page_content='Toys come alive and have a blast doing so', metadata={'date': '1995-02-11', 'director': '', 'genre': ['animation'], 'rating': 0.0})]