# High Cardinality

Often times you may want to do query analysis to create a filter on a categorical column. One of the difficulties here is that you usually need to specify the EXACT categorical value. This can be difficult when there are MANY different categorical values. 

In this notebook we take a look at how to approach this.

## Setup
#### Install dependencies

In [1]:
# %pip install -qU langchain langchain-community langchain-openai faker

#### Set environment variables

We'll use OpenAI in this example:

In [1]:
import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass()

# Optional, uncomment to trace runs with LangSmith. Sign up here: https://smith.langchain.com.
# os.environ["LANGCHAIN_TRACING_V2"] = "true"
# os.environ["LANGCHAIN_API_KEY"] = getpass.getpass()

#### Set up data

We will generate a bunch of fake names

In [3]:
from faker import Faker
fake = Faker()

names = [fake.name() for _ in range(10000)]

Let's look at some of the names

In [14]:
names[0]

'Michelle Horton'

In [6]:
names[567]

'Andrew Fuller'

## Query Analysis

We can now set up a baseline query analysis

In [28]:
from langchain_core.pydantic_v1 import BaseModel, Field

In [29]:
class Search(BaseModel):
    query: str
    author: str

In [35]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

system = """Generate a relevant search query for a library system"""
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        ("human", "{question}"),
    ]
)
llm = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0)
structured_llm = llm.with_structured_output(Search)
query_analyzer = {"question": RunnablePassthrough()} | prompt | structured_llm

We can see that if we spell the name exactly correctly, it knows how to handle it

In [36]:
query_analyzer.invoke("what are books about aliens by Andrew Fuller")

Search(query='books about aliens', author='Andrew Fuller')

The issue is that oftentimes the values you want to filter on are NOT spelled exactly correctly

In [37]:
query_analyzer.invoke("what are books about aliens by andy fuller")

Search(query='books about aliens', author='Andy Fuller')

### Add in all values

One way around this is to add ALL possible values to the prompt. That will generally guide the query in the right direction

In [81]:
system = """Generate a relevant search query for a library system.

`author` attribute MUST be one of:

{authors}

Do NOT hallucinate author name!"""
base_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        ("human", "{question}"),
    ]
)
prompt = prompt.partial(authors = ", ".join(names))

In [82]:
query_analyzer_all = {"question": RunnablePassthrough()} | prompt | structured_llm

However... if the list of categoricals is long enough, it may error!

In [83]:
try:
    query_analyzer_all.invoke("what are books about aliens by andy fuller")
except Exception as e:
    print(e)

Error code: 400 - {'error': {'message': "This model's maximum context length is 16385 tokens. However, your messages resulted in 33733 tokens (33702 in the messages, 31 in the functions). Please reduce the length of the messages or functions.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}


We can try to use a longer context window... but with so much information in there, it is not garunteed to pick it up reliably

In [84]:
llm_long = ChatOpenAI(model="gpt-4-turbo-preview", temperature=0)
structured_llm_long = llm_long.with_structured_output(Search)
query_analyzer_all = {"question": RunnablePassthrough()} | prompt | structured_llm_long

In [85]:
query_analyzer_all.invoke("what are books about aliens by andy fuller")

Search(query='aliens', author='Andy Fuller')

### Find and all relevant values

Instead, what we can do is create an index over the relevant values and then query that for the N most relevant values,

In [87]:
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_texts(
    names,
    embeddings,
    collection_name="author_names"
)

In [88]:
def select_names(question):
    _docs = vectorstore.similarity_search(question, k=50)
    _names = [d.page_content for d in _docs]
    return _names

In [89]:
query_analyzer_select = {
    "question": RunnablePassthrough(),
    "authors": lambda x: ", ".join(select_names(x))
} | base_prompt | structured_llm

In [90]:
query_analyzer_select.invoke("what are books about aliens by andy fuller")

Search(query='books about aliens', author='Andrew Fuller')

### Replace after selection

Another method is to let the LLM fill in whatever value, but then convert that value to a valid value.
This can actually be done with the Pydantic class itself!

In [92]:
from langchain_core.pydantic_v1 import validator

class Search(BaseModel):
    query: str
    author: str
        
    @validator('author')
    def double(cls, v: str) -> str:
        return vectorstore.similarity_search(v, k=1)[0].page_content
        

In [96]:
system = """Generate a relevant search query for a library system"""
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        ("human", "{question}"),
    ]
)
corrective_structure_llm = llm.with_structured_output(Search)
corrective_query_analyzer = {"question": RunnablePassthrough()} | prompt | corrective_structure_llm

In [97]:
corrective_query_analyzer.invoke("what are books about aliens by andy fuller")

Search(query='books about aliens', author='Andrew Fuller')