# RAG from Scratch: Part 11 - Query Structuring

Resources:

- Video: [RAG from Scratch: Part 11](https://www.youtube.com/watch?v=kl6NwWYxvbM&list=PLfaIDFEXuae2LXbO1_PKyVJiQ23ZztA0x&index=11)
- Notebook: [`rag_from_scratch_10_and_11.ipynb`](./notebooks/rag-from-scratch/rag_from_scratch_10_and_11.ipynb)

In [1]:
from dotenv import load_dotenv

In [2]:
load_dotenv(override=True, dotenv_path="../.env")

True

**WARNING**: Currently, the `YoutubeLoader` seems to fail; I tried to fix some bugs/issues following GH posts, but the cell below never worked properly.
**HOWEVER**, the rest of the notebook works, because it shows how to build structured queries given an expected class definition.  

In [None]:
from langchain_community.document_loaders import YoutubeLoader

video_url = "https://www.youtube.com/watch?v=pbAd8O1Lvm4"

loader = YoutubeLoader.from_youtube_url(
    video_url, add_video_info=True, 
    use_oauth=True, allow_oauth_cache=True
)

docs = loader.load()

docs[0].metadata

Let’s assume we’ve built an index that:

1. Allows us to perform unstructured search over the `contents` and `title` of each document.
2. And to use range filtering on `view count`, `publication date`, and `length`.

We want to convert natural language into structured search queries.

We can define a schema for structured search queries.

The documents metadata should be accessible via `.metadata`:

```python
    docs[0].metadata

    {'source': 'pbAd8O1Lvm4',
    'title': 'Self-reflective RAG with LangGraph: Self-RAG and CRAG',
    'description': 'Unknown',
    'view_count': 11922,
    'thumbnail_url': 'https://i.ytimg.com/vi/pbAd8O1Lvm4/hq720.jpg',
    'publish_date': '2024-02-07 00:00:00',
    'length': 1058,
    'author': 'LangChain'}
```

In [None]:
import datetime
from typing import Literal, Optional, Tuple
#from langchain_core.pydantic_v1 import BaseModel, Field
from pydantic import BaseModel, Field

# Define the schema for the tutorial search request.
# This schema is a Pydantic model that defines the structure of the request body.
# Then, we define a prompt that asks to convert a NL question into a search query
# that follows this schema; we can request that to an LLM by using
# the with_structured_output method.
class TutorialSearch(BaseModel):
    """Search over a database of tutorial videos about a software library."""

    content_search: str = Field(
        ...,
        description="Similarity search query applied to video transcripts.",
    )
    title_search: str = Field(
        ...,
        description=(
            "Alternate version of the content search query to apply to video titles. "
            "Should be succinct and only include key words that could be in a video "
            "title."
        ),
    )
    min_view_count: Optional[int] = Field(
        None,
        description="Minimum view count filter, inclusive. Only use if explicitly specified.",
    )
    max_view_count: Optional[int] = Field(
        None,
        description="Maximum view count filter, exclusive. Only use if explicitly specified.",
    )
    earliest_publish_date: Optional[datetime.date] = Field(
        None,
        description="Earliest publish date filter, inclusive. Only use if explicitly specified.",
    )
    latest_publish_date: Optional[datetime.date] = Field(
        None,
        description="Latest publish date filter, exclusive. Only use if explicitly specified.",
    )
    min_length_sec: Optional[int] = Field(
        None,
        description="Minimum video length in seconds, inclusive. Only use if explicitly specified.",
    )
    max_length_sec: Optional[int] = Field(
        None,
        description="Maximum video length in seconds, exclusive. Only use if explicitly specified.",
    )

    def pretty_print(self) -> None:
        for field in self.__fields__:
            if getattr(self, field) is not None and getattr(self, field) != getattr(
                self.__fields__[field], "default", None
            ):
                print(f"{field}: {getattr(self, field)}")

In [8]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

system = """You are an expert at converting user questions into database queries. \
You have access to a database of tutorial videos about a software library for building LLM-powered applications. \
Given a question, return a database query optimized to retrieve the most relevant results.

If there are acronyms or words you are not familiar with, do not try to rephrase them."""
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        ("human", "{question}"),
    ]
)
llm = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0)
structured_llm = llm.with_structured_output(TutorialSearch)
query_analyzer = prompt | structured_llm

In [9]:
query_analyzer.invoke({"question": "rag from scratch"}).pretty_print()

content_search: rag from scratch
title_search: rag
min_view_count: 1000


/var/folders/06/wdqtkk796gjfxfq9063zphx40000gn/T/ipykernel_85998/1936674047.py:47: PydanticDeprecatedSince20: The `__fields__` attribute is deprecated, use `model_fields` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  for field in self.__fields__:
/var/folders/06/wdqtkk796gjfxfq9063zphx40000gn/T/ipykernel_85998/1936674047.py:49: PydanticDeprecatedSince20: The `__fields__` attribute is deprecated, use `model_fields` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  self.__fields__[field], "default", None


In [10]:
query_analyzer.invoke(
    {"question": "videos on chat langchain published in 2023"}
).pretty_print()

content_search: chat langchain
title_search: 2023
earliest_publish_date: 2023-01-01
latest_publish_date: 2024-01-01


/var/folders/06/wdqtkk796gjfxfq9063zphx40000gn/T/ipykernel_85998/1936674047.py:47: PydanticDeprecatedSince20: The `__fields__` attribute is deprecated, use `model_fields` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  for field in self.__fields__:
/var/folders/06/wdqtkk796gjfxfq9063zphx40000gn/T/ipykernel_85998/1936674047.py:49: PydanticDeprecatedSince20: The `__fields__` attribute is deprecated, use `model_fields` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  self.__fields__[field], "default", None


In [11]:
query_analyzer.invoke(
    {"question": "videos that are focused on the topic of chat langchain that are published before 2024"}
).pretty_print()

content_search: chat langchain
title_search: chat langchain
latest_publish_date: 2024-01-01


/var/folders/06/wdqtkk796gjfxfq9063zphx40000gn/T/ipykernel_85998/1936674047.py:47: PydanticDeprecatedSince20: The `__fields__` attribute is deprecated, use `model_fields` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  for field in self.__fields__:
/var/folders/06/wdqtkk796gjfxfq9063zphx40000gn/T/ipykernel_85998/1936674047.py:49: PydanticDeprecatedSince20: The `__fields__` attribute is deprecated, use `model_fields` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  self.__fields__[field], "default", None


In [12]:
query_analyzer.invoke(
    {
        "question": "how to use multi-modal models in an agent, only videos under 5 minutes"
    }
).pretty_print()

content_search: multi-modal models agent
title_search: multi-modal models agent
max_length_sec: 300


/var/folders/06/wdqtkk796gjfxfq9063zphx40000gn/T/ipykernel_85998/1936674047.py:47: PydanticDeprecatedSince20: The `__fields__` attribute is deprecated, use `model_fields` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  for field in self.__fields__:
/var/folders/06/wdqtkk796gjfxfq9063zphx40000gn/T/ipykernel_85998/1936674047.py:49: PydanticDeprecatedSince20: The `__fields__` attribute is deprecated, use `model_fields` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  self.__fields__[field], "default", None
