In [2]:
import pprint
from dotenv import load_dotenv
load_dotenv()

False

## Retrieval Systems Overview

Retrieval systems help AI applications find relevant info from large datasets, supporting:

- **Unstructured text** (e.g., documents) via vector stores or search indexes  
- **Structured data** in relational or graph databases

Modern apps aim to access all data types through **natural language**, with models translating queries into formats the system can understand — enabling intuitive and flexible data interaction.



### Key concepts

 - **Query analysis**: A process where models transform or construct search queries to optimize retrieval.

 - **Information retrieval**: Search queries are used to fetch information from various retrieval systems.

![Query transform](assets/retrieval_concept-2bcff1b2518f194b34eaf472ac748ffa.png "Query transform")

### Query Analysis

Query analysis bridges user input and optimized search queries in retrieval systems.

#### Key Functions:
- **Query Re-writing**: Improve search results by rephrasing or expanding user input.
- **Query Construction**: Convert natural language into structured formats (e.g., SQL).
- **Model Use**: Leverage LLMs to transform or optimize queries.

#### Benefits:
- **Clarify** ambiguous input  
- **Understand semantics** and user intent  
- **Expand** queries with related terms  
- **Handle complex** multi-part questions

#### Techniques:

| **Name**        | **When to Use**                                           | **Description** |
|-----------------|-----------------------------------------------------------|-----------------|
| `Multi-query`   | Ensure high recall                                        | Generate multiple phrasings and merge results |
| `Decomposition` | Break down complex questions                              | Create and solve sub-questions sequentially or in parallel |
| `Step-back`     | Need for higher-level understanding                       | Ask broader grounding questions first |
| `HyDE`          | When raw queries retrieve poorly                          | Generate hypothetical docs to improve retrieval accuracy |


In [21]:
from typing import List

from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage

# Define a pydantic model to enforce the output structure
class Questions(BaseModel):
    questions: List[str] = Field(
        description="A list of sub-questions related to the input query."
    )

# Create an instance of the model and enforce the output structure
model = ChatOpenAI(model="gpt-4o-mini", temperature=0) 
structured_model = model.with_structured_output(Questions)

# Define the system prompt
system = """You are a helpful assistant that generates multiple sub-questions related to an input question. \n
The goal is to break down the input into a set of sub-problems / sub-questions that can be answers in isolation. \n"""

# Pass the question to the model
question = """What are the main components of an LLM-powered autonomous agent system?"""

system_message = SystemMessage(content=system)
human_message = HumanMessage(content=question)

messages = [system_message] + [human_message]

[pprint.pp(m) for m in messages]

questions = structured_model.invoke(messages)

SystemMessage(content='You are a helpful assistant that generates multiple sub-questions related to an input question. \n\nThe goal is to break down the input into a set of sub-problems / sub-questions that can be answers in isolation. \n', additional_kwargs={}, response_metadata={})
HumanMessage(content='What are the main components of an LLM-powered autonomous agent system?', additional_kwargs={}, response_metadata={})


In [4]:
[pprint.pp(q) for q in questions]

('questions',
 ['What is an LLM and how does it function within an autonomous agent system?',
  'What are the key components of an autonomous agent system?',
  'How does an LLM interact with other components in an autonomous agent '
  'system?',
  'What role does data play in the functioning of an LLM-powered autonomous '
  'agent system?',
  'How is decision-making handled in an LLM-powered autonomous agent system?',
  'What are the input and output mechanisms for an LLM in an autonomous agent '
  'system?',
  'How does an LLM-powered autonomous agent system learn and adapt over time?',
  'What are the challenges in integrating an LLM into an autonomous agent '
  'system?',
  'How does an LLM ensure the security and privacy of data in an autonomous '
  'agent system?',
  'What are some real-world applications of LLM-powered autonomous agent '
  'systems?'])


[None]

### Query Construction

Query construction translates natural language into query languages or filters, enabling effective interaction with structured and semi-structured data systems.

#### Structured Data Examples:
- **Text-to-SQL**: Convert natural language to SQL for relational databases  
- **Text-to-Cypher**: Convert natural language to Cypher for graph databases  

#### Semi-structured Data Examples:
- **Text to Metadata Filters**: Convert natural language into metadata filters for vector stores  

These methods use models to bridge user intent with system-specific query formats.

#### Techniques:

| **Name**          | **When to Use**                                                   | **Description** |
|-------------------|-------------------------------------------------------------------|-----------------|
| `Self Query`      | When answers rely on document metadata                            | Transforms input into (1) semantic query + (2) metadata filter |
| `Text-to-SQL`     | When querying relational databases                                | Converts user input into SQL queries |
| `Text-to-Cypher`  | When querying graph databases                                     | Converts user input into Cypher queries |


### Self-Querying Retriever

A self-querying retriever uses an LLM to turn natural language into a structured query. It applies this to a vector store by:

- Performing **semantic search**
- Extracting and applying **metadata filters**

This allows for more accurate and targeted retrieval based on both content and metadata.

![Self Query Retriever](assets/self_querying-26ac0fc8692e85bc3cd9b8640509404f.jpg "Self Query Retriever")

We use `Chroma` as in-memory vector db, let's initialize it and index some movie summaries.

In [5]:
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings

docs = [
    Document(
        page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
        metadata={"year": 1993, "rating": 7.7, "genre": "science fiction"},
    ),
    Document(
        page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
        metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2},
    ),
    Document(
        page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
        metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6},
    ),
    Document(
        page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
        metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3},
    ),
    Document(
        page_content="Toys come alive and have a blast doing so",
        metadata={"year": 1995, "genre": "animated"},
    ),
    Document(
        page_content="Three men walk into the Zone, three men walk out of the Zone",
        metadata={
            "year": 1979,
            "director": "Andrei Tarkovsky",
            "genre": "thriller",
            "rating": 9.9,
        },
    ),
]
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())


In [6]:
vectorstore.get()

{'ids': ['0f841284-069e-42bc-9230-54ef6f68c941',
  'bfd4fe89-05e0-483e-ac68-91531987bced',
  '4ef9a41a-f07d-4c07-bb1a-e6c29f890554',
  'eb471aa5-9cda-42e3-9f99-00507b0d43f4',
  '87f4ddce-ee48-45f2-a3b1-99be490db69b',
  '0034b17e-c7d2-4c92-a5c1-be6726e6a2db'],
 'embeddings': None,
 'documents': ['A bunch of scientists bring back dinosaurs and mayhem breaks loose',
  'Leo DiCaprio gets lost in a dream within a dream within a dream within a ...',
  'A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea',
  'A bunch of normal-sized women are supremely wholesome and some men pine after them',
  'Toys come alive and have a blast doing so',
  'Three men walk into the Zone, three men walk out of the Zone'],
 'uris': None,
 'data': None,
 'metadatas': [{'genre': 'science fiction', 'rating': 7.7, 'year': 1993},
  {'director': 'Christopher Nolan', 'rating': 8.2, 'year': 2010},
  {'director': 'Satoshi Kon', 'rating': 8.6, 'year': 2006}

### Creating our self-querying retriever
Now we can instantiate our retriever. 
To do this we'll need to provide some information upfront about the metadata fields that our documents support and a short description of the document contents.

In [7]:
from langchain.chains.query_constructor.schema import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain_openai import ChatOpenAI

metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']",
        type="string",
    ),
    AttributeInfo(
        name="year",
        description="The year the movie was released",
        type="integer",
    ),
    AttributeInfo(
        name="director",
        description="The name of the movie director",
        type="string",
    ),
    AttributeInfo(
        name="rating", description="A 1-10 rating for the movie", type="float"
    ),
]
document_content_description = "Brief summary of a movie"
llm = ChatOpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vectorstore,
    document_contents=document_content_description,
    metadata_field_info=metadata_field_info,
    verbose= True
)

In [8]:
# This example only specifies a filter
movies = retriever.invoke("I want to watch a movie rated higher than 8.5")
[pprint.pp(m) for m in movies]

Document(id='0034b17e-c7d2-4c92-a5c1-be6726e6a2db', metadata={'director': 'Andrei Tarkovsky', 'genre': 'thriller', 'rating': 9.9, 'year': 1979}, page_content='Three men walk into the Zone, three men walk out of the Zone')
Document(id='4ef9a41a-f07d-4c07-bb1a-e6c29f890554', metadata={'director': 'Satoshi Kon', 'rating': 8.6, 'year': 2006}, page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea')


[None, None]

In [9]:
# This example specifies a query and a filter
movies = retriever.invoke("Has Greta Gerwig directed any movies about women")
[pprint.pp(m) for m in movies]

Document(id='eb471aa5-9cda-42e3-9f99-00507b0d43f4', metadata={'director': 'Greta Gerwig', 'rating': 8.3, 'year': 2019}, page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them')


[None]

In [20]:
# This example specifies a composite filter
movies = retriever.invoke("What's a highly rated (above 8.5) science fiction film?")
[pprint.pp(m) for m in movies]

Document(id='4ef9a41a-f07d-4c07-bb1a-e6c29f890554', metadata={'director': 'Satoshi Kon', 'rating': 8.6, 'year': 2006}, page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea')
Document(id='0034b17e-c7d2-4c92-a5c1-be6726e6a2db', metadata={'director': 'Andrei Tarkovsky', 'genre': 'thriller', 'rating': 9.9, 'year': 1979}, page_content='Three men walk into the Zone, three men walk out of the Zone')


[None, None]

In [15]:
# This example specifies a query and composite filter
movies = retriever.invoke(
    "What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated"
)
[pprint.pp(m) for m in movies]

Document(id='87f4ddce-ee48-45f2-a3b1-99be490db69b', metadata={'genre': 'animated', 'year': 1995}, page_content='Toys come alive and have a blast doing so')


[None]

### Filter k
We can also use the self query retriever to specify k: the number of documents to fetch.

We can do this by passing `enable_limit=True` to the constructor.

In [16]:
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectorstore,
    document_content_description,
    metadata_field_info,
    enable_limit=True,
)

# This example only specifies a relevant query
[pprint.pp(m) for m in retriever.invoke("What are two movies about dinosaurs")]


Document(id='0f841284-069e-42bc-9230-54ef6f68c941', metadata={'genre': 'science fiction', 'rating': 7.7, 'year': 1993}, page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose')
Document(id='87f4ddce-ee48-45f2-a3b1-99be490db69b', metadata={'genre': 'animated', 'year': 1995}, page_content='Toys come alive and have a blast doing so')


[None, None]

### What's Happening Under the Hood?  
#### Constructing from Scratch with LCEL

To better understand the internals—and gain more custom control—we can rebuild the retriever from scratch.

1. **Create a Query-Construction Chain**  
   This chain takes a user query and generates a `StructuredQuery` object, capturing any user-specified filters.

2. **Use Helper Functions**  
   LangChain provides helper functions for:
   - **Creating the prompt**
   - **Parsing the output**

   These include various tunable parameters, which we'll skip here for simplicity.


In [17]:
from langchain.chains.query_constructor.base import (
    StructuredQueryOutputParser,
    get_query_constructor_prompt,
)

prompt = get_query_constructor_prompt(
    document_content_description,
    metadata_field_info,
)
output_parser = StructuredQueryOutputParser.from_components()
query_constructor = prompt | llm | output_parser

In [18]:
print(prompt.format(query="dummy question"))

Your goal is to structure the user's query to match the request schema provided below.

<< Structured Request Schema >>
When responding use a markdown code snippet with a JSON object formatted in the following schema:

```json
{
    "query": string \ text string to compare to document contents
    "filter": string \ logical condition statement for filtering documents
}
```

The query string should contain only text that is expected to match the contents of documents. Any conditions in the filter should not be mentioned in the query as well.

A logical condition statement is composed of one or more comparison and logical operation statements.

A comparison statement takes the form: `comp(attr, val)`:
- `comp` (eq | ne | gt | gte | lt | lte | contain | like | in | nin): comparator
- `attr` (string):  name of attribute to apply the comparison to
- `val` (string): is the comparison value

A logical operation statement takes the form `op(statement1, statement2, ...)`:
- `op` (and | or | not

And what our full chain produces:

In [21]:
query_constructor.invoke(
    {
        "query": "What are some sci-fi movies from the 90's directed by Luc Besson about taxi drivers"
    }
)

StructuredQuery(query='taxi drivers', filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='genre', value='science fiction'), Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.GTE: 'gte'>, attribute='year', value=1990), Comparison(comparator=<Comparator.LT: 'lt'>, attribute='year', value=2000)]), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='director', value='Luc Besson')]), limit=None)

### Key Components of a Self-Query Retriever

1. **Query Constructor**  
   The core of the self-query retriever. A good retrieval system depends on a well-tuned query constructor.  
   - Adjust the **prompt**, **examples**, and **attribute descriptions** for better results.  
   - For a walkthrough using hotel data, see the **Query Constructor Cookbook**.

2. **Structured Query Translator**  
   Translates the `StructuredQuery` object into a metadata filter compatible with your vector store.  
   - LangChain includes built-in translators.  
   - See the [**Integrations**](https://python.langchain.com/docs/integrations/retrievers/self_query/) section for available options.


In [19]:
from langchain_community.query_constructors.chroma import ChromaTranslator

retriever = SelfQueryRetriever(
    query_constructor=query_constructor,
    vectorstore=vectorstore,
    structured_query_translator=ChromaTranslator(),
)

### Further Reading

- [Text-to-SQL](https://python.langchain.com/docs/tutorials/sql_qa/) tutorials
- [Metadata Filter](https://python.langchain.com/docs/tutorials/rag/#query-analysis) tutorials
- [RAG from Scratch – Query Construction Video](https://youtu.be/kl6NwWYxvbM?feature=shared)


# 🔎 Information Retrieval with LangChain

LangChain supports multiple types of retrieval systems depending on your data format and use case.

---

## 📘 Lexical Search Indexes

Lexical retrieval matches query words with document words using frequency-based algorithms like **BM25** and **TF-IDF**.  
This is often implemented via **inverted indexes**, which map each word to the documents where it appears.

✅ Best for: Exact term matching on unstructured text.

**Further reading:**
- [BM25 Retriever Integration](https://python.langchain.com/docs/integrations/retrievers/bm25/)
- [Elasticsearch Retriever Integration](https://python.langchain.com/docs/integrations/retrievers/elasticsearch_retriever/)

---

## 🧠 Vector Indexes

Instead of word matching, **vector indexes** use embedding models to represent documents in high-dimensional vector space.  
This enables semantic similarity search using operations like **cosine similarity**.

✅ Best for: Semantic search on unstructured data.

**Further reading:**
- [Vectorstore Guide](https://python.langchain.com/docs/how_to/vectorstore_retriever/)
- [Vectorstore Integrations](https://python.langchain.com/docs/integrations/vectorstores/)
- [Cameron Wolfe's Blog on Vector Search](https://cameronrwolfe.substack.com/p/the-basics-of-ai-powered-vector-search?utm_source=profile&utm_medium=reader2)

---

## 🗃️ Relational Databases

Structured data is stored in **tables** with defined schemas, using **SQL** to query.  
Great for enforcing data integrity and handling complex relationships.

✅ Best for: Querying structured data via SQL.

**Further reading:**
- [SQL Tutorial](https://python.langchain.com/docs/tutorials/sql_qa/)
- [SQL Toolkit](https://python.langchain.com/docs/integrations/tools/sql_database/)

---

## 🕸️ Graph Databases

Graph databases model highly interconnected data using **nodes**, **edges**, and **properties**.  
Useful for domains like social networks, fraud detection, and supply chains.

✅ Best for: Querying complex relationships using flexible structures.

---

## 🧩 LangChain Retriever

LangChain provides a **unified retriever interface** for all of the above systems.

**Input:** Natural language query (`string`)  
**Output:** List of `Document` objects

You can use query analysis (e.g. **text-to-SQL**) to support natural language input even for databases that require structured queries.

In [24]:
docs = retriever.invoke("fantasy movies")
pprint.pp(docs)

[Document(id='4ef9a41a-f07d-4c07-bb1a-e6c29f890554', metadata={'director': 'Satoshi Kon', 'rating': 8.6, 'year': 2006}, page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea'),
 Document(id='bfd4fe89-05e0-483e-ac68-91531987bced', metadata={'director': 'Christopher Nolan', 'rating': 8.2, 'year': 2010}, page_content='Leo DiCaprio gets lost in a dream within a dream within a dream within a ...'),
 Document(id='0f841284-069e-42bc-9230-54ef6f68c941', metadata={'genre': 'science fiction', 'rating': 7.7, 'year': 1993}, page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose'),
 Document(id='87f4ddce-ee48-45f2-a3b1-99be490db69b', metadata={'genre': 'animated', 'year': 1995}, page_content='Toys come alive and have a blast doing so')]
