# Overview
Retrieval systems are fundamental to many AI applications, efficiently identifying relevant information from large datasets. These systems accommodate various data formats:
- Unstructured text (e.g., documents) is often stored in vector stores or lexical search indexes.
- Structured data is typically housed in relational or graph databases with defined schemas.

# Key concepts
![image.png](attachment:8c3be012-0130-49a3-b6d1-9046d3d64da2.png)

(1) Query analysis: A process where models transform or construct search queries to optimize retrieval.

(2) Information retrieval: Search queries are used to fetch information from various retrieval systems.

# Query analysis
While users typically prefer to interact with retrieval systems using natural language, these systems may require specific query syntax or benefit from certain keywords. Query analysis serves as a bridge between raw user input and optimized search queries. Some common applications of query analysis include:

1. <b>Query Re-writing</b>: Queries can be re-written or expanded to improve semantic or lexical searches.
2. <b>Query Construction</b>: Search indexes may require structured queries (e.g., SQL for databases).


Query analysis employs models to transform or construct optimized search queries from raw user input.

## Query re-writing
Retrieval systems should ideally handle a wide spectrum of user inputs, from simple and poorly worded queries to complex, multi-faceted questions. To achieve this versatility, a popular approach is to use models to transform raw user queries into more effective search queries. This transformation can range from simple keyword extraction to sophisticated query expansion and reformulation. Here are some key benefits of using models for query analysis in unstructured data retrieval:

1. <b>Query Clarification</b>: Models can rephrase ambiguous or poorly worded queries for clarity.
2. <b>Semantic Understanding</b>: They can capture the intent behind a query, going beyond literal keyword matching.
3. <b>Query Expansion</b>: Models can generate related terms or concepts to broaden the search scope.
4. <b>Complex Query Handling</b>: They can break down multi-part questions into simpler sub-queries.


Various techniques have been developed to leverage models for query re-writing, including:
| Name          | When to use                                                  | Description |
|--------------|--------------------------------------------------------------|-------------|
| Multi-query  | When you want to ensure high recall in retrieval by providing multiple phrasings of a question. | Rewrite the user question with multiple phrasings, retrieve documents for each rewritten question, return the unique documents for all queries. |
| Decomposition | When a question can be broken down into smaller subproblems. | Decompose a question into a set of subproblems/questions, which can either be solved sequentially (use the answer from first + retrieval to answer the second) or in parallel (consolidate each answer into final answer). |
| Step-back    | When a higher-level conceptual understanding is required. | First prompt the LLM to ask a generic step-back question about higher-level concepts or principles, and retrieve relevant facts about them. Use this grounding to help answer the user question. |
| HyDE         | If you have challenges retrieving relevant documents using the raw user inputs. | Use an LLM to convert questions into hypothetical documents that answer the question. Use the embedded hypothetical documents to retrieve real documents with the premise that doc-doc similarity search can produce more relevant matches. |

### Query decomposition example 

This can simply be accomplished using prompting and a structured output that enforces a list of sub-questions. These can then be run sequentially or in parallel on a downstream retrieval system.


In [1]:
from typing import List

from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage

# Define a pydantic model to enforce the output structure
class Questions(BaseModel):
    questions: List[str] = Field(
        description="A list of sub-questions related to the input query."
    )

# Create an instance of the model and enforce the output structure
model = ChatOpenAI(model="gpt-4o", temperature=0) 
structured_model = model.with_structured_output(Questions)

# Define the system prompt
system = """You are a helpful assistant that generates multiple sub-questions related to an input question. \n
The goal is to break down the input into a set of sub-problems / sub-questions that can be answers in isolation. \n"""

# Pass the question to the model
question = """What are the main components of an LLM-powered autonomous agent system?"""
questions = structured_model.invoke([SystemMessage(content=system)]+[HumanMessage(content=question)])

In [2]:
questions

Questions(questions=['What is an LLM and how does it function within an autonomous agent system?', 'What are the key components of an autonomous agent system?', 'How does an LLM integrate with other components of an autonomous agent system?', 'What role does natural language processing play in an LLM-powered autonomous agent system?', 'How do LLMs handle decision-making processes in autonomous agents?', 'What are the input and output mechanisms for an LLM in an autonomous agent system?', 'How do LLMs interact with external data sources or sensors in an autonomous agent system?', 'What are the challenges of using LLMs in autonomous agent systems?', 'How is the performance of an LLM-powered autonomous agent system evaluated?', 'What are some examples of LLM-powered autonomous agent systems in use today?'])

### Hypothetical Document Embeddings (HyDE) example 
![image.png](attachment:da54e2cc-0350-46d5-b663-b59a84b13d56.png)

In [7]:
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import chain
from langchain_core.output_parsers import StrOutputParser
from langchain_chroma import Chroma


# Load the document, split it into chunks
raw_documents = TextLoader('../test.txt', encoding='utf-8').load()
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200)
documents = text_splitter.split_documents(raw_documents)

# Create embeddings for the documents
embeddings_model = OpenAIEmbeddings()

db = Chroma.from_documents(
    documents, embeddings_model)

# create retriever to retrieve 2 relevant documents
retriever = db.as_retriever(search_kwargs={"k": 5})

prompt_hyde = ChatPromptTemplate.from_template(
    """Please write a passage to answer the question.\n Question: {question} \n Passage:""")

generate_doc = (prompt_hyde | ChatOpenAI(temperature=0) | StrOutputParser())

In [8]:
"""
Next, we take the hypothetical document generated above and use it as input to the retriever, 
which will generate its embedding and search for similar documents in the vector store:
"""
retrieval_chain = generate_doc | retriever

query = "Who are some lesser known philosophers in the ancient greek history of philosophy?"

prompt = ChatPromptTemplate.from_template(
    """Answer the question based only on the following context: {context} Question: {question} """
)

llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)


In [9]:
@chain
def qa(input):
    # fetch relevant documents from the hyde retrieval chain defined earlier
    docs = retrieval_chain.invoke(input)
    # format prompt
    formatted = prompt.invoke({"context": docs, "question": input})
    # generate answer
    answer = llm.invoke(formatted)
    return answer


print("Running hyde\n")
result = qa.invoke(query)
print("\n\n")
print(result.content)

Running hyde




Some lesser known philosophers in the ancient Greek history of philosophy include Anaximander, Heraclitus, Parmenides, Anaximenes, and Anaxagoras.


## Query construction
Query analysis also can focus on translating natural language queries into specialized query languages or filters. This translation is crucial for effectively interacting with various types of databases that house structured or semi-structured data.

1. <b>Structured Data examples</b>: For relational and graph databases, Domain-Specific Languages (DSLs) are used to query data.

    - Text-to-SQL: Converts natural language to SQL for relational databases.
    - Text-to-Cypher: Converts natural language to Cypher for graph databases.

2. <b>Semi-structured Data examples</b>: For vectorstores, queries can combine semantic search with metadata filtering.

    - Natural Language to Metadata Filters: Converts user queries into appropriate metadata filters.

These approaches leverage models to bridge the gap between user intent and the specific query requirements of different data storage systems. Here are some popular techniques:
| Name           | When to Use                                                                 | Description |
|---------------|---------------------------------------------------------------------------|-------------|
| Self Query    | If users are asking questions that are better answered by fetching documents based on metadata rather than similarity with the text. | This uses an LLM to transform user input into two things: (1) a string to look up semantically, (2) a metadata filter to go along with it. This is useful because oftentimes questions are about the METADATA of documents (not the content itself). |
| Text to SQL   | If users are asking questions that require information housed in a relational database, accessible via SQL. | This uses an LLM to transform user input into a SQL query. |
| Text-to-Cypher | If users are asking questions that require information housed in a graph database, accessible via Cypher. | This uses an LLM to transform user input into a Cypher query. |


![image.png](attachment:2ee07982-ee15-4e27-bf9a-f31989f1aa26.png)

## Text-to-Metadata Filter example
Here is how to use the ```SelfQueryRetriever``` to convert natural language queries into metadata filters.

In [3]:
# !pip install langchain-chroma
# !pip install lark

In [4]:
from langchain_chroma import Chroma
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain_openai import ChatOpenAI
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document


docs = [
    Document(
        page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
        metadata={"year": 1993, "rating": 7.7, "genre": "science fiction"},
    ),
    Document(
        page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
        metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2},
    ),
    Document(
        page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
        metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6},
    ),
    Document(
        page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
        metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3},
    ),
    Document(
        page_content="Toys come alive and have a blast doing so",
        metadata={"year": 1995, "genre": "animated"},
    ),
    Document(
        page_content="Three men walk into the Zone, three men walk out of the Zone",
        metadata={
            "year": 1979,
            "director": "Andrei Tarkovsky",
            "genre": "thriller",
            "rating": 9.9,
        },
    ),
]


# Create embeddings for the documents
embedding_model = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(docs, embedding_model)

# Define the fields for the query
fields = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie",
        type="string or list[string]",
    ),
    AttributeInfo(
        name="year",
        description="The year the movie was released",
        type="integer",
    ),
    AttributeInfo(
        name="director",
        description="The name of the movie director",
        type="string",
    ),
    AttributeInfo(
        name="rating",
        description="A 1-10 rating for the movie",
        type="float",
    ),
]


description = "Brief summary of a movie"
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
retriever = SelfQueryRetriever.from_llm(llm, vectorstore, description, fields)

# This example only specifies a filter
print(retriever.invoke("I want to watch a movie rated higher than 8.5"))

# This example specifies multiple filters
print(retriever.invoke(
    "What's a highly rated (above 8.5) science fiction film?"))

[Document(id='bfa9ced4-e47e-4bba-a05b-a18e271468bd', metadata={'director': 'Andrei Tarkovsky', 'genre': 'thriller', 'rating': 9.9, 'year': 1979}, page_content='Three men walk into the Zone, three men walk out of the Zone'), Document(id='06ad2041-4294-4315-ada6-cc2fbaa7f4a9', metadata={'director': 'Satoshi Kon', 'rating': 8.6, 'year': 2006}, page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea')]
[Document(id='06ad2041-4294-4315-ada6-cc2fbaa7f4a9', metadata={'director': 'Satoshi Kon', 'rating': 8.6, 'year': 2006}, page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea'), Document(id='bfa9ced4-e47e-4bba-a05b-a18e271468bd', metadata={'director': 'Andrei Tarkovsky', 'genre': 'thriller', 'rating': 9.9, 'year': 1979}, page_content='Three men walk into the Zone, three men walk out of the Zone')]


## Text-to-SQL example

![image.png](attachment:d3613da2-1717-46fb-8a80-cebbc818de86.png)

In [None]:
"""
The below example will use a SQLite connection with the Chinook database, which is a sample database that represents a digital media store. 
Follow these installation steps to create Chinook.db in the same directory as this notebook. 
You can also download and build the database via the command line:

### instructions to download sqlite3 on pc: https://dev.to/dendihandian/installing-sqlite3-in-windows-44eb 

```bash
curl -s https://raw.githubusercontent.com/lerocha/chinook-database/master/ChinookDatabase/DataSources/Chinook_Sqlite.sql | sqlite3 Chinook.db
mv chink
```

Afterwards, place `Chinook.db` in the same directory where this code is running.

"""

In [None]:
from langchain_community.tools import QuerySQLDatabaseTool
from langchain_community.utilities import SQLDatabase
from langchain.chains import create_sql_query_chain
# replace this with the connection details of your db
from langchain_openai import ChatOpenAI

db = SQLDatabase.from_uri("sqlite:///Chinook.db")
print(db.get_usable_table_names())
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

In [None]:
# convert question to sql query
write_query = create_sql_query_chain(llm, db)

# Execute SQL query
execute_query = QuerySQLDatabaseTool(db=db)

# combined chain = write_query | execute_query
combined_chain = write_query | execute_query

In [None]:
# run the chain
result = combined_chain.invoke({"question": "How many employees are there?"})

print(result)