# LlamaIndex: QA over Structured Data (Text-to-Pandas Guide)

- Question-Answering (RAG)
  - https://docs.llamaindex.ai/en/stable/use_cases/q_and_a/
- QA over Structured Data (Text-to-Pandas Guide)<br>
  - **[Text-to-SQL Guide (Query Engine + Retriever)](https://docs.llamaindex.ai/en/stable/examples/query_engine/pandas_query_engine/)**

## SETUP

In [1]:
import os
from dotenv import load_dotenv

# Load environment variables (for API key)
load_dotenv()

# Set up OpenAI API key
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
    raise ValueError("Please set the OPENAI_API_KEY environment variable or add it to a .env file")

# Define the model to use
MODEL_GPT = "gpt-4o-mini"

## SETUP (LlamaIndex)
### Pandas Query Engine

In [2]:
# !pip install llama-index llama-index-experimental

In [3]:
import logging
import sys
from IPython.display import Markdown, display

import pandas as pd
from llama_index.experimental.query_engine import PandasQueryEngine

In [4]:
# logging.basicConfig(stream=sys.stdout, level=logging.INFO)
# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

## Let's start on a Toy DataFrame

In [12]:
# Test on some sample data
df = pd.DataFrame(
    {
        "city": ["Toronto", "Tokyo", "Berlin"],
        "population": [2930000, 13960000, 3645000],
    }
)

In [13]:
# query_engine = PandasQueryEngine(df=df, verbose=True)

In [14]:
# from llama_index.llms.ollama import Ollama
# from llama_index.embeddings.huggingface import HuggingFaceEmbedding
# from llama_index.core import Settings

# Settings.llm = Ollama(model="llama2", request_timeout=120.0)
# Settings.embed_model = HuggingFaceEmbedding(
#     model_name="BAAI/bge-small-en-v1.5"
# )

from llama_index.llms.openai import OpenAI

llm = OpenAI(temperature=0.1, model="gpt-4o-mini")
query_engine = PandasQueryEngine(df=df, verbose=True, llm=llm)

In [15]:
response = query_engine.query(
    "What is the city with the highest population?",
)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
> Pandas Instructions:
```
df.loc[df['population'].idxmax(), 'city']
```
> Pandas Output: Tokyo


In [16]:
# get pandas python instructions
print(response.metadata["pandas_instruction_str"])

df.loc[df['population'].idxmax(), 'city']


In [17]:
query_engine = PandasQueryEngine(df=df, verbose=True, synthesize_response=True)
response = query_engine.query(
    "What is the city with the highest population? Give both the city and population",
)
print(str(response))

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
> Pandas Instructions:
```
df.loc[df['population'].idxmax()]
```
> Pandas Output: city             Tokyo
population    13960000
Name: 1, dtype: object
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
The city with the highest population is Tokyo, with a population of 13,960,000.


## Analyzing the Titanic Dataset
The Titanic dataset is one of the most popular tabular datasets in introductory machine learning<br>
- https://www.kaggle.com/c/titanic

### Download Data

In [18]:
# !wget 'https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/docs/examples/data/csv/titanic_train.csv' -O 'titanic_train.csv'

In [19]:
# df = pd.read_csv("./titanic_train.csv")
df = pd.read_csv("./data/titanic_train.csv")

In [20]:
# query_engine = PandasQueryEngine(df=df, verbose=True)
query_engine = PandasQueryEngine(df=df, verbose=True, llm=llm)

In [21]:
response = query_engine.query(
    "What is the correlation between survival and age?",
)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
> Pandas Instructions:
```
df['survived'].corr(df['age'])
```
> Pandas Output: -0.07722109457217755


In [22]:
display(Markdown(f"<b>{response}</b>"))

<b>-0.07722109457217755</b>

In [23]:
# get pandas python instructions
print(response.metadata["pandas_instruction_str"])

df['survived'].corr(df['age'])


## Additional Steps

### Analyzing / Modifying prompts

In [24]:
from llama_index.core import PromptTemplate

In [25]:
# query_engine = PandasQueryEngine(df=df, verbose=True)
query_engine = PandasQueryEngine(df=df, verbose=True, llm=llm)
prompts = query_engine.get_prompts()
print(prompts["pandas_prompt"].template)

You are working with a pandas dataframe in Python.
The name of the dataframe is `df`.
This is the result of `print(df.head())`:
{df_str}

Follow these instructions:
{instruction_str}
Query: {query_str}

Expression:


In [26]:
print(prompts["response_synthesis_prompt"].template)

Given an input question, synthesize a response from the query results.
Query: {query_str}

Pandas Instructions (optional):
{pandas_instructions}

Pandas Output: {pandas_output}

Response: 


In [27]:
new_prompt = PromptTemplate(
    """\
You are working with a pandas dataframe in Python.
The name of the dataframe is `df`.
This is the result of `print(df.head())`:
{df_str}

Follow these instructions:
{instruction_str}
Query: {query_str}

Expression: """
)

query_engine.update_prompts({"pandas_prompt": new_prompt})

In [28]:
instruction_str = """\
1. Convert the query to executable Python code using Pandas.
2. The final line of code should be a Python expression that can be called with the `eval()` function.
3. The code should represent a solution to the query.
4. PRINT ONLY THE EXPRESSION.
5. Do not quote the expression.
"""

In [29]:
response = query_engine.query(
    "Who is the oldest survival by age?",
)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
> Pandas Instructions:
```
df.loc[df['survived'] == 1].loc[df['age'] == df['age'].max(), 'name'].values[0]
```
> Pandas Output: Barkworth, Mr. Algernon Henry Wilson


In [32]:
response = query_engine.query(
    "Who is the oldest survival by age. Give both the survival name and age?",
)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
> Pandas Instructions:
```
df.loc[df['survived'] == 1].loc[df['age'] == df.loc[df['survived'] == 1, 'age'].max(), ['name', 'age']]
```
> Pandas Output:                                      name   age
630  Barkworth, Mr. Algernon Henry Wilson  80.0


In [33]:
response = query_engine.query(
    "Who is the youngest survival by age. Give both the survival name and age?",
)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
> Pandas Instructions:
```
df.loc[df['survived'] == 1].nsmallest(1, 'age')[['name', 'age']]
```
> Pandas Output:                                 name   age
803  Thomas, Master. Assad Alexander  0.42


# Implementing Query Engine using Query Pipeline Syntax
**Query Pipeline over Pandas DataFrames**
- https://docs.llamaindex.ai/en/stable/examples/pipeline/query_pipeline_pandas/

## SETUP

In [34]:
## import os
from dotenv import load_dotenv

# Load environment variables (for API key)
load_dotenv()

# Set up OpenAI API key
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
    raise ValueError("Please set the OPENAI_API_KEY environment variable or add it to a .env file")

# Define the model to use
MODEL_GPT = "gpt-4o-mini"

## SETUP (LlamaIndex)

In [35]:
# %pip install llama-index-llms-openai llama-index-experimental

In [36]:
from llama_index.core.query_pipeline import (
    QueryPipeline as QP,
    Link,
    InputComponent,
)
from llama_index.experimental.query_engine.pandas import (
    PandasInstructionParser,
)
from llama_index.llms.openai import OpenAI
from llama_index.core import PromptTemplate

## Download Data

In [37]:
# !wget 'https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/docs/examples/data/csv/titanic_train.csv' -O 'titanic_train.csv'

In [38]:
import pandas as pd

# df = pd.read_csv("./titanic_train.csv")
df = pd.read_csv("./data/titanic_train.csv")

In [39]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   survived  891 non-null    int64  
 1   pclass    891 non-null    int64  
 2   name      891 non-null    object 
 3   sex       891 non-null    object 
 4   age       714 non-null    float64
 5   sibsp     891 non-null    int64  
 6   parch     891 non-null    int64  
 7   ticket    891 non-null    object 
 8   fare      891 non-null    float64
 9   cabin     204 non-null    object 
 10  embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 76.7+ KB


In [40]:
df.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [41]:
df

Unnamed: 0,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


## Define Modules

In [42]:
instruction_str = (
    "1. Convert the query to executable Python code using Pandas.\n"
    "2. The final line of code should be a Python expression that can be called with the `eval()` function.\n"
    "3. The code should represent a solution to the query.\n"
    "4. PRINT ONLY THE EXPRESSION.\n"
    "5. Do not quote the expression.\n"
)

pandas_prompt_str = (
    "You are working with a pandas dataframe in Python.\n"
    "The name of the dataframe is `df`.\n"
    "This is the result of `print(df.head())`:\n"
    "{df_str}\n\n"
    "Follow these instructions:\n"
    "{instruction_str}\n"
    "Query: {query_str}\n\n"
    "Expression:"
)
response_synthesis_prompt_str = (
    "Given an input question, synthesize a response from the query results.\n"
    "Query: {query_str}\n\n"
    "Pandas Instructions (optional):\n{pandas_instructions}\n\n"
    "Pandas Output: {pandas_output}\n\n"
    "Response: "
)

pandas_prompt = PromptTemplate(pandas_prompt_str).partial_format(
    instruction_str=instruction_str, df_str=df.head(5)
)
pandas_output_parser = PandasInstructionParser(df)
response_synthesis_prompt = PromptTemplate(response_synthesis_prompt_str)
# llm = OpenAI(model="gpt-3.5-turbo")
llm = OpenAI(model="gpt-4o-mini")

## Build Query Pipeline

In [43]:
qp = QP(
    modules={
        "input": InputComponent(),
        "pandas_prompt": pandas_prompt,
        "llm1": llm,
        "pandas_output_parser": pandas_output_parser,
        "response_synthesis_prompt": response_synthesis_prompt,
        "llm2": llm,
    },
    verbose=True,
)
qp.add_chain(["input", "pandas_prompt", "llm1", "pandas_output_parser"])
qp.add_links(
    [
        Link("input", "response_synthesis_prompt", dest_key="query_str"),
        Link(
            "llm1", "response_synthesis_prompt", dest_key="pandas_instructions"
        ),
        Link(
            "pandas_output_parser",
            "response_synthesis_prompt",
            dest_key="pandas_output",
        ),
    ]
)
# add link from response synthesis prompt to llm2
qp.add_link("response_synthesis_prompt", "llm2")

## Run Query

In [44]:
response = qp.run(
    query_str="What is the correlation between survival and age?",
)

[1;3;38;2;155;135;227m> Running module input with input: 
query_str: What is the correlation between survival and age?

[0m[1;3;38;2;155;135;227m> Running module pandas_prompt with input: 
query_str: What is the correlation between survival and age?

[0m[1;3;38;2;155;135;227m> Running module llm1 with input: 
messages: You are working with a pandas dataframe in Python.
The name of the dataframe is `df`.
This is the result of `print(df.head())`:
   survived  pclass                                               name  ...

[0mINFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[1;3;38;2;155;135;227m> Running module pandas_output_parser with input: 
input: assistant: df['survived'].corr(df['age'])

[0m[1;3;38;2;155;135;227m> Running module response_synthesis_prompt with input: 
query_s

In [45]:
print(response.message.content)

The correlation between survival and age is approximately -0.077. This indicates a very weak negative correlation, suggesting that there is little to no relationship between age and survival rates in the dataset analyzed.


In [46]:
response = query_engine.query(
    "Who is the oldest survival by age?",
)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
> Pandas Instructions:
```
df[df['survived'] == 1].loc[df['age'].idxmax(), 'name']
```
> Pandas Output: Barkworth, Mr. Algernon Henry Wilson


In [47]:
display(Markdown(f"<b>{response}</b>"))

<b>Barkworth, Mr. Algernon Henry Wilson</b>

In [48]:
response = query_engine.query(
    "Who is the oldest survival by age. Give both the survival name and age?",
)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
> Pandas Instructions:
```
df.loc[df['survived'] == 1].loc[df['age'] == df['age'].max(), ['name', 'age']]
```
> Pandas Output:                                      name   age
630  Barkworth, Mr. Algernon Henry Wilson  80.0


In [49]:
display(Markdown(f"<b>{response}</b>"))

<b>                                     name   age
630  Barkworth, Mr. Algernon Henry Wilson  80.0</b>

In [50]:
response = query_engine.query(
    "Who is the youngest survival by age. Give both the survival name and age?",
)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
> Pandas Instructions:
```
df.loc[df['survived'] == 1].nsmallest(1, 'age')[['name', 'age']]
```
> Pandas Output:                                 name   age
803  Thomas, Master. Assad Alexander  0.42


In [51]:
display(Markdown(f"<b>{response}</b>"))

<b>                                name   age
803  Thomas, Master. Assad Alexander  0.42</b>