In [None]:
! pip install --upgrade --quiet python-dotenv llama-index-llms-openai llama-index llama-index-embeddings-openai

In [1]:
from llama_index.core import SQLDatabase, Settings
from llama_index.llms.openai import OpenAI
from llama_index.core.query_engine import NLSQLTableQueryEngine

from sqlalchemy import (
    create_engine,
    text
)

import os
from dotenv import load_dotenv
load_dotenv()

True

### In this document
The first few cells go over the general SQL query engine provided by llamaindex. It will demonstrate how to get started via SQL Alchemy and the limitations of NLSQLTableQueryEngine.
You will see that because of context window limitations, you will either hit context window limits or the resulting query will not work because of model confusion due to the large amount of tables and/or columns.

### The dataset used
I chose [this database](https://www.kaggle.com/datasets/wyattowalsh/basketball) from Kaggle because it contained 16 tables with 682 columns. The dataset was Last Updated on Thurs, July 26th, 2023. Download the database and bring the .sqlite file into the directory.

### Reference Documentation
Im going through [this](https://docs.llamaindex.ai/en/stable/examples/index_structs/struct_indices/duckdb_sql_query/) documentation from Llamaindex for reference.

In [2]:
# Connecting to the sqlite db and testing the query output
db_file = "sqlite:///olist.sqlite"
engine = create_engine(db_file)

In [3]:
# Getting all the table names to populate the vector index
all_table_names = []
with engine.connect() as con:
    rows = con.execute(text("SELECT name FROM sqlite_master WHERE type='table' AND name NOT LIKE 'sqlite_%';"))
    for row in rows:
       all_table_names.append(row[0])
all_table_names

['product_category_name_translation',
 'sellers',
 'customers',
 'geolocation',
 'order_items',
 'order_payments',
 'order_reviews',
 'orders',
 'products',
 'leads_qualified',
 'leads_closed']

In [4]:
Settings.llm.model = "gpt-4-turbo"

In [5]:
# Setting up SQL Database using the SQL Alchemy engine above.
sql_database = SQLDatabase(engine, include_tables=all_table_names)
query_engine = NLSQLTableQueryEngine(sql_database=sql_database)

In [6]:
# This is the query that we will use 
query_str = "What is the best and worst reviewed product category based on oder reviews?"
simple_sql_response = query_engine.query(query_str)

In [7]:
print(f"""
AI Response
{simple_sql_response.response}

Query Output
----------------------
{simple_sql_response.source_nodes[0].text}
      
Attempted Query
----------------------
{simple_sql_response.metadata['sql_query']}

""")


AI Response
It appears there was an error in executing the SQL query provided. The error message indicates that the SQL statement is invalid. To assist further, I would need to correct the SQL query to ensure it is syntactically correct and properly structured to fetch the desired data. Here’s a revised version of the SQL query:

```sql
SELECT product_category_name_english, AVG(review_score) AS average_score
FROM order_reviews
JOIN orders ON order_reviews.order_id = orders.order_id
JOIN order_items ON orders.order_id = order_items.order_id
JOIN products ON order_items.product_id = products.product_id
JOIN product_category_name_translation ON products.product_category_name = product_category_name_translation.product_category_name
GROUP BY product_category_name_english
ORDER BY average_score DESC;
```

Please run this corrected query in your database environment. If it executes successfully, it will provide the average review scores for each product category, ordered from highest to low

In [9]:
from sqlalchemy.exc import OperationalError
try:
    with engine.connect() as con:
        rows = con.execute(text(simple_sql_response.metadata['sql_query']))
        for row in rows:
            print(row)
except OperationalError as e:
    print(f"Error: {e.args[0]}")

Error: (sqlite3.OperationalError) near "sql": syntax error
