# 01 - Data Analysis with Text-to-SQL: Leveraging Anthropic Claude on Amazon Bedrock with SQLite

This notebook demonstrates a practical approach to enabling natural language querying of structured data using Large Language Models (LLMs) and SQLite. We leverage Anthropic's Claude to translate plain English questions into SQL queries, executed against a local SQLite database instance to mimic a SQL database without external dependencies. By combining LLM capabilities with SQL, we bridge the gap between non-technical users and data retrieval, enabling intuitive data exploration without the need for SQL proficiency. Techniques for prompt engineering and query optimization are also explored.

In [1]:
!python --version

Python 3.10.13


In [2]:
%pip install --upgrade --quiet langchain langchain-community langchain-aws
%pip install -q sqlfluff

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


Restart the kernel after installing dependencies

In [3]:
# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [4]:
import warnings
warnings.filterwarnings("ignore")

> This notebook was tested with a kernel with python 3.10.6

## 1. Explore data

The data is exported from the [country profile of Tunisia on Harvard Economic Atlas](https://atlas.cid.harvard.edu/explore?country=223&queryLevel=location&product=undefined&year=2021&productClass=HS&target=Product&partner=undefined&startYear=undefined)

In [5]:
import pandas as pd

In [6]:
ls data/

'What did Tunisia export in 2021_.csv'  'What did Tunisia import in 2021_.csv'


In [7]:
properties_data_df = pd.read_csv("data/What did Tunisia export in 2021_.csv")

In [8]:
properties_data_df.head()

Unnamed: 0,Name,Gross Export,Share,Code,Sector
0,Horses,109905,0.000539,101,Agriculture
1,Fowl,284920,0.001396,105,Agriculture
2,Other live animals,36664,0.00018,106,Agriculture
3,Poultry,3507712,0.017193,207,Agriculture
4,Other meat,1087,5e-06,208,Agriculture


In [9]:
properties_data_df.shape

(976, 5)

In [10]:
properties_data_df["year"] = 2021

## 2. Important Design Choices and Decisions

Allowing large language models to prepare SQL queries to run against a database should be done with great caution and appropriate safeguards in place. We also need to recognize and work around the variety of database systems and SQL syntax variations when developing AI assistants that may access databases.

To achieve this, please consider the following recommendations when designing agents/assistants that can access SQL databases:

1. **Never allow the agent to perform write operations** to reduce the risk of SQL injection attacks.
2. The LLM agent should run with a user that has **read-only permission against a selection of tables**.
   - This reinforces and emphasizes point 1.
3. When possible, **use aggregated or materialized views** instead of querying the base tables directly. This will speed up the queries and reduce the load on the database.
4. **Limit the agent's access to only the required SQL tables** based on the expected user queries.

To control access to the database during the engagement, we will help you build a Lambda function that wraps the functionality of the LLM assistant. This Lambda function can be easily integrated with the remaining components of your system. By limiting the access permissions of the Lambda function to read-only from the SQL database, we can effectively limit the permissions of the LLM assistant.

## 3. Store data as SQLite db

In this demonstration, we use SQLite to mock the SQL database. You can validate your initial proof of concept using SQLite, then move to another production grade database for the pilot.

In [11]:
import sqlite3

# Create an empty SQLite db
conn = sqlite3.connect("/tmp/tunisia-exports.db")
c = conn.cursor()
# Write the pandas dataframe data into the SQLite db
properties_data_df.to_sql("tunisia_exports_table", conn, if_exists="replace", index=False)
conn.close()

## 3. Prepare an LLM 

In [12]:
import boto3

bedrock_client = boto3.client("bedrock", region_name="us-west-2")
bedrock_runtime_client = boto3.client("bedrock-runtime", region_name="us-west-2")

In [13]:
available_foundation_models = bedrock_client.list_foundation_models()

Uncomment the following to see the full list of models on Amazon Bedrock

In [14]:
# available_foundation_models

Below we keep only models from the Anthropic Claude family.

In [15]:
claude_models_on_bedrock = [
    m for m in available_foundation_models["modelSummaries"]
    if "Claude" in m["modelName"]
]

Below we extract the model ids for models in the Anthropic Claude family.

In [16]:
[m["modelId"] for m in claude_models_on_bedrock]

['anthropic.claude-instant-v1:2:100k',
 'anthropic.claude-instant-v1',
 'anthropic.claude-v2:0:18k',
 'anthropic.claude-v2:0:100k',
 'anthropic.claude-v2:1:18k',
 'anthropic.claude-v2:1:200k',
 'anthropic.claude-v2:1',
 'anthropic.claude-v2',
 'anthropic.claude-3-sonnet-20240229-v1:0:28k',
 'anthropic.claude-3-sonnet-20240229-v1:0:200k',
 'anthropic.claude-3-sonnet-20240229-v1:0',
 'anthropic.claude-3-haiku-20240307-v1:0:48k',
 'anthropic.claude-3-haiku-20240307-v1:0:200k',
 'anthropic.claude-3-haiku-20240307-v1:0',
 'anthropic.claude-3-opus-20240229-v1:0']

We pick the Claude haiku model for the first experiment.

In [17]:
model_id = "anthropic.claude-3-haiku-20240307-v1:0"

In [18]:
from langchain_aws import ChatBedrock

In [19]:
model_kwargs = {
    "temperature": 0.0,
    "top_p": 0.99,
    "max_tokens": 1000,
}

llm = ChatBedrock(
    client=bedrock_runtime_client,
    model_id=model_id,
    model_kwargs=model_kwargs,
)

In [20]:
from langchain import SQLDatabase

In [21]:
# load db
tunisia_exports_db = SQLDatabase.from_uri("sqlite:////tmp/tunisia-exports.db")

## Ask the LLM to Generate SQL Queries then Handle the Execution Separately

### Generate the SQL Query from Natural Language

In [22]:
from langchain.chains import create_sql_query_chain

In [23]:
text_to_sql_chain = create_sql_query_chain(llm=llm, db=tunisia_exports_db)

In [24]:
user_question = "Which sector has the highest gross expert in Tunisia?"

In [25]:
%%time

sql_query = text_to_sql_chain.invoke({"question": user_question})
sql_query

CPU times: user 56.1 ms, sys: 69 µs, total: 56.1 ms
Wall time: 831 ms


'Question: Which sector has the highest gross expert in Tunisia?\nSQLQuery: SELECT "Sector", MAX("Gross Export") AS "Highest Gross Export"\nFROM tunisia_exports_table\nGROUP BY "Sector"\nORDER BY "Highest Gross Export" DESC\nLIMIT 1;'

In [26]:
if "SQLQuery:" in sql_query:
    sql_query = sql_query.split("SQLQuery:")[-1].strip()

In [27]:
sql_query

'SELECT "Sector", MAX("Gross Export") AS "Highest Gross Export"\nFROM tunisia_exports_table\nGROUP BY "Sector"\nORDER BY "Highest Gross Export" DESC\nLIMIT 1;'

### Apply a SQL Linter to Automatically Fix the Query if There is a Need

After receiving the SQL query, you can validate it, execute it, and potentially call an LLM with the original question and the query answer to formulate a natural language answer. This gives you full control over the query and question answering lifecycle.

You can, for example, run a SQL linter such as [sqlfluff](https://github.com/sqlfluff/sqlfluff) on the SQL query.

In [28]:
%%time
import sqlfluff

fixed_query = sqlfluff.fix(
    sql=sql_query,
    dialect='postgres'
) 

print(fixed_query)

SELECT
    "Sector",
    MAX("Gross Export") AS "Highest Gross Export"
FROM tunisia_exports_table
GROUP BY "Sector"
ORDER BY "Highest Gross Export" DESC
LIMIT 1;

CPU times: user 557 ms, sys: 36.8 ms, total: 593 ms
Wall time: 713 ms


### Execute the Query Against the Database Manually

In [29]:
%%time
conn = sqlite3.connect("/tmp/tunisia-exports.db")
c = conn.cursor()
c.execute(fixed_query)
result = c.fetchall()
result[0]
conn.close()

CPU times: user 1.94 ms, sys: 0 ns, total: 1.94 ms
Wall time: 1.57 ms


In [30]:
result

[('Electronics', 2113516288)]

In [31]:
conn = sqlite3.connect("/tmp/tunisia-exports.db")


def execute_sql_query(conn, query):
    c = conn.cursor()
    c.execute(query)
    result = c.fetchall()
    result[0]
    return result

### Formulate a Natural Language Answer from the Query and Result

In [32]:
import json
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate


system_prompt = """
You are an AI assistant helping non-technical business users understand data insights clearly and professionally. Your task is to take a business question and SQL query results, and provide a concise answer addressing the question directly, without delving into technical details.

Follow these guidelines:

1. Understand the context of the question and data.
2. Craft a clear, confident answer in a professional tone.
3. Focus on key insights and takeaways relevant to the question.
4. Use simple, non-technical language.
5. Highlight the most important points for quick understanding.

Your goal is to communicate actionable insights that empower informed business decisions based on the data, without overwhelming with technical jargon.

Provide a concise and professional answer based on the question and SQL results."""

messages = [
    ("system", system_prompt),
    ("user", "<question>{question}</question><sql_result>{sql_result}</sql_result>")
]

prompt = ChatPromptTemplate.from_messages(messages)
chain = prompt | llm | StrOutputParser()

In [33]:
%%time
response = chain.invoke(
    {
        "question": user_question,
        "sql_result": json.dumps(result)
    }
)

CPU times: user 12.6 ms, sys: 252 µs, total: 12.8 ms
Wall time: 2.09 s


In [34]:
print(response)

Based on the SQL query results, the sector with the highest gross export in Tunisia is Electronics, with a gross export value of 2,113,516,288.

This indicates that the Electronics sector is a significant driver of Tunisia's exports and a key contributor to the country's economy. The high gross export value for this sector suggests it may be an area of strength and competitive advantage for Tunisia in the global market.


## Evaluate on multiple questions

In [35]:
questions = [
    "Which sector has the highest total gross export in Tunisia?",
    "What are the top 3 sectors with the highest gross export in Tunisia?"
]

In [36]:
%%time

qa = []
for question in questions:
    sql_query = text_to_sql_chain.invoke({"question": question})
    if "SQLQuery:" in sql_query:
        sql_query = sql_query.split("SQLQuery:")[-1].strip()
    result = execute_sql_query(conn, sql_query)
    response = chain.invoke(
        {
            "question": question,
            "sql_result": json.dumps(result)
        }
    )
    qa.append(dict(question=question, answer=response.strip(), sql_query=sql_query))

CPU times: user 76.5 ms, sys: 3.01 ms, total: 79.5 ms
Wall time: 7.11 s


In [37]:
qa

[{'question': 'Which sector has the highest total gross export in Tunisia?',
  'answer': "Based on the SQL query results, the sector with the highest total gross export in Tunisia is Electronics, with a total export value of 4,533,943,713.\n\nThe key insight here is that the Electronics sector is the top exporting industry in Tunisia, significantly outpacing other sectors. This suggests that the Electronics industry is a major driver of Tunisia's export economy and likely plays a critical role in the country's overall economic performance.",
  'sql_query': 'SELECT "Sector", SUM("Gross Export") AS "Total Gross Export"\nFROM tunisia_exports_table\nGROUP BY "Sector"\nORDER BY "Total Gross Export" DESC\nLIMIT 1;'},
 {'question': 'What are the top 3 sectors with the highest gross export in Tunisia?',
  'answer': "Based on the SQL query results, the top 3 sectors with the highest gross export in Tunisia are:\n\n1. Electronics - $4,533,943,713\n2. Textiles - $4,099,307,399 \n3. Services - $2,

In [38]:
index = 0
print(qa[index]["question"] + "\n\n", qa[index]["answer"].strip() + "\n\n", qa[index]["sql_query"].strip())

Which sector has the highest total gross export in Tunisia?

 Based on the SQL query results, the sector with the highest total gross export in Tunisia is Electronics, with a total export value of 4,533,943,713.

The key insight here is that the Electronics sector is the top exporting industry in Tunisia, significantly outpacing other sectors. This suggests that the Electronics industry is a major driver of Tunisia's export economy and likely plays a critical role in the country's overall economic performance.

 SELECT "Sector", SUM("Gross Export") AS "Total Gross Export"
FROM tunisia_exports_table
GROUP BY "Sector"
ORDER BY "Total Gross Export" DESC
LIMIT 1;


In [39]:
index = 1
print(qa[index]["question"] + "\n\n", qa[index]["answer"].strip() + "\n\n", qa[index]["sql_query"].strip())

What are the top 3 sectors with the highest gross export in Tunisia?

 Based on the SQL query results, the top 3 sectors with the highest gross export in Tunisia are:

1. Electronics - $4,533,943,713
2. Textiles - $4,099,307,399 
3. Services - $2,933,225,232

The electronics sector has the highest gross export value, followed by textiles and then services. These appear to be the key export-oriented industries driving Tunisia's economy.

 SELECT "Sector", SUM("Gross Export") AS "Total Gross Export"
FROM tunisia_exports_table
GROUP BY "Sector"
ORDER BY "Total Gross Export" DESC
LIMIT 3;


## Conclusion

In this notebook, we demonstrated how to set up a dataframe of fake data, store it in a SQLite database to mimic a SQL database, and use an LLM chatbot to generate SQL queries based on natural language questions. We also explored techniques for executing the generated queries, validating them using a SQL linter, and formulating natural language answers based on the query results. The notebook provides a solid foundation for building AI assistants capable of querying databases while adhering to best practices for security and performance.

---

### Environment and Dependency Information

In [40]:
from utils.helper import package_imports
dict(package_imports(globals()))

{'builtins': None,
 'IPython.core.interactiveshell': '8.20.0',
 'IPython.core.autocall': '8.20.0',
 'io': None,
 'IPython.core.display': '8.20.0',
 'pandas': '2.1.4',
 'pandas.core.frame': '2.1.4',
 'sqlite3': None,
 'boto3': '1.34.93',
 'botocore.client': '1.34.93',
 'langchain_aws.chat_models.bedrock': None,
 'langchain_community.utilities.sql_database': '0.0.34',
 'langchain.chains.sql_database.query': '0.1.16',
 'langchain_core.runnables.base': '0.1.46',
 'sqlfluff': '3.0.5',
 '__main__': None,
 'json': '2.0.9',
 'langchain_core.output_parsers.string': '0.1.46',
 'langchain_core.prompts.chat': '0.1.46',
 'utils.helper': None}

In [41]:
!uname -a

Linux default 4.14.336-257.562.amzn2.x86_64 #1 SMP Sat Feb 24 09:50:35 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux


In [42]:
!python --version

Python 3.10.13
