# Q&A against a SQL Database

Now that we know (from the prior Notebook) how to query tabular data on a CSV file, let's try now to keep the data at is source and ask questions directly to a SQL Database.
The goal of this notebook is to demonstrate how a LLM so advanced as GPT-4 can understand a human question and translate that into a SQL query to get the answer. 

We will be using the Azure SQL Server that you created on the initial deployment. The server should be created on the Resource Group where the Azure Cognitive Search service is located.

Let's begin..

In [1]:
import os
import pandas as pd
import pyodbc
from langchain.chat_models import AzureChatOpenAI
from langchain.agents import create_sql_agent
from langchain.agents.agent_toolkits import SQLDatabaseToolkit
from langchain.sql_database import SQLDatabase
from langchain import SQLDatabaseChain
from langchain.chains import SQLDatabaseSequentialChain

from app.prompts import MSSQL_PROMPT

from IPython.display import Markdown, HTML, display  

def printmd(string):
    display(Markdown(string))

# Don't mess with this unless you really know what you are doing
AZURE_OPENAI_API_VERSION = "2023-03-15-preview"

# Change these below with your own services credentials
AZURE_OPENAI_ENDPOINT = "https://openainlp1.openai.azure.com/"
AZURE_OPENAI_API_KEY = "9e64f10580c44989b8fabe9b23336344"
SQL_SERVER_ENDPOINT = "6fa2eetg2lasw.database.windows.net"
SQL_SERVER_DATABASE = "SampleDB"
SQL_SERVER_USERNAME = "sqladmin"
SQL_SERVER_PASSWORD = "Acar2K6089$"

In [3]:
# Set the ENV variables 
os.environ["OPENAI_API_BASE"] = os.environ["AZURE_OPENAI_ENDPOINT"] = AZURE_OPENAI_ENDPOINT
os.environ["OPENAI_API_KEY"] = os.environ["AZURE_OPENAI_API_KEY"] = AZURE_OPENAI_API_KEY
os.environ["OPENAI_API_VERSION"] = os.environ["AZURE_OPENAI_API_VERSION"] = AZURE_OPENAI_API_VERSION
os.environ["OPENAI_API_TYPE"] = "azure"
os.environ["SQL_SERVER_ENDPOINT"] = SQL_SERVER_ENDPOINT
os.environ["SQL_SERVER_DATABASE"] = SQL_SERVER_DATABASE
os.environ["SQL_SERVER_USERNAME"] = SQL_SERVER_USERNAME
os.environ["SQL_SERVER_PASSWORD"] = SQL_SERVER_PASSWORD

# Install MS SQL DB driver in your machine

We need the driver installed on this compute in order to talk to the SQL DB, so run the below cell once<br>
Reference: https://learn.microsoft.com/en-us/sql/connect/odbc/linux-mac/installing-the-microsoft-odbc-driver-for-sql-server?view=sql-server-ver16&tabs=ubuntu18-install%2Calpine17-install%2Cdebian8-install%2Credhat7-13-install%2Crhel7-offline

In [4]:
!sudo ./download_odbc_driver.sh

sudo: ./download_odbc_driver.sh: command not found


# Load Azure SQL DB with the Covid Tracking CSV Data

The Azure SQL Database is currently empty, so we need to fill it up with data. Let's use the same data on the Covid CSV filed we used on the prior Notebook, that way we can compare results and methods. 
For this, you will need to type below the credentials you used at creation time.

In [28]:

from sqlite3 import OperationalError
from sqlalchemy import create_engine
from sqlalchemy.engine.url import URL

db_config = {
                'drivername': 'mssql+pyodbc',
                'username': os.environ["SQL_SERVER_USERNAME"] +'@'+ os.environ["SQL_SERVER_ENDPOINT"],
                'password': os.environ["SQL_SERVER_PASSWORD"],
                'host': os.environ["SQL_SERVER_ENDPOINT"],
                'port': 1433,
                'database': os.environ["SQL_SERVER_DATABASE"],
                'query': {'driver': 'ODBC Driver 18 for SQL Server'}
            }

# Create a URL object for connecting to the database
db_url = URL.create(**db_config)

# Print the resulting URL string
# print(db_url)

# Connect to the Azure SQL Database using the URL string
engine = create_engine(db_url)

# Test the connection
try:
    conn = engine.connect()
    print("Connection successful!")
    result = engine.execute("SELECT @@Version")
    for row in result:
        print(row)
    conn.close()
    
except OperationalError:
    print("Connection failed.")

Connection successful!
('Microsoft SQL Azure (RTM) - 12.0.2000.8 \n\tMar 30 2023 16:38:37 \n\tCopyright (C) 2022 Microsoft Corporation\n',)


In [16]:
# Read CSV file into a pandas dataframe
csv_path = "./data/all-states-history.csv"
df = pd.read_csv(csv_path).fillna(value = 0)

# Infer column names and data types
column_names = df.columns.tolist()
column_types = df.dtypes.to_dict()

# Generate SQL statement to create table
table_name = 'covidtracking'

create_table_sql = f"CREATE TABLE {table_name} ("
for name, dtype in column_types.items():
    if dtype == 'object':
        create_table_sql += f"{name} VARCHAR(255), "
    elif dtype == 'int64':
        create_table_sql += f"{name} INT, "
    elif dtype == 'float64':
        create_table_sql += f"{name} FLOAT, "
    elif dtype == 'bool':
        create_table_sql += f"{name} BIT, "
    elif dtype == 'datetime64[ns]':
        create_table_sql += f"{name} DATETIME, "
create_table_sql = create_table_sql[:-2] + ")"

try:
    engine.execute(create_table_sql)
except Exception as e:
    print(e)
    
# Insert data into SQL Database
try:
    df.to_sql(table_name, con=engine, if_exists='fail', index=False)
except Exception as e:
    print(e)

Table 'covidtracking' already exists.


# Query with LLM

In [29]:
# Create or LLM Langchain object using GPT-4 deployment
llm = AzureChatOpenAI(deployment_name="gpt-35-turbo", temperature=0, max_tokens=500)

In [30]:
# Let's use a type of Chain made for this type of SQL work.  
db = SQLDatabase.from_uri(db_url)
db_chain = SQLDatabaseChain(llm=llm, database=db, prompt=MSSQL_PROMPT, verbose=True)

In [31]:
# Let's check our prompt we created 
print(db_chain.prompt.template)


You are an MS SQL expert. Given an input question, first create a syntactically correct MS SQL query to run, then look at the results of the query and return the answer to the input question.

Unless the user specifies in the question a specific number of examples to obtain, query for at most {top_k} results using the TOP clause as per MS SQL. You can order the results to return the most informative data in the database.

Never query for all columns from a table. You must query only the columns that are needed to answer the question. Wrap each column name in square brackets ([]) to denote them as delimited identifiers.

Pay attention to use only the column names you can see in the tables below. Be careful to not query for columns that do not exist. Also, pay attention to which column is in which table.

**Do not use double quotes on the SQL query**. 

Your response should be in Markdown.

** ALWAYS before giving the Final Answer, try another method**. Then reflect on the answers of th

In [32]:
# Natural Language question (query)
query_str = 'How may patients in total were hospitalized during July 2020 nationwide?'

In [33]:
printmd(db_chain(query_str)['result'])



[1m> Entering new SQLDatabaseChain chain...[0m
How may patients in total were hospitalized during July 2020 nationwide?
SQLQuery:[32;1m[1;3mSELECT SUM([hospitalizedCumulative]) FROM covidtracking WHERE date LIKE '2020-07%'[0m
SQLResult: [33;1m[1;3m[(5535849,)][0m
Answer:[32;1m[1;3mThere were 5,535,849 patients hospitalized during July 2020 nationwide.

Explanation:
I queried the covidtracking table for the hospitalizedCumulative column where the date starts with '2020-07'. The query returned a list with the number of patients hospitalized for each day in July 2020. To answer the question, I took the sum of all the hospitalized patients in the list, which is 5,535,849. 
I used the following query

```sql
SELECT SUM([hospitalizedCumulative]) FROM covidtracking WHERE date LIKE '2020-07%'
```[0m
[1m> Finished chain.[0m


There were 5,535,849 patients hospitalized during July 2020 nationwide.

Explanation:
I queried the covidtracking table for the hospitalizedCumulative column where the date starts with '2020-07'. The query returned a list with the number of patients hospitalized for each day in July 2020. To answer the question, I took the sum of all the hospitalized patients in the list, which is 5,535,849. 
I used the following query

```sql
SELECT SUM([hospitalizedCumulative]) FROM covidtracking WHERE date LIKE '2020-07%'
```

### To use or not use Agents

As you can see above we achieved our goal of Question->SQL without using an Agent, we did it just by using a clever prompt. (If you want to see the different kind of prompts templates that come with langchain for sql chain, you can check it out [HERE](https://github.com/hwchase17/langchain/blob/master/langchain/chains/sql_database/prompt.py)). **So the question is, why do we need an Agent then?**

**This is why**: As we explained on the prior Notebook, an agent is a process in which the LLM self-asks about what approach and steps to take, questions the validity of the results and if sure, provides the answer. The SQLDatabaseChain doesn't do all this analysis, but instead tries a one-shot query in order to answer the question, which is good! but not enough for complex questions. That's why it couldn't solve the part about nationwide, it needs multiple steps in order to solve the problem, one query is not enough.
Notice that it did't pay atention to the prompt where we explicitly say to try two methods and reflect on the answer.

As homework, try to use an agent instead of a chain and get to the same result as Notebook 5

# Summary

In this notebook, we achieved our goal of Asking a Question in natural language to a dataset located on a SQL Database.  We did this by using purely prompt engineering (Langchain does it for us) and the cognitive power of GPT-4.

This process shows why it is NOT necessary to move the data from its original source as long as the source has an API and a common language we can use to interface with. GPT-4 has been trained on the whole public Github corpus, so it can pretty much understand most of the coding and database query languages that exists out there. 

# NEXT

The Next Notebook will guide you on how we stick everything together. How do we use the features of the past notebooks and create a brain agent that can respond to any request accordingly.