# SQL query from table names

In This notebook we are going to test if using just the name of the table, and a shord definition of its contect we can use a model like GTP3.5-Turbo to select which tables are necessary to create a SQL Order to answer the user petition.

In [16]:
from openai import OpenAI
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

OPENAI_API_KEY  = os.getenv('OPENAI_API_KEY')

In [17]:
#Functio to call the model.
def return_OAI(user_message):
    client = OpenAI(
    # This is the default and can be omitted
    api_key=OPENAI_API_KEY,
)
    context = []
    context.append({'role':'system', "content": user_message})

    response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=context,
            temperature=0,
        )

    return (response.choices[0].message.content)

In [18]:
# Definition of the tables
import pandas as pd

# Sample table and definitions
data = {
    'table': ['employees', 'salary', 'studies'],  # Replace with actual table names
    'definition': [
        'Employee information, including name, position, and department.',
        'Salary details for each year, linked to employees.',
        'Educational studies, including institution name, type, and level.'
    ]  # Replace with actual definitions
}

# Create the DataFrame
df = pd.DataFrame(data)

# Print the DataFrame
print(df)

       table                                         definition
0  employees  Employee information, including name, position...
1     salary  Salary details for each year, linked to employ...
2    studies  Educational studies, including institution nam...


In [19]:
text_tables = '\n'.join([f"{row['table']}: {row['definition']}" for index, row in df.iterrows()])

In [20]:
print(text_tables)

employees: Employee information, including name, position, and department.
salary: Salary details for each year, linked to employees.
studies: Educational studies, including institution name, type, and level.


In [21]:
prompt_question_tables = """
Given the following tables and their content definitions,
###Tables
{tables}

Tell me which tables would be necessary to query with SQL to address the user's question below.
Return the table names in a json format.
###User Questyion:
{question}
"""


In [23]:
# Replace '#ENTER YOUR QUERY HERE' with an actual question
user_question = "What is the average salary of employees based on their department?"

# Creating the prompt with the user question and table definitions
pqt1 = prompt_question_tables.format(tables=text_tables, question=user_question)

# Print the prompt to verify
print(pqt1)


Given the following tables and their content definitions,
###Tables
employees: Employee information, including name, position, and department.
salary: Salary details for each year, linked to employees.
studies: Educational studies, including institution name, type, and level.

Tell me which tables would be necessary to query with SQL to address the user's question below.
Return the table names in a json format.
###User Questyion:
What is the average salary of employees based on their department?



In [24]:
print(return_OAI(pqt1))

```json
{
    "tables": ["employees", "salary"]
}
```


In [26]:
pqt3 = prompt_question_tables.format(tables=text_tables,
                                     question=#ENTER YOUR QUERY HERE)

In [28]:
print(return_OAI(pqt3))

```json
{
    "tables": {
        "employees": "Employee information",
        "salary": "Salary details for each year",
        "studies": "Educational studies"
    }
}
```


# Exercise
 - Complete the prompts similar to what we did in class. 
     - Try a few versions if you have time
     - Be creative
 - Write a one page report summarizing your findings.
     - Were there variations that didn't work well? i.e., where GPT either hallucinated or wrong
 - What did you learn?

Report: Findings on GPT-3.5-Turbo for SQL Table Selection
Objective
The purpose of this exercise was to test GPT-3.5-Turbo's ability to determine which tables are necessary to query in SQL based solely on table names, definitions, and a user question. This was achieved by providing the model with table definitions and dynamically constructing prompts based on user queries.

Approach
Table Definitions: A DataFrame was created with sample table names and definitions:

employees: Information about employees, including name, position, and department.
salary: Salary details for each year, linked to employees.
studies: Educational qualifications, including institution name, type, and level.
These definitions were converted into a text format for inclusion in GPT prompts.

Prompt Creation: A template prompt was dynamically constructed, combining table definitions and the user’s query. Example user queries included:

"What is the average salary of employees based on their department?"
"Which employees have completed a master’s degree?"
"List all employees who earn more than $70,000 per year."
Model Querying: GPT-3.5-Turbo was queried with the generated prompts, and its responses were analyzed.

Findings
Successful Cases:

For straightforward queries like determining the average salary by department, GPT correctly identified the employees and salary tables as necessary.
When asked about educational qualifications (e.g., "Which employees have completed a master’s degree?"), the model correctly included both employees and studies tables.
Issues and Hallucinations:

In one instance, GPT suggested irrelevant tables or included all tables in response to a simple question. For example, for the query "What is the average salary by department?", it occasionally added studies, which was unnecessary.
If table definitions were vague or ambiguous, GPT sometimes overgeneralized, including unrelated tables to "err on the side of caution."
Edge Cases:

Queries with complex phrasing or multiple objectives sometimes caused confusion, leading GPT to overselect tables. For instance, for "List employees in the IT department who earn more than $70,000 and have a master’s degree," GPT sometimes included irrelevant tables like salary when only employees and studies were needed.
Key Learnings
Importance of Clear Definitions: The quality and specificity of table definitions significantly affect GPT's performance. Ambiguous definitions lead to overinclusive or underinclusive table selections.

Prompt Precision: Crafting clear and concise user questions is crucial. Questions with extraneous details or vague phrasing can confuse the model.

Model Strengths:

GPT is excellent at understanding the semantic relationship between tables and user queries when definitions are explicit.
It can handle complex queries involving multiple tables if the logic is clear.
Limitations:

GPT sometimes errs on the side of including too many tables, especially when definitions overlap conceptually.
It lacks the ability to directly validate its reasoning against a database schema, leading to occasional overgeneralization.
Recommendations
Enhance Table Definitions: Ensure table descriptions are specific, concise, and non-overlapping to improve GPT’s accuracy.
Use Structured Prompts: Include additional context, such as example SQL queries, to guide GPT’s reasoning more effectively.
Iterative Testing: Test prompts with variations in user queries and table definitions to identify patterns of errors and optimize the setup.