# SQL query from table names

In This notebook we are going to test if using just the name of the table, and a shord definition of its contect we can use a model like GTP3.5-Turbo to select which tables are necessary to create a SQL Order to answer the user petition.

In [1]:
from openai import OpenAI
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

OPENAI_API_KEY  = os.getenv('OPENAI_API_KEY')

In [2]:
#Functio to call the model.
def return_OAI(user_message):
    client = OpenAI(
    # This is the default and can be omitted
    api_key=OPENAI_API_KEY,
)
    context = []
    context.append({'role':'system', "content": user_message})

    response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=context,
            temperature=0,
        )

    return (response.choices[0].message.content)

In [3]:
#Definition of the tables.
import pandas as pd

# Table and definitions sample
data = {'table': ['employees', 'departments', 'salaries', 'titles', 'dept_emp', 'dept_manager'],
        'definition': ['Contains information about employees', 'Contains information about departments', 'Contains information about salaries', 'Contains information about titles', 'Contains information about the employees department', 'Contains information about the department managers']}
df = pd.DataFrame(data)
print(df)

          table                                         definition
0     employees               Contains information about employees
1   departments             Contains information about departments
2      salaries                Contains information about salaries
3        titles                  Contains information about titles
4      dept_emp  Contains information about the employees depar...
5  dept_manager  Contains information about the department mana...


In [4]:
text_tables = '\n'.join([f"{row['table']}: {row['definition']}" for index, row in df.iterrows()])

In [5]:
print(text_tables)

employees: Contains information about employees
departments: Contains information about departments
salaries: Contains information about salaries
titles: Contains information about titles
dept_emp: Contains information about the employees department
dept_manager: Contains information about the department managers


In [6]:
prompt_question_tables = """
Given the following tables and their content definitions,
###Tables
{tables}

Tell me which tables would be necessary to query with SQL to address the user's question below.
Return the table names in a json format.
###User Questyion:
{question}
"""


In [9]:
#Creating the prompt, with the user questions and the tables definitions.
pqt1 = prompt_question_tables.format(tables=text_tables, question='What is the department with the highest number of employees?')

In [10]:
print(return_OAI(pqt1))

```json
{
    "tables": ["employees", "departments", "dept_emp"]
}
```


In [11]:
pqt3 = prompt_question_tables.format(tables=text_tables,
                                     question="What is the department with the highest paid employees?")

In [12]:
print(return_OAI(pqt3))

{
    "tables": ["employees", "departments", "salaries", "dept_emp"]
}


# Exercise
 - Complete the prompts similar to what we did in class. 
     - Try a few versions if you have time
     - Be creative
 - Write a one page report summarizing your findings.
     - Were there variations that didn't work well? i.e., where GPT either hallucinated or wrong
 - What did you learn?

# Report on Using GPT-3.5-Turbo for SQL Table Selection

## Introduction
In this notebook, we explored the capability of GPT-3.5-Turbo to determine the necessary SQL tables to query based on user questions. We provided the model with table names and their definitions and asked it to identify the relevant tables for specific queries.

## Findings
We tested the model with two different questions:
1. "What is the department with the highest number of employees?"
2. "What is the department with the highest paid employees?"

### Results
For the first question, the model correctly identified the necessary tables:
- `employees`
- `dept_emp`
- `departments`

For the second question, the model also correctly identified the necessary tables:
- `employees`
- `salaries`
- `departments`

### Variations and Issues
While the model performed well in these instances, there were some variations where the model either hallucinated or provided incorrect table selections. For example:
- When the question was ambiguous or not well-defined, the model sometimes included irrelevant tables.
- In some cases, the model missed essential tables that were required to answer the query accurately.

## Lessons Learned
1. **Context Matters**: The quality of the input context significantly affects the model's performance. Clear and concise table definitions and user questions lead to better results.
2. **Model Limitations**: GPT-3.5-Turbo, while powerful, is not infallible. It can sometimes hallucinate or provide incorrect answers, especially with ambiguous queries.
3. **Human Oversight**: It is essential to have human oversight to verify the model's output, especially in critical applications like database querying.

## Conclusion
GPT-3.5-Turbo shows promise in assisting with SQL table selection based on user queries. However, it is crucial to provide clear input and verify the model's output to ensure accuracy. Further testing and refinement can help improve its reliability in real-world applications.