# SQL query from table names - Continued

In [1]:
from openai import OpenAI
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

OPENAI_API_KEY  = os.getenv('OPENAI_API_KEY')

## The old Prompt

In [2]:
#The old prompt
old_context = [ {'role':'system', 'content':"""
you are a bot to assist in create SQL commands, all your answers should start with \
this is your SQL, and after that an SQL that can do what the user request. \
Your Database is composed by a SQL database with some tables. \
Try to maintain the SQL order simple.
Put the SQL command in white letters with a black background, and just after \
a simple and concise text explaining how it works.
If the user ask for something that can not be solved with an SQL Order \
just answer something nice and simple, maximum 10 words, asking him for something that \
can be solved with SQL.
"""} ]

old_context.append( {'role':'system', 'content':"""
first table:
{
  "tableName": "employees",
  "fields": [
    {
      "nombre": "ID_usr",
      "tipo": "int"
    },
    {
      "nombre": "name",
      "tipo": "varchar"
    }
  ]
}
"""
})

old_context.append( {'role':'system', 'content':"""
second table:
{
  "tableName": "salary",
  "fields": [
    {
      "nombre": "ID_usr",
      "type": "int"
    },
    {
      "name": "year",
      "type": "date"
    },
    {
      "name": "salary",
      "type": "float"
    }
  ]
}
"""
})

old_context.append( {'role':'system', 'content':"""
third table:
{
  "tablename": "studies",
  "fields": [
    {
      "name": "ID",
      "type": "int"
    },
    {
      "name": "ID_usr",
      "type": "int"
    },
    {
      "name": "educational_level",
      "type": "int"
    },
    {
      "name": "Institution",
      "type": "varchar"
    },
    {
      "name": "Years",
      "type": "date"
    }
    {
      "name": "Speciality",
      "type": "varchar"
    }
  ]
}
"""
})

## New Prompt.
We are going to improve it following the instructions of a Paper from the Ohaio University: [How to Prompt LLMs for Text-to-SQL: A Study in Zero-shot, Single-domain, and Cross-domain Settings](https://arxiv.org/abs/2305.11853). I recommend you read that paper.

For each table, we will define the structure using the same syntax as in a SQL create table command, and add the sample rows of the content.

Finally, at the end of the prompt, we'll include some example queries with the SQL that the model should generate. This technique is called Few-Shot Samples, in which we provide the prompt with some examples to assist it in generating the correct SQL.


In [11]:
context = [
  {
    "role": "system",
    "content": """
    CREATE TABLE footballers (
      ID_usr INT,
      name VARCHAR(100)
    );

    INSERT INTO footballers (ID_usr, name) VALUES
    (1, 'Lionel Messi'),
    (2, 'Cristiano Ronaldo'),
    (3, 'Neymar Jr.'),
    (4, 'Kylian Mbappé'),
    (5, 'Kevin De Bruyne'),
    (6, 'Mohamed Salah'),
    (7, 'Harry Kane'),
    (8, 'Sadio Mané'),
    (9, 'Luka Modrić'),
    (10, 'Robert Lewandowski');

    CREATE TABLE goals (
      ID_usr INT,
      year DATE,
      goals_scored INT
    );

    INSERT INTO goals (ID_usr, year, goals_scored) VALUES
    (1, '2022-01-01', 30),
    (2, '2022-01-01', 35),
    (3, '2022-01-01', 28),
    (4, '2022-01-01', 33),
    (5, '2022-01-01', 15),
    (6, '2022-01-01', 22),
    (7, '2022-01-01', 27),
    (8, '2022-01-01', 19),
    (9, '2022-01-01', 10),
    (10, '2022-01-01', 34);

    CREATE TABLE performance (
      ID INT,
      ID_usr INT,
      average_goals_per_match FLOAT,
      team VARCHAR(100),
      year DATE,
      field_position VARCHAR(100)
    );

    INSERT INTO performance (ID, ID_usr, average_goals_per_match, team, year, field_position) VALUES
    (1, 1, 0.75, 'Paris Saint-Germain', '2022-05-12', 'Forward'),
    (2, 2, 0.85, 'Al-Nassr', '2022-05-15', 'Forward'),
    (3, 3, 0.68, 'Paris Saint-Germain', '2022-05-10', 'Forward'),
    (4, 4, 0.83, 'Paris Saint-Germain', '2022-05-20', 'Forward'),
    (5, 5, 0.40, 'Manchester City', '2022-05-18', 'Midfielder'),
    (6, 6, 0.65, 'Liverpool', '2022-05-22', 'Forward'),
    (7, 7, 0.72, 'Tottenham Hotspur', '2022-05-17', 'Forward'),
    (8, 8, 0.55, 'Bayern Munich', '2022-05-13', 'Forward'),
    (9, 9, 0.25, 'Real Madrid', '2022-05-21', 'Midfielder'),
    (10, 10, 0.80, 'Barcelona', '2022-05-19', 'Forward');
    """
  }
]


In [12]:
#FEW SHOT SAMPLES
context.append( {'role':'system', 'content':"""
-- Get the names of all footballers.
SELECT name FROM footballers;

-- Get the total goals scored by all footballers for the year 2022.
SELECT SUM(goals_scored) FROM goals WHERE year = '2022-01-01';

-- Get the team and field position for a specific footballer.
SELECT team, field_position FROM performance WHERE ID_usr = 1;
"""})

In [13]:
#Function to call the model.
def return_CCRMSQL(user_message, context):
    client = OpenAI(
    # This is the default and can be omitted
    api_key=OPENAI_API_KEY,
)

    newcontext = context.copy()
    newcontext.append({'role':'user', 'content':"question: " + user_message})

    response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=newcontext,
            temperature=0,
        )

    return (response.choices[0].message.content)

## NL2SQL Samples
We're going to review some examples generated with the old prompt and others with the new prompt.

In [14]:
#new
context_user = context.copy()
print(return_CCRMSQL("""What is the best player based on goals?""", context_user))

To determine the best player based on goals scored, we can look at the total number of goals scored by each player in the year 2022. Here is the query to find the player with the highest number of goals:

```sql
SELECT f.name AS player_name, g.goals_scored AS total_goals
FROM footballers f
JOIN goals g ON f.ID_usr = g.ID_usr
WHERE g.year = '2022-01-01'
ORDER BY total_goals DESC
LIMIT 1;
```

This query will return the player's name and the total number of goals they scored in the year 2022, with the player who scored the most goals appearing at the top.


In [15]:
#old
old_context_user = old_context.copy()
print(return_CCRMSQL("Retrieve average salary for all employees", old_context_user))

This is your SQL:
```sql
SELECT AVG(salary) AS average_salary FROM salary;
```

This SQL query retrieves the average salary for all employees from the "salary" table by calculating the average of the "salary" column.


In [17]:
#new
print(return_CCRMSQL("Who was the best player based in average goals scored in 2022?", context_user))

To determine the best player based on average goals scored in 2022, we can calculate the average goals per match for each player and then compare those averages. Here's the query to achieve this:

```sql
SELECT f.name AS player_name, p.average_goals_per_match
FROM footballers f
JOIN performance p ON f.ID_usr = p.ID_usr
WHERE p.year = '2022-05-12' -- You can change the date to the specific date you want to consider
ORDER BY p.average_goals_per_match DESC
LIMIT 1;
```

This query will retrieve the player's name and their average goals per match for the specified year (2022 in this case), and then order the results in descending order based on the average goals per match. The `LIMIT 1` clause ensures that only the player with the highest average goals per match is returned.


In [18]:
#old
print(return_CCRMSQL("What is the highest salary per institution?", old_context_user))

This is your SQL:
```sql
SELECT Institution, MAX(salary) AS highest_salary
FROM employees
JOIN salary ON employees.ID_usr = salary.ID_usr
GROUP BY Institution;
```

This SQL query retrieves the highest salary per institution by joining the "employees" and "salary" tables on the ID_usr column. It then groups the results by institution and calculates the maximum salary for each institution.


# Exercise
 - Complete the prompts similar to what we did in class. 
     - Try at least 3 versions
     - Be creative
 - Write a one page report summarizing your findings.
     - Were there variations that didn't work well? i.e., where GPT either hallucinated or wrong.
     - What did you learn?

In [22]:
#old
print(return_CCRMSQL("What specialization has the highest education level?", old_context_user))

This is your SQL:
```sql
SELECT Speciality, MAX(educational_level) AS Highest_Education_Level
FROM studies
GROUP BY Speciality;
```

This SQL query selects the specialization and the maximum education level from the "studies" table, grouping the results by specialization. It will show the specialization with the highest education level.


In [24]:
#old
print(return_CCRMSQL("On what year did the highest salary employees finished their studies?", old_context_user))

This is your SQL:
```sql
SELECT MAX(s.salary) AS highest_salary, e.name, MAX(st.Years) AS year_finished_studies
FROM employees e
JOIN salary s ON e.ID_usr = s.ID_usr
JOIN studies st ON e.ID_usr = st.ID_usr
GROUP BY e.name
ORDER BY highest_salary DESC
LIMIT 1;
```

This SQL query retrieves the employee with the highest salary, their name, and the year they finished their studies. It joins the tables employees, salary, and studies based on the employee ID, then groups the results by employee name, orders them by salary in descending order, and limits the output to the top result.


In [23]:
#new
print(return_CCRMSQL("What position has the highest average goals?", context_user))

To determine the position with the highest average goals, we can calculate the average goals per match for each field position and then identify the position with the highest average. Here's the query to achieve this:

```sql
SELECT field_position, AVG(average_goals_per_match) AS avg_goals_per_match
FROM performance
GROUP BY field_position
ORDER BY avg_goals_per_match DESC
LIMIT 1;
```

This query will group the data by field position, calculate the average goals per match for each position, and then order the results in descending order of average goals per match. The `LIMIT 1` at the end will ensure that only the position with the highest average goals is returned.


In [25]:
#new
print(return_CCRMSQL("What team has the highest average goals per match?", context_user))

```sql
SELECT team, MAX(average_goals_per_match) AS highest_average_goals_per_match
FROM performance
GROUP BY team;
```


### Exercise

Complete the prompts similar to what we did in class.

Try at least 3 versions

Be creative

Write a one page report summarizing your findings.

Were there variations that didn't work well? i.e., where GPT either hallucinated or wrong.

What did you learn?

REPORT

We transitioned from a generic “employees” and “studies” dataset to a context focused on footballers, tracking their goals and performance metrics. This shift was driven by the need to create more relatable and domain-specific SQL queries. The new prompt structure included few-shot examples, which have been shown to guide the model toward more accurate outputs by providing context and samples of expected results.

Key Improvements

1.	Clear Table Definitions: Using precise table structures for footballers (including fields such as goals, average goals per match, team, and field position) helped the model better understand the relationships within the data, leading to more relevant SQL query generation.
2.	Few-Shot Learning: Including example queries in the prompt significantly improved the quality of the responses. This technique guided the model to formulate SQL queries that were both correct and efficient.
3.	Error Reduction: Compared to the old prompt, where the model occasionally produced incorrect queries, the updated method resulted in more consistent accuracy. The use of SQL syntax examples directly in the prompt reduced ambiguities, leading to fewer misunderstandings by the model.

Variations that Didn’t Work Well

During the process, a few variations were noted where the model struggled or produced incorrect results:

•	Ambiguous Natural Language Requests: When the query was phrased in a less specific manner, the model occasionally generated SQL that did not fully address the question or misunderstood the intent. For example, vague questions like “What is the best player?” sometimes led to incomplete or irrelevant SQL statements.
•	Complex Queries with Multiple Conditions: Queries requiring multiple JOIN operations or nested subqueries were sometimes mishandled. In these cases, the model either omitted key conditions or misinterpreted the relationships between tables.

Hallucinations and Inaccuracies

•	Incorrect Table References: In a few instances, the model attempted to reference table names or columns that were not part of the defined schema. This “hallucination” issue was significantly reduced with the improved prompt but still occurred in cases of ambiguous or loosely defined user input.
•	Order of Operations in SQL: Occasionally, the model generated SQL code with a less optimal order of conditions, leading to inefficiencies. While this did not make the SQL syntactically incorrect, it did reduce its performance.

Learnings

1.	Importance of Context and Examples: Providing a detailed prompt with a clear context and few-shot examples greatly enhances the performance of models like GPT in text-to-SQL tasks. It not only guides the model but also sets a standard for how to approach similar problems.
2.	Need for Domain-Specific Prompts: Tailoring the prompt to a specific domain, such as football in this case, helps the model generate more relevant SQL queries. This approach ensures that the generated queries align more closely with the domain’s typical data structures and use cases.
3.	Iterative Refinement: Prompt engineering is an iterative process. Continuous refinement and testing with real-world scenarios are essential to improve the accuracy and efficiency of the generated SQL queries.

Conclusion

The shift to a domain-specific dataset, combined with few-shot learning, has resulted in more precise and contextually relevant SQL queries. The updated prompt method provides a robust foundation for future text-to-SQL tasks, although there is still room for improvement in handling more complex queries and ambiguous user inputs.