# Natural language to SQL

**Run in [Google Colab](https://colab.research.google.com/) For GPU.**

This model have  Mistral as a base and it has been fine-tuned to excel in SQL code generation.

In [1]:
from google.colab import userdata
userdata.get('HF_TOKEN')

'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'

In [2]:
#Install the lastest versions of peft & transformers library recommended
#if you want to work with the most recent models
!pip install -q git+https://github.com/huggingface/peft.git
!pip install git+https://github.com/huggingface/accelerate.git
!pip install git+https://github.com/huggingface/transformers.git
!pip install bitsandbytes

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting git+https://github.com/huggingface/accelerate.git
  Cloning https://github.com/huggingface/accelerate.git to /tmp/pip-req-build-kv3d3thi
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/accelerate.git /tmp/pip-req-build-kv3d3thi
  Resolved https://github.com/huggingface/accelerate.git to commit fd9880da9123e595806a44e00536280d009fed99
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-sc33vxpe
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
import accelerate

In [4]:
model_name = "defog/sqlcoder-7b"

We need to create the Quantization configuration to load the Model.

It is a large model and I want it to fit in a 16GB GPU, I'm going to use a 4 bits quantization.

If you want to learn more about quantization, refer to this article: [QLoRA: Training a Large Language Model on a 16GB GPU.](https://medium.com/towards-artificial-intelligence/qlora-training-a-large-language-model-on-a-16gb-gpu-00ea965667c1)

You can try to use this model in a 8 bit quantizations and check in you see any improvements in the results.

In [5]:
bnb_config = BitsAndBytesConfig(
  load_in_4bit=True,
  bnb_4bit_use_double_quant=True,
  bnb_4bit_quant_type="nf4",
  bnb_4bit_compute_dtype=torch.bfloat16
)


To load the model I pass to the AutoModelForCasualLM teh quantization configurations, and HuggingFace take care of all the hard work.

In [6]:
foundation_model = AutoModelForCausalLM.from_pretrained(model_name,
                    quantization_config=bnb_config,
                    device_map='auto',
                    use_cache = True)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [7]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
eos_token_id = tokenizer.convert_tokens_to_ids(["```"])[0]

This function wraps the call to *model.generate*

In [8]:
#this function returns the outputs from the model received, and inputs.
def get_outputs(model, inputs, max_new_tokens=400):
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        num_return_sequences=1,
        eos_token_id=eos_token_id,
        pad_token_id=eos_token_id,
        max_new_tokens=max_new_tokens,
        do_sample=False,
        num_beams=5
    )
    return outputs

# Prompt without Shots.
In this first PROMPT we are going to give Instructions to the model and pass the structure of the Database.

The instructions are significantly different from those we are passing to GPT-3.5-Turbo. This model is really well fine-tuned, but it is smaller than GPT-3.5.

We need to be more clear with the instructions, as it does not have the same capacity to understand our orders as GPT-3.5.

In [9]:
sp_nl2sql = """
    ### Instructions:
Your task is convert a question into a SQL query, given a SQL database schema.
Adhere to these rules:
- **Deliberately go through the question and database schema word by word** to appropriately answer the question

    ### Input
    Generate a SQL query that answers the question below.
    This query will run on a database whose schema is represented in this string:

    CREATE TABLE footballers (
      ID_usr INT,
      name VARCHAR(100)
    );

    INSERT INTO footballers (ID_usr, name) VALUES
    (1, 'Lionel Messi'),
    (2, 'Cristiano Ronaldo'),
    (3, 'Neymar Jr.'),
    (4, 'Kylian Mbappé'),
    (5, 'Kevin De Bruyne'),
    (6, 'Mohamed Salah'),
    (7, 'Harry Kane'),
    (8, 'Sadio Mané'),
    (9, 'Luka Modrić'),
    (10, 'Robert Lewandowski');

    CREATE TABLE goals (
      ID_usr INT,
      year DATE,
      goals_scored INT
    );

    INSERT INTO goals (ID_usr, year, goals_scored) VALUES
    (1, '2022-01-01', 30),
    (2, '2022-01-01', 35),
    (3, '2022-01-01', 28),
    (4, '2022-01-01', 33),
    (5, '2022-01-01', 15),
    (6, '2022-01-01', 22),
    (7, '2022-01-01', 27),
    (8, '2022-01-01', 19),
    (9, '2022-01-01', 10),
    (10, '2022-01-01', 34);

    CREATE TABLE performance (
      ID INT,
      ID_usr INT,
      average_goals_per_match FLOAT,
      team VARCHAR(100),
      year DATE,
      field_position VARCHAR(100)
    );

    INSERT INTO performance (ID, ID_usr, average_goals_per_match, team, year, field_position) VALUES
    (1, 1, 0.75, 'Paris Saint-Germain', '2022-05-12', 'Forward'),
    (2, 2, 0.85, 'Al-Nassr', '2022-05-15', 'Forward'),
    (3, 3, 0.68, 'Paris Saint-Germain', '2022-05-10', 'Forward'),
    (4, 4, 0.83, 'Paris Saint-Germain', '2022-05-20', 'Forward'),
    (5, 5, 0.40, 'Manchester City', '2022-05-18', 'Midfielder'),
    (6, 6, 0.65, 'Liverpool', '2022-05-22', 'Forward'),
    (7, 7, 0.72, 'Tottenham Hotspur', '2022-05-17', 'Forward'),
    (8, 8, 0.55, 'Bayern Munich', '2022-05-13', 'Forward'),
    (9, 9, 0.25, 'Real Madrid', '2022-05-21', 'Midfielder'),
    (10, 10, 0.80, 'Barcelona', '2022-05-19', 'Forward');


    ### Response
    Based on your instructions, here is the SQL query I have generated to answer the question
    `{question}`:
    ```sql3
    """

In [10]:
sp_nl2sql = sp_nl2sql.format(question="Find the Footballer with the Most Goals in 2022")
print(sp_nl2sql)


    ### Instructions:
Your task is convert a question into a SQL query, given a SQL database schema.
Adhere to these rules:
- **Deliberately go through the question and database schema word by word** to appropriately answer the question

    ### Input
    Generate a SQL query that answers the question below.
    This query will run on a database whose schema is represented in this string:

    CREATE TABLE footballers (
      ID_usr INT,
      name VARCHAR(100)
    );

    INSERT INTO footballers (ID_usr, name) VALUES
    (1, 'Lionel Messi'),
    (2, 'Cristiano Ronaldo'),
    (3, 'Neymar Jr.'),
    (4, 'Kylian Mbappé'),
    (5, 'Kevin De Bruyne'),
    (6, 'Mohamed Salah'),
    (7, 'Harry Kane'),
    (8, 'Sadio Mané'),
    (9, 'Luka Modrić'),
    (10, 'Robert Lewandowski');

    CREATE TABLE goals (
      ID_usr INT,
      year DATE,
      goals_scored INT
    );

    INSERT INTO goals (ID_usr, year, goals_scored) VALUES
    (1, '2022-01-01', 30),
    (2, '2022-01-01', 35),
    (3, '20

In [11]:
input_sentences = tokenizer(sp_nl2sql, return_tensors="pt").to('cuda')
response = get_outputs(foundation_model, input_sentences, max_new_tokens=400)
SQL = tokenizer.batch_decode(response, skip_special_tokens=True)

In [12]:
#Empty the cache in orde to do more calls without problems.
torch.cuda.empty_cache()

In [13]:
print(SQL[0].split("```sql3")[-1].split("```")[0].split(";")[0].strip() + ";")

SELECT c.name, COUNT(g.ID_usr) AS goals_scored FROM players c JOIN goals g ON c.ID_usr = g.ID_usr WHERE EXTRACT(YEAR FROM g.year) = 2022 GROUP BY c.name ORDER BY goals_scored DESC NULLS LAST LIMIT 1;


The SQL Order is correct.

#Prompt with shots OpenAI Style.
In this second prompt we are going to add some Shots with samples to see if our SQL style affects the model.

In [14]:
sp_nl2sql2 = """
    ### Instructions:
Your task is convert a question into a SQL query, given a SQL database schema.
Adhere to these rules:
- **Deliberately go through the question and database schema word by word** to appropriately answer the question
- **Use the samples SQL In the ### Samples section to clearn more about teh Databases structure


    ### Input
    Generate a SQL query that answers the question below.
    This query will run on a database whose schema is represented in this string:

    CREATE TABLE footballers (
      ID_usr INT,
      name VARCHAR(100)
    );

    INSERT INTO footballers (ID_usr, name) VALUES
    (1, 'Lionel Messi'),
    (2, 'Cristiano Ronaldo'),
    (3, 'Neymar Jr.'),
    (4, 'Kylian Mbappé'),
    (5, 'Kevin De Bruyne'),
    (6, 'Mohamed Salah'),
    (7, 'Harry Kane'),
    (8, 'Sadio Mané'),
    (9, 'Luka Modrić'),
    (10, 'Robert Lewandowski');

    CREATE TABLE goals (
      ID_usr INT,
      year DATE,
      goals_scored INT
    );

    INSERT INTO goals (ID_usr, year, goals_scored) VALUES
    (1, '2022-01-01', 30),
    (2, '2022-01-01', 35),
    (3, '2022-01-01', 28),
    (4, '2022-01-01', 33),
    (5, '2022-01-01', 15),
    (6, '2022-01-01', 22),
    (7, '2022-01-01', 27),
    (8, '2022-01-01', 19),
    (9, '2022-01-01', 10),
    (10, '2022-01-01', 34);

    CREATE TABLE performance (
      ID INT,
      ID_usr INT,
      average_goals_per_match FLOAT,
      team VARCHAR(100),
      year DATE,
      field_position VARCHAR(100)
    );

    INSERT INTO performance (ID, ID_usr, average_goals_per_match, team, year, field_position) VALUES
    (1, 1, 0.75, 'Paris Saint-Germain', '2022-05-12', 'Forward'),
    (2, 2, 0.85, 'Al-Nassr', '2022-05-15', 'Forward'),
    (3, 3, 0.68, 'Paris Saint-Germain', '2022-05-10', 'Forward'),
    (4, 4, 0.83, 'Paris Saint-Germain', '2022-05-20', 'Forward'),
    (5, 5, 0.40, 'Manchester City', '2022-05-18', 'Midfielder'),
    (6, 6, 0.65, 'Liverpool', '2022-05-22', 'Forward'),
    (7, 7, 0.72, 'Tottenham Hotspur', '2022-05-17', 'Forward'),
    (8, 8, 0.55, 'Bayern Munich', '2022-05-13', 'Forward'),
    (9, 9, 0.25, 'Real Madrid', '2022-05-21', 'Midfielder'),
    (10, 10, 0.80, 'Barcelona', '2022-05-19', 'Forward');

    ### Response
    Query: Fetch All Footballers’ Names and Their Teams
    SELECT name, team
    FROM footballers
    JOIN performance ON footballers.ID_usr = performance.ID_usr;
    Query: Fetch All Footballers’ Names and Their Average Goals Per Match
    SELECT name, average_goals_per_match
    FROM footballers
    JOIN performance ON footballers.ID_usr = performance.ID_usr;
    Query: Query to List All Footballers and Their Field Positions
    SELECT name, field_position
    FROM footballers
    JOIN performance ON footballers.ID_usr = performance.ID_usr;

    `{question}`:
    ```sql3
    """


In [15]:
sp_nl2sql2 = sp_nl2sql2.format(question="Find the Footballer with the Most Goals in 2022")
(print(sp_nl2sql2))


    ### Instructions:
Your task is convert a question into a SQL query, given a SQL database schema.
Adhere to these rules:
- **Deliberately go through the question and database schema word by word** to appropriately answer the question
- **Use the samples SQL In the ### Samples section to clearn more about teh Databases structure


    ### Input
    Generate a SQL query that answers the question below.
    This query will run on a database whose schema is represented in this string:

    CREATE TABLE footballers (
      ID_usr INT,
      name VARCHAR(100)
    );

    INSERT INTO footballers (ID_usr, name) VALUES
    (1, 'Lionel Messi'),
    (2, 'Cristiano Ronaldo'),
    (3, 'Neymar Jr.'),
    (4, 'Kylian Mbappé'),
    (5, 'Kevin De Bruyne'),
    (6, 'Mohamed Salah'),
    (7, 'Harry Kane'),
    (8, 'Sadio Mané'),
    (9, 'Luka Modrić'),
    (10, 'Robert Lewandowski');

    CREATE TABLE goals (
      ID_usr INT,
      year DATE,
      goals_scored INT
    );

    INSERT INTO goals (ID_

In [16]:
input_sentences = tokenizer(sp_nl2sql2, return_tensors="pt").to('cuda')
response = get_outputs(foundation_model, input_sentences, max_new_tokens=400)
SQL = tokenizer.batch_decode(response, skip_special_tokens=True)
torch.cuda.empty_cache()

In [17]:
print(SQL[0].split("```sql3")[-1].split("```")[0].split(";")[0].strip() + ";")

SELECT name, COUNT(goals_scored) AS total_goals
    FROM goals
    WHERE year = '2022-01-01'
    GROUP BY name
    ORDER BY total_goals DESC NULLS LAST
    LIMIT 1;


The Order is really different from the one obtained with the first prompt.

The first difference is the format. But The SQL is realy more simple, at least it is my sensation.

#Prompt with Shots in Sample Style.

In this prompt, we will place the examples in a separate section, and in the instructions, we will instruct the model to pay attention to them in order to generate the SQL commands.

In [18]:
sp_nl2sql3b = """
    ### Instructions:
Your task is convert a question into a SQL query, given a SQL database schema.
Adhere to these rules:
- **Deliberately go through the question and database schema word by word** to appropriately answer the question
- **Use the samples SQL In the ### Samples section to learn more about the Databases structure


    ### Input
    Generate a SQL query that answers the question below.
    This query will run on a database whose schema is represented in this string:

    CREATE TABLE footballers (
      ID_usr INT,
      name VARCHAR(100)
    );

    INSERT INTO footballers (ID_usr, name) VALUES
    (1, 'Lionel Messi'),
    (2, 'Cristiano Ronaldo'),
    (3, 'Neymar Jr.'),
    (4, 'Kylian Mbappé'),
    (5, 'Kevin De Bruyne'),
    (6, 'Mohamed Salah'),
    (7, 'Harry Kane'),
    (8, 'Sadio Mané'),
    (9, 'Luka Modrić'),
    (10, 'Robert Lewandowski');

    CREATE TABLE goals (
      ID_usr INT,
      year DATE,
      goals_scored INT
    );

    INSERT INTO goals (ID_usr, year, goals_scored) VALUES
    (1, '2022-01-01', 30),
    (2, '2022-01-01', 35),
    (3, '2022-01-01', 28),
    (4, '2022-01-01', 33),
    (5, '2022-01-01', 15),
    (6, '2022-01-01', 22),
    (7, '2022-01-01', 27),
    (8, '2022-01-01', 19),
    (9, '2022-01-01', 10),
    (10, '2022-01-01', 34);

    CREATE TABLE performance (
      ID INT,
      ID_usr INT,
      average_goals_per_match FLOAT,
      team VARCHAR(100),
      year DATE,
      field_position VARCHAR(100)
    );

    INSERT INTO performance (ID, ID_usr, average_goals_per_match, team, year, field_position) VALUES
    (1, 1, 0.75, 'Paris Saint-Germain', '2022-05-12', 'Forward'),
    (2, 2, 0.85, 'Al-Nassr', '2022-05-15', 'Forward'),
    (3, 3, 0.68, 'Paris Saint-Germain', '2022-05-10', 'Forward'),
    (4, 4, 0.83, 'Paris Saint-Germain', '2022-05-20', 'Forward'),
    (5, 5, 0.40, 'Manchester City', '2022-05-18', 'Midfielder'),
    (6, 6, 0.65, 'Liverpool', '2022-05-22', 'Forward'),
    (7, 7, 0.72, 'Tottenham Hotspur', '2022-05-17', 'Forward'),
    (8, 8, 0.55, 'Bayern Munich', '2022-05-13', 'Forward'),
    (9, 9, 0.25, 'Real Madrid', '2022-05-21', 'Midfielder'),
    (10, 10, 0.80, 'Barcelona', '2022-05-19', 'Forward');

    ### Samples

    Query: Fetch All Footballers’ Names and Their Teams
    SELECT name, team
    FROM footballers
    JOIN performance ON footballers.ID_usr = performance.ID_usr;
    Query: Fetch All Footballers’ Names and Their Average Goals Per Match
    SELECT name, average_goals_per_match
    FROM footballers
    JOIN performance ON footballers.ID_usr = performance.ID_usr;
    Query: Query to List All Footballers and Their Field Positions
    SELECT name, field_position
    FROM footballers
    JOIN performance ON footballers.ID_usr = performance.ID_usr;

    ### Response
    Based on your instructions, here is the SQL query I have generated to answer the question
    `{question}`:
    ```sql3
    """


In [19]:
sp_nl2sql3 = sp_nl2sql3b.format(question="Get the Footballers’ Names and Average Goals Per Match for Players with More Than 0.7 Goals Per Match")
print (sp_nl2sql3)


    ### Instructions:
Your task is convert a question into a SQL query, given a SQL database schema.
Adhere to these rules:
- **Deliberately go through the question and database schema word by word** to appropriately answer the question
- **Use the samples SQL In the ### Samples section to learn more about the Databases structure


    ### Input
    Generate a SQL query that answers the question below.
    This query will run on a database whose schema is represented in this string:

        CREATE TABLE footballers (
      ID_usr INT,
      name VARCHAR(100)
    );

    INSERT INTO footballers (ID_usr, name) VALUES
    (1, 'Lionel Messi'),
    (2, 'Cristiano Ronaldo'),
    (3, 'Neymar Jr.'),
    (4, 'Kylian Mbappé'),
    (5, 'Kevin De Bruyne'),
    (6, 'Mohamed Salah'),
    (7, 'Harry Kane'),
    (8, 'Sadio Mané'),
    (9, 'Luka Modrić'),
    (10, 'Robert Lewandowski');

    CREATE TABLE goals (
      ID_usr INT,
      year DATE,
      goals_scored INT
    );

    INSERT INTO goals (

In [20]:
input_sentences = tokenizer(sp_nl2sql3, return_tensors="pt").to('cuda')
response = get_outputs(foundation_model, input_sentences, max_new_tokens=400)
SQL = tokenizer.batch_decode(response, skip_special_tokens=True)
torch.cuda.empty_cache()

In [21]:
print(SQL[0].split("```sql3")[-1].split("```")[0].split(";")[0].strip() + ";")

SELECT c.name, AVG(p.average_goals_per_match) AS average_goals_per_match FROM players c JOIN performance p ON c.ID_usr = p.ID WHERE p.average_goals_per_match > 0.7 GROUP BY c.name;


#Now the question in spanish.


In [22]:
sp_nl2sql3 = sp_nl2sql3b.format(question="Quien fue el jugador con más goles en 2022?")
print (sp_nl2sql3)


    ### Instructions:
Your task is convert a question into a SQL query, given a SQL database schema.
Adhere to these rules:
- **Deliberately go through the question and database schema word by word** to appropriately answer the question
- **Use the samples SQL In the ### Samples section to learn more about the Databases structure


    ### Input
    Generate a SQL query that answers the question below.
    This query will run on a database whose schema is represented in this string:

        CREATE TABLE footballers (
      ID_usr INT,
      name VARCHAR(100)
    );

    INSERT INTO footballers (ID_usr, name) VALUES
    (1, 'Lionel Messi'),
    (2, 'Cristiano Ronaldo'),
    (3, 'Neymar Jr.'),
    (4, 'Kylian Mbappé'),
    (5, 'Kevin De Bruyne'),
    (6, 'Mohamed Salah'),
    (7, 'Harry Kane'),
    (8, 'Sadio Mané'),
    (9, 'Luka Modrić'),
    (10, 'Robert Lewandowski');

    CREATE TABLE goals (
      ID_usr INT,
      year DATE,
      goals_scored INT
    );

    INSERT INTO goals (

In [23]:
input_sentences = tokenizer(sp_nl2sql3, return_tensors="pt").to('cuda')
response = get_outputs(foundation_model, input_sentences, max_new_tokens=400)
SQL = tokenizer.batch_decode(response, skip_special_tokens=True)
torch.cuda.empty_cache()

In [24]:
print(SQL[0].split("```sql3")[-1].split("```")[0].split(";")[0].strip() + ";")

SELECT c.name, COUNT(g.ID_usr) AS goals_scored
    FROM players c
    JOIN goals g ON c.ID_usr = g.ID_usr
    WHERE g.year = '2022-01-01'
    GROUP BY c.name
    ORDER BY goals_scored DESC
    LIMIT 1;


The generated SQL command is the same regardless of where we have placed the examples.

#Conclusions.

Let's see the three SQL's together.

* SELECT employees.name, MAX(salary.salary) AS max_salary FROM employees JOIN salary ON employees.ID_Usr = salary.ID_Usr GROUP BY employees.name ORDER BY max_salary DESC NULLS LAST LIMIT 1;

* SELECT e.name
    FROM employees e
    JOIN salary s ON e.ID_Usr = s.ID_usr
    WHERE s.salary = (SELECT MAX(salary) FROM salary);

* SELECT e.name
    FROM employees e
    JOIN salary s ON e.ID_Usr = s.ID_usr
    WHERE s.salary = (SELECT MAX(salary) FROM salary);

* Spanish Question: SELECT e.name
     FROM employees e
     JOIN salary s ON e.ID_Usr = s.ID_Usr
     WHERE s.salary = (SELECT MAX(salary) FROM salary)
     GROUP BY e.name
     ORDER BY COUNT(studies.ID_study) DESC
     LIMIT 1;


**The model has demonstrated that it is highly efficient in crafting SQL.** Additionally, it pays a lot of attention, perhaps too much, to the examples we provide. Clearly, these examples should be crafted by one of the best SQL programmers we have access to, though their use may not be essential.

On the other hand, although the model is clearly very proficient in SQL generation, during the creation of the notebook, I have encountered several issues because the commands need to be extremely clear. It doesn't handle typos well (which should not exist).

It appears to have some issues when it receives commands in Spanish. I assume this problem would be present in any language other than English. Therefore, since it's a tool that could be used by non-technical personnel, this should be considered in environments where English is not the primary language.

# Exercise
 - Complete the prompts similar to what we did in class.
     - Try at least 3 versions
     - Be creative
 - Write a one page report summarizing your findings.
     - Were there variations that didn't work well? i.e., where GPT either hallucinated or wrong
 - What did you learn?

In [26]:
sp_nl2sql3 = sp_nl2sql3b.format(question="Which footballers scored more than 25 goals in the year 2022?")
input_sentences = tokenizer(sp_nl2sql3, return_tensors="pt").to('cuda')
response = get_outputs(foundation_model, input_sentences, max_new_tokens=400)
SQL = tokenizer.batch_decode(response, skip_special_tokens=True)
torch.cuda.empty_cache()
print(SQL[0].split("```sql3")[-1].split("```")[0].split(";")[0].strip() + ";")

SELECT c.name, COUNT(g.goals_scored) AS goals_scored
    FROM players c
    JOIN goals g ON c.ID_usr = g.ID_usr
    WHERE g.year = '2022-01-01'
    GROUP BY c.name
    HAVING COUNT(g.goals_scored) > 25;


In [27]:
sp_nl2sql3 = sp_nl2sql3b.format(question="What is the average number of goals per match for each footballer in Paris Saint-Germain?")
input_sentences = tokenizer(sp_nl2sql3, return_tensors="pt").to('cuda')
response = get_outputs(foundation_model, input_sentences, max_new_tokens=400)
SQL = tokenizer.batch_decode(response, skip_special_tokens=True)
torch.cuda.empty_cache()
print(SQL[0].split("```sql3")[-1].split("```")[0].split(";")[0].strip() + ";")

SELECT c.name, AVG(p.average_goals_per_match) AS average_goals_per_match FROM players c JOIN performance p ON c.ID_usr = p.ID AND p.team = 'Paris Saint-Germain' WHERE c.team = 'Paris Saint-Germain' GROUP BY c.name;


In [28]:
sp_nl2sql3 = sp_nl2sql3b.format(question="Who are the midfielders in the database, and which teams do they play for?")
input_sentences = tokenizer(sp_nl2sql3, return_tensors="pt").to('cuda')
response = get_outputs(foundation_model, input_sentences, max_new_tokens=400)
SQL = tokenizer.batch_decode(response, skip_special_tokens=True)
torch.cuda.empty_cache()
print(SQL[0].split("```sql3")[-1].split("```")[0].split(";")[0].strip() + ";")

CREATE TABLE midfielders (
      ID_usr INT,
      name VARCHAR(100)
    );


REPORT

Several challenges and limitations were encountered while generating SQL queries:

1.	Ambiguity in Natural Language Queries: The model occasionally struggled with generating correct SQL queries when the input query was ambiguous. It often returned incomplete or inaccurate SQL statements.
2.	Model Hallucinations: The model sometimes produced SQL queries with references to nonexistent columns or tables.
3.	Complex SQL Queries: When generating complex SQL commands involving multiple joins or nested subqueries, the model occasionally misordered the operations or failed to include necessary conditions. This affected the performance and accuracy of the queries.

Learnings

1.	Importance of Prompt Engineering: Properly structuring prompts and providing specific examples of SQL queries improved the model’s accuracy. Few-shot learning techniques were essential in guiding the model to produce better SQL outputs.
2.	Impact of Quantization: The use of 4-bit quantization was crucial for fitting the model into limited memory resources without sacrificing much accuracy. However, exploring different quantization levels (like 8-bit) could help fine-tune the trade-off between model size and performance.
3.	Error Handling: Handling ambiguities and refining natural language inputs are necessary steps to enhance the model’s reliability. Teaching the model to identify and ask clarifying questions when faced with vague input could significantly reduce errors.
4.	Model Adaptation: Adapting the model to specific domains enhances contextual understanding, leading to more relevant and accurate responses.

Conclusion

This project demonstrates that while fine-tuned models for SQL generation can achieve high accuracy with structured prompts and quantization techniques, there are areas where improvements are still needed, especially in handling ambiguities and complex query structures. 