# Natural language to SQL

**Run in [Google Colab](https://colab.research.google.com/) For GPU.**

This model have  Mistral as a base and it has been fine-tuned to excel in SQL code generation.

In [3]:
from google.colab import userdata
token = userdata.get('HF_TOKEN')
print(f"Key sucessfully loaded: {len(token) > 0}")

Key sucessfully loaded: True


In [4]:
#Install the lastest versions of peft & transformers library recommended
#if you want to work with the most recent models
!pip install -q git+https://github.com/huggingface/peft.git
!pip install git+https://github.com/huggingface/accelerate.git
!pip install git+https://github.com/huggingface/transformers.git
!pip install bitsandbytes

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone
Collecting git+https://github.com/huggingface/accelerate.git
  Cloning https://github.com/huggingface/accelerate.git to /tmp/pip-req-build-psaye0q1
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/accelerate.git /tmp/pip-req-build-psaye0q1
  Resolved https://github.com/huggingface/accelerate.git to commit 55136b8dc4a1f5bf8a33f38f25b279debdabcc00
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: accelerate
  Building wheel for accelerate (pyproject.toml) ... [?25l[?25hdone
  Created wheel for accelerate: filename=accelerate-0.35.0.dev0-py3-none-any.whl size=33

In [5]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
import accelerate

In [6]:
model_name = "defog/sqlcoder-7b"

We need to create the Quantization configuration to load the Model.

It is a large model and I want it to fit in a 16GB GPU, I'm going to use a 4 bits quantization.

If you want to learn more about quantization, refer to this article: [QLoRA: Training a Large Language Model on a 16GB GPU.](https://medium.com/towards-artificial-intelligence/qlora-training-a-large-language-model-on-a-16gb-gpu-00ea965667c1)

You can try to use this model in a 8 bit quantizations and check in you see any improvements in the results.

In [7]:
bnb_config = BitsAndBytesConfig(
  load_in_4bit=True,
  bnb_4bit_use_double_quant=True,
  bnb_4bit_quant_type="nf4",
  bnb_4bit_compute_dtype=torch.bfloat16
)


To load the model I pass to the AutoModelForCasualLM teh quantization configurations, and HuggingFace take care of all the hard work.

In [8]:
foundation_model = AutoModelForCausalLM.from_pretrained(model_name,
                    quantization_config=bnb_config,
                    device_map='auto',
                    use_cache = True)

config.json:   0%|          | 0.00/619 [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [9]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
eos_token_id = tokenizer.convert_tokens_to_ids(["```"])[0]

tokenizer_config.json:   0%|          | 0.00/915 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

This function wraps the call to *model.generate*

In [10]:
#this function returns the outputs from the model received, and inputs.
def get_outputs(model, inputs, max_new_tokens=400):
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        num_return_sequences=1,
        eos_token_id=eos_token_id,
        pad_token_id=eos_token_id,
        max_new_tokens=max_new_tokens,
        do_sample=False,
        num_beams=5
    )
    return outputs

# Prompt without Shots.
In this first PROMPT we are going to give Instructions to the model and pass the structure of the Database.

The instructions are significantly different from those we are passing to GPT-3.5-Turbo. This model is really well fine-tuned, but it is smaller than GPT-3.5.

We need to be more clear with the instructions, as it does not have the same capacity to understand our orders as GPT-3.5.

In [42]:
table = """
     create table (
        ID_Hotel INT primary key,
        name VARCHAR,
        rooms INT);

    /*3 example rows
    select * from hotels limit 3;
    ID_Hotel    name                              rooms
    153         ESTREL Berlin                     1200
    8           Steigenberger Airport Frankfurt   933
    59          SI-Suites Stuttgart               52
    */

    create table guests(
        ID_Guest INT primary key,
        name VARCHAR,
        date_of_birth DATE,
        city VARCHAR,
        foreign key(ID_Hotel) references hotel (ID_Hotel));

    /*3 example rows
    select * from guests limit 3;
    ID_Guest    name            date_of_birth   city
    12          Adam Mayer      22/09/1983      Radolfszell
    15          Jasmin Müller   04/01/1992      Mannheim
    19          Rene Nagel      30/08/1995      Oberdupfingen
    */

    create table reservations(
        ID_Res INT primary key,
        arrival DATE,
        departure DATE;
        room_number INT,
        status VARCHAR,
        foreign key(ID_Guest) references guest (ID_Guest),
        foreign key(ID_Hotel) references hotel (ID_Hotel));

    /*3 example rows
    select * from reservations limit 3;
    ID_Res    arrival      departure     room_number  status
    123784    01/01/2025   05/01/2025    NULL         confirmed
    93676     07/10/2024   19/10/2024    210          checked_in
    74673     19/06/2024   22/06/2024    143          checked_out
    */
"""

In [16]:
sp_nl2sql = """
    ### Instructions:
Your task is convert a question into a SQL query, given a SQL database schema.
Adhere to these rules:
- **Deliberately go through the question and database schema word by word** to appropriately answer the question

    ### Input
    Generate a SQL query that answers the question below.
    This query will run on a database whose schema is represented in this string:

    {table}

    ### Response
    Based on your instructions, here is the SQL query I have generated to answer the question
    `{question}`:
    ```sql3
    """

In [17]:
sp_nl2sql = sp_nl2sql.format(question="Guest names with more then 3 reservations", table=table)
print(sp_nl2sql)


    ### Instructions:
Your task is convert a question into a SQL query, given a SQL database schema.
Adhere to these rules:
- **Deliberately go through the question and database schema word by word** to appropriately answer the question

    ### Input
    Generate a SQL query that answers the question below.
    This query will run on a database whose schema is represented in this string:

    
     create table (
        ID_Hotel INT primary key,
        name VARCHAR,
        rooms INT);

    /*3 example rows
    select * from hotels limit 3;
    ID_Hotel    name                              rooms
    153         ESTREL Berlin                     1200
    8           Steigenberger Airport Frankfurt   933
    59          SI-Suites Stuttgart               52
    */

    create table guests(
        ID_Guest INT primary key,
        name VARCHAR,
        date_of_birth DATE,
        city VARCHAR,
        foreign key(ID_Hotel) references hotel (ID_Hotel));

    /*3 example rows
    select

In [13]:
input_sentences = tokenizer(sp_nl2sql, return_tensors="pt").to('cuda')
response = get_outputs(foundation_model, input_sentences, max_new_tokens=400)
SQL = tokenizer.batch_decode(response, skip_special_tokens=True)

In [None]:
#Empty the cache in orde to do more calls without problems.
torch.cuda.empty_cache()

In [14]:
print(SQL[0].split("```sql3")[-1].split("```")[0].split(";")[0].strip() + ";")

SELECT g.name FROM guests g JOIN reservations r ON g.id_guest = r.id_guest GROUP BY g.name HAVING COUNT(r.id_res) > 3;


The SQL Order is correct.

#Prompt with shots OpenAI Style.
In this second prompt we are going to add some Shots with samples to see if our SQL style affects the model.

In [32]:
sp_nl2sql2 = """
    ### Instructions:
Your task is convert a question into a SQL query, given a SQL database schema.
Adhere to these rules:
- **Deliberately go through the question and database schema word by word** to appropriately answer the question
- **Use the samples SQL In the ### Samples section to clearn more about teh Databases structure


    ### Input
    Generate a SQL query that answers the question below.
    This query will run on a database whose schema is represented in this string:

    {table}

    ### Response
    Question: Return the names of guests which are born between 10. October 1999 und 10. October 2000?
    SELECT g.name
    FROM guests g
    WHERE g.date_of_birth BETWEEN '1999-10-10' AND '2000-10-10';

    `{question}`:
    ```sql3
    """


In [33]:
sp_nl2sql2 = sp_nl2sql2.format(question="Guest names with more then 3 reservations", table=table)
(print(sp_nl2sql2))


    ### Instructions:
Your task is convert a question into a SQL query, given a SQL database schema.
Adhere to these rules:
- **Deliberately go through the question and database schema word by word** to appropriately answer the question
- **Use the samples SQL In the ### Samples section to clearn more about teh Databases structure


    ### Input
    Generate a SQL query that answers the question below.
    This query will run on a database whose schema is represented in this string:

    
     create table (
        ID_Hotel INT primary key,
        name VARCHAR,
        rooms INT);

    /*3 example rows
    select * from hotels limit 3;
    ID_Hotel    name                              rooms
    153         ESTREL Berlin                     1200
    8           Steigenberger Airport Frankfurt   933
    59          SI-Suites Stuttgart               52
    */

    create table guests(
        ID_Guest INT primary key,
        name VARCHAR,
        date_of_birth DATE,
        city VARC

In [20]:
input_sentences = tokenizer(sp_nl2sql2, return_tensors="pt").to('cuda')
response = get_outputs(foundation_model, input_sentences, max_new_tokens=400)
SQL = tokenizer.batch_decode(response, skip_special_tokens=True)
torch.cuda.empty_cache()

In [21]:
print(SQL[0].split("```sql3")[-1].split("```")[0].split(";")[0].strip() + ";")

SELECT g.name, COUNT(r.id_res) AS number_of_reservations
    FROM guests g
    JOIN reservations r ON g.id_;


The Order is really different from the one obtained with the first prompt.

The first difference is the format. But The SQL is realy more simple, at least it is my sensation.

#Prompt with Shots in Sample Style.

In this prompt, we will place the examples in a separate section, and in the instructions, we will instruct the model to pay attention to them in order to generate the SQL commands.

In [34]:
sp_nl2sql3b = """
    ### Instructions:
Your task is convert a question into a SQL query, given a SQL database schema.
Adhere to these rules:
- **Deliberately go through the question and database schema word by word** to appropriately answer the question
- **Use the samples SQL In the ### Samples section to learn more about the Databases structure


    ### Input
    Generate a SQL query that answers the question below.
    This query will run on a database whose schema is represented in this string:

    {table}

    ### Samples

    Question: Return the names of guests which are born between 10. October 1999 und 10. October 2000?
    SELECT g.name
    FROM guests g
    WHERE g.date_of_birth BETWEEN '1999-10-10' AND '2000-10-10';

    Question: Return a List of checked_in reservations, for a specific named hotel
    SELECT r.ID_Res
    FROM reservations r
    JOIN hotels h ON r.ID_Hotel = h.ID_Hotel
    WHERE r.status = 'checked_in' AND r.name = 'Hotel am Schillerpark';

    ### Response
    Based on your instructions, here is the SQL query I have generated to answer the question
    `{question}`:
    ```sql3
    """


In [35]:
sp_nl2sql3 = sp_nl2sql3b.format(question="Guest names with more then 3 reservations", table=table)
print (sp_nl2sql3)


    ### Instructions:
Your task is convert a question into a SQL query, given a SQL database schema.
Adhere to these rules:
- **Deliberately go through the question and database schema word by word** to appropriately answer the question
- **Use the samples SQL In the ### Samples section to learn more about the Databases structure


    ### Input
    Generate a SQL query that answers the question below.
    This query will run on a database whose schema is represented in this string:

    
     create table (
        ID_Hotel INT primary key,
        name VARCHAR,
        rooms INT);

    /*3 example rows
    select * from hotels limit 3;
    ID_Hotel    name                              rooms
    153         ESTREL Berlin                     1200
    8           Steigenberger Airport Frankfurt   933
    59          SI-Suites Stuttgart               52
    */

    create table guests(
        ID_Guest INT primary key,
        name VARCHAR,
        date_of_birth DATE,
        city VARCH

In [25]:
input_sentences = tokenizer(sp_nl2sql3, return_tensors="pt").to('cuda')
response = get_outputs(foundation_model, input_sentences, max_new_tokens=400)
SQL = tokenizer.batch_decode(response, skip_special_tokens=True)
torch.cuda.empty_cache()

In [26]:
print(SQL[0].split("```sql3")[-1].split("```")[0].split(";")[0].strip() + ";")

SELECT g.name
    FROM guests g
    JOIN reservations r ON g.ID_Guest = r.ID_Guest
    GROUP BY g.name
    HAVING COUNT(r.ID_Res) > 3;


#Now the question in spanish.


In [36]:
sp_nl2sql3 = sp_nl2sql3b.format(question="Nombres de invitados con más de 3 reservas", table=table)
print (sp_nl2sql3)


    ### Instructions:
Your task is convert a question into a SQL query, given a SQL database schema.
Adhere to these rules:
- **Deliberately go through the question and database schema word by word** to appropriately answer the question
- **Use the samples SQL In the ### Samples section to learn more about the Databases structure


    ### Input
    Generate a SQL query that answers the question below.
    This query will run on a database whose schema is represented in this string:

    
     create table (
        ID_Hotel INT primary key,
        name VARCHAR,
        rooms INT);

    /*3 example rows
    select * from hotels limit 3;
    ID_Hotel    name                              rooms
    153         ESTREL Berlin                     1200
    8           Steigenberger Airport Frankfurt   933
    59          SI-Suites Stuttgart               52
    */

    create table guests(
        ID_Guest INT primary key,
        name VARCHAR,
        date_of_birth DATE,
        city VARCH

In [39]:
input_sentences = tokenizer(sp_nl2sql3, return_tensors="pt").to('cuda')
response = get_outputs(foundation_model, input_sentences, max_new_tokens=400)
SQL = tokenizer.batch_decode(response, skip_special_tokens=True)
torch.cuda.empty_cache()

In [40]:
print(SQL[0].split("```sql3")[-1].split("```")[0].split(";")[0].strip() + ";")

SELECT COUNT(*) AS total_reservations FROM reservations WHERE room_number > 3;


~~The generated SQL command is the same regardless of where we have placed the examples.~~

The output is completly different. Either the AI is bit crappy or my translated spanish text is wrong.

#Conclusions.

Let's see the three SQL's together.

* SELECT employees.name, MAX(salary.salary) AS max_salary FROM employees JOIN salary ON employees.ID_Usr = salary.ID_Usr GROUP BY employees.name ORDER BY max_salary DESC NULLS LAST LIMIT 1;

* SELECT e.name
    FROM employees e
    JOIN salary s ON e.ID_Usr = s.ID_usr
    WHERE s.salary = (SELECT MAX(salary) FROM salary);

* SELECT e.name
    FROM employees e
    JOIN salary s ON e.ID_Usr = s.ID_usr
    WHERE s.salary = (SELECT MAX(salary) FROM salary);

* Spanish Question: SELECT e.name
     FROM employees e
     JOIN salary s ON e.ID_Usr = s.ID_Usr
     WHERE s.salary = (SELECT MAX(salary) FROM salary)
     GROUP BY e.name
     ORDER BY COUNT(studies.ID_study) DESC
     LIMIT 1;


**The model has demonstrated that it is highly efficient in crafting SQL.** Additionally, it pays a lot of attention, perhaps too much, to the examples we provide. Clearly, these examples should be crafted by one of the best SQL programmers we have access to, though their use may not be essential.

On the other hand, although the model is clearly very proficient in SQL generation, during the creation of the notebook, I have encountered several issues because the commands need to be extremely clear. It doesn't handle typos well (which should not exist).

It appears to have some issues when it receives commands in Spanish. I assume this problem would be present in any language other than English. Therefore, since it's a tool that could be used by non-technical personnel, this should be considered in environments where English is not the primary language.

# Exercise
 - Complete the prompts similar to what we did in class.
     - Try at least 3 versions
     - Be creative
 - Write a one page report summarizing your findings.
     - Were there variations that didn't work well? i.e., where GPT either hallucinated or wrong
 - What did you learn?

In [41]:
questions = ["Find the names of all hotels with more than 500 rooms.",
             "List the reservations that are currently 'checked_in' and are scheduled to depart before '2024-12-31'.",
             "Get the names of all guests who live in Munich have a reservation at the hotel named 'ESTREL Berlin'."]

for text in questions:
  payload = sp_nl2sql3b.format(question=text, table=table)
  input_sentences = tokenizer(payload, return_tensors="pt").to('cuda')
  response = get_outputs(foundation_model, input_sentences, max_new_tokens=400)
  SQL = tokenizer.batch_decode(response, skip_special_tokens=True)

  print("Question: " + text)
  print(SQL[0].split("```sql3")[-1].split("```")[0].split(";")[0].strip() + ";")
  print("")

Question: Find the names of all hotels with more than 500 rooms.
SELECT h.name, h.rooms
    FROM hotels h
    WHERE h.rooms > 500;

Question: List the reservations that are currently 'checked_in' and are scheduled to depart before '2024-12-31'.
SELECT r.ID_Res
    FROM reservations r
    JOIN guests g ON r.ID_Guest = g.ID_Guest
    JOIN hotels h ON r.ID_Hotel = h.ID_Hotel
    WHERE r.status = 'checked_in' AND g.date_of_birth BETWEEN '1999-10-10' AND '2000-10-10' AND r.departure BETWEEN CURRENT_DATE AND '2024-12-31';

Question: Get the names of all guests who live in Munich have a reservation at the hotel named 'ESTREL Berlin'.
SELECT g.name
    FROM guests g
    JOIN reservations r ON g.ID_Guest = r.ID_Guest
    JOIN hotels h ON r.ID_Hotel = h.ID_Hotel
    WHERE g.date_of_birth BETWEEN '1999-10-10' AND '2000-10-10' AND h.name = 'ESTREL Berlin';



**Findings:**

Some respones are great.
But the spanish qustion returned for the lookup the room_number of reservations in the select instead of hotel.rooms

My test with 3 questions:
1. Test: was easy and correct
2. Test: somehow the date_of_birth from the guest was added
3. Test: somehow the date of birth from the guest was added