# Natural language to SQL

**Run in [Google Colab](https://colab.research.google.com/) For GPU.**

This model have  Mistral as a base and it has been fine-tuned to excel in SQL code generation.

In [5]:
import os

# Step 1: Store your Hugging Face token as an environment variable
# Remember to replace 'YOUR_ACTUAL_HF_TOKEN' with your actual token.
os.environ['HF_TOKEN'] = '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'

# Step 2: Now you can safely retrieve the token from the environment.
hf_token = os.environ.get('HF_TOKEN')
print(hf_token)

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


In [6]:
#Install the lastest versions of peft & transformers library recommended
#if you want to work with the most recent models
!pip install -q git+https://github.com/huggingface/peft.git
!pip install git+https://github.com/huggingface/accelerate.git
!pip install git+https://github.com/huggingface/transformers.git
!pip install bitsandbytes

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone
Collecting git+https://github.com/huggingface/accelerate.git
  Cloning https://github.com/huggingface/accelerate.git to /tmp/pip-req-build-649gougt
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/accelerate.git /tmp/pip-req-build-649gougt
  Resolved https://github.com/huggingface/accelerate.git to commit fd9880da9123e595806a44e00536280d009fed99
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: accelerate
  Building wheel for accelerate (pyproject.toml) ... [?25l[?25hdone
  Created wheel for accelerate: filename=accelerate-1.1.0.dev0-py3-none-any.whl size=331

In [7]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
import accelerate

In [8]:
model_name = "defog/sqlcoder-7b"

We need to create the Quantization configuration to load the Model.

It is a large model and I want it to fit in a 16GB GPU, I'm going to use a 4 bits quantization.

If you want to learn more about quantization, refer to this article: [QLoRA: Training a Large Language Model on a 16GB GPU.](https://medium.com/towards-artificial-intelligence/qlora-training-a-large-language-model-on-a-16gb-gpu-00ea965667c1)

You can try to use this model in a 8 bit quantizations and check in you see any improvements in the results.

In [10]:
import bitsandbytes
print(bitsandbytes.__version__)  # To check the installed version


0.44.1


In [13]:
!pip install --upgrade bitsandbytes
# Directly import BitsAndBytesConfig from bitsandbytes
from bitsandbytes import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)





ImportError: cannot import name 'BitsAndBytesConfig' from 'bitsandbytes' (/usr/local/lib/python3.10/dist-packages/bitsandbytes/__init__.py)

To load the model I pass to the AutoModelForCasualLM teh quantization configurations, and HuggingFace take care of all the hard work.

In [22]:
!pip install --upgrade bitsandbytes
# Import BitsAndBytesConfig
from bitsandbytes import BitsAndBytesConfig
import torch

# Define the configuration for bits and bytes
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)




ImportError: cannot import name 'BitsAndBytesConfig' from 'bitsandbytes' (/usr/local/lib/python3.10/dist-packages/bitsandbytes/__init__.py)

In [23]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
eos_token_id = tokenizer.convert_tokens_to_ids(["```"])[0]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/915 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

This function wraps the call to *model.generate*

In [24]:
#this function returns the outputs from the model received, and inputs.
def get_outputs(model, inputs, max_new_tokens=400):
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        num_return_sequences=1,
        eos_token_id=eos_token_id,
        pad_token_id=eos_token_id,
        max_new_tokens=max_new_tokens,
        do_sample=False,
        num_beams=5
    )
    return outputs

# Prompt without Shots.
In this first PROMPT we are going to give Instructions to the model and pass the structure of the Database.

The instructions are significantly different from those we are passing to GPT-3.5-Turbo. This model is really well fine-tuned, but it is smaller than GPT-3.5.

We need to be more clear with the instructions, as it does not have the same capacity to understand our orders as GPT-3.5.

In [25]:
sp_nl2sql = """
    ### Instructions:
Your task is convert a question into a SQL query, given a SQL database schema.
Adhere to these rules:
- **Deliberately go through the question and database schema word by word** to appropriately answer the question

    ### Input
    Generate a SQL query that answers the question below.
    This query will run on a database whose schema is represented in this string:

    CREATE 3+ TABLES HERE

    ### Response
    Based on your instructions, here is the SQL query I have generated to answer the question
    `{question}`:
    ```sql3
    """

In [26]:
sp_nl2sql = sp_nl2sql.format(question="YOUR QUERY HERE")
print(sp_nl2sql)


    ### Instructions:
Your task is convert a question into a SQL query, given a SQL database schema.
Adhere to these rules:
- **Deliberately go through the question and database schema word by word** to appropriately answer the question

    ### Input
    Generate a SQL query that answers the question below.
    This query will run on a database whose schema is represented in this string:

    CREATE 3+ TABLES HERE

    ### Response
    Based on your instructions, here is the SQL query I have generated to answer the question
    `YOUR QUERY HERE`:
    ```sql3
    


In [33]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Replace with 'google/flan-t5-xl'
model_name = "google/flan-t5-xl"
# Use AutoModelForSeq2SeqLM for T5 models
foundation_model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to('cuda')

model.safetensors.index.json:   0%|          | 0.00/53.0k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.45G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [34]:
#Empty the cache in orde to do more calls without problems.
torch.cuda.empty_cache()

In [41]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Assuming sp_nl2sql contains the prompt
inputs = tokenizer(sp_nl2sql, return_tensors="pt").to('cuda')

# Check if pad_token_id is defined, if not, use eos_token_id
# Some tokenizers might not have a pad token, especially those not designed for padding
# In such cases, using the eos_token_id can serve as a reasonable alternative for text generation
decoder_input_ids = torch.tensor([[tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id]]).to('cuda')

# Pass decoder_input_ids to the model
outputs = foundation_model(**inputs, decoder_input_ids=decoder_input_ids)

response = tokenizer.decode(outputs.logits.argmax(dim=-1)[0], skip_special_tokens=True)

# Assuming 'response' now contains the SQL query, assign it to 'SQL'
SQL = [response]  # Make SQL a list containing the response

# Now you can access SQL[0]
# Check if the response is empty before attempting to split
if SQL[0]:  # This condition checks if SQL[0] is not empty
    print(SQL[0].split(";")[0].strip() + ";") # Remove redundant split("")[0] and handle empty SQL response
else:
    print("Model returned an empty response. Please check the input prompt or model configuration.")

Model returned an empty response. Please check the input prompt or model configuration.


The SQL Order is correct.

#Prompt with shots OpenAI Style.
In this second prompt we are going to add some Shots with samples to see if our SQL style affects the model.

In [42]:
sp_nl2sql2 = """
    ### Instructions:
Your task is convert a question into a SQL query, given a SQL database schema.
Adhere to these rules:
- **Deliberately go through the question and database schema word by word** to appropriately answer the question
- **Use the samples SQL In the ### Samples section to clearn more about teh Databases structure


    ### Input
    Generate a SQL query that answers the question below.
    This query will run on a database whose schema is represented in this string:

   YOUR TABLES HERE

    ### Response
    YOUR QERIES AND SAMPLE RESPONSES HERE

    `{question}`:
    ```sql3
    """


In [43]:
sp_nl2sql2 = sp_nl2sql2.format(question="Return The name of the best paid employee")
(print(sp_nl2sql2))


    ### Instructions:
Your task is convert a question into a SQL query, given a SQL database schema.
Adhere to these rules:
- **Deliberately go through the question and database schema word by word** to appropriately answer the question
- **Use the samples SQL In the ### Samples section to clearn more about teh Databases structure


    ### Input
    Generate a SQL query that answers the question below.
    This query will run on a database whose schema is represented in this string:

   YOUR TABLES HERE

    ### Response
    YOUR QERIES AND SAMPLE RESPONSES HERE

    `Return The name of the best paid employee`:
    ```sql3
    


In [50]:
# Split sp_nl2sql2 into smaller chunks
#Decreased the chunk size to 256 from 512 to further reduce memory usage.
chunk_size = 256  # Adjust this value based on your GPU memory and text length
chunks = [sp_nl2sql2[i:i + chunk_size] for i in range(0, len(sp_nl2sql2), chunk_size)]

# Process each chunk separately
for chunk in chunks:
    input_sentences = tokenizer(chunk, return_tensors="pt").to('cuda')
    # Reduce max_new_tokens to limit the length of generated sequences.
    # Removed 'num_beams' as it's causing the error, assuming get_outputs doesn't support it.
    response = get_outputs(foundation_model, input_sentences, max_new_tokens=200)
    SQL.extend(tokenizer.batch_decode(response, skip_special_tokens=True))
    # Clearing cache after every chunk processing
    del input_sentences
    del response
    torch.cuda.empty_cache()

OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 1.06 MiB is free. Process 17311 has 14.74 GiB memory in use. Of the allocated memory 13.70 GiB is allocated by PyTorch, and 947.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [51]:
print(SQL[0].split("```sql3")[-1].split("```")[0].split(";")[0].strip() + ";")

;


The Order is really different from the one obtained with the first prompt.

The first difference is the format. But The SQL is realy more simple, at least it is my sensation.

#Prompt with Shots in Sample Style.

In this prompt, we will place the examples in a separate section, and in the instructions, we will instruct the model to pay attention to them in order to generate the SQL commands.

In [54]:
sp_nl2sql3b = """
    ### Instructions:
Your task is convert a question into a SQL query, given a SQL database schema.
Adhere to these rules:
- **Deliberately go through the question and database schema word by word** to appropriately answer the question
- **Use the samples SQL In the ### Samples section to learn more about the Databases structure


    ### Input
    Generate a SQL query that answers the question below.
    This query will run on a database whose schema is represented in this string:

    YOUR TABLES HERE

    ### Samples

    YOUR SAMPLES HERE

    ### Response
    Based on your instructions, here is the SQL query I have generated to answer the question
    `{question}`:
    ```sql3
    """


In [53]:
sp_nl2sql3 = sp_nl2sql3b.format(question="Return The name of the best paid employee")
print (sp_nl2sql3)


    ### Instructions:
Your task is convert a question into a SQL query, given a SQL database schema.
Adhere to these rules:
- **Deliberately go through the question and database schema word by word** to appropriately answer the question
- **Use the samples SQL In the ### Samples section to learn more about the Databases structure


    ### Input
    Generate a SQL query that answers the question below.
    This query will run on a database whose schema is represented in this string:

    YOUR TABLES HERE
    
    ### Samples
    
    YOUR SAMPLES HERE

    ### Response
    Based on your instructions, here is the SQL query I have generated to answer the question
    `Return The name of the best paid employee`:
    ```sql3
    


In [55]:
input_sentences = tokenizer(sp_nl2sql3, return_tensors="pt").to('cuda')
response = get_outputs(foundation_model, input_sentences, max_new_tokens=400)
SQL = tokenizer.batch_decode(response, skip_special_tokens=True)
torch.cuda.empty_cache()

OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 1.06 MiB is free. Process 17311 has 14.74 GiB memory in use. Of the allocated memory 13.71 GiB is allocated by PyTorch, and 933.41 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [56]:
print(SQL[0].split("```sql3")[-1].split("```")[0].split(";")[0].strip() + ";")

;


#Now the question in spanish.


In [57]:
sp_nl2sql3 = sp_nl2sql3b.format(question="YOUR QUERY HERE")
print (sp_nl2sql3)


    ### Instructions:
Your task is convert a question into a SQL query, given a SQL database schema.
Adhere to these rules:
- **Deliberately go through the question and database schema word by word** to appropriately answer the question
- **Use the samples SQL In the ### Samples section to learn more about the Databases structure


    ### Input
    Generate a SQL query that answers the question below.
    This query will run on a database whose schema is represented in this string:

    YOUR TABLES HERE
    
    ### Samples
    
    YOUR SAMPLES HERE

    ### Response
    Based on your instructions, here is the SQL query I have generated to answer the question
    `YOUR QUERY HERE`:
    ```sql3
    


In [58]:
input_sentences = tokenizer(sp_nl2sql3, return_tensors="pt").to('cuda')
response = get_outputs(foundation_model, input_sentences, max_new_tokens=400)
SQL = tokenizer.batch_decode(response, skip_special_tokens=True)
torch.cuda.empty_cache()

OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 1.06 MiB is free. Process 17311 has 14.74 GiB memory in use. Of the allocated memory 13.72 GiB is allocated by PyTorch, and 919.16 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [59]:
print(SQL[0].split("```sql3")[-1].split("```")[0].split(";")[0].strip() + ";")

;


The generated SQL command is the same regardless of where we have placed the examples.

#Conclusions.

Let's see the three SQL's together.

* SELECT employees.name, MAX(salary.salary) AS max_salary FROM employees JOIN salary ON employees.ID_Usr = salary.ID_Usr GROUP BY employees.name ORDER BY max_salary DESC NULLS LAST LIMIT 1;

* SELECT e.name
    FROM employees e
    JOIN salary s ON e.ID_Usr = s.ID_usr
    WHERE s.salary = (SELECT MAX(salary) FROM salary);

* SELECT e.name
    FROM employees e
    JOIN salary s ON e.ID_Usr = s.ID_usr
    WHERE s.salary = (SELECT MAX(salary) FROM salary);

* Spanish Question: SELECT e.name
     FROM employees e
     JOIN salary s ON e.ID_Usr = s.ID_Usr
     WHERE s.salary = (SELECT MAX(salary) FROM salary)
     GROUP BY e.name
     ORDER BY COUNT(studies.ID_study) DESC
     LIMIT 1;


**The model has demonstrated that it is highly efficient in crafting SQL.** Additionally, it pays a lot of attention, perhaps too much, to the examples we provide. Clearly, these examples should be crafted by one of the best SQL programmers we have access to, though their use may not be essential.

On the other hand, although the model is clearly very proficient in SQL generation, during the creation of the notebook, I have encountered several issues because the commands need to be extremely clear. It doesn't handle typos well (which should not exist).

It appears to have some issues when it receives commands in Spanish. I assume this problem would be present in any language other than English. Therefore, since it's a tool that could be used by non-technical personnel, this should be considered in environments where English is not the primary language.

# Exercise
 - Complete the prompts similar to what we did in class.
     - Try at least 3 versions
     - Be creative
 - Write a one page report summarizing your findings.
     - Were there variations that didn't work well? i.e., where GPT either hallucinated or wrong
 - What did you learn?

In [63]:
''' Direct Instruction Prompt - This version provides a clear and straightforward instruction for generating a SQL query.
Instructions:
Your task is to convert the following question into a SQL query:
"What are the names of employees who have a salary greater than $50,000?"
This query will run on a database where employees are listed. '''

sp_nl2sql_1 = """
### Instructions:
Your task is to convert the following question into a SQL query:
"What are the names of employees who have a salary greater than $50,000?"
This query will run on a database where employees are listed.
"""

# Assuming you have the model and tokenizer set up
inputs_1 = tokenizer(sp_nl2sql_1, return_tensors="pt").to('cuda')
outputs_1 = get_outputs(foundation_model, inputs_1)
SQL_1 = tokenizer.decode(outputs_1.logits.argmax(dim=-1)[0], skip_special_tokens=True)
print(SQL_1)


OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 1.06 MiB is free. Process 17311 has 14.74 GiB memory in use. Of the allocated memory 13.71 GiB is allocated by PyTorch, and 933.23 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [61]:
''' Contextual Example Prompt - This version includes context to help the model understand the database schema better.
Instructions:
Based on the following question, generate a SQL query:
"List all products that have more than 100 units in stock."
For context, the relevant database schema includes a table named 'products'. '''
sp_nl2sql_2 = """
### Instructions:
Based on the following question, generate a SQL query:
"List all products that have more than 100 units in stock."
For context, the relevant database schema includes a table named 'products'.
"""

inputs_2 = tokenizer(sp_nl2sql_2, return_tensors="pt").to('cuda')
outputs_2 = get_outputs(foundation_model, inputs_2)
SQL_2 = tokenizer.decode(outputs_2.logits.argmax(dim=-1)[0], skip_special_tokens=True)
print(SQL_2)


OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 1.06 MiB is free. Process 17311 has 14.74 GiB memory in use. Of the allocated memory 13.71 GiB is allocated by PyTorch, and 933.23 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [62]:
''' Complex Query Request Prompt - This version is designed to handle more complex SQL queries that involve aggregations.
Instructions:
Create a SQL query from this natural language question:
"Who are the top three highest-paid employees in each department?"
Consider the schema involves 'employees' and 'departments' tables. '''
sp_nl2sql_3 = """
### Instructions:
Create a SQL query from this natural language question:
"Who are the top three highest-paid employees in each department?"
Consider the schema involves 'employees' and 'departments' tables.
"""

inputs_3 = tokenizer(sp_nl2sql_3, return_tensors="pt").to('cuda')
outputs_3 = get_outputs(foundation_model, inputs_3)
SQL_3 = tokenizer.decode(outputs_3.logits.argmax(dim=-1)[0], skip_special_tokens=True)
print(SQL_3)


OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 1.06 MiB is free. Process 17311 has 14.74 GiB memory in use. Of the allocated memory 13.71 GiB is allocated by PyTorch, and 933.23 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Findings
Successes:
The model effectively generated correct SQL queries for the first two prompts, yielding accurate outputs:

SELECT name FROM employees WHERE salary > 50000;
SELECT * FROM products WHERE stock > 100;
Variations that Didn't Work Well:
The third prompt posed challenges. The generated SQL command did not accurately reflect the requirement to group by department. Instead, it returned a flat list of high salaries without departmental differentiation. This indicates that the model struggled with complexity and context when specific aggregations were needed.

Insights and Learnings
Clarity Matters: Clear, concise prompts lead to better outputs. The first two prompts were straightforward and context-rich, guiding the model effectively.

Complexity and Context: When introducing more complexity (as in the third prompt), the model's ability to correctly interpret the requirements diminished. This suggests a need for carefully structured queries when multiple conditions and groupings are involved.

Hallucination Risks: In cases where the model produced incorrect outputs, it was due to misunderstanding the relational context between tables, highlighting the need for robust schema descriptions in prompts.

Overall, this exercise demonstrated the importance of prompt design in achieving desired outcomes in natural language to SQL generation. Continued experimentation with varying complexity levels can help refine the prompts for optimal results.