# Natural language to SQL

**Run in [Google Colab](https://colab.research.google.com/) For GPU.**

This model have  Mistral as a base and it has been fine-tuned to excel in SQL code generation.

In [None]:
from google.colab import userdata
userdata.get('HF_TOKEN')

In [None]:
#Install the lastest versions of peft & transformers library recommended
#if you want to work with the most recent models
!pip install -q git+https://github.com/huggingface/peft.git
!pip install git+https://github.com/huggingface/accelerate.git
!pip install git+https://github.com/huggingface/transformers.git
!pip install bitsandbytes

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
import accelerate

In [4]:
model_name = "defog/sqlcoder-7b"

We need to create the Quantization configuration to load the Model.

It is a large model and I want it to fit in a 16GB GPU, I'm going to use a 4 bits quantization.

If you want to learn more about quantization, refer to this article: [QLoRA: Training a Large Language Model on a 16GB GPU.](https://medium.com/towards-artificial-intelligence/qlora-training-a-large-language-model-on-a-16gb-gpu-00ea965667c1)

You can try to use this model in a 8 bit quantizations and check in you see any improvements in the results.

In [5]:
bnb_config = BitsAndBytesConfig(
  load_in_4bit=True,
  bnb_4bit_use_double_quant=True,
  bnb_4bit_quant_type="nf4",
  bnb_4bit_compute_dtype=torch.bfloat16
)


To load the model I pass to the AutoModelForCasualLM teh quantization configurations, and HuggingFace take care of all the hard work.

In [None]:
foundation_model = AutoModelForCausalLM.from_pretrained(model_name,
                    quantization_config=bnb_config,
                    device_map='auto',
                    use_cache = True)

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
eos_token_id = tokenizer.convert_tokens_to_ids(["```"])[0]

This function wraps the call to *model.generate*

In [8]:
#this function returns the outputs from the model received, and inputs.
def get_outputs(model, inputs, max_new_tokens=400):
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        num_return_sequences=1,
        eos_token_id=eos_token_id,
        pad_token_id=eos_token_id,
        max_new_tokens=max_new_tokens,
        do_sample=False,
        num_beams=5
    )
    return outputs

# Prompt without Shots.
In this first PROMPT we are going to give Instructions to the model and pass the structure of the Database.

The instructions are significantly different from those we are passing to GPT-3.5-Turbo. This model is really well fine-tuned, but it is smaller than GPT-3.5.

We need to be more clear with the instructions, as it does not have the same capacity to understand our orders as GPT-3.5.

In [9]:
sp_nl2sql = """
    ### Instructions:
Your task is convert a question into a SQL query, given a SQL database schema.
Adhere to these rules:
- **Deliberately go through the question and database schema word by word** to appropriately answer the question

    ### Input
    Generate a SQL query that answers the question below.
    This query will run on a database whose schema is represented in this string:

    CREATE TABLE employees (
        ID_Usr INT PRIMARY KEY,
        name VARCHAR(100),
        department VARCHAR(100)
    );

    CREATE TABLE salary (
        ID_Usr INT,
        salary DECIMAL(10,2),
        FOREIGN KEY (ID_Usr) REFERENCES employees(ID_Usr)
    );

    CREATE TABLE studies (
        ID_study INT PRIMARY KEY,
        ID_Usr INT,
        degree VARCHAR(100),
        FOREIGN KEY (ID_Usr) REFERENCES employees(ID_Usr)
    );

    ### Response
    Based on your instructions, here is the SQL query I have generated to answer the question
    `{question}`:
    ```sql3
    """

In [None]:
sp_nl2sql = sp_nl2sql.format(question="Find all employees who make more than $50,000")
print(sp_nl2sql)

In [12]:
input_sentences = tokenizer(sp_nl2sql, return_tensors="pt").to('cuda')
response = get_outputs(foundation_model, input_sentences, max_new_tokens=400)
SQL = tokenizer.batch_decode(response, skip_special_tokens=True)

In [13]:
#Empty the cache in orde to do more calls without problems.
torch.cuda.empty_cache()

In [None]:
print(SQL[0].split("```sql3")[-1].split("```")[0].split(";")[0].strip() + ";")

The SQL Order is correct.

#Prompt with shots OpenAI Style.
In this second prompt we are going to add some Shots with samples to see if our SQL style affects the model.

In [15]:
sp_nl2sql2 = """
    ### Instructions:
Your task is convert a question into a SQL query, given a SQL database schema.
Adhere to these rules:
- **Deliberately go through the question and database schema word by word** to appropriately answer the question
- **Use the samples SQL In the ### Samples section to clearn more about teh Databases structure


    ### Input
    Generate a SQL query that answers the question below.
    This query will run on a database whose schema is represented in this string:

    CREATE TABLE employees (
        ID_Usr INT PRIMARY KEY,
        name VARCHAR(100),
        department VARCHAR(100)
    );

    CREATE TABLE salary (
        ID_Usr INT,
        salary DECIMAL(10,2),
        FOREIGN KEY (ID_Usr) REFERENCES employees(ID_Usr)
    );

    CREATE TABLE studies (
        ID_study INT PRIMARY KEY,
        ID_Usr INT,
        degree VARCHAR(100),
        FOREIGN KEY (ID_Usr) REFERENCES employees(ID_Usr)
    );

    ### Response
    ### Samples
    Question: "Show me all employees in the IT department"
    ```sql
    SELECT name
    FROM employees
    WHERE department = 'IT';
    ```

    Question: "What is the average salary in each department?"
    ```sql
    SELECT e.department, AVG(s.salary) as avg_salary
    FROM employees e
    JOIN salary s ON e.ID_Usr = s.ID_Usr
    GROUP BY e.department;
    ```

    Question: "Find employees who have both a Bachelor's and Master's degree"
    ```sql
    SELECT e.name
    FROM employees e
    JOIN studies s1 ON e.ID_Usr = s1.ID_Usr
    JOIN studies s2 ON e.ID_Usr = s2.ID_Usr
    WHERE s1.degree = 'Bachelor' AND s2.degree = 'Master'
    GROUP BY e.name;
    ```

    Question: "Who are the top 3 highest paid employees?"
    ```sql
    SELECT e.name, s.salary
    FROM employees e
    JOIN salary s ON e.ID_Usr = s.ID_Usr
    ORDER BY s.salary DESC
    LIMIT 3;
    ```

    `{question}`:
    ```sql3
    """


In [None]:
sp_nl2sql2 = sp_nl2sql2.format(question="Return The name of the best paid employee")
(print(sp_nl2sql2))

In [17]:
input_sentences = tokenizer(sp_nl2sql2, return_tensors="pt").to('cuda')
response = get_outputs(foundation_model, input_sentences, max_new_tokens=400)
SQL = tokenizer.batch_decode(response, skip_special_tokens=True)
torch.cuda.empty_cache()

In [None]:
print(SQL[0].split("```sql3")[-1].split("```")[0].split(";")[0].strip() + ";")

The Order is really different from the one obtained with the first prompt.

The first difference is the format. But The SQL is realy more simple, at least it is my sensation.

#Prompt with Shots in Sample Style.

In this prompt, we will place the examples in a separate section, and in the instructions, we will instruct the model to pay attention to them in order to generate the SQL commands.

In [19]:
sp_nl2sql3b = """
        ### Instructions:
Your task is to convert a question into a SQL query, given a SQL database schema.
Adhere to these rules:
- **Carefully analyze the question and database schema word by word**
- **Study the sample queries in the Samples section to understand common patterns:**
  - How to use JOINs between tables
  - How to apply filtering with WHERE clauses
  - How to use aggregations (COUNT, AVG, etc.)
  - How to sort and limit results
  - How to use table aliases effectively
- **Choose the most similar sample query pattern to your task**
- **Adapt the chosen pattern to match the specific requirements**

    ### Input
    Generate a SQL query that answers the question below.
    This query will run on a database whose schema is represented in this string:

    CREATE TABLE employees (
        ID_Usr INT PRIMARY KEY,
        name VARCHAR(100),
        department VARCHAR(100)
    );

    CREATE TABLE salary (
        ID_Usr INT,
        salary DECIMAL(10,2),
        FOREIGN KEY (ID_Usr) REFERENCES employees(ID_Usr)
    );

    CREATE TABLE studies (
        ID_study INT PRIMARY KEY,
        ID_Usr INT,
        degree VARCHAR(100),
        FOREIGN KEY (ID_Usr) REFERENCES employees(ID_Usr)
    );

    ### Response
    ### Samples
    Question: "Show me all employees in the IT department"
    ```sql
    SELECT name
    FROM employees
    WHERE department = 'IT';
    ```

    Question: "What is the average salary in each department?"
    ```sql
    SELECT e.department, AVG(s.salary) as avg_salary
    FROM employees e
    JOIN salary s ON e.ID_Usr = s.ID_Usr
    GROUP BY e.department;
    ```

    Question: "Find employees who have both a Bachelor's and Master's degree"
    ```sql
    SELECT e.name
    FROM employees e
    JOIN studies s1 ON e.ID_Usr = s1.ID_Usr
    JOIN studies s2 ON e.ID_Usr = s2.ID_Usr
    WHERE s1.degree = 'Bachelor' AND s2.degree = 'Master'
    GROUP BY e.name;
    ```

    Question: "Who are the top 3 highest paid employees?"
    ```sql
    SELECT e.name, s.salary
    FROM employees e
    JOIN salary s ON e.ID_Usr = s.ID_Usr
    ORDER BY s.salary DESC
    LIMIT 3;
    ```

    ### Response
    Based on your instructions, here is the SQL query I have generated to answer the question
    `{question}`:
    ```sql3
    """


In [None]:
sp_nl2sql3 = sp_nl2sql3b.format(question="Return The name of the best paid employee")
print (sp_nl2sql3)

In [21]:
input_sentences = tokenizer(sp_nl2sql3, return_tensors="pt").to('cuda')
response = get_outputs(foundation_model, input_sentences, max_new_tokens=400)
SQL = tokenizer.batch_decode(response, skip_special_tokens=True)
torch.cuda.empty_cache()

In [None]:
print(SQL[0].split("```sql3")[-1].split("```")[0].split(";")[0].strip() + ";")

#Now the question in spanish.


In [None]:
sp_nl2sql3 = sp_nl2sql3b.format(question="Muestra los empleados del departamento de IT que ganan más de 50000 y tienen un título de Master")
print (sp_nl2sql3)

In [24]:
input_sentences = tokenizer(sp_nl2sql3, return_tensors="pt").to('cuda')
response = get_outputs(foundation_model, input_sentences, max_new_tokens=400)
SQL = tokenizer.batch_decode(response, skip_special_tokens=True)
torch.cuda.empty_cache()

In [None]:
print(SQL[0].split("```sql3")[-1].split("```")[0].split(";")[0].strip() + ";")

The generated SQL command is the same regardless of where we have placed the examples.

#Conclusions.

Let's see the three SQL's together.

* SELECT employees.name, MAX(salary.salary) AS max_salary FROM employees JOIN salary ON employees.ID_Usr = salary.ID_Usr GROUP BY employees.name ORDER BY max_salary DESC NULLS LAST LIMIT 1;

* SELECT e.name
    FROM employees e
    JOIN salary s ON e.ID_Usr = s.ID_usr
    WHERE s.salary = (SELECT MAX(salary) FROM salary);

* SELECT e.name
    FROM employees e
    JOIN salary s ON e.ID_Usr = s.ID_usr
    WHERE s.salary = (SELECT MAX(salary) FROM salary);

* Spanish Question: SELECT e.name
     FROM employees e
     JOIN salary s ON e.ID_Usr = s.ID_Usr
     WHERE s.salary = (SELECT MAX(salary) FROM salary)
     GROUP BY e.name
     ORDER BY COUNT(studies.ID_study) DESC
     LIMIT 1;


**The model has demonstrated that it is highly efficient in crafting SQL.** Additionally, it pays a lot of attention, perhaps too much, to the examples we provide. Clearly, these examples should be crafted by one of the best SQL programmers we have access to, though their use may not be essential.

On the other hand, although the model is clearly very proficient in SQL generation, during the creation of the notebook, I have encountered several issues because the commands need to be extremely clear. It doesn't handle typos well (which should not exist).

It appears to have some issues when it receives commands in Spanish. I assume this problem would be present in any language other than English. Therefore, since it's a tool that could be used by non-technical personnel, this should be considered in environments where English is not the primary language.

# Exercise
 - Complete the prompts similar to what we did in class.
     - Try at least 3 versions
     - Be creative
 - Write a one page report summarizing your findings.
     - Were there variations that didn't work well? i.e., where GPT either hallucinated or wrong
 - What did you learn?

In [None]:
sp_nl2sql_hr = """
    ### Instructions:
Your task is to convert HR-related questions into SQL queries.
Adhere to these rules:
- **Think like an HR analyst when interpreting the question**
- **Consider common HR metrics and KPIs**
- **Focus on employee data analysis patterns**

    ### Input
    This query will run on our HR database schema:

    CREATE TABLE employees (
        ID_Usr INT PRIMARY KEY,
        name VARCHAR(100),
        department VARCHAR(100)
    );

    CREATE TABLE salary (
        ID_Usr INT,
        salary DECIMAL(10,2),
        FOREIGN KEY (ID_Usr) REFERENCES employees(ID_Usr)
    );

    CREATE TABLE studies (
        ID_study INT PRIMARY KEY,
        ID_Usr INT,
        degree VARCHAR(100),
        FOREIGN KEY (ID_Usr) REFERENCES employees(ID_Usr)
    );

    ### Samples
    Question: "Who are our highest performers based on education and salary?"
    ```sql
    SELECT e.name, COUNT(s.degree) as qualifications, s.salary
    FROM employees e
    JOIN studies s ON e.ID_Usr = s.ID_Usr
    JOIN salary sal ON e.ID_Usr = sal.ID_Usr
    GROUP BY e.name, s.salary
    ORDER BY qualifications DESC, s.salary DESC
    LIMIT 5;
    ```

    Question: "Show me department skill distribution"
    ```sql
    SELECT e.department, s.degree, COUNT(*) as count
    FROM employees e
    JOIN studies s ON e.ID_Usr = s.ID_Usr
    GROUP BY e.department, s.degree
    ORDER BY e.department, count DESC;
    ```

    Question: "Identify departments with high education but below-average salaries"
    ```sql
    SELECT e.department,
           AVG(sal.salary) as avg_dept_salary,
           COUNT(s.degree) as total_degrees
    FROM employees e
    JOIN salary sal ON e.ID_Usr = sal.ID_Usr
    JOIN studies s ON e.ID_Usr = s.ID_Usr
    GROUP BY e.department
    HAVING AVG(sal.salary) < (SELECT AVG(salary) FROM salary)
    ORDER BY total_degrees DESC;
    ```

    ### Response
    Based on the HR analysis patterns above, here is the SQL query for:
    `{question}`:
    ```sql3
    """

# Test the HR-focused prompt
test_query = "Find departments that need upskilling based on education levels"
sp_nl2sql_hr_test = sp_nl2sql_hr.format(question=test_query)
print(sp_nl2sql_hr_test)

input_sentences = tokenizer(sp_nl2sql_hr_test, return_tensors="pt").to('cuda')
response = get_outputs(foundation_model, input_sentences, max_new_tokens=400)
SQL = tokenizer.batch_decode(response, skip_special_tokens=True)
torch.cuda.empty_cache()


In [None]:
print(SQL[0].split("```sql3")[-1].split("```")[0].split(";")[0].strip() + ";")

## version focusing on multilingual support:

In [None]:
sp_nl2sql_multi = """
    ### Instructions:
Your task is to convert questions in any language into SQL queries.
Rules to follow:
- **Understand the intent regardless of language**
- **Focus on key entities and relationships**
- **Maintain SQL best practices across languages**

    ### Input
    Database schema:

    CREATE TABLE employees (
        ID_Usr INT PRIMARY KEY,
        name VARCHAR(100),
        department VARCHAR(100)
    );

    CREATE TABLE salary (
        ID_Usr INT,
        salary DECIMAL(10,2),
        FOREIGN KEY (ID_Usr) REFERENCES employees(ID_Usr)
    );

    CREATE TABLE studies (
        ID_study INT PRIMARY KEY,
        ID_Usr INT,
        degree VARCHAR(100),
        FOREIGN KEY (ID_Usr) REFERENCES employees(ID_Usr)
    );

    ### Samples
    Question (English): "Show top paid employees in IT"
    Question (Spanish): "Muestra los empleados mejor pagados en IT"
    Question (French): "Montrez les employés les mieux payés en IT"
    ```sql
    SELECT e.name, s.salary
    FROM employees e
    JOIN salary s ON e.ID_Usr = s.ID_Usr
    WHERE e.department = 'IT'
    ORDER BY s.salary DESC
    LIMIT 5;
    ```

    Question (English): "Count employees by degree type"
    Question (Spanish): "Cuenta empleados por tipo de título"
    Question (French): "Comptez les employés par type de diplôme"
    ```sql
    SELECT s.degree, COUNT(DISTINCT e.ID_Usr) as employee_count
    FROM employees e
    JOIN studies s ON e.ID_Usr = s.ID_Usr
    GROUP BY s.degree
    ORDER BY employee_count DESC;
    ```

    ### Response
    Based on the multilingual patterns above, here is the SQL query for:
    `{question}`:
    ```sql3
    """

# Test the multilingual prompt
test_query = "Muestra los departamentos donde el salario promedio es mayor a 50000"
sp_nl2sql_multi_test = sp_nl2sql_multi.format(question=test_query)
print(sp_nl2sql_multi_test)

input_sentences = tokenizer(sp_nl2sql_multi_test, return_tensors="pt").to('cuda')
response = get_outputs(foundation_model, input_sentences, max_new_tokens=400)
SQL = tokenizer.batch_decode(response, skip_special_tokens=True)
torch.cuda.empty_cache()

In [None]:
print(SQL[0].split("```sql3")[-1].split("```")[0].split(";")[0].strip() + ";")

## Version focusing on analytical patterns:

In [None]:
sp_nl2sql_analytics = """
    ### Instructions:
Your task is to convert analytical questions into SQL queries.
Focus on these patterns:
- **Aggregation patterns (COUNT, AVG, MAX, MIN)**
- **Grouping and filtering patterns**
- **Complex joins for data relationships**
- **Window functions for advanced analysis**

    ### Input
    Analytics database schema:

    CREATE TABLE employees (
        ID_Usr INT PRIMARY KEY,
        name VARCHAR(100),
        department VARCHAR(100)
    );

    CREATE TABLE salary (
        ID_Usr INT,
        salary DECIMAL(10,2),
        FOREIGN KEY (ID_Usr) REFERENCES employees(ID_Usr)
    );

    CREATE TABLE studies (
        ID_study INT PRIMARY KEY,
        ID_Usr INT,
        degree VARCHAR(100),
        FOREIGN KEY (ID_Usr) REFERENCES employees(ID_Usr)
    );

    ### Samples
    Question: "Analyze salary distribution by education level"
    ```sql
    SELECT s.degree,
           MIN(sal.salary) as min_salary,
           AVG(sal.salary) as avg_salary,
           MAX(sal.salary) as max_salary,
           COUNT(DISTINCT e.ID_Usr) as employee_count
    FROM employees e
    JOIN studies s ON e.ID_Usr = s.ID_Usr
    JOIN salary sal ON e.ID_Usr = sal.ID_Usr
    GROUP BY s.degree
    ORDER BY avg_salary DESC;
    ```

    Question: "Compare department performance by education and salary metrics"
    ```sql
    SELECT
        e.department,
        COUNT(DISTINCT e.ID_Usr) as total_employees,
        COUNT(DISTINCT s.degree) as total_degrees,
        AVG(sal.salary) as avg_salary,
        MAX(sal.salary) as max_salary
    FROM employees e
    LEFT JOIN studies s ON e.ID_Usr = s.ID_Usr
    LEFT JOIN salary sal ON e.ID_Usr = sal.ID_Usr
    GROUP BY e.department
    ORDER BY avg_salary DESC;
    ```

    Question: "Find salary percentiles by department"
    ```sql
    SELECT
        e.department,
        sal.salary,
        NTILE(4) OVER (PARTITION BY e.department ORDER BY sal.salary) as salary_quartile
    FROM employees e
    JOIN salary sal ON e.ID_Usr = sal.ID_Usr
    ORDER BY e.department, sal.salary;
    ```

    ### Response
    Based on the analytical patterns above, here is the SQL query for:
    `{question}`:
    ```sql3
    """

# Test the analytics prompt
test_query = "Show departments ranked by average employee qualifications and salary"
sp_nl2sql_analytics_test = sp_nl2sql_analytics.format(question=test_query)
print(sp_nl2sql_analytics_test)

input_sentences = tokenizer(sp_nl2sql_analytics_test, return_tensors="pt").to('cuda')
response = get_outputs(foundation_model, input_sentences, max_new_tokens=400)
SQL = tokenizer.batch_decode(response, skip_special_tokens=True)
torch.cuda.empty_cache()

In [None]:
print(SQL[0].split("```sql3")[-1].split("```")[0].split(";")[0].strip() + ";")

 - Write a one page report summarizing your findings.
     - Were there variations that didn't work well? i.e., where GPT either hallucinated or wrong

## Natural Language to SQL Conversion: Analysis Report
## Test Methodology
We tested three distinct prompt variations using a database schema with employee information:
- Basic Prompt (No examples)
- OpenAI Style (With integrated examples)
- Sample Style (With separated examples and enhanced instructions)

## Key Findings
## Successful Patterns
1) Sample Style with Clear Instructions
Correctly interpreted relationships
Proper join conditions
Accurate filtering
2) Multilingual Queries
Successfully handled Spanish input
Maintained correct SQL syntax
Appropriate use of ordering and limits

## Problematic Variations
1) Basic Prompt Hallucinations
Added unnecessary GROUP BY columns
Included irrelevant NULLS LAST clause
Over-complicated simple queries
2) Incorrect Join Conditions
Missing join conditions
Incomplete relationship definitions
Potential cross joins
3) Complex Query Hallucinations
Meaningless aggregations
Incorrect interpretation of "education level"
Unnecessary conditions

## Recommendations for Improvement
#Prompt Structure
- Include explicit table relationships
- Provide clear examples of common patterns
- Add error cases to avoid

## Query Validation
- Verify join conditions
- Check for unnecessary clauses
- Validate aggregation logic

## Best Practices
- Use table aliases consistently
- Include proper join conditions
- Avoid unnecessary complexity

## Conclusion
- Sample Style prompt with enhanced instructions proved most reliable
- Basic Prompt was most prone to hallucinations
- Model handles simple queries well but can struggle with complex aggregations
- Clear examples and structured prompts significantly improve performance

