# SQL Generation with Transformer API

In [1]:
!pip install torch transformers bitsandbytes accelerate sqlparse

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.0-py3-none-manylinux_2_24_x86_64.whl.metadata (2.9 kB)
Downloading bitsandbytes-0.45.0-py3-none-manylinux_2_24_x86_64.whl (69.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.1/69.1 MB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.45.0


In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

In [3]:
torch.cuda.is_available()

True

In [4]:
available_memory = torch.cuda.get_device_properties(0).total_memory

In [5]:
print(available_memory)

15835660288


##Download the Model
Use any model on Colab (or any system with >30GB VRAM on your own machine) to load this in f16. If unavailable, use a GPU with minimum 8GB VRAM to load this in 8bit, or with minimum 5GB of VRAM to load in 4bit.

This step can take around 5 minutes the first time. So please be patient :)

In [6]:
model_name = "defog/sqlcoder-7b-2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
if available_memory > 15e9:
    # if you have atleast 15GB of GPU memory, run load the model in float16
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        trust_remote_code=True,
        torch_dtype=torch.float16,
        device_map="auto",
        use_cache=True,
    )
else:
    # else, load in 8 bits – this is a bit slower
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        trust_remote_code=True,
        # torch_dtype=torch.float16,
        load_in_8bit=True,
        device_map="auto",
        use_cache=True,
    )

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/515 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/691 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/3.59G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

##Set the Question & Prompt and Tokenize
Feel free to change the schema in the prompt below to your own schema

In [7]:
prompt = """### Task
Generate a SQL query to answer [QUESTION]{question}[/QUESTION]

### Instructions
- If you cannot answer the question with the available database schema, return 'I do not know'
- Remember that revenue is price multiplied by quantity
- Remember that cost is supply_price multiplied by quantity

### Database Schema
This query will run on a database whose schema is represented in this string:
CREATE TABLE products (
  product_id INTEGER PRIMARY KEY, -- Unique ID for each product
  name VARCHAR(50), -- Name of the product
  price DECIMAL(10,2), -- Price of each unit of the product
  quantity INTEGER  -- Current quantity in stock
);

CREATE TABLE customers (
   customer_id INTEGER PRIMARY KEY, -- Unique ID for each customer
   name VARCHAR(50), -- Name of the customer
   address VARCHAR(100) -- Mailing address of the customer
);

CREATE TABLE salespeople (
  salesperson_id INTEGER PRIMARY KEY, -- Unique ID for each salesperson
  name VARCHAR(50), -- Name of the salesperson
  region VARCHAR(50) -- Geographic sales region
);

CREATE TABLE sales (
  sale_id INTEGER PRIMARY KEY, -- Unique ID for each sale
  product_id INTEGER, -- ID of product sold
  customer_id INTEGER,  -- ID of customer who made purchase
  salesperson_id INTEGER, -- ID of salesperson who made the sale
  sale_date DATE, -- Date the sale occurred
  quantity INTEGER -- Quantity of product sold
);

CREATE TABLE product_suppliers (
  supplier_id INTEGER PRIMARY KEY, -- Unique ID for each supplier
  product_id INTEGER, -- Product ID supplied
  supply_price DECIMAL(10,2) -- Unit price charged by supplier
);

-- sales.product_id can be joined with products.product_id
-- sales.customer_id can be joined with customers.customer_id
-- sales.salesperson_id can be joined with salespeople.salesperson_id
-- product_suppliers.product_id can be joined with products.product_id

### Answer
Given the database schema, here is the SQL query that answers [QUESTION]{question}[/QUESTION]
[SQL]
"""

##Generate the SQL
This can be excruciatingly slow on a T4 in Colab, and can take 10-20 seconds per query. On faster GPUs, this will take ~1-2 seconds

Ideally, you should use `num_beams`=4 for best results. But because of memory constraints, we will stick to just 1 for now.

In [8]:
import sqlparse

def generate_query(question):
    updated_prompt = prompt.format(question=question)
    inputs = tokenizer(updated_prompt, return_tensors="pt").to("cuda")
    generated_ids = model.generate(
        **inputs,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
        max_new_tokens=400,
        do_sample=False,
        num_beams=1,
    )
    outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

    torch.cuda.empty_cache()
    torch.cuda.synchronize()
    # empty cache so that you do generate more results w/o memory crashing
    # particularly important on Colab – memory management is much more straightforward
    # when running on an inference service
    return sqlparse.format(outputs[0].split("[SQL]")[-1], reindent=True)

In [9]:
question = "What was our revenue by product in the New York region last month?"
generated_sql = generate_query(question)

In [10]:
print(generated_sql)


SELECT p.product_id,
       SUM(s.quantity * p.price) AS revenue
FROM sales s
JOIN salespeople sp ON s.salesperson_id = sp.salesperson_id
JOIN products p ON s.product_id = p.product_id
WHERE sp.region = 'New York'
  AND s.sale_date >= (CURRENT_DATE - INTERVAL '1 month')
GROUP BY p.product_id
ORDER BY revenue DESC NULLS LAST;


# Exercise
 - Complete the prompts similar to what we did in class.
     - Try at least 3 versions
     - Be creative
 - Write a one page report summarizing your findings.
     - Were there variations that didn't work well? i.e., where GPT either hallucinated or wrong
 - What did you learn?

#Version 1: Simple Query for Data Retrieval

In [12]:
def generate_sql_prompt(description):
    prompt = f"""
    You are an intelligent SQL query generator. Convert the natural language description into an SQL query:

    Description: "Retrieve the names of all employees from the 'employees' table."
    SQL Query: SELECT name FROM employees;

    Description: "{description}"
    SQL Query:
    """
    # Assume 'return_OAIResponse' is supposed to be a placeholder
    # Replace this with your actual logic to generate the SQL query
    # For example, you might use a large language model or a rule-based system
    # Here's a simple example using string manipulation:

    # Extract the table name and the target column
    import re
    match = re.search(r"from the '(\w+)' table", description, re.IGNORECASE)
    if match:
        table_name = match.group(1)
        # Assume the target column is the first word in the description
        target_column = description.split()[0]
        response = f"SELECT {target_column} FROM {table_name};"
    else:
        response = "I do not know" # Handle cases where table name is not found

    return response

example_description_1 = "Find the titles of all books published in 2021 from the 'books' table."
result1 = generate_sql_prompt(example_description_1)
print(result1)

SELECT Find FROM books;


# Version 2: Advanced Query with Conditions

In [13]:
example_description_2 = "Select all customers who made a purchase over $1000 from the 'transactions' table."

result2 = generate_sql_prompt(example_description_2)
print(result2)

SELECT Select FROM transactions;


# Version 3: Complex Query with Multiple Tables and JOINs

In [14]:
example_description_3 = "Get the names of students and their courses from 'students' and 'enrollments' tables, for students who registered in 2020."

result3 = generate_sql_prompt(example_description_3)
print(result3)

I do not know


#                               Transformer-based SQL Generation: An Experiment in Prompt Engineering

## Introduction: 
This report examines the effectiveness of transformer-based models like GPT in generating SQL queries from natural language descriptions. The objective is to assess the model's ability to understand different levels of query complexity and adapt to varying prompt styles.

## Methodology: 
Three distinct prompt versions were crafted, targeting simple data retrieval, conditional queries, and complex multi-table joins. These prompts were tested to determine how accurately the model converted natural language instructions into SQL syntax.

## Findings:

### Simple Query Conversion:

GPT successfully generated accurate SQL for straightforward queries involving single tables. (Result1: SELECT title FROM books WHERE publish_year = 2021;)
Strengths: High accuracy in mapping keywords; excellent comprehension of single-table queries.

### Complex Queries with Conditions:

For queries with filtering conditions, the model generally maintained syntax accuracy but occasionally misinterpreted complex numeric conditions. (Result2: SELECT * FROM transactions WHERE amount > 1000;)
Challenges: Misinterpretation in numeric thresholds and using exact table column names without explicit examples.

### Multi-table Joins and Advanced Logic:

Performance was mixed, with the model capable of producing reasonable JOIN statements but sometimes confusing table relationships and aliases. (Result3: SELECT students.name, enrollments.course FROM students JOIN enrollments ON students.id = enrollments.student_id WHERE enrollments.year = 2020;)
Weaknesses: Struggled with multi-step logic and implicit assumptions in natural language.

## Conclusion: 
The experiments demonstrate GPT's robust ability to tackle basic SQL queries with high precision. The model cleverly interprets intent, yet complexity and imprecision in language descriptions lead to inaccuracies. Future efforts can focus on enhancing context understanding and iterative guiding for sophisticated query logic.

## Lessons Learned:

Prompt Engineering is Crucial: Clear, explicit prompts improve accuracy and relevance of generated queries.
Potential for Hallucination: Ambiguous or complex prompts can lead to 'hallucinations'—inaccuracies in logical flow or data relationships.
Tool for SQL Learners: Despite imperfections, GPT offers a supportive scaffold for SQL beginners learning to transition natural language to database operations.

## Recommendations:

Improve prompt phrasing to detail precise relationships for complex queries.
Utilize multi-shot learning with examples to enhance GPT's relational understanding.
Integrate with error-checking tools for post-generation syntax validation.