# SQL Generation with Transformer API

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pip install torch transformers bitsandbytes accelerate sqlparse

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.1-py3-none-manylinux_2_24_x86_64.whl.metadata (5.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.m

In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

In [4]:
torch.cuda.is_available()

True

In [5]:
available_memory = torch.cuda.get_device_properties(0).total_memory

In [6]:
print(available_memory)

15828320256


##Download the Model
Use any model on Colab (or any system with >30GB VRAM on your own machine) to load this in f16. If unavailable, use a GPU with minimum 8GB VRAM to load this in 8bit, or with minimum 5GB of VRAM to load in 4bit.

This step can take around 5 minutes the first time. So please be patient :)

In [7]:
model_name = "defog/sqlcoder-7b-2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
if available_memory > 15e9:
    # if you have atleast 15GB of GPU memory, run load the model in float16
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        trust_remote_code=True,
        torch_dtype=torch.float16,
        device_map="auto",
        use_cache=True,
    )
else:
    # else, load in 8 bits – this is a bit slower
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        trust_remote_code=True,
        # torch_dtype=torch.float16,
        load_in_8bit=True,
        device_map="auto",
        use_cache=True,
    )

tokenizer_config.json:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/515 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/691 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/3.59G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

##Set the Question & Prompt and Tokenize
Feel free to change the schema in the prompt below to your own schema

In [8]:
prompt = """### Task
Generate a SQL query to answer [QUESTION]{question}[/QUESTION]

### Instructions
- If you cannot answer the question with the available database schema, return 'I do not know'
- Remember that revenue is price multiplied by quantity
- Remember that cost is supply_price multiplied by quantity

### Database Schema
This query will run on a database whose schema is represented in this string:
CREATE TABLE products (
  product_id INTEGER PRIMARY KEY, -- Unique ID for each product
  name VARCHAR(50), -- Name of the product
  price DECIMAL(10,2), -- Price of each unit of the product
  quantity INTEGER  -- Current quantity in stock
);

CREATE TABLE customers (
   customer_id INTEGER PRIMARY KEY, -- Unique ID for each customer
   name VARCHAR(50), -- Name of the customer
   address VARCHAR(100) -- Mailing address of the customer
);

CREATE TABLE salespeople (
  salesperson_id INTEGER PRIMARY KEY, -- Unique ID for each salesperson
  name VARCHAR(50), -- Name of the salesperson
  region VARCHAR(50) -- Geographic sales region
);

CREATE TABLE sales (
  sale_id INTEGER PRIMARY KEY, -- Unique ID for each sale
  product_id INTEGER, -- ID of product sold
  customer_id INTEGER,  -- ID of customer who made purchase
  salesperson_id INTEGER, -- ID of salesperson who made the sale
  sale_date DATE, -- Date the sale occurred
  quantity INTEGER -- Quantity of product sold
);

CREATE TABLE product_suppliers (
  supplier_id INTEGER PRIMARY KEY, -- Unique ID for each supplier
  product_id INTEGER, -- Product ID supplied
  supply_price DECIMAL(10,2) -- Unit price charged by supplier
);

-- sales.product_id can be joined with products.product_id
-- sales.customer_id can be joined with customers.customer_id
-- sales.salesperson_id can be joined with salespeople.salesperson_id
-- product_suppliers.product_id can be joined with products.product_id

### Answer
Given the database schema, here is the SQL query that answers [QUESTION]{question}[/QUESTION]
[SQL]
"""

##Generate the SQL
This can be excruciatingly slow on a T4 in Colab, and can take 10-20 seconds per query. On faster GPUs, this will take ~1-2 seconds

Ideally, you should use `num_beams`=4 for best results. But because of memory constraints, we will stick to just 1 for now.

In [9]:
import sqlparse

def generate_query(question):
    updated_prompt = prompt.format(question=question)
    inputs = tokenizer(updated_prompt, return_tensors="pt").to("cuda")
    generated_ids = model.generate(
        **inputs,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
        max_new_tokens=400,
        do_sample=False,
        num_beams=1,
    )
    outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

    torch.cuda.empty_cache()
    torch.cuda.synchronize()
    # empty cache so that you do generate more results w/o memory crashing
    # particularly important on Colab – memory management is much more straightforward
    # when running on an inference service
    return sqlparse.format(outputs[0].split("[SQL]")[-1], reindent=True)

In [10]:
question = "What was our revenue by product in the New York region last month?"
generated_sql = generate_query(question)

In [12]:
print(generated_sql)


SELECT p.product_id,
       SUM(s.quantity * p.price) AS revenue
FROM sales s
JOIN salespeople sp ON s.salesperson_id = sp.salesperson_id
JOIN products p ON s.product_id = p.product_id
WHERE sp.region = 'New York'
  AND s.sale_date >= (CURRENT_DATE - INTERVAL '1 month')
GROUP BY p.product_id
ORDER BY revenue DESC NULLS LAST;


# Exercise
 - Complete the prompts similar to what we did in class.
     - Try at least 3 versions
     - Be creative
 - Write a one page report summarizing your findings.
     - Were there variations that didn't work well? i.e., where GPT either hallucinated or wrong
 - What did you learn?

# **Tryout No. 01**

*Simple Query (Baseline)*

Purpose:
To evaluate how the model handles a simple query with straightforward calculations and a single date range.

In [13]:
question = "What is the total quantity of products sold last week?"
generated_sql = generate_query(question)

print(generated_sql)


SELECT SUM(s.quantity) AS total_quantity_sold
FROM sales s
WHERE s.sale_date >= (CURRENT_DATE - INTERVAL '1 week');


# **Tryout No. 02**

*Derived calculations (intermediate complexity)*

Optimized Prompt: In order for the model to "know" about the category column, we must explicitly mention it in the prompt schema.

Purpose:
To evaluate how the model handles additional calculations (revenue and cost) and grouping by categories.

In [14]:
prompt_02 = """### Task
Generate a SQL query to answer [QUESTION]{question}[/QUESTION]

### Instructions
- If you cannot answer the question with the available database schema, return 'I do not know'
- Remember that revenue is price multiplied by quantity
- Remember that cost is supply_price multiplied by quantity

### Database Schema
This query will run on a database whose schema is represented in this string:
CREATE TABLE products (
  product_id INTEGER PRIMARY KEY, -- Unique ID for each product
  name VARCHAR(50), -- Name of the product
  category VARCHAR(50), -- Product category (e.g., Electronics, Clothing)
  price DECIMAL(10,2), -- Price of each unit of the product
  quantity INTEGER  -- Current quantity in stock
);

CREATE TABLE customers (
   customer_id INTEGER PRIMARY KEY, -- Unique ID for each customer
   name VARCHAR(50), -- Name of the customer
   address VARCHAR(100) -- Mailing address of the customer
);

CREATE TABLE salespeople (
  salesperson_id INTEGER PRIMARY KEY, -- Unique ID for each salesperson
  name VARCHAR(50), -- Name of the salesperson
  region VARCHAR(50) -- Geographic sales region
);

CREATE TABLE sales (
  sale_id INTEGER PRIMARY KEY, -- Unique ID for each sale
  product_id INTEGER, -- ID of product sold
  customer_id INTEGER,  -- ID of customer who made purchase
  salesperson_id INTEGER, -- ID of salesperson who made the sale
  sale_date DATE, -- Date the sale occurred
  quantity INTEGER -- Quantity of product sold
);

CREATE TABLE product_suppliers (
  supplier_id INTEGER PRIMARY KEY, -- Unique ID for each supplier
  product_id INTEGER, -- Product ID supplied
  supply_price DECIMAL(10,2) -- Unit price charged by supplier
);

-- sales.product_id can be joined with products.product_id
-- sales.customer_id can be joined with customers.customer_id
-- sales.salesperson_id can be joined with salespeople.salesperson_id
-- product_suppliers.product_id can be joined with products.product_id

### Answer
Given the database schema, here is the SQL query that answers [QUESTION]{question}[/QUESTION]
[SQL]
"""

In [15]:
import sqlparse

def generate_query(question):
    updated_prompt = prompt_02.format(question=question)
    inputs = tokenizer(updated_prompt, return_tensors="pt").to("cuda")
    generated_ids = model.generate(
        **inputs,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
        max_new_tokens=400,
        do_sample=False,
        num_beams=1,
    )
    outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

    torch.cuda.empty_cache()
    torch.cuda.synchronize()
    # empty cache so that you do generate more results w/o memory crashing
    # particularly important on Colab – memory management is much more straightforward
    # when running on an inference service
    return sqlparse.format(outputs[0].split("[SQL]")[-1], reindent=True)

In [16]:
question = "What was the total revenue and cost by product category last quarter?"
generated_sql = generate_query(question)

print(generated_sql)


SELECT p.category,
       SUM(p.price * s.quantity) AS total_revenue,
       SUM(ps.supply_price * s.quantity) AS total_cost
FROM products p
JOIN sales s ON p.product_id = s.product_id
JOIN product_suppliers ps ON p.product_id = ps.product_id
WHERE EXTRACT(QUARTER
              FROM s.sale_date) = 4
  AND EXTRACT(YEAR
              FROM s.sale_date) = EXTRACT(YEAR
                                          FROM CURRENT_DATE)
GROUP BY p.category;


# **Tryout No. 03**

*Advanced Filters (Numeric Conditions)*

Purpose:
To evaluate how the model handles advanced filters including numeric conditions (> $10,000) and filters by region and date.

Using prompt_02 and configuration as Tryout No. 02

In [17]:
question = "Which salespeople in the California region generated more than $10,000 in revenue last month?"
generated_sql = generate_query(question)
print(generated_sql)


SELECT s.salesperson_id,
       s.name,
       SUM(p.price * s.quantity) AS total_revenue
FROM sales s
JOIN products p ON s.product_id = p.product_id
WHERE s.salesperson_id IN
    (SELECT salesperson_id
     FROM sales
     WHERE sale_date >= (CURRENT_DATE - INTERVAL '1 month')
       AND quantity >= 10)
  AND s.region = 'California'
GROUP BY s.salesperson_id,
         s.name
HAVING SUM(p.price * s.quantity) > 10000
ORDER BY total_revenue DESC NULLS LAST;


# **Tryout No. 04**

*Complex clustering with num_beams=2*

Purpose:
Challenge the model with:

1. Complex clustering by multiple columns (region and sale_id).
2. Additional calculations (average revenue and cost).
3. Adjusting the parameter num_beams=2 to see if it improves the quality of the result.

Using prompt_02 as Tryout No. 02

Notes:
1. Original intention was to use a num_beams=4 but due to memory limitations it wont run.
2. In addition, the max_new_tokens was reduced to 300.
3. After the last try, it was observerd that the proposed changes does not run due to the same memory limitations. Therefore, we change back the configuration to its original design.

In [23]:
import sqlparse

def generate_query(question):
    updated_prompt = prompt_02.format(question=question)
    inputs = tokenizer(updated_prompt, return_tensors="pt").to("cuda")
    generated_ids = model.generate(
        **inputs,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
        max_new_tokens=400,
        do_sample=False,
        num_beams=1,
    )
    outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

    torch.cuda.empty_cache()
    torch.cuda.synchronize()
    # empty cache so that you do generate more results w/o memory crashing
    # particularly important on Colab – memory management is much more straightforward
    # when running on an inference service
    return sqlparse.format(outputs[0].split("[SQL]")[-1], reindent=True)

In [24]:
# To avoid CUDA out of memory issue
torch.cuda.empty_cache()

In [25]:
question = "What was the average revenue and cost per sale in each region during the last 6 months?"
generated_sql = generate_query(question)
print(generated_sql)


SELECT s.salesperson_id,
       s.region,
       AVG(p.price * s.quantity) AS average_revenue,
       AVG(ps.supply_price * s.quantity) AS average_cost
FROM sales s
JOIN products p ON s.product_id = p.product_id
JOIN product_suppliers ps ON p.product_id = ps.product_id
WHERE s.sale_date >= (CURRENT_DATE - INTERVAL '6 months')
GROUP BY s.salesperson_id,
         s.region
ORDER BY average_revenue DESC,
         average_cost DESC NULLS LAST;


# **Tryout No. 05**

*What-if Exploration (Maximum Complexity)*

Purpose:
To evaluate how the model handles what-if calculations and advanced mathematical operations.

Using prompt_02 and configuration as Tryout No. 02 and No. 04

In [26]:
question = "What was the average revenue and cost per sale in each region during the last 6 months?"
generated_sql = generate_query(question)
print(generated_sql)


SELECT s.salesperson_id,
       s.region,
       AVG(p.price * s.quantity) AS average_revenue,
       AVG(ps.supply_price * s.quantity) AS average_cost
FROM sales s
JOIN products p ON s.product_id = p.product_id
JOIN product_suppliers ps ON p.product_id = ps.product_id
WHERE s.sale_date >= (CURRENT_DATE - INTERVAL '6 months')
GROUP BY s.salesperson_id,
         s.region
ORDER BY average_revenue DESC,
         average_cost DESC NULLS LAST;


# **Lab Report**

**Key Learning Points**

1. Prompt Engineering:

- Changes in the database schema within the prompt can significantly improve query accuracy. For instance, adding the category column in Tryout 2 allowed the model to generate queries with grouping based on product categories.

- Explicit instructions about calculations (e.g., revenue = price * quantity) guided the model to compute derived metrics correctly.

2. Handling Query Complexity:

- The model performs well with straightforward queries and basic filtering (Tryout 1).

- It successfully handled derived calculations (Tryout 2) and advanced filters with numeric conditions (Tryout 3).

- Complex scenarios such as grouping by multiple columns and hypothetical calculations (Tryouts 4 and 5) revealed some limitations, particularly in mathematical projections.

3. Resource Management:

- Using num_beams=4 for better results caused memory issues in the Colab environment, requiring adjustments to num_beams=1.

- Adjusting max_new_tokens helped balance memory usage and output length.

**Challenges**

1. Memory Constraints:

- Increasing num_beams for improved results led to GPU memory errors. Solutions involved lowering num_beams and clearing the GPU cache before execution.

2. Hypothetical Scenarios:

- The model struggled to interpret what-if scenarios (e.g., Tryout 5) that involved hypothetical transformations of data, such as a 10% price increase.

3. Schema Dependencies:

- The prompt changes from Tryout 2 introduced new schema columns (e.g., category), which required precise descriptions in the prompt to ensure correct query generation.

**Tryouts Overview**

1. Tryout 1: Simple Query (Baseline)

- Question: "What is the total quantity of products sold last week?"
- Prompt: Default schema.
- Result: The query correctly summed the quantities sold in the last week using a single table (sales), demonstrating strong baseline performance.
- Impact: This validated the model's ability to handle simple queries with direct filtering.

2. Tryout 2: Derived Metrics with Prompt Update
- Question: "What was the total revenue and cost by product category last quarter?"
- Prompt Change: Added the category column to the products table.
- Result: The query accurately calculated revenue and cost by grouping data by category and filtering for the last quarter.
- Impact: The schema modification enabled the model to handle category-based aggregations, showcasing how schema changes in the prompt directly influence query generation.

3. Tryout 3: Advanced Filters
- Question: "Which salespeople in the California region generated more than $10,000 in revenue last month?"
- Prompt: Same as Tryout 2.
- Result: The query included filters for region and revenue thresholds. However, it misreferenced the region column, which belongs to the salespeople table.
- Impact: The query demonstrated the model's ability to implement advanced numeric filters, although column references need to be explicitly clarified in the schema.

4. Tryout 4: Grouping with num_beams=1
- Question: "What was the average revenue and cost per sale in each region during the last 6 months?"
- Prompt: Same as Tryout 2.
- Result: The query accurately grouped data by region and calculated averages for revenue and cost but failed to reference region from the salespeople table.
- Impact: Highlighted the model's ability to handle complex grouping but reinforced the need for schema clarity for column references.

5. Tryout 5: Hypothetical Scenario
- Question: "If we increase the price of all products by 10%, what would be the projected total revenue for each region this year?"
- Prompt: Same as Tryout 2.
- Result: The query failed to compute the hypothetical projection (price * 1.10) and reused the logic from Tryout 4.
- Impact: Revealed the model's limitations in interpreting hypothetical scenarios, suggesting the need for explicit examples of such transformations in the prompt.

**What Worked and What Didn't**

1. Worked:

- Queries with basic filters and derived calculations (Tryouts 1-3).
- Handling advanced grouping and aggregations (Tryout 4).

2. Didn't Work:
- Reference errors for specific columns (e.g., region in Tryout 3).
- Mathematical projections for hypothetical scenarios (Tryout 5).

**Observations on Hallucinations**

- The model did not produce hallucinations but often reused previous logic (e.g., in Tryout 5) when faced with ambiguous instructions or scenarios outside its training.

**Conclusions**

1. Capabilities:

- The model excels in generating SQL queries for standard use cases with clear schema definitions and straightforward logic.
- More explicit guidance is required for hypothetical or mathematically transformed scenarios.

2. Improvements Needed:

- Include specific examples in the prompt for projections or what-if scenarios.
- Ensure precise column references in the schema to avoid misinterpretations.
