<a href="https://colab.research.google.com/github/micah-shull/LLMs/blob/main/LLM_021_SQL_code_generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

There are several strategies to make the English to SQL process more efficient, accurate, and context-aware when using LLMs to generate SQL code. Here are some improvements and best practices that can enhance the workflow:

### 1. **Schema Embedding for Context-Awareness**

Instead of just including the table structure in the prompt, you can dynamically retrieve and include more detailed schema information, such as data types, primary keys, and relationships between tables. This way, GPT has more context to generate accurate queries.

#### Example Code

```python
def get_detailed_table_schema(df):
    columns_info = ", ".join([f"{col} ({str(df[col].dtype)})" for col in df.columns])
    return f"### Table structure: Sales ({columns_info})\n"

def create_full_prompt(df, user_query):
    schema_info = get_detailed_table_schema(df)
    prompt = f"{schema_info}\n### User Request: {user_query}\n### SQL Query:\nSELECT"
    return prompt
```

Including data types helps GPT understand columns better, such as whether `date` columns need date-based operations or numeric fields should be aggregated.

### 2. **Example Queries in System Prompt for Few-Shot Learning**

You can "train" GPT to produce better results by giving it examples of SQL queries based on similar table structures and natural language requests. This approach is called few-shot learning, and it helps the model understand what kind of SQL syntax and patterns are expected.

#### Example Code

```python
example_queries = [
    {"role": "system", "content": "You are an assistant that writes SQL queries."},
    {"role": "user", "content": "Get total sales by product."},
    {"role": "assistant", "content": "SELECT product_id, SUM(sales) AS total_sales FROM Sales GROUP BY product_id;"},
    {"role": "user", "content": "Calculate the average sales by city."},
    {"role": "assistant", "content": "SELECT city, AVG(sales) AS avg_sales FROM Sales GROUP BY city;"}
]

def get_sql_query_from_gpt(df, user_query):
    prompt = create_full_prompt(df, user_query)
    response = openai.ChatCompletion.create(
        model=MODEL,
        messages=example_queries + [{"role": "user", "content": prompt}],
        temperature=0,
        max_tokens=150
    )
    return response.choices[0].message.content.strip()
```

This method uses example queries as part of the conversation history, guiding GPT to produce similar responses.

### 3. **Prompt Refinement and Iterative Query Refinement**

If the initial response from GPT is not correct, you can iteratively refine the query by analyzing the initial response and adjusting the prompt. For example, if the LLM didn’t group data correctly, you could add feedback to the prompt, such as "Ensure the query includes a GROUP BY clause."

### 4. **Using Specialized SQL Models**

Some LLMs are fine-tuned specifically for SQL or programming tasks. Using a specialized model, such as OpenAI Codex or similar models from other platforms, can improve accuracy and reduce the need for prompt engineering.

### 5. **Use Structured Data to Automate Column Extraction and Formatting**

Automate the process of extracting column names and formatting them as SQL-friendly prompts. You can integrate this with SQLAlchemy or other database libraries to inspect the schema directly from a live database and format it dynamically.

### 6. **Multi-Step Reasoning for Complex Queries**

For more complex queries involving joins, subqueries, or conditional logic, you can use multi-step reasoning. This involves breaking down a complex natural language query into parts, generating partial SQL code, and combining the parts.

For example:
1. Generate a basic query to retrieve specific columns.
2. Generate additional clauses (e.g., `WHERE`, `GROUP BY`).
3. Combine each part into a final query.

### Example: Putting It All Together

Here’s a more comprehensive example that includes a schema extraction function, prompt engineering, and few-shot examples for a complete workflow.This setup uses few-shot learning, structured schema information, and dynamic prompt generation to maximize the accuracy of SQL code generated by the LLM.

In [4]:
# !pip install python-dotenv openai

In [9]:
# Essential imports
import os
from dotenv import load_dotenv
import pandas as pd  # For data loading and manipulation
import openai         # For accessing the OpenAI API to generate SQL queries

# Optional, for database schema inspection if connected to a database
from sqlalchemy import create_engine, inspect  # To connect to a database and inspect schema

In [10]:
# Load the environment variables from the .env file
load_dotenv('/content/API_KEYS.env')

# Get the API key from the environment
api_key = os.getenv("OPENAI_API_KEY")

# Print the API key to confirm it's loaded correctly
print(f"API Key loaded from .env: {api_key[0:40]}")

# Set the API key for the OpenAI library
openai.api_key = api_key

API Key loaded from .env: sk-proj-e1GUWruINPRnrozmiakkRMQEnFiEbthN


In [14]:
MODEL = 'gpt-4'  # Use a model that supports code or SQL generation effectively

def get_schema_info(df):
    columns_info = ", ".join([f"{col} ({str(df[col].dtype)})" for col in df.columns])
    return f"### Table structure: Sales ({columns_info})\n"

def create_prompt_with_examples(df, user_query):
    schema_info = get_schema_info(df)
    examples = [
        {"role": "system", "content": "You are an assistant that generates SQL queries."},
        {"role": "user", "content": "Get total sales by product."},
        {"role": "assistant", "content": "SELECT product_id, SUM(sales) AS total_sales FROM Sales GROUP BY product_id;"},
        {"role": "user", "content": "Calculate the average sales by city."},
        {"role": "assistant", "content": "SELECT city, AVG(sales) AS avg_sales FROM Sales GROUP BY city;"}
    ]
    prompt = f"{schema_info}\n### User Request: {user_query}\n### SQL Query:\nSELECT"
    return examples + [{"role": "user", "content": prompt}]

def get_sql_query(df, user_query):
    """
    Generates a SQL query using GPT based on the provided table structure and natural language query.

    Parameters:
        df (pd.DataFrame): DataFrame to define the SQL table structure.
        user_query (str): The user's natural language query.

    Returns:
        str: The SQL query generated by GPT, ensuring it starts with 'SELECT'.
    """
    messages = create_prompt_with_examples(df, user_query)

    try:
        # Send the prompt to GPT and retrieve the response
        response = openai.chat.completions.create(
            model=MODEL,
            messages=messages,
            temperature=0,
            max_tokens=150,
            top_p=1.0,
            frequency_penalty=0.0,
            presence_penalty=0.0,
            stop=["#", ";"]
        )

        # Retrieve and clean the generated query
        query = response.choices[0].message.content.strip()

        # Check if 'SELECT' is at the start of the query; if not, add it
        if not query.upper().startswith("SELECT"):
            query = "SELECT " + query

        return query

    except Exception as e:
        print("An error occurred while generating the SQL query:", e)
        return None


# Load data and test query generation
df = pd.read_csv("/content/sales_data_sample.csv")
query = "Calculate the average sales by city."
generated_sql = get_sql_query(df, query)
print("Generated SQL Query:\n", generated_sql)

Generated SQL Query:
 SELECT CITY, AVG(SALES) AS avg_sales FROM Sales GROUP BY CITY
