<a href="https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/demo/navigator/navigator-data-designer-sdk-text-to-sql.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🎨 Navigator Data Designer SDK: Text-to-SQL

This notebook demonstrates how to use the Gretel Navigator SDK to create a synthetic data generation pipeline for SQL code examples. We'll build a system that generates SQL code based on natural language instructions, with varying complexity levels and industry focuses.


In [None]:
%%capture
!pip install -U git+https://github.com/gretelai/gretel-python-client

In [None]:
from gretel_client.navigator import DataDesigner

## 📘 Setting Up the Data Designer

First, we'll initialize the Data Designer with appropriate system instructions.

In [None]:
data_designer = DataDesigner(
    api_key="prompt",
    model_suite="apache-2.0",  # Use apache-2.0 or llama-3.x based on your licensing needs
    endpoint="https://api-dev.gretel.cloud",
    special_system_instructions="""\
You are an expert at writing, analyzing and editing SQL queries. You know what
a high-quality, clean, efficient, and maintainable SQL code looks like. You
excel at transforming natural language into SQL, as well as SQL back into
natural language. Your job is to assist the user with their SQL-related tasks.
"""
)

## 🌱 Define Categorical Seed Columns

We'll set up our seed columns for industry sectors, code complexity, and task and instruction types. These will help generate diverse and relevant code examples.

In [None]:
data_designer.add_categorical_seed_column(
    name="industry_sector",
    values=["Healthcare", "Finance", "Technology"],
    subcategories=[
        {
            "name": "topic",
            "values": {
                "Healthcare": [
                    "Electronic Health Records (EHR) Systems", 
                    "Telemedicine Platforms", 
                    "AI-Powered Diagnostic Tools",
                ],
                "Finance": [
                    "Fraud Detection Software", 
                    "Automated Trading Systems", 
                    "Personal Finance Apps",
                ],
                "Technology": [
                    "Cloud Computing Platforms", 
                    "Artificial Intelligence and Machine Learning Platforms", 
                    "DevOps and Continuous Integration/Continuous Deployment (CI/CD) Tools",
                ]
            }
        }
        
    ]
)

data_designer.add_categorical_seed_column(
    name="sql_complexity",
    values=["Beginner", "Intermediate", "Advanced"],
    subcategories=[
        {
            "name": "sql_concept",
            "values": {
                "Beginner": [
                    "Basic SQL", 
                    "SELECT Statements", 
                    "WHERE Clauses", 
                    "Basic JOINs", 
                    "INSERT, UPDATE, DELETE",
                ],
                "Intermediate": [
                    "Aggregation", 
                    "Single JOIN", 
                    "Subquery", 
                    "Views", 
                    "Stored Procedures",
                ],
                "Advanced": [
                    "Multiple JOINs", 
                    "Window Functions", 
                    "Common Table Expressions (CTEs)", 
                    "Triggers", 
                    "Query Optimization",
                ]
            }
        },
        {
            "name": "sql_complexity_description",
            "description": "The complexity level of the given SQL complexity and SQL concept.",
            "num_new_values_to_generate": 1
        }
    ]
)

data_designer.add_categorical_seed_column(
    name="sql_task_type",
    values=[
        "Data Retrieval",
        "Data Definition",
        "Data Manipulation",
        "Analytics and Reporting",
        "Database Administration",
        "Data Cleaning and Transformation",
    ],
    subcategories=[
        {
            "name": "sql_task_type_description",
            "description": "A brief description of the SQL task type.",
            "num_new_values_to_generate": 1
        }
    ]
)

data_designer.add_categorical_seed_column(
    name="instruction_phrase",
    values=[
        "Construct an SQL query to",
        "Formulate an SQL statement that",
        "Implement an SQL view that",
    ]
)


## ✨ Define Generated Data Columns

Now we'll set up the columns that will be generated by the LLMs, including the instruction and code implementation.

In [None]:
data_designer.add_generated_data_column(
    name="sql_prompt",
    generation_prompt="""\
Generate a clear and specific natural language instruction for creating an SQL query tailored 
to the {industry_sector} sector, focusing on the {topic} topic and the {sql_task_type} task.
Each instruction should begin with one of the following phrases: "{instruction_phrase}".

Important Guidelines:
* Industry Relevance: Ensure the instruction is directly related to the {industry_sector} sector and the {topic} topic.
* Task Specificity: Clearly define the SQL task type ({sql_task_type}) to provide focused and actionable requirements.
* Complexity Alignment: Align the instruction with the appropriate SQL complexity level by implicitly incorporating relevant SQL concepts.
* Clarity and Precision: Craft the instruction to be unambiguous and straightforward, providing all necessary context without unnecessary verbosity.
* Response Formatting: Exclude any markers or similar formatting cues in the instruction.
""",
    columns_to_list_in_prompt=["industry_sector", "topic", "sql_task_type", "instruction_phrase"]
)


data_designer.add_generated_data_column(
    name="sql_context",
    generation_prompt="""\
Generate a set of database tables and views that are pertinent to the SQL instruction in {sql_prompt} and the 
task type {sql_task_type} within the {industry_sector} sector and {topic} topic.

Important Guidelines:
* Relevance: Ensure that all generated tables and views are directly related to the {industry_sector} sector and the {topic} topic.
* Completeness: Include all essential columns with appropriate data types, primary/foreign keys, and necessary constraints.
* Realism: Design realistic and practical table schemas that reflect typical structures used in the specified industry sector.
* Executable SQL: Provide complete and executable statements. Ensure that there are no syntax errors and that the statements can be run without modification.
* Consistency: Maintain consistent naming conventions for tables and columns, adhering to best practices (e.g., snake_case for table and column names).
* Response Formatting: Exclude any markers or similar formatting cues in the instruction.
""",
    llm_type="code",
    data_config={"type": "code", "params": {"syntax": "sql"}},
    columns_to_list_in_prompt=["industry_sector", "topic", "sql_prompt", "sql_task_type"], 
)


data_designer.add_generated_data_column(
    name="sql",
    generation_prompt="""\
Write an SQL query to answer/execute the following instruction and sql context.
Instruction: {sql_prompt}\n
DB Context: {sql_context}\n

Important Guidelines:
* SQL Quality: Write self-contained and modular SQL code.
* SQL Validity: Please ensure that your SQL code is executable and does not contain any errors.
* Context: Base the SQL query on the provided database context. Ensure that all referenced tables, views, and columns exist within this context.
* Complexity & Concepts: The SQL should be written at a {sql_complexity} level and relate to {sql_concept}.
""",
    llm_type="code",
    data_config={"type": "code", "params": {"syntax": "sql"}},
    columns_to_list_in_prompt=["sql_prompt", "sql_context", "sql_complexity"],
)

## 🔍 Add Validation and Evaluation

Let's add post-processing steps to validate the generated code and evaluate the text-to-SQL conversion.

In [None]:
data_designer.add_validator(
    validator="code",
    code_lang="ansi",
    code_columns=["sql_context", "sql"]
    
)

data_designer.add_evaluator(
    eval_type="text_to_sql",
    instruction_column_name="sql_prompt",
    context_column_name="sql_context",
    response_column_name="sql"   
)

In [None]:
data_designer

## 👀 Generate Preview Dataset

Let's generate a preview to see some data.

In [None]:
preview = data_designer.generate_dataset_preview()

In [None]:
# The preview dataset is available as a pandas DataFrame.
preview.dataset.head()

## 🔎 Easily inspect individual records

- Run the cell below to display individual records for inspection.

- Run the cell multiple times to cycle through the 10 preview records.

- Alternatively, you can pass the `index` argument to `display_sample_record` to display a specific record.

In [None]:
preview.display_sample_record()

## 🚀 Generate Full Dataset

If you're satisfied with the preview, you can generate a larger dataset using a batch workflow.

In [None]:
# Submit batch job
batch_job = data_designer.submit_batch_workflow(num_records=100)
df = batch_job.fetch_dataset(wait_for_completion=True)

# Download evaluation report
path = batch_job.download_evaluation_report(wait_for_completion=True)