<a href="https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/notebooks/demo/navigator/text-to-code/navigator-data-designer-yaml-text-to-sql.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🌅 Early Preview: Data Designer

> **Note:** The [Data Designer](https://gretel.ai/navigator/data-designer) functionality demonstrated in this notebook is currently in **Early Preview**.
>
> To access these features and run this notebook, please [join the waitlist](https://gretel.ai/navigator/data-designer#waitlist).


# 🎨 Navigator Data Designer SDK: Text-to-SQL

In [None]:
%%capture
!pip install -U git+https://github.com/gretelai/gretel-python-client

In [None]:
from gretel_client.navigator import DataDesigner

## 📘 Text-to-SQL Configuration

In this example, we want an LLM to help us generate _values_ for some data seed categories / subcategories, as specified by the `num_new_values_to_generate` parameter.

- `num_new_values_to_generate` indicates that we want to generate this many new values, in addition to any that exist in the config.

- If both `values` and `num_new_values_to_generate` are present, then the existing values are used as examples for generation.



In [None]:
config_string = """
model_suite: apache-2.0

special_system_instructions: >-
  You are an expert at writing, analyzing and editing SQL queries. You know what
  a high-quality, clean, efficient, and maintainable SQL code looks like. You
  excel at transforming natural language into SQL, as well as SQL back into
  natural language. Your job is to assist the user with their SQL-related tasks.

categorical_seed_columns:
  - name: industry_sector
    values: [Healthcare, Finance, Technology]
    subcategories:
      - name: topic
        values:
          Healthcare:
            - Electronic Health Records (EHR) Systems
            - Telemedicine Platforms
            - AI-Powered Diagnostic Tools
          Finance:
            - Fraud Detection Software
            - Automated Trading Systems
            - Personal Finance Apps
          Technology:
            - Cloud Computing Platforms
            - Artificial Intelligence and Machine Learning Platforms
            - DevOps and Continuous Integration/Continuous Deployment (CI/CD) Tools

  - name: sql_complexity
    values: [Beginner, Intermediate, Advanced]
    subcategories:
      - name: sql_concept
        values:
          Beginner: ["Basic SQL", "SELECT Statements", "WHERE Clauses", "Basic JOINs", "INSERT, UPDATE, DELETE"]
          Intermediate: ["Aggregation", "Single JOIN", "Subquery", "Views", "Stored Procedures"]
          Advanced: ["Multiple JOINs", "Window Functions", "Common Table Expressions (CTEs)", "Triggers", "Query Optimization"]
      - name: sql_complexity_description
        description: The complexity level of the given SQL complexity and SQL concept.
        num_new_values_to_generate: 1

  - name: sql_task_type
    values:
      - "Data Retrieval"
      - "Data Definition"
      - "Data Manipulation"
      - "Analytics and Reporting"
      - "Database Administration"
      - "Data Cleaning and Transformation"
    subcategories:
      - name: sql_task_type_description
        description: A brief description of the SQL task type.
        num_new_values_to_generate: 1

  - name: instruction_phrase
    values:
      - "Construct an SQL query to"
      - "Formulate an SQL statement that"
      - "Implement an SQL view that"

generated_data_columns:
  - name: sql_prompt
    generation_prompt: >-
      Generate a clear and specific natural language instruction for creating an SQL query tailored to the {industry_sector} sector, focusing on the {topic} topic and the {sql_task_type} task. 
      Each instruction should begin with one of the following phrases: "{instruction_phrase}".
      
      Important Guidelines:
        * Industry Relevance: Ensure the instruction is directly related to the {industry_sector} sector and the {topic} topic.
        * Task Specificity: Clearly define the SQL task type ({sql_task_type}) to provide focused and actionable requirements.
        * Complexity Alignment: Align the instruction with the appropriate SQL complexity level by implicitly incorporating relevant SQL concepts.
        * Clarity and Precision: Craft the instruction to be unambiguous and straightforward, providing all necessary context without unnecessary verbosity.
        * Response Formatting: Exclude any markers or similar formatting cues in the instruction.
    columns_to_list_in_prompt: [industry_sector, topic, sql_task_type, instruction_phrase]

  - name: sql_context
    generation_prompt: >-
      Generate a set of database tables and views that are pertinent to the SQL instruction in {sql_prompt} and the task type {sql_task_type} within the {industry_sector} sector and {topic} topic.
      
      Important Guidelines:
        * Relevance: Ensure that all generated tables and views are directly related to the {industry_sector} sector and the {topic} topic. They should provide the necessary structure to support the SQL instruction effectively.
        * Completeness: Include all essential columns with appropriate data types, primary keys, foreign keys, and necessary constraints to accurately represent real-world database schemas.
        * Realism: Design realistic and practical table schemas that reflect typical structures used in the specified industry sector. Avoid overly simplistic or excessively complex schemas unless required by the task.
        * Executable SQL: Provide complete and executable statements. Ensure that there are no syntax errors and that the statements can be run without modification.
        * Consistency: Maintain consistent naming conventions for tables and columns, adhering to best practices (e.g., snake_case for table and column names).
        * Response Formatting: Exclude any markers or similar formatting cues in the instruction.
    columns_to_list_in_prompt: [industry_sector, topic, sql_prompt, sql_task_type]
    llm_type: code
    data_config:
      type: code
      params:
        syntax: sql
  
  - name: sql
    generation_prompt: >-
      Write an SQL query to answer/execute the following instruction and sql context.
      Instruction: {sql_prompt}\n
      Context: {sql_context}\n
      
      Important Guidelines:
        * SQL Quality: Write self-contained and modular SQL code.
        * SQL Validity: Please ensure that your SQL code is executable and does not contain any errors.
        * Context: Base the SQL query on the provided database context in "{sql_context}". Ensure that all referenced tables, views, and columns exist within this context.
        * Complexity & Concepts: The SQL should be written at a {sql_complexity} level, making use use of concepts such as {sql_context}.
    columns_to_list_in_prompt: [sql_prompt, sql_context, sql_complexity]
    llm_type: code
    data_config:
      type: code 
      params:
        syntax: sql

post_processors:
  - validator: code
    settings:
      code_lang: ansi
      code_columns: [sql_context, sql]
  
  - evaluator: text_to_sql
    settings:
      text_column: sql_prompt
      code_column: sql
      context_column: sql_context
"""

data_designer = DataDesigner.from_config(config_string, api_key="prompt")

In [None]:
data_designer

## 🌱 Generating categorical seed _values_

If some/all of your categorical data seeds have values that need to be generated (as is the case for this example), you have two choices:

1.   Generate them every time you generate a preview dataset and/or batch workflow. In this case, you simply call `designer.generate_dataset_preview` or `designer.submit_batch_workflow` without providing `data_seeds` as input.

2.  Generate them once using `designer.run_data_seeds_step` and then pass the resulting `data_seeds` as input when generating a preview / batch workflow, as we will show below.

In [None]:
data_seeds = data_designer.run_data_seeds_step()

In [None]:
data_seeds

In [None]:
data_seeds.inspect()

## 👀 Generating a dataset preview

- You can run `generate_seed_category_values` multiple times.

- Once you are happy with the results, you can pass `data_seeds` as input to the preview / batch generation methods.

- Notice that Step 1 now loads the data seeds rather than generating them.

In [None]:
preview = data_designer.generate_dataset_preview(data_seeds=data_seeds)

In [None]:
preview.dataset

## 🔎 Taking a closer look at single records

In [None]:
# Provide an index to display a specific record or leave it empty 
# to cycle through records each time you run the cell.
preview.display_sample_record(index=5)

In [None]:
preview.dataset

## 🤔 Like what you see?

- Submit a batch workflow!

- Notice we pass `data_seeds` as an argument to `data_designer.submit_batch_workflow` so we use the same data seeds any time we run this workflow.

In [None]:
batch_job = data_designer.submit_batch_workflow(num_records=25, data_seeds=data_seeds)

In [None]:
# Check to see if the Workflow is still active.
batch_job.workflow_run_status

In [None]:
df = batch_job.fetch_dataset(wait_for_completion=True)

In [None]:
path = batch_job.download_evaluation_report(wait_for_completion=True)