<a href="https://colab.research.google.com/github/gretelai/gretel-blueprints/tree/main/docs/notebooks/demo/navigator/navigator-data-designer-sdk-text-to-sql.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🎨 Navigator Data Designer SDK: Text-to-SQL

In [None]:
%%capture
!pip install -U git+https://github.com/gretelai/gretel-python-client

In [None]:
from gretel_client.navigator import DataDesigner

session_kwargs = {
    "api_key": "prompt",
    "endpoint": "https://api.gretel.cloud",
}

## 📘 Text-to-SQL Configuration

In this example, we want an LLM to help us generate _values_ for some data seed categories / subcategories, as specified by the `num_new_values_to_generate` parameter.

- `num_new_values_to_generate` indicates that we want to generate this many new values, in addition to any that exist in the config.

- If both `values` and `num_new_values_to_generate` are present, then the existing values are used as examples for generation.



In [None]:
config = """
model_suite: llama-3.x

special_system_instructions: >-
  You are an expert at writing, analyzing and editing SQL queries. You know what
  a high-quality, clean, efficient, and maintainable SQL code looks like. You
  excel at transforming natural language into SQL, as well as SQL back into
  natural language. Your job is to assist the user with their SQL-related tasks.
  Leverage T-SQL only.

categorical_seed_columns:
  - name: domain
    description: Major industry domain or sector that relies on robust data solutions
    values: [Healthcare, Finance, Education, Science and Technology, Environmental Science, Government]
    num_new_values_to_generate: 5
    subcategories:
      - name: domain_description
        description: High-level description of the domain, highlighting various types of data relevant to writing SQL
        num_new_values_to_generate: 1
      - name: topic
        description: Key topics that professional SQL developers care about in the given domain
        num_new_values_to_generate: 15

  - name: sql_complexity
    description: Complexity of the SQL query, ranging from basic operations to advanced data processing techniques
    values:
      - "Basic SQL"
      - "Aggregation"
      - "Single Join"
      - "Subquery"
      - "Multiple Join"
      - "Window Functions"
    subcategories:
      - name: sql_complexity_description
        description: Description of the complexity level of the SQL query
        num_new_values_to_generate: 1

  - name: sql_task_type
    description: Type of SQL task that the query represents
    values:
      - "Data Retrieval"
      - "Data Definition"
      - "Data Manipulation"
      - "Analytics and Reporting"
      - "Database Administration"
      - "Data Cleaning and Transformation"
    subcategories:
      - name: sql_task_type_description
        description: Description of the type of SQL task
        num_new_values_to_generate: 1

generated_data_columns:
  - name: sql_prompt
    generation_prompt: >-
        Create a natural language prompt to generate SQL in the field of {domain},
        specifically about the topic of {topic}. Feel free to ask for data that
        focus on a smaller subject within the scope of {domain_description}.
    columns_to_list_in_prompt: all_categorical_seed_columns
    llm_type: natural_language

  - name: sql_context
    generation_prompt: >-
        Write a SQL query that generates tables and views in a database and are
        pertinent to the natural language prompt in {sql_prompt}.

        Include complete executable SQL table CREATE statements and/or view CREATE statements.
        Provide up to five tables/views that are relevant to the user's natural language prompt.
        Table names and schemas should correspond to the {domain} domain and focus on {domain_description}
    columns_to_list_in_prompt: [domain, domain_description, topic, sql_prompt]
    llm_type: code

  - name: sql
    generation_prompt: >-
        Write an SQL query to answer/execute the natural language prompt in
        {sql_prompt}.

        SQL should be based on the database context generated in {sql_context}.
        SQL should leverage {sql_complexity}.
    columns_to_list_in_prompt: [domain, topic, sql_complexity, sql_task_type]
    llm_type: code


post_processors:
    - validator: code
      settings:
        code_lang: tsql
        code_columns: [sql_context, sql]

    - evaluator: text_to_sql
      settings:
        text_column: sql_prompt
        code_column: sql
        context_column: sql_context
"""

data_designer = DataDesigner.from_config(config, **session_kwargs)

In [None]:
data_designer

## 🌱 Generating categorical seed _values_

If some/all of your categorical data seeds have values that need to be generated (as is the case for this example), you have two choices:

1.   Generate them every time you generate a preview dataset and/or batch workflow. In this case, you simply call `designer.generate_dataset_preview` or `designer.submit_batch_workflow` without providing `data_seeds` as input.

2.  Generate them once using `designer.generate_seed_category_values` and then pass the resulting `data_seeds` as input when generating a preview / batch workflow, as we will show below.

In [None]:
data_seeds = data_designer.generate_seed_category_values()

In [None]:
data_seeds

In [None]:
data_seeds.inspect()

## 👀 Generating a dataset preview

- You can run `generate_seed_category_values` multiple times.

- Once you are happy with the results, you can pass `data_seeds` as input to the preview / batch generation methods.

- Notice that Step 1 now loads the data seeds rather than generating them.

In [None]:
preview = data_designer.generate_dataset_preview(data_seeds=data_seeds)

In [None]:
preview.dataset

## 🔎 Taking a closer look at single records

In [None]:
preview.display_sample_record(5)

In [None]:
preview.dataset

## 🤔 Like what you see?

- Submit a batch workflow!

- Notice we pass `data_seeds` as an argument to `data_designer.submit_batch_workflow` so we use the same data seeds any time we run this workflow.

In [None]:
batch_job = data_designer.submit_batch_workflow(num_records=25, data_seeds=data_seeds)

In [None]:
batch_job.status

In [None]:
df = batch_job.fetch_dataset(wait_for_completion=True)

In [None]:
path = batch_job.download_evaluation_report()