# 🎨 Gretel - Navigator Data Designer SDK: Text-to-Python


In [None]:
%%capture
%pip install -U gretel_client

In [None]:
from gretel_client.navigator import DataDesigner

## 📘 Text-to-Python Configuration

Below we show an example Text-to-Python `DataDesigner` configuration. The main sections are as follow:

- **model_suite:** You can use `apache-2.0` or `llama-3.x` depending on the type of license you want associated with the data you generate. Selecting `apache-2.0` ensures that all models used by Data Designer comply with the `apache-2.0` license and using `llama-3.x` means the models used by Data Designer will fall under the `Llama 3` license.

- **special_system_instructions:** This is an optional use-case-specific instruction to be added to the system prompt of all LLMs used during synthetic data generation.

- **categorical_seed_columns:** Specifies categorical data seed columns that will be used to seed the synthetic data generation process. Here we fully specify all seed categories and subcategories. It is also possible to generate category values using the `num_new_values_to_generate` parameter.

- **generated_data_columns:** Specifies data columns that are fully generated using LLMs, seeded by the categorical seed columns. The `generation_prompt` field is the prompt template that will be used to generate the data column. All data seeds and previously defined data columns can be used as template keyword arguments.

- **post_processors:** Specifics validation / evaluation / processing that is applied to the dataset after generation. Here, we define a code validator and the `text_to_python` evaluation suite.

To run the code you need to get your Gretel Api Key from https://console.gretel.ai/users/me/key

In [None]:
config_string = """
model_suite: apache-2.0

special_system_instructions: >-
  You are an expert conversation designer and domain specialist. Your job is to
  produce realistic user-assistant dialogues for fine-tuning a model. Always ensure:
  - Responses are factually correct and contextually appropriate.
  - Communication is clear, helpful, and matches the complexity level.
  - Avoid disallowed content and toxicity.
  - After the two-turn conversation, provide a single toxicity assessment for the user's messages in the entire conversation.

categorical_seed_columns:
  - name: domain
    values: [Tech Support, Personal Finances, Educational Guidance]
    subcategories:
      - name: topic
        values:
          Tech Support:
            - Troubleshooting a Laptop
            - Setting Up a Home Wi-Fi Network
            - Installing Software Updates
          Personal Finances:
            - Budgeting Advice
            - Understanding Taxes
            - Investment Strategies
          Educational Guidance:
            - Choosing a College Major
            - Effective Studying Techniques
            - Learning a New Language
        num_new_values_to_generate: 3
    num_new_values_to_generate: 3

  - name: complexity
    values: [Basic, Intermediate, Advanced]

generated_data_columns:
  # Turn 1: User initial message
  - name: user_message
    generation_prompt: >-
      The user is seeking help or information in the {domain} domain, specifically about the topic of {topic}, at a {complexity} complexity level.
      
      The user_message should:
      - Sound natural and realistic.
      - Avoid disallowed content (e.g., harassment, hate, extremely sensitive requests).
      - Reflect the specified domain, topic, and complexity level.

  # Turn 1: Assistant responds
  - name: assistant_message
    generation_prompt: >-
      As a helpful assistant, write a response to the user's query below:
      Query: {user_message}\n
      
      Instructions:
      - Provide a clear, accurate, and contextually relevant response.
      - Be correct, non-toxic, and helpful.
      - Avoid disallowed content. If the request is disallowed, provide a safe refusal.
      - To encourage variety, do not always start your response with the same phrase (e.g., avoid always beginning with "Certainly!"). 
      - You can start directly addressing the user's question or use different acknowledgments like "Of course,", "Sure,", "Let's see," or simply begin explaining.
      - Only provide the message, do not add headers, explanations, or assessments.

  # Turn 2: User follows up
  - name: user_message_2
    generation_prompt: >-
      The user now follows up on the assistant_message.
      Previous User Query: {user_message}\n
      Previous Assistant Response: {assistant_message}\n
      
      The second user_message should:
      - Be a logical follow-up or request for clarification or more detail based on the assistant's prior response.
      - Maintain the same domain, topic, and complexity.
      - Sound natural and realistic.
      - Avoid disallowed content.
    columns_to_list_in_prompt: all_categorical_seed_columns

  # Turn 2: Assistant responds again
  - name: assistant_message_2
    generation_prompt: >-
      The user has followed up with another query:
      Previous User Query: {user_message}\n
      Previous Assistant Response: {assistant_message}\n
      New User Query: {user_message_2}\n
      
      Instructions:
      - Provide a clear, accurate, and contextually relevant follow-up response.
      - Maintain the complexity level and continue to be non-toxic and helpful.
      - Avoid disallowed content. If the request is disallowed, provide a safe refusal.
      - Again, do not always start with the same phrase. Vary your opening or start directly.
      - Only provide the message, do not add headers, explanations, or assessments.
    columns_to_list_in_prompt: all_categorical_seed_columns

  # Toxicity Label for the entire 2-turn conversation
  - name: toxicity_label
    generation_prompt: >-
      Analyze all user messages in this two-turn conversation and determine the overall toxicity level of the user's behavior.
      
      Consider:
      - First User Message: {user_message}
      - Assistant Response: {assistant_message}
      - Second User Message: {user_message_2}
      - Assistant Response: {assistant_message_2}
      
      Focus primarily on the user's messages to detect any disallowed content, harassment, hate, or harmful language.
      If any user message exhibits toxicity, reflect it in the final label.
      
      Possible categories:
      - "None": No toxicity detected in user messages.
      - "Mild": Slightly rude or sarcastic but not hateful or harmful.
      - "Moderate": Some disrespectful or harassing language.
      - "Severe": Overt hate, harassment, or harmful content.
      
      Provide only one label. If all user messages are polite and safe, label "None".
      Do not elaborate beyond providing the label.

"""

data_designer = DataDesigner.from_config(config_string, api_key="prompt")

## 👀 Generating a dataset preview

- Preview mode allows you to quickly iterate on your data design.

- Each preview generation call creates 10 records for inspection.

In [None]:
preview = data_designer.generate_dataset_preview()

In [None]:
# The preview dataset is accessible as a DataFrame
preview.dataset[["domain","topic"]].value_counts()

## 🔎 Easily inspect individual records

- Run the cell below to display individual records for inspection.

- Run the cell multiple times to cycle through the 10 preview records.

- Alternatively, you can pass the `index` argument to `display_sample_record` to display a specific record.

In [None]:
preview.display_sample_record()

## 🤔 Like what you see?

- Submit a batch workflow!

In [None]:
batch_job = data_designer.submit_batch_workflow(num_records=25)

In [None]:
# Check to see if the Workflow is still active.
batch_job.workflow_run_status

In [None]:
df = batch_job.fetch_dataset(wait_for_completion=True)

In [None]:
path = batch_job.download_evaluation_report(wait_for_completion=True)