# 🎨 Navigator Data Designer SDK: Text-to-Python

This notebook demonstrates how to use the Gretel Navigator SDK to create a synthetic data generation pipeline for Python code examples. We'll build a system that generates Python code based on natural language instructions, with varying complexity levels and industry focuses.

In [None]:
%%capture
%pip install -U gretel_client

In [9]:
from gretel_client.navigator import DataDesigner
from gretel_client.navigator.tasks.types import ValidatorType, EvaluationType, LLMJudgePromptTemplateType
from typing import Literal
from pydantic import BaseModel, Field

## 📘 Setting Up the Data Designer

First, we'll define our structured output model and initialize the Data Designer with appropriate system instructions.

In [2]:
# Define structured output model for code generation
class PythonCode(BaseModel):
    """A Python code example with documentation."""
    code: str = Field(..., description="The Python code implementation")
    docstring: str = Field(..., description="Documentation explaining the code")

# Initialize the Data Designer
data_designer = DataDesigner(
    api_key="prompt",  # Replace with your Gretel API key
    model_suite="apache-2.0",  # Use apache-2.0 or llama-3.x based on your licensing needs
    endpoint="https://api.gretel.cloud",
    special_system_instructions="""
    You are an expert at writing, analyzing, and editing Python code. You know what
    high-quality, clean, efficient, and maintainable Python code looks like. You
    excel at transforming natural language into Python, as well as Python back into
    natural language. Your job is to assist the user with their Python-related tasks.
    """
)

[12:13:05] [INFO] 🦜 Using apache-2.0 model suite
Logged in as kirit.thadaka@gretel.ai ✅


## 🌱 Define Categorical Seed Columns

We'll set up our seed columns for industry sectors, code complexity, and instruction types. These will help generate diverse and relevant code examples.

In [3]:
# Add industry sector categories
data_designer.add_categorical_seed_column(
    name="industry_sector",
    description="The industry sector for the code example",
    values=["Healthcare", "Finance", "Technology"],
    subcategories=[
        {
            "name": "topic",
            "values": {
                "Healthcare": [
                    "Electronic Health Records (EHR) Systems",
                    "Telemedicine Platforms", 
                    "AI-Powered Diagnostic Tools"
                ],
                "Finance": [
                    "Fraud Detection Software",
                    "Automated Trading Systems",
                    "Personal Finance Apps"
                ],
                "Technology": [
                    "Cloud Computing Platforms",
                    "Artificial Intelligence and Machine Learning Platforms",
                    "DevOps and CI/CD Tools"
                ]
            }
        }
    ]
)

# Add code complexity and concepts
data_designer.add_categorical_seed_column(
    name="code_complexity",
    description="The complexity level of the code",
    values=["Beginner", "Intermediate", "Advanced"],
    subcategories=[
        {
            "name": "code_concept",
            "values": {
                "Beginner": [
                    "Variables",
                    "Data Types",
                    "Functions",
                    "Loops",
                    "Classes"
                ],
                "Intermediate": [
                    "List Comprehensions",
                    "Object-oriented programming",
                    "Lambda Functions",
                    "Web frameworks",
                    "Pandas"
                ],
                "Advanced": [
                    "Multithreading",
                    "Context Managers",
                    "Generators"
                ]
            }
        }
    ]
)

# Add instruction phrases
data_designer.add_categorical_seed_column(
    name="instruction_phrase",
    description="Starting phrase for the code instruction",
    values=[
        "Write a function that",
        "Create a class that",
        "Implement a script",
        "Can you create a function",
        "Develop a module that"
    ]
)

## ✨ Define Generated Data Columns

Now we'll set up the columns that will be generated by the LLMs, including the instruction and code implementation.

In [4]:
# Generate instruction for the code
data_designer.add_generated_data_column(
    name="instruction",
    generation_prompt="""
    Generate an instruction to create Python code that solves a specific problem. 
    Each instruction should begin with one of the following phrases: {instruction_phrase}.
    
    Important Guidelines:
    * Industry Relevance: Ensure the instruction pertains to the {industry_sector} sector and {topic} topic.
    * Code Complexity: Tailor the instruction to the {code_complexity} level. Utilize relevant {code_concept} where appropriate to match the complexity level.
    * Clarity and Specificity: Make the problem statement clear and unambiguous. Provide sufficient context to understand the requirements without being overly verbose.
    * Response Formatting: Do not include any markers such as ### Response ### in the instruction.
    """
)

# Generate the Python code
data_designer.add_generated_data_column(
    name="code_implementation",
    generation_prompt="""
    Write Python code for the following instruction:
    Instruction: {instruction}

    Important Guidelines:
    * Code Quality: Your code should be clean, complete, self-contained and accurate.
    * Code Validity: Please ensure that your python code is executable and does not contain any errors.
    * Packages: Remember to import any necessary libraries, and to use all libraries you import.
    * Complexity & Concepts: The code should be written at a {code_complexity} level, making use of concepts such as {code_concept}.
    """,
    llm_type="code",
    data_config={"type": "code", "params": {"syntax": "python"}}
)

## 🔍 Add Validation and Evaluation

Let's add post-processing steps to validate the generated code and evaluate the text-to-Python conversion.

In [14]:
# Add code validator
data_designer.add_validator(
    validator=ValidatorType.CODE,
    code_lang="python",
    code_columns=["code_implementation"]
)

# Add text-to-python evaluator
data_designer.add_evaluator(
    eval_type=LLMJudgePromptTemplateType.TEXT_TO_PYTHON,  
    instruction_column_name="instruction",    
    response_column_name="code_implementation"
)

## 👀 Generate Preview Dataset

Let's generate a preview to see some data.

In [19]:
preview.dataset.to_csv('test.csv')

In [15]:
# Generate preview dataset
preview = data_designer.generate_dataset_preview(num_records=5)
print("\nPreview of generated records:")

[14:11:09] [INFO] 🚀 Generating dataset preview
[14:11:09] [INFO] 📥 Step 1: Load data seeds
[14:11:10] [INFO] 🎲 Step 2: Sample data seeds
[14:11:10] [INFO] 🦜 Step 3: Generate column from template >> generating instruction
[14:11:12] [INFO] 🦜 Step 4: Generate column from template >> generating code implementation
[14:12:10] [INFO] 🔍 Step 5: Validate code
[14:12:13] [INFO] ⚖️ Step 6: Judge with llm
[14:12:20] [INFO] 🧐 Step 7: Evaluate dataset
[14:12:20] [INFO] 👀 Your dataset preview is ready for a peek!



Preview of generated records:


In [17]:
preview.display_sample_record()

## 🚀 Generate Full Dataset

If you're satisfied with the preview, you can generate a larger dataset using a batch workflow.

In [None]:
# Submit batch job
batch_job = data_designer.submit_batch_workflow(num_records=100)
df = batch_job.fetch_dataset(wait_for_completion=True)
print("\nGenerated dataset shape:", df.shape)

# Download evaluation report
path = batch_job.download_evaluation_report(wait_for_completion=True)