<a target="_parent" href="https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/demo/navigator/text-to-code/text-to-python.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# 🎨 Navigator Data Designer SDK: Text-to-Python

This notebook demonstrates how to use the Gretel Navigator SDK to create a synthetic data generation pipeline for Python code examples. We'll build a system that generates Python code based on natural language instructions, with varying complexity levels and industry focuses.

In [9]:
%%capture
# Install the latest version of Gretel client and dependencies
%pip install -U gretel_client 

In [10]:
from gretel_client.navigator_client import Gretel
from gretel_client.data_designer import columns as C
from gretel_client.data_designer import params as P

## 📘 Setting Up the Data Designer

First, we'll initialize the Data Designer with appropriate system instructions.

In [11]:
# Initialize Gretel client and Data Designer
gretel = Gretel(api_key="prompt", endpoint="https://api.dev.gretel.ai")
aidd = gretel.data_designer.new(
    model_suite="apache-2.0"  # Use apache-2.0 or llama-3.x based on your licensing needs
)

Found cached Gretel credentials
Logged in as kirit.thadaka@gretel.ai ✅
Using project: default-sdk-project-1b613ec72030408
Project link: https://console-eng.gretel.ai/proj_2uY0cfM0kjiegpyEZvCHNKZYxGf


## 🌱 Define Categorical Seed Columns

We'll set up our seed columns for industry sectors, code complexity, and instruction types. These will help generate diverse and relevant code examples.

In [12]:
# Add industry sector categories
aidd.add_column(C.SamplerColumn(
    name="industry_sector",
    type=P.SamplerType.CATEGORY,
    params=P.CategorySamplerParams(
        values=["Healthcare", "Finance", "Technology"],
        description="The industry sector for the code example"
    )
))

# Add topic as a subcategory of industry_sector
aidd.add_column(C.SamplerColumn(
    name="topic",
    type=P.SamplerType.SUBCATEGORY,
    params=P.SubcategorySamplerParams(
        category="industry_sector",
        values={
            "Healthcare": [
                "Electronic Health Records (EHR) Systems",
                "Telemedicine Platforms", 
                "AI-Powered Diagnostic Tools"
            ],
            "Finance": [
                "Fraud Detection Software",
                "Automated Trading Systems",
                "Personal Finance Apps"
            ],
            "Technology": [
                "Cloud Computing Platforms",
                "Artificial Intelligence and Machine Learning Platforms",
                "DevOps and CI/CD Tools"
            ]
        }
    )
))

# Add code complexity with subcategory for code concepts
aidd.add_column(C.SamplerColumn(
    name="code_complexity",
    type=P.SamplerType.CATEGORY,
    params=P.CategorySamplerParams(
        values=["Beginner", "Intermediate", "Advanced"],
        description="The complexity level of the code"
    )
))

# Add code_concept as a subcategory of code_complexity
aidd.add_column(C.SamplerColumn(
    name="code_concept",
    type=P.SamplerType.SUBCATEGORY,
    params=P.SubcategorySamplerParams(
        category="code_complexity",
        values={
            "Beginner": [
                "Variables",
                "Data Types",
                "Functions",
                "Loops",
                "Classes"
            ],
            "Intermediate": [
                "List Comprehensions",
                "Object-oriented programming",
                "Lambda Functions",
                "Web frameworks",
                "Pandas"
            ],
            "Advanced": [
                "Multithreading",
                "Context Managers",
                "Generators"
            ]
        }
    )
))

# Add instruction phrases
aidd.add_column(C.SamplerColumn(
    name="instruction_phrase",
    type=P.SamplerType.CATEGORY,
    params=P.CategorySamplerParams(
        values=[
            "Write a function that",
            "Create a class that",
            "Implement a script",
            "Can you create a function",
            "Develop a module that"
        ],
        description="Starting phrase for the code instruction"
    )
))

## ✨ Define Generated Data Columns

Now we'll set up the columns that will be generated by the LLMs, including the instruction and code implementation.

In [13]:
# Generate instruction for the code
aidd.add_column(
    C.LLMTextColumn(
        name="instruction",
        system_prompt="You are an expert at generating clear and specific programming tasks.",
        prompt="""\
Generate an instruction to create Python code that solves a specific problem. 
Each instruction should begin with one of the following phrases: {{instruction_phrase}}.

Important Guidelines:
* Industry Relevance: Ensure the instruction pertains to the {{industry_sector}} sector and {{topic}} topic.
* Code Complexity: Tailor the instruction to the {{code_complexity}} level. Utilize relevant {{code_concept}} where appropriate to match the complexity level.
* Clarity and Specificity: Make the problem statement clear and unambiguous. Provide sufficient context to understand the requirements without being overly verbose.
* Response Formatting: Do not include any markers such as ### Response ### in the instruction.
"""
    )
)

# Generate the Python code
aidd.add_column(
    C.LLMCodeColumn(
        name="code_implementation",
        output_format="python",
        system_prompt="You are an expert Python programmer who writes clean, efficient, and well-documented code.",
        prompt="""\
Write Python code for the following instruction:
Instruction: {{instruction}}

Important Guidelines:
* Code Quality: Your code should be clean, complete, self-contained and accurate.
* Code Validity: Please ensure that your python code is executable and does not contain any errors.
* Packages: Remember to import any necessary libraries, and to use all libraries you import.
* Complexity & Concepts: The code should be written at a {{code_complexity}} level, making use of concepts such as {{code_concept}}.
"""
    )
)

## 🔍 Add Validation and Evaluation

Let's add post-processing steps to validate the generated code and evaluate the text-to-Python conversion.

In [14]:
# Add validators and evaluators
from gretel_client.data_designer.judge_rubrics import TEXT_TO_PYTHON_LLM_JUDGE_PROMPT_TEMPLATE, PYTHON_RUBRICS

aidd.add_column(C.CodeValidationColumn(
    name="code_validity_result",
    code_lang="python",
    target_column="code_implementation"
))

aidd.add_column(C.LLMJudgeColumn(
    name="code_judge_result",
    prompt=TEXT_TO_PYTHON_LLM_JUDGE_PROMPT_TEMPLATE,
    rubrics=PYTHON_RUBRICS
))

## 👀 Generate Preview Dataset

Let's generate a preview to see some data.

In [15]:
aidd.with_evaluation_report()

In [16]:
# Generate a preview
preview = aidd.preview()

[10:26:17] [INFO] 🚀 Generating preview
[10:26:19] [INFO] 🎲 Step 1: Using samplers to generate 5 columns
[10:26:20] [INFO] 🦜 Step 2: Generating text column `instruction`
[10:26:27] [INFO] 🦜 Step 3: Generating code column `code_implementation`
[10:26:46] [INFO] ⚖️ Step 4: Using llm to judge column `code_judge_result`
[10:27:04] [INFO] 🔍 Step 5: Validating code in column `code_implementation`
[10:27:19] [INFO] 🧐 Step 6: Evaluating dataset
[10:27:19] [INFO] 🎉 Your dataset preview is ready!


## 🔎 Easily inspect individual records

- Run the cell below to display individual records for inspection.

- Run the cell multiple times to cycle through the 10 preview records.

- Alternatively, you can pass the `index` argument to `display_sample_record` to display a specific record.

In [17]:
preview.display_sample_record()

## 🚀 Generate Full Dataset

If you're satisfied with the preview, you can generate a larger dataset using a batch workflow.

In [18]:
# Submit batch job
workflow_run = aidd.create(
    num_records=100,
    name="text_to_python_examples"
)

workflow_run.wait_until_done()
print("\nGenerated dataset shape:", workflow_run.dataset.df.shape)

[10:27:20] [INFO] 🚀 Submitting batch workflow
▶️ Creating Workflow: w_2vjHcUaS48TcGdBk2Uz3Ths0Ya3
▶️ Created Workflow Run: wr_2vjHcbIDQnSmg8p00S2PKleY4Xl
🔗 Workflow Run console link: https://console-dev.gretel.ai/workflows/w_2vjHcUaS48TcGdBk2Uz3Ths0Ya3/runs/wr_2vjHcbIDQnSmg8p00S2PKleY4Xl
Fetching task logs for workflow run wr_2vjHcbIDQnSmg8p00S2PKleY4Xl
Workflow run is now in status: RUN_STATUS_CREATED
Got task wt_2vjHce3qhg9hoLBwEBDIfNEq9CL
Workflow run is now in status: RUN_STATUS_ACTIVE
[using-samplers-to-generate-5-columns] Task Status is now: RUN_STATUS_ACTIVE
[using-samplers-to-generate-5-columns] 2025-04-14 17:27:48.787889+00:00 Preparing step 'using-samplers-to-generate-5-columns'
[using-samplers-to-generate-5-columns] 2025-04-14 17:28:07.124836+00:00 Starting 'generate_columns_using_samplers' task execution
[using-samplers-to-generate-5-columns] 2025-04-14 17:28:07.126302+00:00 🎲 Using numerical samplers to generate 100 records across 5 columns
[using-samplers-to-generate-5-co

MaxRetryError: HTTPSConnectionPool(host='api.dev.gretel.ai', port=443): Max retries exceeded with url: /v1/workflows/runs/tasks/search?query=workflow_run_id%3Awr_2vjHcbIDQnSmg8p00S2PKleY4Xl (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x11d8ef6e0>: Failed to resolve 'api.dev.gretel.ai' ([Errno 8] nodename nor servname provided, or not known)"))

[evaluating-dataset] Task Status is now: RUN_STATUS_COMPLETED


In [19]:
# Download evaluation report
path = workflow_run.download_report(format="html")