In [1]:
import json
from navigator_helpers import NL2CodeTaskSuite, NL2CodePipeline
from navigator_helpers.tasks.text_to_code.utils import display_nl2code_sample
from navigator_helpers.tasks.prompt_templates.template_suite import load_prompt_template_suite
from navigator_helpers.tasks.prompt_templates.text_to_code import (
    NL_TYPE_PYTHON,
    NL_TYPE_SQL,
)

### Configure your data generation pipeline

Gretel's NL2Code Pipelines come with a well defined configuration to help generate the right dataset for you. This configuration allows you to define what characteristics for your dataset are important for you. These configurations are called Contextual Tags. This includes - 
1. Domains: These are broad categories of knowledge. Think of them as the departments you might find in a university. Examples include:
  - Science and Technology
  - Social Sciences
  - Arts and Humanities
  - Business and Economics

2. Topics: Within each Domain, we have more specific areas of focus. These are like the specific courses or research areas within a department. For example:

- Under "Science and Technology":
  - Artificial Intelligence
  - Neuroscience
  - Quantum Physics

- Under "Social Sciences":
  - Psychology
  - Anthropology
  - Sociology

3. Complexity Levels: For each prompt, you can define the complexity of the code you want generated.

When you're defining your interests, you can be as broad or as specific as you like. 


When defining your Pipeline, you can let Gretel create a configuration for you with automatically generated Domains and Topics, or you can specify them yourself. Here is an example of a configuration that makes Gretel generate Domains, Topics, and complexity levels for you.

```
code_lang: python
llm_as_a_judge: true
llm_suite_type: open_license
num_complexity_levels: 4
num_domains: 10
num_topics_per_domain: 10
syntax_validation: true
```

You can also specify your own configuration with specific Domains, Topics, and complexity levels you want.

```
code_language: python
llm_suite_type: open_license
domain_and_topics: 
    Astronomy: 
        - Exoplanets
    Artificial Intelligence: 
        - Deep Learning
    Web App Development:
        - Frontend Development
        - Backend Development
  complexity_levels:
    - "Novice: Basic syntax, variables, and data types"
    - "Intermediate: Control structures, loops, and functions"
    - "Expert: Asynchronous programming, decorators, and metaclasses"
llm_as_a_judge: true
syntax_validation: true
```

In [2]:
### Use the auto_config.yml file to let Gretel generate the contextual tags.
pipeline = NL2CodePipeline(config="configs/auto_config.yml")


2024-09-05 11:56:55.889 - INFO - ⚙️ Setting up Text-to-Python pipeline
2024-09-05 11:56:55.889 - INFO - 📦 Artifact path: nl2code-artifacts/python
2024-09-05 11:56:56.607 - INFO - 🦜 Initializing LLM suite
2024-09-05 11:56:56.609 - INFO - 📖 Natural language LLM: gretelai/gpt-mixtral-8x-22b
2024-09-05 11:56:57.062 - INFO - 💻 Code LLM: gretelai/gpt-codestral-mamba
2024-09-05 11:56:57.465 - INFO - ⚖️ Judge LLM: gretelai/gpt-groq-llama-3-1-8b


### Run Specific Tasks

You can run specific Tasks within the Pipeline

In [3]:
contextual_tags = pipeline.create_contextual_tags()

2024-09-05 11:56:57.975 - INFO - 🏷️ Generating domains
2024-09-05 11:56:59.920 - INFO - 🏷️ Generating topics for each domain
2024-09-05 11:57:17.548 - INFO - 🏷️ Generating levels of Python complexity


### Generate data
You can run the pipeline and generate the desired number of data samples

In [4]:
results = pipeline.run(num_samples=5)

2024-09-05 11:57:18.975 - INFO - 🚀 Starting Text-to-Python synthetic data pipeline


⌛️ Running Pipeline [current task: syntax validation]: 100%|██████████| 5/5 [00:46, 0.11sample/s]       

2024-09-05 11:58:05.883 - INFO - 🥳 Synthetic dataset generation complete!





In [5]:
results.display_sample(index = 0)

In [7]:
print(results.dataframe.code[0])

import pandas as pd

# 1. Reads the CSV file and loads the data into a suitable data structure (like a list of dictionaries or a pandas DataFrame)
data = pd.read_csv('aircraft_propulsion_data.csv')

# 2. Calculates and adds a new column to the data structure containing the 'Efficiency' of each propulsion system
data['Efficiency'] = data['Thrust (kN)'] / data['Fuel Consumption (kg/h)']

# 3. Filters the data structure to include only turbofan engines and stores the filtered data in a new data structure
turbofan_data = data[data['Engine Type'] == 'Turbofan']

# 4. Calculates the average 'Specific Fuel Consumption' of turbofan engines and prints the result
average_sfc = turbofan_data['Specific Fuel Consumption (kg/kN*hr)'].mean()
print("Average Specific Fuel Consumption of Turbofan Engines:", average_sfc)

# 5. Sorts the filtered data structure based on 'Efficiency' in descending order and prints the top 5 aircraft with the most efficient turbofan engines
top_5_efficient_engines = turbofa