# Navigator Text-SQL Notebook

In [20]:
%%capture
!pip install git+https://github.com/gretelai/gretel-python-client@dev/data-designer-m1

session_kwargs = {
    "api_key": "prompt",
    "endpoint": "https://api-dev.gretel.cloud",
    "cache": "yes",
}

In [21]:
from gretel_client.navigator import DataDesigner

### 📘 Text-to-SQL Blueprint
The blueprint below is inspired by our [synthetic text-to-SQL dataset](https://huggingface.co/datasets/gretelai/synthetic_text_to_sql).  

Data Designer can support SQL dialects like ANSI, T-SQL, BigQuery, MySQL, and PostgreSQL. This example uses ANSI

In [22]:
text2sql_blueprint_string ="""
model_suite: Apache-2.0

special_system_instructions: >-
  You are an expert at writing SQL queries and technical documentation. You are obsessed with writing
  clean, efficient, and maintainable SQL code. You are tasked with generating SQL queries and natural
  language text that will be used to train a language model that will be used to generate SQL queries.

seed_categories:
  - name: domain
    description: Major industry domain or sector that relies on robust data solutions
    values: [Healthcare, Finance, Retail, Manufacturing, Education, Public Health, Science and Technology, Environmental Science, Government, Media and Entertainment, 
            Transportation, Energy, Agriculture, Manufacturing, Food and Beverage, Wellness, Construction, Automotive, Telecommunications, Public Services, Financial Services, 
            Medicine, Social Services, Education and Training, Information and Communications, Environment, Textiles, Startups, Legal and Law, Entertainment, Pharmaceuticals, 
            Food Service, Advertising, Financial Planning, Travel and Tourism, Waste Management, E-commerce, Hospitality, Philanthropy, Public Health, Sports, Social Media, 
            Venture Capital, Arts and Culture, Economics, Artificial Intelligence, Biotechnology, Renewable Energy and Sustainability, Business and Entrepreneurship, 
            Defense and Aerospace, Logistics, Oil and Gas, Fashion and Apparel, Human Resources, Music, Nonprofit, Gaming, Insurance, Space Exploration, Banking, Smart Cities, 
            Recreation, Maritime, Electricity, Gas & Water Services, Wholesale Trade, Hotel and Resorts, Rental Services, Fitness, Agricultural Technology, Consulting, Analytics, 
            Chemicals, Urban Planning, Internet of Things, Global Trade, Automation Technology, Journalism, Engineering, Psychology, Scientific Research, Publishing, Cybersecurity, 
            Credit Cards & Loans, Robotics & Computing, Digital Health, Consumer Electronics, Business Intelligence, Market Research, Sales Forecasting, Data Governance, Digital Marketing]
    subcategories:
      - name: topic
        description: Key topics that professional SQL developers care about in the given domain
        num_values_to_generate: 25

  - name: sql_complexity
    description: Complexity of the SQL query, ranging from basic operations to advanced data processing techniques
    values:
      - "Basic SQL"
      - "Aggregation"
      - "Single Join"
      - "Subquery"
      - "Multiple Join"
      - "Window Functions"

  - name: sql_task_type
    description: Type of SQL task that the query represents
    num_values_to_generate: 25
    values:
      - "Data Retrieval"
      - "Data Definition"
      - "Data Manipulation"
      - "Analytics and Reporting"
      - "Database Administration"
      - "Data Cleaning and Transformation"

  - name: natural_language_type
    description: Type of natural language that will be paired with an SQL query
    num_values_to_generate: 25
    values:
      - "a natural language prompt for an SQL query"
      - "a question about how to extract information from a database using SQL"
      - "an explanation of what an SQL query is doing"
      - "an instruction that tells a user to write an SQL query for a specific task"

  - name: data_generation_mode
    description: Mode of data generation for different contexts
    values:
      - "Creation"
      - "Editing"
      - "Augmentation"
      - "Sequential"
      - "Time Series"

data_columns:
  - name: domain_description
    description: Detailed description of the {domain}
    specific_instructions: "Provide a detailed description of the {domain} that includes one key areas of focus for a subdomain for writing SQL."

  - name: sql_complexity_description
    description: Description of the complexity level of the SQL query
    specific_instructions: "Provide a description for the {sql_complexity} level, highlighting the types of SQL operations involved."
    relevant_columns: [sql_complexity]

  - name: sql_task_type_description
    description: Description of the type of SQL task
    specific_instructions: "Provide a description for the {sql_task_type}, including its typical use cases in the {domain} industry."
    relevant_columns: [sql_task_type, domain]

  - name: sql_prompt
    description: Natural language prompt that will be paired with an SQL query
    specific_instructions: "You are an expert engineer, well versed in prompt tuning for LLMs.
    Create a natural language prompt to generate SQL in the field of {domain}, specifically about the domain of {topic}. 
    Feel free to ask for data that focus on a smaller subject within the scope of {natural_language_type}.

    Make sure to use the following guidelines:
        * Just return the generated text without any additional preface or content
        * The generated text must include diverse words / sentences and must always specify all required column names
    "
    relevant_columns: [domain, topic, natural_language_type]

  - name: sql_context
    description: SQL context that provides additional information, such as table or view creation statements
    specific_instructions: "Generate SQL statement as context that is relevant to the {domain} industry and aligns with the {topic} and {sql_prompt}"
    relevant_columns: [domain, topic, sql_prompt]
    output_type: code
    llm_type: code

  - name: sql
    description: SQL query that will be paired with natural language text
    specific_instructions: "Write an SQL query to accomplish the task described by {sql_prompt}. Use the provided {sql_context} if applicable."
    relevant_columns: [domain, topic, sql_complexity, sql_task_type, sql_context, sql_prompt]
    output_type: code
    llm_type: code

  - name: sql_explanation
    description: Natural language explanation of what the SQL query is doing
    specific_instructions: "Provide a detailed explanation of the SQL query, including the purpose of each clause and how it contributes to achieving the task described by {sql_prompt}."
    relevant_columns: [sql_prompt, sql, sql_context]

data_validators:
  - validator: code
    code_lang: ansi
    code_columns: [sql]
"""

In [23]:
# Defines a new DataDesigner instance
designer = DataDesigner.from_config(text2sql_blueprint_string, **session_kwargs)
designer


Found cached Gretel credentials
Using endpoint https://api-dev.gretel.cloud
Logged in as dhruv@gretel.ai ✅


DataDesigner(
    seed_categories (needs generation): ['domain:topic', 'sql_complexity', 'sql_task_type', 'natural_language_type', 'data_generation_mode']
    data_columns: ['domain_description', 'sql_complexity_description', 'sql_task_type_description', 'sql_prompt', 'sql_context', 'sql', 'sql_explanation']
    validators: ['code']
)

### 👀 Generating a dataset preview

In [24]:
preview = designer.generate_dataset_preview()

[20:56:47] [INFO] 🚀 Generating dataset preview
[20:56:48] [INFO] 🦜 Step 1: Generate seed category values
[21:00:13] [INFO] 🌱 Step 2: Sample data seeds
[21:00:13] [INFO] 🦜 Step 3: Generate column from template >> generating domain description
[21:00:16] [INFO] 🦜 Step 4: Generate column from template >> generating sql complexity description
[21:00:19] [INFO] 🦜 Step 5: Generate column from template >> generating sql task type description
[21:00:22] [INFO] 🦜 Step 6: Generate column from template >> generating sql prompt
[21:00:24] [INFO] 🦜 Step 7: Generate column from template >> generating sql context
[21:00:28] [INFO] 🦜 Step 8: Generate column from template >> generating sql
[21:00:31] [INFO] 🦜 Step 9: Generate column from template >> generating sql explanation
[21:00:36] [INFO] 🔍 Step 10: Validate code
[21:00:38] [INFO] 👀 Your dataset preview is ready for a peek!


In [25]:
preview.display_dataframe_in_notebook()

Unnamed: 0,sql_complexity,data_generation_mode,sql_task_type,natural_language_type,domain,topic,domain_description,sql_complexity_description,sql_task_type_description,sql_prompt,sql_context,sql,sql_explanation,sql_is_valid,sql_messages
0,Basic SQL,Editing,Role Management,a query to insert new data,Consulting,Predictive Analytics Applications,"The domain of Consulting focuses on providing expert advice and strategic guidance to clients to improve their business operations and performance. A key area of focus is Role Management, which involves defining, creating, and maintaining user roles within a database system to ensure proper access control and data security. This subdomain requires writing SQL queries to manage user roles, such as inserting new roles, updating role permissions, and deleting roles as needed.","The `sql_complexity_description` for Basic SQL level is: Involves simple operations such as SELECT, INSERT, UPDATE, and DELETE on single tables without complex conditions, joins, or subqueries.","The `sql_task_type_description` for Role Management is: Managing user roles and permissions within a database to ensure secure access and compliance with organizational policies. Typical use cases in the Consulting industry include setting up role-based access control for clients, ensuring data privacy, and facilitating secure data sharing among team members.","Please provide the necessary columns (customer_id, purchase_date, product_category, quantity, price) and any additional filters or conditions you would like to apply, so I can generate the SQL query to insert new data into the Predictive Analytics Applications domain.","CREATE TABLE Predictive_Analytics (  customer_id INT,  purchase_date DATE,  product_category VARCHAR(100),  quantity INT,  price DECIMAL(10, 2) ); INSERT INTO Predictive_Analytics (customer_id, purchase_date, product_category, quantity, price) SELECT customer_id, purchase_date, product_category, quantity, price FROM Sales_Data WHERE product_category IN ('Consulting Services', 'IT Solutions', 'Data Analytics') AND purchase_date >= DATEADD(year, -1, GETDATE());","INSERT INTO Predictive_Analytics (customer_id, purchase_date, product_category, quantity, price) SELECT customer_id, purchase_date, product_category, quantity, price FROM Sales_Data WHERE product_category IN ('Consulting Services', 'IT Solutions', 'Data Analytics') AND purchase_date >= DATEADD(year, -1, GETDATE());","The SQL query inserts new data into the Predictive_Analytics table from the Sales_Data table. It selects records where the product_category is either 'Consulting Services', 'IT Solutions', or 'Data Analytics' and the purchase_date is within the last year. The query ensures that only relevant data is inserted by applying filters on the product_category and purchase_date.",True,[]
1,Subquery,Creation,Alter Table,a request to recover lost data,Engineering,Data Quality Assurance,"The Engineering domain focuses on the design, development, and maintenance of complex systems. A key area of focus is on writing SQL to manage and manipulate database structures. This includes using subqueries and altering tables to recover lost data, ensuring data integrity and consistency. This task requires a deep understanding of relational database management systems and the ability to craft efficient, error-free SQL commands.","The Subquery level involves SQL queries that incorporate one or more subqueries within the main query. These subqueries can be used in various clauses such as SELECT, FROM, WHERE, and JOIN. Subqueries at this level can perform operations like filtering, aggregation, and joining data from multiple tables, increasing the complexity of the query.","The `sql_task_type_description` for Alter Table is a SQL command used to modify the structure of an existing table. Common use cases in the Engineering industry include adding or removing columns, altering column data types, adding or dropping constraints, and modifying table properties to better fit the database schema requirements.","Can you provide me with the SQL query to retrieve all records from the `sensor_readings` table where the `temperature` column value is NULL or outside the expected range of 0 to 100 degrees Celsius, along with the corresponding `sensor_id` and `reading_time`?","SELECT sensor_id, reading_time FROM sensor_readings WHERE temperature IS NULL OR temperature < 0 OR temperature > 100;","SELECT sensor_id, reading_time FROM sensor_readings WHERE temperature IS NULL OR temperature < 0 OR temperature > 100;","The SQL query retrieves records from the `sensor_readings` table where the `temperature` is either NULL or outside the range of 0 to 100 degrees Celsius. It selects the `sensor_id` and `reading_time` for these records. The WHERE clause filters the rows based on the temperature condition, ensuring only relevant data is included in the result set.",True,[]
2,Basic SQL,Augmentation,Backup Database,an explanation of using aggregate functions,Robotics & Computing,Autonomous System Control,"The domain of Robotics & Computing encompasses the integration of computational systems with mechanical and electrical systems to create intelligent machines. A key subdomain focuses on database management within robotics, where SQL is crucial for handling the vast amounts of data generated by robotic systems. This includes tasks such as backup and recovery of databases to ensure data integrity and availability.","The `sql_complexity_description` for Basic SQL level is: Involves simple SELECT, INSERT, UPDATE, and DELETE operations without the use of advanced features such as JOINs, subqueries, or complex expressions.","The `sql_task_type_description` for Backup Database is a process used to create copies of database files for the purpose of disaster recovery, data restoration, and maintaining historical data. In the Robotics & Computing industry, this task is crucial for ensuring data integrity and system reliability, allowing for quick recovery in case of hardware failures, software errors, or data corruption.","Please provide the table name and column names you would like to use for generating the SQL query related to Autonomous System Control in Robotics & Computing. Specifically, include columns such as `system_id`, `sensor_data`, `actuator_response`, and `control_algorithm`.","CREATE TABLE autonomous_system_control (  system_id INT PRIMARY KEY,  sensor_data VARCHAR(255),  actuator_response VARCHAR(255),  control_algorithm VARCHAR(255) );","CREATE TABLE autonomous_system_control (  system_id INT PRIMARY KEY,  sensor_data VARCHAR(255),  actuator_response VARCHAR(255),  control_algorithm VARCHAR(255) );","The SQL query is creating a table named `autonomous_system_control` to store data related to the control mechanisms of an autonomous system. It includes columns for `system_id` (to uniquely identify each system), `sensor_data` (to store data collected by sensors), `actuator_response` (to record the system's response to actuator commands), and `control_algorithm` (to document the algorithm used for control).",True,[]
3,Multiple Join,Time Series,Alter Table,an instruction to delete outdated records,Medicine,Drug Interaction Studies,"The `domain_description` column for the Medicine subdomain focuses on detailing the various aspects of a specific medicine, including its composition, usage, side effects, and storage conditions. A key area of focus is writing SQL to delete outdated records from the Medicine table, ensuring that only the most current information is retained. This involves joining tables such as Medicine, Stock, and Sales to identify and remove entries that no longer meet the criteria for being up-to-date.","The `sql_complexity_description` for the Multiple Join level is: This query involves joining three or more tables, requiring the use of JOIN, ON, and sometimes USING clauses to link multiple data sources, thereby increasing the complexity due to potential data inconsistencies and performance issues.","The `sql_task_type_description` for Alter Table is a SQL command used to modify the structure of an existing table. In the Medicine industry, it is commonly used to add or remove columns to accommodate new data types or to adjust existing data models as the requirements evolve. This command is also used to rename columns, modify column data types, and add or drop constraints, ensuring that the database schema aligns with the latest medical research, patient data management needs, or regulatory requirements.","Please provide the SQL query to delete outdated drug interaction records from the `drug_interactions` table where the `interaction_date` is earlier than January 1, 2010, and the `status` is marked as 'inactive'. The table has the following columns: `id`, `drug_a`, `drug_b`, `interaction_date`, `status`.",DELETE FROM drug_interactions WHERE interaction_date < '2010-01-01' AND status = 'inactive';,DELETE FROM drug_interactions WHERE interaction_date < '2010-01-01' AND status = 'inactive';,"The SQL query deletes outdated drug interaction records from the `drug_interactions` table where the `interaction_date` is earlier than January 1, 2010, and the `status` is marked as 'inactive'. The query checks each record to ensure that both conditions (`interaction_date < '2010-01-01'` and `status = 'inactive'`) are met before deleting it.",True,[]
4,Subquery,Time Series,Revoke Permissions,a request to retrieve customer data,Pharmaceuticals,Pharmaceutical Market Access,"The domain description for Pharmaceuticals includes a detailed overview of drug development, manufacturing processes, regulatory compliance, and market analysis. A key area of focus for writing SQL in this domain is managing and querying time-series data related to drug approvals and recalls. This involves creating and updating tables that track the historical timeline of drug approvals, recalls, and other regulatory actions, ensuring accurate and timely data retrieval for compliance and operational purposes.","The Subquery level involves SQL queries that include one or more subqueries, which are nested within the main query. These subqueries can be used in the FROM, WHERE, or JOIN clauses and can perform operations such as selection, aggregation, or joining with other tables. This level of complexity allows for more advanced data manipulation and retrieval, enabling queries to be more dynamic and flexible.","The `sql_task_type_description` for Revoke Permissions is a SQL command used to remove existing permissions from users or roles within a database. In the Pharmaceuticals industry, this task is typically used to ensure compliance with strict data access regulations, revoking unnecessary or expired permissions to maintain secure and controlled data access.","Please provide the SQL query to retrieve the list of all pharmaceutical products along with their respective market access status, approval date, and the name of the regulatory body for each product in the year 2022.","CREATE TABLE pharmaceutical_products (  product_id INT PRIMARY KEY,  product_name VARCHAR(255),  market_access_status VARCHAR(50),  approval_date DATE,  regulatory_body VARCHAR(100) ); INSERT INTO pharmaceutical_products (product_id, product_name, market_access_status, approval_date, regulatory_body) SELECT p.product_id,  p.product_name,  m.market_access_status,  a.approval_date,  r.regulatory_body FROM products p JOIN market_access m ON p.product_id = m.product_id JOIN approvals a ON p.product_id = a.product_id JOIN regulatory_bodies r ON p.product_id = r.product_id WHERE YEAR(a.approval_date) = 2022;","SELECT product_name, market_access_status, approval_date, regulatory_body FROM pharmaceutical_products WHERE YEAR(approval_date) = 2022;","The SQL query retrieves a list of all pharmaceutical products along with their respective market access status, approval date, and the name of the regulatory body for each product in the year 2022. The query filters the records to include only those where the approval_date falls within the year 2022 using the YEAR() function. It selects the columns product_name, market_access_status, approval_date, and regulatory_body from the pharmaceutical_products table.",True,[]
5,Basic SQL,Editing,Window Functions,an explanation of what an SQL query is doing,Space Exploration,Space Tourism Regulation,"The domain description for Space Exploration involves the study and practice of traveling through and utilizing space. Key areas of focus include astronautics, astrophysics, and the development of spacecraft and launch vehicles. For writing SQL, a subdomain could be the tracking and management of spacecraft trajectories and mission data. This involves using window functions to analyze real-time and historical data for optimizing flight paths, ensuring safety, and managing fuel consumption.","The `sql_complexity_description` for Basic SQL level is: Involves simple SQL operations such as SELECT, FROM, WHERE, and JOIN for retrieving and filtering data from one or more tables.","The `sql_task_type_description` for Window Functions in the context of Space Exploration is: Window Functions allow for the application of aggregate functions across a set of table rows related to the current row. Typical use cases include calculating moving averages of telemetry data over time to monitor spacecraft health, or ranking mission data to prioritize tasks based on real-time metrics.","Can you provide a list of all space tourism companies along with the number of approved flights and the total revenue generated from each company? Please include the columns: company_name, approved_flights, and total_revenue.","CREATE VIEW space_tourism_stats AS SELECT company_name, COUNT(*) AS approved_flights, SUM(revenue) AS total_revenue FROM space_tourism_data GROUP BY company_name;","SELECT company_name, approved_flights, total_revenue FROM space_tourism_stats;","The SQL query selects the company name, the number of approved flights, and the total revenue generated from each space tourism company. It accomplishes this by querying a view named `space_tourism_stats`, which itself is derived from the `space_tourism_data` table. The view counts the number of approved flights for each company and calculates the total revenue generated by summing up the revenue for each company, then groups the results by company name.",True,[]
6,Aggregation,Sequential,Select Data,a query to find products in stock,Cybersecurity,Zero Trust Architecture,"The domain description for Cybersecurity in this context focuses on protecting and securing databases, particularly those handling sensitive information. A key area of focus is the management and protection of SQL queries, especially those involving aggregation to ensure data integrity and confidentiality. For instance, the task involves selecting data on products in stock, which requires careful handling to prevent unauthorized access or data leakage. The sequential data generation mode implies that the dataset is generated in a structured, ordered manner, further emphasizing the importance of secure and efficient SQL query execution.","The `sql_complexity_description` for the Aggregation level indicates that the SQL query involves operations such as COUNT, SUM, AVG, MIN, and MAX to summarize data across one or more columns. These operations are typically used to provide a concise summary of the dataset, such as calculating the total sales, average price, or the minimum and maximum values of a particular metric.","The `sql_task_type_description` for Select Data is a SQL query used to retrieve specific data from one or more tables in a database. In the Cybersecurity industry, this type of query is commonly used to extract user activity logs, network traffic data, or security event records for analysis and monitoring purposes.","Please provide the SQL query that retrieves the list of devices and their statuses in our network, focusing on those that have not undergone multi-factor authentication within the last 30 days, along with the last authentication timestamp and device location.","CREATE VIEW device_auth_status AS SELECT device_id, device_status, last_auth_timestamp, device_location FROM network_devices WHERE last_auth_timestamp < NOW() - INTERVAL '30 days' AND multi_factor_auth = 'false'; SELECT * FROM device_auth_status;",SELECT * FROM device_auth_status;,"The SQL query retrieves a list of devices and their statuses from the network, focusing on devices that have not undergone multi-factor authentication within the last 30 days. It includes the last authentication timestamp and device location for these devices. This is achieved by querying the `device_auth_status` view, which selects `device_id`, `device_status`, `last_auth_timestamp`, and `device_location` from the `network_devices` table where `last_auth_timestamp` is older than 30 days and `multi_factor_auth` is set to 'false'.",True,[]
7,Single Join,Augmentation,Index Management,an explanation of joining tables,Transportation,Train Scheduling Algorithms,The domain description for Transportation focuses on the movement of goods and people. A key area of focus for writing SQL in this domain is managing the join operations between the `vehicles` and `routes` tables to ensure efficient routing and scheduling. This involves optimizing the join conditions to minimize the time and resources required for querying the transportation network.,"The `sql_complexity_description` for the Single Join level is: This query involves a single join operation to combine data from two tables based on a related column, enhancing data retrieval and analysis without the complexity of multiple joins.","The `sql_task_type_description` for Index Management in the Transportation industry typically involves creating, maintaining, and optimizing indexes on database tables to improve query performance. This includes tasks such as ensuring fast retrieval of data for tracking shipments, managing large volumes of transportation records, and supporting real-time analytics for logistics operations.","Can you provide me with the SQL query to join the `trains`, `schedules`, and `stations` tables to show the train number, station name, and arrival time for each train at each station, along with the scheduled departure time from the previous station? Please include column names in your query.","CREATE TABLE trains (  train_id INT PRIMARY KEY,  train_number VARCHAR(10) ); CREATE TABLE schedules (  schedule_id INT PRIMARY KEY,  train_id INT,  station_id INT,  arrival_time TIME,  departure_time TIME,  FOREIGN KEY (train_id) REFERENCES trains(train_id) ); CREATE TABLE stations (  station_id INT PRIMARY KEY,  station_name VARCHAR(50) ); SELECT t.train_number,  s.station_name,  s.arrival_time,  s.departure_time FROM trains t JOIN schedules s ON t.train_id = s.train_id JOIN stations st ON s.station_id = st.station_id ORDER BY t.train_id, s.station_id;","SELECT t.train_number,  st.station_name,  s.arrival_time,  s.departure_time FROM trains t JOIN schedules s ON t.train_id = s.train_id JOIN stations st ON s.station_id = st.station_id ORDER BY t.train_id, s.station_id;","sql_explanation: The SQL query joins the `trains`, `schedules`, and `stations` tables to retrieve the train number, station name, arrival time, and departure time for each train at each station. It starts by selecting the `train_number` from the `trains` table, the `station_name` from the `stations` table, and the `arrival_time` and `departure_time` from the `schedules` table. The query then uses JOIN clauses to match records from these tables based on their respective IDs. Finally, the results are ordered by `train_id` and `station_id` to ensure the data is presented in a logical sequence.",True,[]
8,Window Functions,Sequential,Window Functions,an explanation of using subqueries,Artificial Intelligence,AI in Marketing,"The Artificial Intelligence in focus here is specialized in the domain of writing SQL queries, particularly with an emphasis on subqueries and window functions. A key area of focus for this AI is optimizing the use of subqueries to enhance query performance and readability. Subqueries are used to break down complex queries into more manageable parts, allowing for more precise data retrieval and manipulation. The AI is designed to generate and optimize subqueries in a sequential data generation mode, ensuring that each subquery contributes effectively to the overall query.","The Window Functions level describes SQL queries that utilize window functions, such as ROW_NUMBER(), RANK(), DENSE_RANK(), SUM(), AVG(), MIN(), MAX(), and other aggregate functions. These functions operate on a set of rows related to the current row within a result set partition, allowing for complex calculations and data analysis without the need for self-joins or subqueries.","The `sql_task_type_description` for Window Functions is: Window Functions allow for the application of aggregate functions across a set of table rows that are related to the current row. They are commonly used in the Artificial Intelligence industry for tasks such as ranking, moving averages, and calculating cumulative values, which are essential for data analysis and feature engineering in AI models.","Generate a SQL query to find the top 5 AI marketing campaigns based on engagement rate, including the campaign name, start date, end date, and total number of interactions. Also, include the average engagement rate per day for each campaign.","CREATE VIEW ai_marketing_campaigns AS SELECT campaign_name,  start_date,  end_date,  COUNT(interaction_id) AS total_interactions,  AVG(engagement_rate) AS avg_engagement_rate_per_day FROM ai_campaign_interactions GROUP BY campaign_name, start_date, end_date ORDER BY total_interactions DESC LIMIT 5;","SELECT campaign_name,  start_date,  end_date,  total_interactions,  avg_engagement_rate_per_day,  AVG(engagement_rate) OVER (PARTITION BY campaign_name) AS avg_engagement_rate FROM ai_marketing_campaigns ORDER BY total_interactions DESC LIMIT 5;","The SQL query aims to identify the top 5 AI marketing campaigns based on engagement rate. It selects the campaign name, start date, end date, and total number of interactions. The query also calculates the average engagement rate per day for each campaign. The query is constructed in two main parts: a view `ai_marketing_campaigns` that aggregates data from `ai_campaign_interactions`, and the main query that selects from this view, orders the results by total interactions in descending order, and limits the output to the top 5 campaigns.",True,[]
9,Multiple Join,Editing,Join Tables,an instruction to optimize query performance,Electricity,Renewable Integration,"The domain description for Electricity focuses on the generation, transmission, and distribution processes. A key area of focus is optimizing query performance by efficiently managing large datasets related to electricity usage and grid operations. This involves joining tables from different sources such as meter readings, transformer statuses, and customer consumption patterns to provide real-time insights and predictive analytics. Efficient indexing and partitioning strategies are crucial for handling high-volume data and ensuring quick query execution.","The `sql_complexity_description` for the Multiple Join level indicates a query that involves joining three or more tables to retrieve data. This level of complexity typically requires the use of INNER JOIN, LEFT JOIN, RIGHT JOIN, or FULL OUTER JOIN operations to combine rows from two or more tables based on a related column between them. Such queries are more intricate and may require careful handling of join conditions and multiple data relationships to ensure accurate and efficient data retrieval.","The `sql_task_type_description` for Join Tables in the Electricity industry is: To combine data from two or more tables based on a related column between them, facilitating the retrieval of integrated information such as customer details and electricity consumption records. Common use cases include generating billing statements, analyzing usage patterns, and ensuring accurate meter readings and customer billing accuracy.","Generate a SQL query to retrieve the total energy production from solar and wind sources for each month in the year 2022, along with the average production for each source. Include the month name, total solar production, total wind production, and average production for each source.","CREATE VIEW Energy_Production AS SELECT MONTH(date) AS month_number,  MONTHNAME(date) AS month_name,  SUM(CASE WHEN source = 'solar' THEN production ELSE 0 END) AS total_solar_production,  SUM(CASE WHEN source = 'wind' THEN production ELSE 0 END) AS total_wind_production,  AVG(CASE WHEN source = 'solar' THEN production ELSE 0 END) AS avg_solar_production,  AVG(CASE WHEN source = 'wind' THEN production ELSE 0 END) AS avg_wind_production FROM energy_data WHERE YEAR(date) = 2022 GROUP BY MONTH(date), MONTHNAME(date);","SELECT month_number, month_name, total_solar_production, total_wind_production, avg_solar_production, avg_wind_production FROM Energy_Production;","The SQL query retrieves monthly energy production data for the year 2022, specifically focusing on total and average production from solar and wind sources. It first creates a view named `Energy_Production` by selecting the month number and name from the `date` column. It then calculates the total solar and wind production by summing up the `production` values where the `source` is 'solar' or 'wind', respectively. Additionally, it computes the average solar and wind production by averaging the `production` values under the same conditions. The results are grouped by month number and month name, providing a detailed summary of energy production for each month in 2022.",True,[]


### 🔎 Taking a closer look at single records

In [26]:
designer.display_sample_record(preview.output.sample(1))

### 🤔 Like what you see? Generate an entire dataset

In [27]:
# Submit a batch workflow to generate records
results = designer.submit_batch_workflow(num_records=50)

[21:00:39] [INFO] 🛜 Connecting to your Gretel Project:
[21:00:39] [INFO] 🔗 -> https://console-dev.gretel.ai/proj_2o7zwNweGeCVqAYglVkYgmCMwet
[21:00:43] [INFO] ▶️ Starting your workflow run to generate 50 records:
[21:00:43] [INFO] 🔗 -> https://console-dev.gretel.ai/workflows/w_2o7zwUp8hHds6wIMBpVOrHbjDW6/runs/wr_2o7zwjG52YmzVYb1vpN3EUENz8u


In [28]:
# Fetch the dataset
df = results.fetch_dataset(wait_for_completion=True)

[21:00:44] [INFO] 🏗️ We are still building your dataset. Workflow status: CREATED.
[21:00:44] [INFO] ⏳ Waiting for workflow run to complete...
[21:00:44] [INFO] 👀 Follow along -> https://console-dev.gretel.ai/workflows/w_2o7zwUp8hHds6wIMBpVOrHbjDW6/runs/wr_2o7zwjG52YmzVYb1vpN3EUENz8u
[21:10:55] [INFO] ✅ Fetching dataset from completed workflow run


In [29]:
# Inspect the dataset
df.head()

Unnamed: 0,sql_complexity,data_generation_mode,sql_task_type,natural_language_type,domain,topic,domain_description,sql_complexity_description,sql_task_type_description,sql_prompt,sql_context,sql,sql_explanation,sql_is_valid,sql_messages
0,Single Join,Creation,Drop Table,an explanation of using transactions,Artificial Intelligence,AI in Cybersecurity,The Artificial Intelligence in this domain foc...,The Single Join level involves a SQL query tha...,The `Drop Table` SQL task is used to remove a ...,Please provide the specific details or column ...,CREATE TABLE ai_cybersecurity_transactions (\n...,DROP TABLE ai_cybersecurity_transactions;,sql_explanation: The SQL query is dropping the...,True,[]
1,Aggregation,Augmentation,Analytics and Reporting,an explanation of creating a view,Public Services,Public Library Management,The domain description for Public Services inv...,The Aggregation level describes SQL queries th...,The `sql_task_type_description` for Analytics ...,Generate a SQL query to create a view named `l...,CREATE VIEW library_book_count_by_genre AS\nSE...,CREATE VIEW library_book_count_by_genre AS\nSE...,The SQL query creates a view named `library_bo...,True,[]
2,Multiple Join,Augmentation,Drop Table,an instruction to delete outdated records,Construction,Site Safety Standards,The domain description for the Construction su...,The `sql_complexity_description` for the Multi...,The `sql_task_type_description` for Drop Table...,Generate a SQL query to delete outdated record...,-- Create table statement for SiteSafetyInspec...,DELETE FROM SiteSafetyInspection WHERE inspect...,The SQL query deletes records from the SiteSaf...,True,[]
3,Window Functions,Creation,View Deletion,a command to filter data by date,Digital Health,Electronic Prescriptions,Digital Health encompasses the integration of ...,The Window Functions level describes SQL queri...,The `sql_task_type_description` for View Delet...,Please provide the SQL query to retrieve all e...,CREATE VIEW ElectronicPrescriptions AS\nSELECT...,DROP VIEW ElectronicPrescriptions;,The SQL query first creates a view named `Elec...,True,[]
4,Multiple Join,Editing,Delete Data,an instruction that tells a user to write an S...,Renewable Energy and Sustainability,Eco-Friendly Transportation,The domain description for Renewable Energy an...,The `sql_complexity_description` for the Multi...,The `sql_task_type_description` for Delete Dat...,Please write an SQL query to find the total nu...,"SELECT city, COUNT(*) AS total_charging_statio...",DELETE FROM electric_vehicle_charging_stations...,The SQL query deletes all records from the `el...,True,[]
