## Synthetic Dataset generator

- 🌍 **Task**: Generate realistic synthetic datasets
- 🎯 **Supported Data Types**: Tabular, Text, Time-series
- 🧠 **Models**: GPT (OpenAI) , Claude (Anthropic), CodeQwen1.5-7B-Chat (via Hugging Face Inference)
- 🚀 **Tools**: Python, pandas, numpy, Gradio UI, OpenAI / Anthropic / HuggingFace APIs
- 📤 **Output Formats**: JSON and CSV file
- 🧑‍💻 **Skill Level**: Intermediate
- ⚙️ **Hardware**: ✅ CPU is sufficient — no GPU required

🎯 **How It Works**

1️⃣ Define your business problem or dataset topic.  
2️⃣ Choose the dataset type, output format, model, and number of samples.  
3️⃣ Get a ready-to-use synthetic dataset, instantly downloadable in your preferred format!

🛠️ **Requirements**  
- 🔑 OpenAI API Key (for GPT)  
- 🔑 Anthropic API Key (for Claude)  
- 🔑 Hugging Face Token + Endpoint (for CodeQwen1.5-7B-Chat via HF Inference)  
  e.g. `https://w0e2ze0xyrjnhx0y.us-east-1.aws.endpoints.huggingface.cloud`

⚙️ **Customizable by user**  
- 🤖 Selected model: GPT / Claude / CodeQwen  
- 📜 `system_prompt`: Controls model behavior (concise, accurate, structured)  
- 💬 `user_prompt`: Dynamic — includes business problem, data type, format, and path


### Imports

In [149]:
import re
import os   
import sys
import subprocess
import gradio as gr
import openai
import anthropic
from dotenv import load_dotenv 
from openai import OpenAI     
import anthropic          
import gradio as gr  
from huggingface_hub import InferenceClient
from transformers import AutoTokenizer

### Initialization

In [170]:
load_dotenv(override=True)

openai_api_key = os.getenv('OPENAI_API_KEY')
if not openai_api_key:
    print("❌ OpenAI API Key is missing!")

anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')
if not openai_api_key:
    print("❌ Anthropic API Key is missing!")

OPENAI_MODEL = "gpt-4o-mini"
CLAUDE_MODEL = "claude-3-5-sonnet-20240620"
openai = OpenAI()
claude = anthropic.Anthropic()

## Hugging Face Models
hf_token = os.getenv('HF_TOKEN')
if not openai_api_key:
    print("❌ Hagging Face Token is missing!")
    
code_qwen = "Qwen/CodeQwen1.5-7B-Chat"
CODE_QWEN_URL = "https://w0e2ze0xyrjnhx0y.us-east-1.aws.endpoints.huggingface.cloud"

### Prompts definition

In [189]:
system_message = """
You are a helpful assistant whose main purpose is to generate datasets for business problems.

Be less verbose.
Be accurate and concise.

The user will describe a business problem. Based on this, you must generate a synthetic dataset that fits the context.

The dataset should be saved in a specific format such as CSV, JSON — the desired format will be specified by the user.

The dependencies for python code should include only standard python libraries such as numpy, pandas and built-in libraries. 

When saving a DataFrame to JSON using `to_json()`, do not use the `encoding` parameter. Instead, manually open the file with `open()` and specify the encoding. Then pass the file object to `to_json()`.

Return only the Python code that generates and saves the dataset.
After saving the file, print the code that was executed and a message confirming the dataset was generated successfully.
"""



In [190]:
def user_prompt(**input_data):
    user_prompt = f"""
        Generate a synthetic {input_data["dataset_type"].lower()} dataset in {input_data["output_format"].upper()} format.       
        Business problem: {input_data["business_problem"]}
        Samples: {input_data["num_samples"]}
        """
    return user_prompt


### Call API for Closed Models

In [191]:
def stream_gpt(user_prompt):
    stream = openai.chat.completions.create(
        model=OPENAI_MODEL,
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user","content": user_prompt},
        ],
        stream=True,
    )

    response = ""
    for chunk in stream:
        response += chunk.choices[0].delta.content or ""
        yield response

    return response


def stream_claude(user_prompt):
    result = claude.messages.stream(
        model=CLAUDE_MODEL,
        max_tokens=2000,
        system=system_message,
        messages=[
            {"role": "user","content": user_prompt}
        ]
    )
    reply = ""
    with result as stream:
        for text in stream.text_stream:
            reply += text
            yield reply
            print(text, end="", flush=True)
    return reply


### Call API for Open Models (Hugging Face)

In [192]:
def stream_code_qwen(user_prompt):
    tokenizer = AutoTokenizer.from_pretrained(code_qwen)
    messages=[
            {"role": "system", "content": system_message},
            {"role": "user","content": user_prompt},
        ]
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    client = InferenceClient(CODE_QWEN_URL, token=hf_token)
    stream = client.text_generation(text, stream=True, details=True, max_new_tokens=3000)
    result = ""
    for r in stream:
        result += r.token.text
        yield result    

### Select the model and generate the ouput

In [193]:
def generate_from_inputs(model, **input_data):
    # print("🔍 input_data received:", input_data)
    user_prompt_str = user_prompt(**input_data)
    
    if model == "GPT":
        result = stream_gpt(user_prompt_str)
    elif model == "Claude":
        result = stream_claude(user_prompt_str)
    elif model == "Code Qwen":
        result = stream_code_qwen(user_prompt_str)
    else:
        raise ValueError("Unknown model")
    
    for stream_so_far in result:
        yield stream_so_far
    
    return result


In [194]:
def handle_generate(business_problem, dataset_type, dataset_format, num_samples, model):
    input_data = {
        "business_problem": business_problem,
        "dataset_type": dataset_type,
        "output_format": dataset_format,
        "num_samples": num_samples,
    }

    response = generate_from_inputs(model, **input_data)
    for chunk in response:
        yield chunk


### Extract python code from the LLM output and execute it locally

In [195]:
def extract_code(text):
    match = re.search(r"```python(.*?)```", text, re.DOTALL)

    if match:
        code = match.group(0).strip()
    else:
        code = ""
        print("No matching substring found.")

    return code.replace("```python\n", "").replace("```", "")


def execute_code_in_virtualenv(text, python_interpreter=sys.executable):
    if not python_interpreter:
        raise EnvironmentError("Python interpreter not found in the specified virtual environment.")

    code_str = extract_code(text)
    command = [python_interpreter, '-c', code_str]

    try:
        result = subprocess.run(command, check=True, capture_output=True, text=True)
        stdout = result.stdout
        return stdout

    except subprocess.CalledProcessError as e:
        return f"Execution error:\n{e}"


## Gradio interface

In [196]:
def update_output_format(dataset_type):
    if dataset_type in ["Tabular", "Time-series"]:
        return gr.update(choices=["JSON", "csv"], value="JSON")
    elif dataset_type == "Text":
        return gr.update(choices=["JSON"], value="JSON")
        
with gr.Blocks() as ui:
    gr.Markdown("## Create a dataset for a business problem")
    
    with gr.Column():
        business_problem = gr.Textbox(label="Business problem", lines=2)
        dataset_type = gr.Dropdown(
            ["Tabular", "Time-series", "Text"], label="Dataset type"
        )
        
        output_format = gr.Dropdown( choices=["JSON", "csv"], value="JSON",label="Output Format")
        
        num_samples = gr.Number(label="Number of samples (for tabular and time-series data)", value=10, precision=0)
        
        model = gr.Dropdown(["GPT", "Claude"], label="Select model", value="GPT")
        
        dataset_type.change(update_output_format,inputs=[dataset_type], outputs=[output_format])
    
    with gr.Row():
        dataset_run = gr.Button("Create a dataset")
        code_run = gr.Button("Execute code for a dataset")
   
    with gr.Row():
        dataset_out = gr.Textbox(label="Generated Dataset")
        code_out = gr.Textbox(label="Executed code")
    
    dataset_run.click(
        handle_generate,
        inputs=[business_problem, dataset_type, output_format, num_samples, model],
        outputs=[dataset_out]
    )
    
    code_run.click(
        execute_code_in_virtualenv,
        inputs=[dataset_out],
        outputs=[code_out]
    )

In [197]:
ui.launch(inbrowser=True, allowed_paths=[file_path])

* Running on local URL:  http://127.0.0.1:7879

To create a public link, set `share=True` in `launch()`.




Here's the Python code to generate and save the synthetic dataset as requested:

```python
 as pd pandas
 as np numpy

 random seed for reproducibility
seed(42)m.

# Generate data
 = ['Clothes', 'Cosmetics']
products = {
 ['T-shirt', 'Jeans', 'Dress', 'Jacket', 'Skirt'],
etics': ['Lipstick', 'Foundation', 'Mascara', 'Eyeshadow', 'Blush']
}

 []a =
 in range(10):
.random.choice(categories)
 = np.random.choice(products[category])
 np.round(np.random.uniform(10, 100), 2)
d([category, product, price])

# Create DataFrame
DataFrame(data, columns=['Category', 'Product', 'Price'])

 CSVve to
_csv('clothes_cosmetics_dataset.csv', index=False)

 generated and saved as 'clothes_cosmetics_dataset.csv'")
("\nGenerated DataFrame:")
)rint(df
```

 code will generate a CSV file named 'clothes_cosmetics_dataset.csv' with 10 samples of clothes and cosmetic products, including their categories and prices.Here's the Python code to generate a synthetic time-series dataset for clothes and cosmetic products