# Synthetic Data Generator Notebook
## About
This colab notebook demonstrates the use of Frontier and Open-source LLM models for generating synthetic dataset for a business scenario provided by the user. From a UI interface implemented in gradio, a user can define their business scenario in detail, select the number of records needed along with the its format and adjust the number of max output tokens to be generated by the chosen LLM.

It does not stop here. Once the records have been produced in the LLM output, it can be extracted and stored in a file, format same as set by user before. The file is stored in colab notebook under the contents directory. All of this is extraction is done with the help of the 're' library. My first time using it and I totally enjoyed learning it.

## Outlook
Sometimes the response is loaded with the user prompt and a lot of tags when using an open-source models, such as Mixtral from Mistral. This is because of the prompt format being used. The 'assistant' 'role' format does not suit them. This is an optimization to look for and can be easily done by using custom prompt template for such models and these templates are hinted on their huggingface repo.

## Install & Imports

In [0]:
!pip install -q gradio anthropic requests torch bitsandbytes transformers accelerate openai

In [0]:
# imports
import re
import os
import sys
import gc
import io
import json
import anthropic
import gradio as gr
import requests
import subprocess
import google.generativeai as ggai
import torch
import tempfile
import shutil
from io import StringIO
import pandas as pd
from google.colab import userdata
from huggingface_hub import login
from openai import OpenAI
from pathlib import Path
from datetime import datetime
from IPython.display import Markdown, display, update_display
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer, BitsAndBytesConfig

## HuggingFace Setup

In [0]:
# Sign in to HuggingFace Hub

hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

## Frontier Models configuration

In [0]:
openai_client = OpenAI(api_key=userdata.get('OPENAI_API_KEY'))
anthropic_client = anthropic.Anthropic(api_key=userdata.get('ANTHROPIC_API_KEY'))
ggai.configure(api_key=userdata.get('GOOGLE_API_KEY'))

## Defining Prompts

In [0]:
system_prompt = """
You are a synthetic dataset generator. Your role is to create synthetic dataset that infers structured data schemas from business scenarios given by the user.

Your task is to:
1. Understand the user's business problem(s) or use case(s).
2. Identify the key fields needed to support that scenario.
3. Define appropriate field names, data types, and formats.
4. Generate synthetic records that match the inferred schema.

Guidelines:
- Use realistic field names and values. Do not invent unrelated fields or values.
- Choose sensible data types: string, integer, float, date, boolean, enum, etc.
- Respect logical constraints (e.g., age range, date ranges, email formats).
- Output the dataset in the format the user requests (json, csv, txt, markdown table).
- If the scenario is vague or broad, make reasonable assumptions and explain them briefly before generating the dataset.
- Always generate a dataset that supports the business use case logically.

Before generating the data, display the inferred schema in a readable format.
"""

# trial_user_prompt = "I’m building a churn prediction model for a telecom company. Can you generate a synthetic dataset with 100 rows?"
def get_user_prompt(business_problem, no_of_samples, file_format):
  return f"""
  The business scenario for which I want you to generate a dataset is defined below:
  {business_problem}

  Generate a synthetic dataset of {no_of_samples} records in {file_format} format.
  When generating the dataset, wrap it between the '<<<>>>' tag. Make sure the tag is there in the output.
  Do not include any other special characters in between the tags, other than the ones required in producing the correct format of data.
  For examples: When a 'csv' format is given, only the ',' character can be used in between the tags.
  """

### Quanitzation Config

In [0]:
# This allows us to load the model into memory and use less memory
def get_quantization_config():
  return BitsAndBytesConfig(
      load_in_4bit=True,
      bnb_4bit_use_double_quant=True,
      bnb_4bit_compute_dtype=torch.bfloat16,
      bnb_4bit_quant_type="nf4"
  )

## HF Model inference

In [0]:
# All in one HuggingFace Model Response function
def run_hfmodel_and_get_response(prompt, model_name, output_tokens):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    inputs = tokenizer.apply_chat_template(prompt, return_tensors="pt")
    if torch.cuda.is_available():
      inputs = inputs.to("cuda")
    streamer = TextStreamer(tokenizer)
    if "microsoft/bitnet-b1.58-2B-4T" in model_name:
      model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", trust_remote_code=True)
    elif "tiiuae/Falcon-E-3B-Instruct" in model_name:
      model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.float16 )
    else:
      model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", quantization_config=get_quantization_config())
    outputs = model.generate(inputs, max_new_tokens=output_tokens, streamer=streamer)
    response = tokenizer.decode(outputs[0])
    del model, inputs, tokenizer, outputs
    gc.collect()
    torch.cuda.empty_cache()
    return response

## Frontier Models Inference

In [0]:
# ChatGPT, Claude and Gemini response function
def get_chatgpt_response(prompt, model_name, output_tokens):
  response = openai_client.chat.completions.create(
        model=model_name,
        messages=prompt,
        max_tokens=output_tokens,
    )
  return response.choices[0].message.content

def get_claude_response(prompt, model_name, output_tokens):
  response = anthropic_client.messages.create(
        model=model_name,
        max_tokens=output_tokens,
        system=system_prompt,
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
    )
  return response.content[0].text

def get_gemini_response(prompt, model_name, output_tokens):
    model = ggai.GenerativeModel(
          model_name=model_name,
          system_instruction=system_prompt,
    )

    response = model.generate_content(prompt, generation_config={
        "max_output_tokens": output_tokens,
        "temperature": 0.7,
    })
    return response.text

## Gradio Implementation

### Dropdowns Selection Lists

In [0]:
# Dropdown List Values for the user
MODEL_TYPES=["GPT", "Claude", "Gemini", "HuggingFace"]
OPENAI_MODEL_NAMES=["gpt-4o-mini", "gpt-4o", "gpt-3.5-turbo"]
ANTHROPIC_MODELS=["claude-3-7-sonnet-latest", "claude-3-5-haiku-latest", "claude-3-opus-latest"]
GOOGLE_MODELS=["gemini-2.0-flash", "gemini-1.5-pro"]
HUGGINGFACE_MODELS=[
    "meta-llama/Llama-3.2-3B-Instruct",
    "microsoft/bitnet-b1.58-2B-4T",
    "ByteDance-Seed/Seed-Coder-8B-Instruct",
    "tiiuae/Falcon-E-3B-Instruct",
    "Qwen/Qwen2.5-7B-Instruct"
]
MODEL_NAMES = {
    "GPT": OPENAI_MODEL_NAMES,
    "Claude": ANTHROPIC_MODELS,
    "Gemini": GOOGLE_MODELS,
    "HuggingFace": HUGGINGFACE_MODELS
}

### UI

In [0]:
with gr.Blocks() as generator_ui:
    gr.Markdown("# 🧠 Business Scenario → Synthetic Dataset Generator")

    with gr.Row():
      with gr.Column(scale=3):
        with gr.Row():
          dataset_size=gr.Number(value=10, label="Enter the number of data samples to generate.", show_label=True)
          format=gr.Dropdown(["json", "csv", "txt", "markdown"], label="Select the format for the dataset", show_label=True)
        with gr.Row():
          scenario=gr.Textbox(label="Business Scenario", lines=5, placeholder="Describe your business scenario here")
        with gr.Row():
          error = gr.Markdown(visible=False)
        with gr.Row():
          clear = gr.Button("Clear Everything")
          submit = gr.Button("Generate Dataset", variant="primary")

      with gr.Column(scale=1):
          model_type = gr.Dropdown(MODEL_TYPES, label="Model Type", show_label=True, info="Select the model type you want to use")
          model_name = gr.Dropdown(MODEL_NAMES[model_type.value], label="Model Name", show_label=True, allow_custom_value=True, info="Select the model name or enter one manually")
          output_tokens= gr.Number(value=1000, label="Enter the max number of output tokens to generate.", show_label=True, info="This will impact the length of the response containg the dataset")

    with gr.Row():
      # Chatbot Interface
        chatbot = gr.Chatbot(
            type='messages',
            label='Chatbot',
            show_label=True,
            height=300,
            resizable=True,
            elem_id="chatbot",
            avatar_images=("🧑", "🤖",)
        )
    with gr.Row(variant="compact"):
      extract_btn = gr.Button("Extract and Save Dataset", variant="huggingface", visible=False)
      file_name = gr.Textbox(label="Enter file name here (without file extension)", placeholder="e.g. cancer_synthetic, warehouse_synthetic (no digits)", visible=False)
    with gr.Row():
      markdown_preview = gr.Markdown(visible = False)
      dataset_preview = gr.Textbox(label="Dataset Preview",visible=False)
    with gr.Row():
      file_saved = gr.Textbox(visible=False)

    def run_inference(scenario, model_type, model_name, output_tokens, dataset_size, format):
      """Run the model and get the response"""
      model_type=model_type.lower()
      print(f"scenario: {scenario}")
      print(f"model_type: {model_type}")
      print(f"model_name: {model_name}")
      if not scenario.strip():
        return gr.update(value="❌ **Error:** Please define a scenario first!",visible=True), []

      user_prompt = get_user_prompt(scenario, dataset_size, format)
      prompt =  [
          {"role": "system", "content": system_prompt},
          {"role": "user", "content": user_prompt},
      ]

      if model_type == "gpt":
        response = get_chatgpt_response(prompt=prompt, model_name=model_name, output_tokens=output_tokens)
      elif model_type == "claude":
        response = get_claude_response(prompt=user_prompt, model_name=model_name, output_tokens=output_tokens)
      elif model_type == "gemini":
        response = get_gemini_response(prompt=user_prompt, model_name=model_name, output_tokens=output_tokens)
      else:
        response = run_hfmodel_and_get_response(prompt=prompt, model_name=model_name, output_tokens=output_tokens)
        torch.cuda.empty_cache()
      history = [
          {"role": "user", "content": scenario},
          {"role": "assistant", "content": response}
      ]
      return gr.update(visible=False), history

    def extract_dataset_string(response):
      """Extract dataset content between defined tags using regex."""
      # Remove known artificial tokens (common in HuggingFace or Claude)
      response = re.sub(r"<\[.*?\]>", "", response)

      # Remove system or prompt echo if repeated before dataset
      response = re.sub(r"(?is)^.*?<<<", "<<<", response.strip(), count=1)

      # 1. Match strict <<<>>>...<<<>>> tag blocks (use last match)
      matches = re.findall(r"<<<>>>[\s\r\n]*(.*?)[\s\r\n]*<<<>>>", response, re.DOTALL)
      if matches:
          return matches[-1].strip()

      # 2. Match loose <<< ... >>> format
      matches = re.findall(r"<<<[\s\r\n]*(.*?)[\s\r\n]*>>>", response, re.DOTALL)
      if matches:
          return matches[-1].strip()

      # 3. Match final fallback: take everything after last <<< as raw data
      last_open = response.rfind("<<<")
      if last_open != -1:
          raw = response[last_open + 3 :].strip()
          # Optionally cut off noisy trailing notes, explanations, etc.
          raw = re.split(r"\n\s*\n|Explanation:|Note:|---", raw)[0]
          return raw.strip()

      return "Could not extract dataset! Try again with a different model."

    def extract_dataset_from_response(chatbot_history, file_name, file_type):
      """Extract dataset and update in gradio UI components"""
      response = chatbot_history[-1]["content"]
      if not response:
        return gr.update(visible=True, value="Could not find LLM Response! Try again."), gr.update(visible=False)

      # match = re.search(r'<<<\s*(.*?)\s*>>>', response, re.DOTALL)
      # print(match)
      # if match and match.group(1).strip() == "":
      #   match = re.search(r'<<<>>>\s*(.*?)\s*<<<>>>', response, re.DOTALL)
      #   print(match)
      # if match is None:
      #   return gr.update(visible=True, value="Could not extract dataset! Try again with a different model."), gr.update(visible=False)
      # dataset = match.group(1).strip()
      dataset = extract_dataset_string(response)
      if dataset == "Could not extract dataset! Try again with a different model.":
        return gr.update(visible=True, value=dataset), gr.update(visible=False)
      text = save_dataset(dataset, file_type, file_name)
      return gr.update(visible=True, value=text), gr.update(visible=True, value=dataset)

    def save_dataset(dataset, file_format, file_name):
      """Save dataset to a file based on the selected format."""
      file_name=file_name+"."+file_format
      print(dataset)
      print(file_name)
      if file_format == "json":
        try:
          data = json.loads(dataset)
          with open(file_name, "w", encoding="utf-8") as f:
            json.dump(data, f, indent=4)
          return "Dataset saved successfully!"
        except:
          return "Could not save dataset! Try again in another format."
      elif file_format == "csv":
        try:
          df = pd.read_csv(StringIO(dataset))
          df.to_csv(file_name, index=False)
          return "Dataset saved successfully!"
        except:
          return "Could not save dataset! Try again in another format."
      elif file_format == "txt":
        try:
          with open(file_name, "w", encoding="utf-8") as f:
            f.write(dataset)
          return "Dataset saved successfully!"
        except:
          return "Could not save dataset! Try again in another format."

    def clear_chat():
      """Clear the chat history."""
      return "", [], gr.update(visible=False), gr.update(visible=False)

    def show_extract_btn(chatbot_history, format):
      """Show the extract button if the response has been displayed in the chatbot and format is not set to markdown"""
      if chatbot_history == []:
        return gr.update(visible=False), gr.update(visible=False), gr.update(visible=False)
      if format == "markdown":
        return gr.update(visible=True, value=chatbot_history[1]["content"]), gr.update(visible=False), gr.update(visible=False)
      return gr.update(visible=False), gr.update(visible=True), gr.update(visible=True)

    extract_btn.click(
        fn=extract_dataset_from_response,
        inputs=[chatbot, file_name, format],
        outputs=[file_saved, dataset_preview]
    )

    chatbot.change(
        fn=show_extract_btn,
        inputs=[chatbot, format],
        outputs=[markdown_preview, extract_btn, file_name]
    )

    model_type.change(
        fn=lambda x: gr.update(choices=MODEL_NAMES[x], value=MODEL_NAMES[x][0]),
        inputs=[model_type],
        outputs=[model_name]
    )

    submit.click(
        fn=run_inference,
        inputs=[scenario, model_type, model_name, output_tokens, dataset_size, format],
        outputs=[error, chatbot],
        show_progress=True
    )

    clear.click(
        clear_chat,
        outputs=[scenario, chatbot, dataset_preview, file_saved]
    )

In [0]:
# Example Scenarios

# Generate a dataset for predicting customer churn in a subscription-based telecom company. Include features like monthly charges, contract type, tenure (in months), number of support calls, internet usage (in GB), payment method, and whether the customer has churned.
# Generate a dataset for training a model to approve/reject loan applications. Include features like loan amount, applicant income, co-applicant income, employment type, credit history (binary), loan term, number of dependents, education level, and loan approval status.
# Create a dataset of credit card transactions for detecting fraud. Include transaction ID, amount, timestamp, merchant category, customer location, card presence (yes/no), transaction device type, and fraud label (yes/no).
# Generate a dataset of investment customers with fields like portfolio value, age, income bracket, risk appetite (low/medium/high), number of transactions per month, preferred investment types, and risk score.
# Create a dataset of hospitalized patients to predict readmission within 30 days. Include patient ID, age, gender, number of prior admissions, diagnosis codes, length of stay, discharge type, medications prescribed, and readmission label.
# Generate a dataset for predicting medical appointment no-shows. Include appointment ID, scheduled date, appointment date, lead time (days between scheduling and appointment), SMS reminders sent, patient age, gender, health condition severity, and no-show status.

generator_ui.launch(share=True, debug=True, inbrowser=True)