**Synthetic Test Data Generator**

*Natural Language Processing*:

- Parses user descriptions to understand dataset requirements.
- Automatically detects field types and data relationships.
- Supports flexible schema generation.

*AI-Powered Data Generation*:

- Uses Hugging Face transformers (DialoGPT/GPT-2) for intelligent text generation.
- Generates realistic, diverse synthetic data.
- Supports 13+ data types including names, emails, addresses, dates, descriptions, etc.

*Professional Gradio Interface*:

- Clean, intuitive UI with helpful tips and examples.
- Real-time data generation and preview.
- Multiple export formats (CSV, JSON, Excel).
- Quick example buttons for common use cases.

*Technical Implementation*:

- Built with Hugging Face pipelines and tokenizers.
- GPU acceleration support (falls back to CPU).
- Pandas for data manipulation.
- Error handling and performance optimization.

🎯 *Supported Data Types*

- Personal Data: Names, emails, phone numbers, addresses.
- Temporal Data: Dates, timestamps.
- Numerical Data: Numbers, prices, ratings, quantities.
- Text Data: Descriptions, comments, AI-generated content.
- Categorical Data: Categories, departments, types.
- Technical Data: URLs, IDs, boolean flags.

🚀 *Usage Examples*
The tool can generate datasets like:

- Customer databases with contact information.
- Product catalogs with pricing and descriptions.
- Employee records with HR data.
- Sales transactions with financial details.
- User profiles for testing applications.

In [1]:
!pip install -q gradio transformers pandas torch numpy

In [8]:
import gradio as gr
import pandas as pd
import json
import random
import re
from typing import Dict, List, Any, Tuple
from datetime import datetime, timedelta
import numpy as np
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
import torch
import os
from dotenv import load_dotenv
from huggingface_hub import login

In [9]:
load_dotenv(override=True)
hf_token = os.getenv('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


In [3]:
class SyntheticDataGenerator:
    def __init__(self):
        """Initialize the Synthetic Data Generator with various pipelines and models."""
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        
        # Initialize text generation pipeline
        try:
            self.text_generator = pipeline(
                "text-generation",
                model="microsoft/DialoGPT-medium",
                device=0 if torch.cuda.is_available() else -1,
                max_length=100,
                do_sample=True,
                temperature=0.7
            )
        except:
            # Fallback to a smaller model if GPU memory is limited
            self.text_generator = pipeline(
                "text-generation",
                model="gpt2",
                device=0 if torch.cuda.is_available() else -1,
                max_length=50,
                do_sample=True,
                temperature=0.7
            )
        
        # Data type generators
        self.data_generators = {
            'name': self._generate_names,
            'email': self._generate_emails,
            'phone': self._generate_phones,
            'address': self._generate_addresses,
            'date': self._generate_dates,
            'number': self._generate_numbers,
            'text': self._generate_text,
            'category': self._generate_categories,
            'boolean': self._generate_booleans,
            'url': self._generate_urls,
            'id': self._generate_ids,
            'price': self._generate_prices,
            'rating': self._generate_ratings,
            'description': self._generate_descriptions
        }
        
        # Sample data pools
        self.sample_data = {
            'first_names': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry', 'Ivy', 'Jack'],
            'last_names': ['Smith', 'Johnson', 'Williams', 'Brown', 'Jones', 'Garcia', 'Miller', 'Davis', 'Rodriguez', 'Martinez'],
            'domains': ['gmail.com', 'yahoo.com', 'hotmail.com', 'outlook.com', 'company.com', 'email.com'],
            'streets': ['Main St', 'Oak Ave', 'Pine Rd', 'Cedar Ln', 'Elm Dr', 'Maple Way', 'Park Blvd'],
            'cities': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'Philadelphia', 'San Antonio'],
            'states': ['NY', 'CA', 'IL', 'TX', 'AZ', 'PA', 'FL', 'OH', 'GA', 'NC'],
            'categories': ['Electronics', 'Clothing', 'Books', 'Home & Garden', 'Sports', 'Toys', 'Beauty', 'Automotive']
        }

    def parse_dataset_description(self, description: str) -> Dict[str, Any]:
        """Parse natural language description to extract dataset requirements."""
        description = description.lower()
        
        # Extract number of records
        num_records = 100  # default
        num_match = re.search(r'(\d+)\s*(?:records?|rows?|entries?|samples?)', description)
        if num_match:
            num_records = int(num_match.group(1))
        
        # Detect field types based on keywords
        fields = {}
        
        # Common field patterns
        field_patterns = {
            r'names?|customers?|users?|people': 'name',
            r'emails?|e-mails?': 'email',
            r'phones?|telephone|mobile': 'phone',
            r'address|location|street': 'address',
            r'dates?|time|birthday|created': 'date',
            r'age|quantity|count|amount': 'number',
            r'description|comment|review|text|content': 'text',
            r'category|type|genre|department': 'category',
            r'active|enabled|verified|status': 'boolean',
            r'website|url|link': 'url',
            r'id|identifier|key': 'id',
            r'price|cost|salary|revenue|budget': 'price',
            r'rating|score|stars': 'rating'
        }
        
        # Extract specific field mentions
        words = description.split()
        for i, word in enumerate(words):
            for pattern, field_type in field_patterns.items():
                if re.search(pattern, word):
                    # Try to get field name from context
                    field_name = word
                    if i > 0 and words[i-1] in ['user', 'customer', 'product', 'item']:
                        field_name = f"{words[i-1]}_{word}"
                    fields[field_name] = field_type
        
        # If no specific fields detected, create a default schema
        if not fields:
            if 'user' in description or 'customer' in description or 'person' in description:
                fields = {
                    'name': 'name',
                    'email': 'email',
                    'phone': 'phone',
                    'address': 'address',
                    'created_date': 'date'
                }
            elif 'product' in description or 'item' in description:
                fields = {
                    'product_name': 'text',
                    'category': 'category',
                    'price': 'price',
                    'rating': 'rating',
                    'description': 'description'
                }
            else:
                fields = {
                    'id': 'id',
                    'name': 'name',
                    'value': 'number',
                    'category': 'category',
                    'date': 'date'
                }
        
        return {
            'num_records': min(num_records, 10000),  # Limit for performance
            'fields': fields,
            'description': description
        }

    def _generate_names(self, count: int) -> List[str]:
        """Generate random names."""
        names = []
        for _ in range(count):
            first = random.choice(self.sample_data['first_names'])
            last = random.choice(self.sample_data['last_names'])
            names.append(f"{first} {last}")
        return names

    def _generate_emails(self, count: int) -> List[str]:
        """Generate random email addresses."""
        emails = []
        for _ in range(count):
            first = random.choice(self.sample_data['first_names']).lower()
            last = random.choice(self.sample_data['last_names']).lower()
            domain = random.choice(self.sample_data['domains'])
            separator = random.choice(['.', '_', ''])
            emails.append(f"{first}{separator}{last}@{domain}")
        return emails

    def _generate_phones(self, count: int) -> List[str]:
        """Generate random phone numbers."""
        phones = []
        for _ in range(count):
            area = random.randint(200, 999)
            exchange = random.randint(200, 999)
            number = random.randint(1000, 9999)
            phones.append(f"({area}) {exchange}-{number}")
        return phones

    def _generate_addresses(self, count: int) -> List[str]:
        """Generate random addresses."""
        addresses = []
        for _ in range(count):
            number = random.randint(1, 9999)
            street = random.choice(self.sample_data['streets'])
            city = random.choice(self.sample_data['cities'])
            state = random.choice(self.sample_data['states'])
            zip_code = random.randint(10000, 99999)
            addresses.append(f"{number} {street}, {city}, {state} {zip_code}")
        return addresses

    def _generate_dates(self, count: int) -> List[str]:
        """Generate random dates."""
        dates = []
        start_date = datetime.now() - timedelta(days=365*2)
        for _ in range(count):
            random_days = random.randint(0, 365*2)
            date = start_date + timedelta(days=random_days)
            dates.append(date.strftime("%Y-%m-%d"))
        return dates

    def _generate_numbers(self, count: int) -> List[int]:
        """Generate random numbers."""
        return [random.randint(1, 1000) for _ in range(count)]

    def _generate_text(self, count: int) -> List[str]:
        """Generate random text using the language model."""
        texts = []
        prompts = ["The quick", "In today's world", "Technology has", "People often", "The future"]
        
        for _ in range(count):
            try:
                prompt = random.choice(prompts)
                result = self.text_generator(prompt, max_length=30, num_return_sequences=1)
                text = result[0]['generated_text'].replace(prompt, "").strip()
                if len(text) < 10:
                    text = f"Sample text content number {random.randint(1, 1000)}"
                texts.append(text[:100])  # Limit length
            except:
                texts.append(f"Sample text content number {random.randint(1, 1000)}")
        
        return texts

    def _generate_categories(self, count: int) -> List[str]:
        """Generate random categories."""
        return [random.choice(self.sample_data['categories']) for _ in range(count)]

    def _generate_booleans(self, count: int) -> List[bool]:
        """Generate random boolean values."""
        return [random.choice([True, False]) for _ in range(count)]

    def _generate_urls(self, count: int) -> List[str]:
        """Generate random URLs."""
        urls = []
        domains = ['example.com', 'test.org', 'sample.net', 'demo.io', 'site.co']
        paths = ['home', 'about', 'contact', 'products', 'services', 'blog']
        
        for _ in range(count):
            domain = random.choice(domains)
            path = random.choice(paths)
            urls.append(f"https://www.{domain}/{path}")
        return urls

    def _generate_ids(self, count: int) -> List[str]:
        """Generate random IDs."""
        return [f"ID_{i:06d}" for i in range(1, count + 1)]

    def _generate_prices(self, count: int) -> List[float]:
        """Generate random prices."""
        return [round(random.uniform(9.99, 999.99), 2) for _ in range(count)]

    def _generate_ratings(self, count: int) -> List[float]:
        """Generate random ratings (1-5 scale)."""
        return [round(random.uniform(1.0, 5.0), 1) for _ in range(count)]

    def _generate_descriptions(self, count: int) -> List[str]:
        """Generate random product descriptions."""
        adjectives = ['Amazing', 'High-quality', 'Durable', 'Innovative', 'Premium', 'Efficient']
        nouns = ['product', 'item', 'solution', 'device', 'tool', 'service']
        
        descriptions = []
        for _ in range(count):
            adj = random.choice(adjectives)
            noun = random.choice(nouns)
            descriptions.append(f"{adj} {noun} designed for optimal performance and user satisfaction.")
        
        return descriptions

    def generate_dataset(self, description: str, output_format: str = "CSV") -> Tuple[str, str]:
        """Generate a synthetic dataset based on description."""
        try:
            # Parse the description
            schema = self.parse_dataset_description(description)
            
            # Generate data
            data = {}
            for field_name, field_type in schema['fields'].items():
                if field_type in self.data_generators:
                    data[field_name] = self.data_generators[field_type](schema['num_records'])
                else:
                    # Default to text if type not recognized
                    data[field_name] = self.data_generators['text'](schema['num_records'])
            
            # Create DataFrame
            df = pd.DataFrame(data)
            
            # Generate output based on format
            if output_format == "CSV":
                output = df.to_csv(index=False)
                filename = "synthetic_data.csv"
            elif output_format == "JSON":
                output = df.to_json(orient='records', indent=2)
                filename = "synthetic_data.json"
            else:  # Excel
                # For demo purposes, return CSV format
                output = df.to_csv(index=False)
                filename = "synthetic_data.csv"
            
            # Create summary
            summary = f"""
Dataset Generated Successfully!

📊 **Dataset Summary:**
- Records: {len(df)}
- Fields: {len(df.columns)}
- Columns: {', '.join(df.columns.tolist())}

🔍 **Sample Data Preview:**
{df.head().to_string(index=False)}

💾 **Output Format:** {output_format}
📁 **Suggested Filename:** {filename}
            """
            
            return output, summary
            
        except Exception as e:
            error_msg = f"Error generating dataset: {str(e)}"
            return "", error_msg

In [10]:
# Initialize the generator
generator = SyntheticDataGenerator()

config.json:   0%|          | 0.00/642 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


pytorch_model.bin:   0%|          | 0.00/863M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/863M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


In [11]:
def generate_data_interface(description, format_choice):
    """Interface function for Gradio."""
    if not description.strip():
        return "", "Please provide a description of the dataset you need."
    
    output, summary = generator.generate_dataset(description, format_choice)
    return output, summary

In [16]:
# Create Gradio interface
def create_interface():
    """Create the Gradio interface."""
    
    with gr.Blocks(title="🔬 Synthetic Test Data Generator", theme=gr.themes.Soft()) as demo:
        gr.Markdown("""
        # 🔬 Synthetic Test Data Generator
        
        **Generate realistic test data for your applications using AI-powered synthesis!**
        
        Simply describe the type of dataset you need in natural language, and this tool will:
        - 🧠 Understand your requirements using NLP
        - 🎯 Generate diverse, realistic data
        - 📊 Output in your preferred format
        - ⚡ Use Hugging Face models for intelligent text generation
        """)
        
        with gr.Row():
            with gr.Column(scale=2):
                description_input = gr.Textbox(
                    label="📝 Describe Your Dataset",
                    placeholder="Example: Generate 500 customer records with names, emails, phone numbers, addresses, and registration dates",
                    lines=4,
                    info="Describe what kind of data you need. Be specific about fields, data types, and quantity."
                )
                
                format_choice = gr.Radio(
                    choices=["CSV", "JSON", "Excel"],
                    value="CSV",
                    label="📁 Output Format",
                    info="Choose your preferred output format"
                )
                
                generate_btn = gr.Button("🚀 Generate Dataset", variant="primary", size="lg")
            
            with gr.Column(scale=1):
                gr.Markdown("""
                ### 💡 **Tips for Better Results:**
                
                **Be Specific:**
                - Mention exact field names you want
                - Specify the number of records
                - Include data types (dates, numbers, text, etc.)
                
                **Examples:**
                - "100 user profiles with name, email, age, and signup date"
                - "Product catalog with 200 items including name, category, price, and description"
                - "Employee records with ID, name, department, salary, and hire date"
                
                **Supported Data Types:**
                - 👤 Names, emails, phones
                - 📍 Addresses, locations
                - 📅 Dates and timestamps  
                - 💰 Prices, numbers, ratings
                - 📝 Text descriptions
                - 🏷️ Categories, boolean flags
                - 🔗 URLs and IDs
                """)
        
        with gr.Row():
            with gr.Column():
                summary_output = gr.Markdown(label="📋 Generation Summary")
            
        with gr.Row():
            with gr.Column():
                data_output = gr.Code(
                    label="📦 Generated Dataset",
                    lines=20,
                    interactive=True
                )
        
        # Example buttons
        gr.Markdown("### 🎯 Quick Examples:")
        with gr.Row():
            example_btns = [
                gr.Button("👥 Customer Data", size="sm"),
                gr.Button("🛍️ Product Catalog", size="sm"),
                gr.Button("👨‍💼 Employee Records", size="sm"),
                gr.Button("📊 Sales Data", size="sm")
            ]
        
        # Event handlers
        generate_btn.click(
            fn=generate_data_interface,
            inputs=[description_input, format_choice],
            outputs=[data_output, summary_output]
        )
        
        # Example button handlers
        example_btns[0].click(
            lambda: "Generate 100 customer records with full name, email address, phone number, shipping address, and account creation date",
            outputs=description_input
        )
        
        example_btns[1].click(
            lambda: "Create 200 product entries with product name, category, price, rating, description, and availability status",
            outputs=description_input
        )
        
        example_btns[2].click(
            lambda: "Generate 150 employee records with employee ID, full name, department, job title, salary, hire date, and active status",
            outputs=description_input
        )
        
        example_btns[3].click(
            lambda: "Create 300 sales transaction records with transaction ID, customer name, product name, quantity, unit price, total amount, and sale date",
            outputs=description_input
        )
        
        gr.Markdown("""
        ---
        ### 🔧 **Technical Features:**
        - **AI-Powered**: Uses Hugging Face transformers for intelligent text generation
        - **Flexible Schema**: Automatically detects data types from natural language
        - **Multiple Formats**: Export as CSV, JSON, or Excel
        - **Scalable**: Generate up to 10,000 records per dataset
        - **Realistic Data**: Produces believable, diverse synthetic data
        - **Open Source**: Built with open-source models and tools
        
        **Powered by:** Transformers • Pandas • Gradio • PyTorch
        """)
    
    return demo

In [None]:
interface = create_interface()
interface.launch(
    server_name="0.0.0.0",
    server_port=8081,
    share=True,
    debug=True
)

* Running on local URL:  http://0.0.0.0:8081
* Running on public URL: https://3cddbca8cedf62ae70.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)
