Welcome to the Synthetic Data Generator! This application allows you to create artificial datasets for various machine learning tasks, useful for privacy-preserving development, data augmentation, and testing.
Synthetic data is artificially generated data that mimics the statistical properties and patterns of real data but does not contain any original, real-world information. It's useful for:
- Privacy: Developing models without exposing sensitive real data.
- Data Augmentation: Expanding limited datasets to improve model performance.
- Testing: Creating diverse test cases for robust model evaluation.
- Prototyping: Building and validating pipelines before acquiring real data.
The application is split into a sidebar for configuration and a main view for preview and download.
On the left-hand side, you'll find the controls to configure your synthetic data:
Choose the kind of data you want to generate:
- Numeric (Tabular): Standard numerical features (e.g., age, price).
- Categorical (Tabular): Features with discrete categories (e.g., gender, product type).
- Mixed (Tabular): A combination of numeric and categorical features.
- Time Series (Tabular): Tabular data with a time-based index, potentially with trends/seasonality.
- Image: Image datasets (e.g., for computer vision tasks).
Define the structure of your dataset:
- For Tabular Data (Numeric, Categorical, Mixed, Time Series):
- Number of Features: Specify how many columns your dataset should have.
- Edit Features: Expand this section to configure each feature:
- Name: Give your feature a descriptive name.
- Type: Choose
numericorcategorical. - Min/Max Value (Numeric): Define the range for numeric features.
- Categories (Categorical): List the possible categories, separated by commas (e.g., "Red, Green, Blue").
- For Image Data:
- Height/Width (px): Set the resolution of the generated images.
- Channels: Choose
Grayscale (1)for black and white images orRGB (3)for color images.
Control the characteristics of the generated data:
- Sample Size: The total number of rows (for tabular) or images (for image) to generate.
- Target Data Distribution:
- For Numeric/Time Series: Select statistical distributions like Uniform, Normal (Gaussian), Exponential, or Multimodal (Normal) to influence the shape of your numeric features.
- For Categorical: Choose between Balanced (equal probability for all categories) or Unbalanced (some categories more frequent).
- Noise Level: (For Tabular Data) Introduces randomness or variability into the data. A higher value means more noise.
- Random Seed: An integer value that ensures reproducibility. If you use the same seed, you'll get the same synthetic data every time. Use
0for a truly random seed each generation.
Click this button after configuring all your parameters to start the data generation process. A spinner will indicate that the generation is in progress.
The main area of the app displays the results of your generation:
Shows the first few rows of your generated tabular dataset.
- Distributions: Histograms (for numeric) and bar charts (for categorical) to help you understand the spread and frequency of values for each feature.
- Correlation: A heatmap showing the correlation between numeric features, indicating how strongly they relate to each other.
Displays a grid of a few (up to 9) generated images for a quick visual inspection.
- Download Data as CSV: For tabular data, download the entire generated dataset as a CSV file.
- Download Images as ZIP: For image data, download all generated images compressed into a ZIP archive.
This application includes placeholder classes for VAE (Variational Autoencoder) and GAN (Generative Adversarial Network) models.
- VAEs are typically used for tabular data generation, learning the underlying distribution of the data and generating new samples from that learned distribution.
- GANs are often used for image data generation, where a generator creates fake images and a discriminator tries to tell them apart from real images, leading to highly realistic synthetic images over time.
Note: The current VAEDataGenerator and GANDataGenerator classes in data_generation/ are basic TensorFlow/Keras outlines. A full implementation would involve:
- Training Data: Providing real datasets to train these models.
- Hyperparameter Tuning: Optimizing learning rates, network architectures, etc.
- More Complex Architectures: For higher quality or more diverse synthetic data.
The app.py directly calls create_tabular_data and create_image_data_placeholder from data_generation/utils.py for simplicity. To activate the VAE/GAN, you would modify app.py to:
- Load or train your
VAEDataGeneratororGANDataGeneratorinstances. - Pass the generated data through their
train()andgenerate()methods.
- "No synthetic data generated yet": Ensure you've selected a data type, configured schema/parameters, and clicked the "Generate Synthetic Data" button.
- Performance: Generating very large sample sizes or high-resolution images might take time, especially for image generation.
- Browser Caching: If updates don't appear, try refreshing your browser (
Ctrl+F5orCmd+R). - Model Training: Full VAE/GAN training can be computationally intensive and requires GPU for practical use. The current implementation uses simple noise generation or basic placeholder models.
Enjoy generating your synthetic datasets!