Skip to content

Synthetic Data Generator for ML --- Users select desired data types and distributions, and GANs or VAEs generate original datasets for training models.

License

Notifications You must be signed in to change notification settings

nulldevcodes/synthetic_data_generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Synthetic Data Generator: Help & Instructions

Welcome to the Synthetic Data Generator! This application allows you to create artificial datasets for various machine learning tasks, useful for privacy-preserving development, data augmentation, and testing.

1. What is Synthetic Data?

Synthetic data is artificially generated data that mimics the statistical properties and patterns of real data but does not contain any original, real-world information. It's useful for:

  • Privacy: Developing models without exposing sensitive real data.
  • Data Augmentation: Expanding limited datasets to improve model performance.
  • Testing: Creating diverse test cases for robust model evaluation.
  • Prototyping: Building and validating pipelines before acquiring real data.

2. How to Use the App

The application is split into a sidebar for configuration and a main view for preview and download.

Sidebar (Configuration Panel)

On the left-hand side, you'll find the controls to configure your synthetic data:

A. Select Data Type

Choose the kind of data you want to generate:

  • Numeric (Tabular): Standard numerical features (e.g., age, price).
  • Categorical (Tabular): Features with discrete categories (e.g., gender, product type).
  • Mixed (Tabular): A combination of numeric and categorical features.
  • Time Series (Tabular): Tabular data with a time-based index, potentially with trends/seasonality.
  • Image: Image datasets (e.g., for computer vision tasks).

B. Dataset Schema

Define the structure of your dataset:

  • For Tabular Data (Numeric, Categorical, Mixed, Time Series):
    • Number of Features: Specify how many columns your dataset should have.
    • Edit Features: Expand this section to configure each feature:
      • Name: Give your feature a descriptive name.
      • Type: Choose numeric or categorical.
      • Min/Max Value (Numeric): Define the range for numeric features.
      • Categories (Categorical): List the possible categories, separated by commas (e.g., "Red, Green, Blue").
  • For Image Data:
    • Height/Width (px): Set the resolution of the generated images.
    • Channels: Choose Grayscale (1) for black and white images or RGB (3) for color images.

C. Generation Parameters

Control the characteristics of the generated data:

  • Sample Size: The total number of rows (for tabular) or images (for image) to generate.
  • Target Data Distribution:
    • For Numeric/Time Series: Select statistical distributions like Uniform, Normal (Gaussian), Exponential, or Multimodal (Normal) to influence the shape of your numeric features.
    • For Categorical: Choose between Balanced (equal probability for all categories) or Unbalanced (some categories more frequent).
  • Noise Level: (For Tabular Data) Introduces randomness or variability into the data. A higher value means more noise.
  • Random Seed: An integer value that ensures reproducibility. If you use the same seed, you'll get the same synthetic data every time. Use 0 for a truly random seed each generation.

D. Generate Synthetic Data Button

Click this button after configuring all your parameters to start the data generation process. A spinner will indicate that the generation is in progress.

Main View (Data Preview & Download)

The main area of the app displays the results of your generation:

A. Tabular Data Sample (for Tabular Data Types)

Shows the first few rows of your generated tabular dataset.

B. Data Visualizations (for Tabular Data Types)

  • Distributions: Histograms (for numeric) and bar charts (for categorical) to help you understand the spread and frequency of values for each feature.
  • Correlation: A heatmap showing the correlation between numeric features, indicating how strongly they relate to each other.

C. Synthetic Image Samples (for Image Data Type)

Displays a grid of a few (up to 9) generated images for a quick visual inspection.

D. Download Options

  • Download Data as CSV: For tabular data, download the entire generated dataset as a CSV file.
  • Download Images as ZIP: For image data, download all generated images compressed into a ZIP archive.

3. Advanced Concepts (Full VAE/GAN Implementation)

This application includes placeholder classes for VAE (Variational Autoencoder) and GAN (Generative Adversarial Network) models.

  • VAEs are typically used for tabular data generation, learning the underlying distribution of the data and generating new samples from that learned distribution.
  • GANs are often used for image data generation, where a generator creates fake images and a discriminator tries to tell them apart from real images, leading to highly realistic synthetic images over time.

Note: The current VAEDataGenerator and GANDataGenerator classes in data_generation/ are basic TensorFlow/Keras outlines. A full implementation would involve:

  • Training Data: Providing real datasets to train these models.
  • Hyperparameter Tuning: Optimizing learning rates, network architectures, etc.
  • More Complex Architectures: For higher quality or more diverse synthetic data.

The app.py directly calls create_tabular_data and create_image_data_placeholder from data_generation/utils.py for simplicity. To activate the VAE/GAN, you would modify app.py to:

  1. Load or train your VAEDataGenerator or GANDataGenerator instances.
  2. Pass the generated data through their train() and generate() methods.

4. Troubleshooting and Tips

  • "No synthetic data generated yet": Ensure you've selected a data type, configured schema/parameters, and clicked the "Generate Synthetic Data" button.
  • Performance: Generating very large sample sizes or high-resolution images might take time, especially for image generation.
  • Browser Caching: If updates don't appear, try refreshing your browser (Ctrl+F5 or Cmd+R).
  • Model Training: Full VAE/GAN training can be computationally intensive and requires GPU for practical use. The current implementation uses simple noise generation or basic placeholder models.

Enjoy generating your synthetic datasets!

About

Synthetic Data Generator for ML --- Users select desired data types and distributions, and GANs or VAEs generate original datasets for training models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages