Skip to content

nradich/synthetic_data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

74 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Synthetic Web App

This Web App is an end-to-end project for generating realistic logistics and commerce data, landing it in Azure storage, processing it in Databricks, publishing it to Azure SQL, and serving it through an Azure Functions API and TypeScript frontend.

Synthetic Web App Overview

Overview

The project models a synthetic commerce system with customers, products, warehouses, and orders. The pipeline uses NVIDIA NeMo Data Designer to generate realistic records, Azure Data Lake Storage for landing data, Databricks Auto Loader for incremental ingestion, Azure SQL for serving, and a lightweight web application for exploration.

At a glance, the platform covers:

  • Synthetic data generation guided by explicit schemas and shipping logic
  • Lake ingestion into Bronze Delta tables with Databricks Auto Loader
  • Incremental publish into Azure SQL using watermark-based processing
  • API access through Azure Functions
  • A React and TypeScript frontend for browsing operational data

Pipeline At A Glance

Pipeline Overview

  1. Define the synthetic entities and generation rules in src/schemas.py and src/shipping_geo.py.
  2. Generate realistic datasets in src/generate_realistic_data.py.
  3. Write scheduled JSON outputs to Azure Data Lake Storage in src/daily_synthetic_pipeline.py.
  4. Ingest landed files into Bronze Delta tables with src/autoloader_bronze.py.
  5. Incrementally publish Bronze data into Azure SQL with src/sqlserver_publish.py.
  6. Orchestrate the downstream Databricks job with src/autoloader_to_sql_pipeline.py.
  7. Expose the data through the Azure Functions API in web/api/function_app.py.
  8. Visualize and interact with the data in the frontend under web/frontend.

Repository Layout

  • src contains the Python scripts that drive synthetic data generation, ADLS landing, Databricks ingestion, and SQL publishing.
  • web contains the application layer: the Azure Functions API and the TypeScript frontend.
  • sql contains the database schema, table creation scripts, and schema evolution scripts for Azure SQL.
  • config contains environment and configuration helpers used across the Python pipeline.
  • init-scripts contains Databricks cluster setup scripts, including NeMo and ODBC installation.
  • data contains local sample CSVs that support development and testing flows.
  • docs is the natural home for screenshots and additional documentation as the project evolves.

Source Workflow

The src folder is the backbone of the platform. These are the main scripts in workflow order.

  1. src/client.py configures the NVIDIA NeMo client used during synthetic generation.
  2. src/schemas.py defines the core entities and fields that shape the generated datasets.
  3. src/shipping_geo.py adds warehouse, geography, and shipping-estimate realism.
  4. src/generate_realistic_data.py produces realistic records for the synthetic commerce domain.
  5. src/daily_synthetic_pipeline.py writes generated outputs to Azure Data Lake Storage in a scheduled, partition-friendly format.
  6. src/autoloader_bronze.py incrementally ingests landed JSON into Bronze Delta tables.
  7. src/sqlserver_publish.py publishes Bronze data into Azure SQL using watermark-based processing.
  8. src/autoloader_to_sql_pipeline.py runs the downstream ingestion and publish sequence together.

Supporting scripts include src/generate_data.py for earlier generation flows.

Application Layer

Pipeline Overview

Database Layer

The sql folder contains the scripts used to bootstrap and evolve the Azure SQL schema.

Next Steps

Read more about the web app here

About

Using Nvidia Nemo data designer to create a dataset, potentially web app and digital twin

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors