GitHub - jmandel/synthea-to-bigquery: Tool to generate synthetic EHR data via Synthea and load into BigQuery

Quick and Dirty Data Loader

Download the Synthea data release and prepare for BigQuery:

Translate patient bundles --> resource-specific .ndjson files like Patient.ndjson.gz containing all resources of a given type
Generate BigQuery schema files for each resource, based on the complete hierarchy of fields used with the resource-specific .ndjson files
Load all resource-specific .ndjson files into BigQuery

./01-prepare.sh
./02-load.sh

Note: This takes O(3h) as currently written.

Clean up the logic for schema generation (currently it's unreadable / unmaintainable)
Make it easy to parallelize (e.g. multile files per resource type so generation can run in multiple processes at once
Move schema generation to a post-processing step to avoid runtime overhead of walking each JSON hierachy when generating resource-spscific .ndjson files

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
vocab		vocab
.gitignore		.gitignore
01-prepare.sh		01-prepare.sh
02-load.sh		02-load.sh
README.md		README.md
generate_schema.py		generate_schema.py
prepare.py		prepare.py