This repository contains the code, data, and documentation for my Master's Thesis on generating and evaluating realistic synthetic healthcare datasets. It demonstrates a reproducible pipeline for:
Data Cleaning — Preparing clinical (EHR) and proteomic (TCGA) datasets.
Synthetic Data Generation — Creating synthetic patient records using multiple methods (synthpop, vine copula, ctGAN).
Evaluation — Comparing univariate distributions and other metrics between real and synthetic data.
The was approved by the ethics committee of the faculty of social and Behavioural Sciences of the University of Utrecht:
FETC: 24-2032
This archive can be accessed via GitHub for an unlimited amount of time. I am responsible for the research archive. If there are any questions, feel free to contact me via: l.jochim@students.uu.nl