This repository implements a Gaussian Mixture Model (GMM) workflow for clustering and imputing incomplete data. It combines:
- a Python implementation of the alternating GMM-imputation algorithm;
- reproducible notebooks for step-by-step explanation, reproduction, and application;
- a modular LaTeX report with figures, experiments, and case studies.
The project is centered on the paper Gaussian Mixture Model Clustering with Incomplete Data and extends it with local implementation details, reproduction experiments, and application-driven artifacts.
GMM_4_missing_data/
├── data/
│ └── CC GENERAL.csv
├── docs/
│ └── Gaussian Mixture Model Clustering with Incomplete Data.pdf
├── latex_composition/
│ ├── main.tex
│ ├── main.pdf
│ ├── figures/
│ └── sections/
├── notebooks/
│ ├── step_by_step_experiment.ipynb
│ ├── GMM_Nhat.ipynb
│ ├── airplane_crashes_gmm_imputation.ipynb
│ └── artifacts/
├── src/
│ ├── gmm_missing.py
│ ├── utils.py
│ └── airplane_crashes_pipeline.py
├── pyproject.toml
├── requirements.txt
└── README.md
src/gmm_missing.py: mainGMMMissingestimator.src/utils.py: synthetic data generation and basic imputers.src/airplane_crashes_pipeline.py: end-to-end application pipeline for the airplane crashes case study.
notebooks/step_by_step_experiment.ipynb: pedagogical walkthrough of the EM-style update cycle on synthetic data.notebooks/GMM_Nhat.ipynb: reproduction experiment on theCC GENERALdataset.notebooks/airplane_crashes_gmm_imputation.ipynb: real-data application with generated figures and RMSE-based imputation evaluation.
latex_composition/main.tex: source of the report.latex_composition/main.pdf: compiled paper.latex_composition/figures/: figures used in the report, including notebook-exported artifacts.
The project targets Python >=3.12.
Using uv:
uv syncUsing pip:
pip install -r requirements.txtRun the notebooks:
uv run jupyter labRecommended order:
notebooks/step_by_step_experiment.ipynbnotebooks/GMM_Nhat.ipynbnotebooks/airplane_crashes_gmm_imputation.ipynb
Compile the LaTeX report:
cd latex_composition
latexmk -pdf -interaction=nonstopmode main.texThis repository currently includes three complementary layers of evidence:
- algorithm-level inspection through the step-by-step notebook;
- reproduction on
CC GENERAL; - application-level evaluation on the airplane crashes dataset, with exported figures and RMSE on masked entries.
Artifacts generated in notebooks are intended to be promoted into latex_composition/figures/ so that the report remains tied to actual outputs rather than illustrative placeholders.
Important outputs already tracked in the project include:
latex_composition/main.pdf- figures under
latex_composition/figures/applications/ - figures under
latex_composition/figures/notebook_exports/
- The LaTeX report uses a custom style file at
latex_composition/setting/khang_paper.sty. - The current font setup keeps the original
lmodernandmicrotypeconfiguration, which may produce non-blocking T5-related warnings during compilation.