This repository provides 10 synthetic orbital action datasets of stars (sibling stars), generated by our numerical simulation that mimics the formation process of the Milky Way. Each dataset ("/dataset/seed**.csv") contains pre-processed 3-dim. orbital action features of 275 stars, associated to each empirical uncertainty set (consisting of 101 feature instances for each). Overall, each dataset contains 275 [stars] x 101 [feature instances in each uncertainty set / each star] = 27775 [feature instances] associated with pre-computed penalties. This repository also provides R source codes to reproduce experimental results in our paper [1].
While [1] provides a greedy and optimistic approach to clustering (GOC) and applies GOC to the synthetic datasets (whose true clusters are known), [2] further applies GOC to real-world orbital action datasets of stars. Please see [1] for further details of these synthetic datasets and experimental results.
[1] A. Okuno and K. Hattori. (2022) "A Greedy and Optimistic Approach to Clustering with a Specified Uncertainty of Covariates", arXiv:2204.08205
[2] K. Hattori, A. Okuno, and I. U. Roederer. (2023) "Finding r-II sibling stars in the Milky Way with the Greedy Optimistic Clustering algorithm", The Astrophysical Journal, 946(1), 48.
For the use of these datasets, prease cite our papers [1] and [2] with the following BiBTeX entries:
@article{Okuno2022,
year = {2022},
publisher = {CoRR},
volume = {},
number = {},
pages = {},
author = {Okuno, Akifumi and Hattori, Kohei},
title = {A Greedy and Optimistic Approach to Clustering with a Specified Uncertainty of Covariates},
journal = {arXiv preprint arXiv:2204.08205},
note = {submitted.}
}
@article{Hattori2023,
year = {2023},
publisher = {The American Astronomical Society},
volume = {946},
number = {1},
pages = {48},
author = {Hattori, Kohei and Okuno, Akifumi and Roederer, Ian U.},
title = {Finding r-II Sibling Stars in the Milky Way with the Greedy Optimistic Clustering Algorithm},
journal = {The Astrophysical Journal}
}
A. Okuno (ISM and RIKEN AIP, okuno@ism.ac.jp; https://okuno.net/)
K. Hattori (NAOJ, ISM, and U. Michigan, khattori@ism.ac.jp; https://koheihattori.github.io/)
(Left) true classes, (Right) clusters detected by GOC
You can find 10 datasets "seed1.csv"-"seed10.csv". See [1] (Example 1 in Section 2 and Appendix A) for more detailed description of these datasets. The standardized instances (including removal of a few outliers) can be found: they were processed by "/verbose/preprocessing.R". In "/dataset/verbose", you can find intermediate products of our numerical simulation (to generate the synthetic datasets).
This script provides visualizations of datasets, and clustering results. The visualization results are saved to "/output/visualization".
This script applies GOC to our synthetic datasets. The clustering results are saved to "/output".
- "/verbose/dependencies.R" lists depended packages (used in experiments.R and visualization.R)
- "/verbose/preprocessing.R" preprocess (standardize) the dataset instances
- "/verbose/evaluation_metrics.R" provides clustering scores (NMI and F-measure)
- "/verbose/GOC.R" provides GOC function equipped with clustering oracle "clm" therein. Particularly, we employed k-means, k-medoids (in ClusterR), Gaussian mixture model (in ClusterR and Mclust)
- "/verbose/summary_in_LaTeX_form.R" outputs the summary of experimental results (in the form of LaTeX tables)
- "/verbose/convergence.R" outputs the summary of experimental results on the convergence of GOC
- "/verbose/affinity_propagation.R" conducts experiments on affinity propagaion