This is a Distributed Clustering with Spark based on Dirichlet Process Mixture, this approach is described in the following paper:
Khadidja Meguelati, Bénédicte Fontez, Nadine Hilgert, Florent Masseglia. Dirichlet Process Mixture Models made Scalable and Effective by means of Massive Distribution. SAC: Symposium on Applied Computing, Apr 2019, Limassol, Cyprus.
Please kindly cite our paper if the code helps you. Thank you.
@inproceedings{meguelati:hal-01999453,
TITLE = {Dirichlet Process Mixture Models made Scalable and Effective by means of Massive Distribution},
AUTHOR = {Meguelati, Khadidja and Fontez, B{\'e}n{\'e}dicte and Hilgert, Nadine and Masseglia, Florent},
URL = {https://hal.archives-ouvertes.fr/hal-01999453},
BOOKTITLE = {{SAC: Symposium on Applied Computing}},
ADDRESS = {Limassol, Cyprus},
YEAR = {2019},
MONTH = Apr,
DOI = {10.1145/3297280.3297327},
KEYWORDS = {Dirichlet Process Mixture Model ; Clustering ; Parallelism},
PDF = {\url{https://hal.archives-ouvertes.fr/hal-01999453/file/ACM_SigConf_SAC2019.pdf}},
HAL_ID = {hal-01999453},
HAL_VERSION = {v1}
}
DC-DPM works with Apache Spark. In order to run it you must download and install Spark Release 2.0.0. The code is written in Scala, install Scala 2.11.6
We use maven to build it, Use the given pom.xml file to build an executable jar containing all the dependencies.
To execute DC-DPM use the following command :
$SPARK_HOME/bin/spark-submit --class "com.mycompany.dcdpm.App" DCDPM-jar-with-dependencies.jar <variance in clusters> <variance between centers> <dimensions> <number of workers> <number of distributions> <target to data file> <number of clusters for Kmeans> <number of real clusters> <real clusters are known>
- variance in clusters: We suppose that data are generated from a normal distribution, we need a covariance matrix with n dimensions which is an identity matrix with the value σ² in the diagonal. You should give the value of σ²
- variance between centers: We suppose that centers are generated from a normal distribution, we need a covariance matrix with n dimensions which is an identity matrix with the value σ² in the diagonal. You should give the value of σ²
- dimensions: the number of dimensions
- number of workers:
- number of distributions: in each distribution we perform several iterations of Gibbs Sampling on each worker and a synchronsation at the master level
- target to data file: The data file should be as follow :
- each data in a line
- values are seperated by space " "
- if the ground truth is known, the data file should contain the label of the real cluster for each data in the last column, see data with known ground truth.txt an example with 3 real clusters, and data with unknown ground truth.txt an example of the other case.
- number of clusters for Kmeans: the initialization Of DPM is done by a K-means step, you should indicate the number of clusters for Kmeans initialization
- number of real clusters: if the ground truth is known, indicate the number of real clusters, else you can enter 0
- real clusters are known: if the ground truth is known, enter 1 else you can enter 0