Skip to content

khadidjaM/DC-DPM

Repository files navigation

DC-DPM

This is a Distributed Clustering with Spark based on Dirichlet Process Mixture, this approach is described in the following paper:

Khadidja Meguelati, Bénédicte Fontez, Nadine Hilgert, Florent Masseglia. Dirichlet Process Mixture Models made Scalable and Effective by means of Massive Distribution. SAC: Symposium on Applied Computing, Apr 2019, Limassol, Cyprus.

Please kindly cite our paper if the code helps you. Thank you.

@inproceedings{meguelati:hal-01999453,
  TITLE = {Dirichlet Process Mixture Models made Scalable and Effective by means of Massive Distribution},
  AUTHOR = {Meguelati, Khadidja and Fontez, B{\'e}n{\'e}dicte and Hilgert, Nadine and Masseglia, Florent},
  URL = {https://hal.archives-ouvertes.fr/hal-01999453},
  BOOKTITLE = {{SAC: Symposium on Applied Computing}},
  ADDRESS = {Limassol, Cyprus},
  YEAR = {2019},
  MONTH = Apr,
  DOI = {10.1145/3297280.3297327},
  KEYWORDS = {Dirichlet Process Mixture Model ; Clustering ; Parallelism},
  PDF = {\url{https://hal.archives-ouvertes.fr/hal-01999453/file/ACM_SigConf_SAC2019.pdf}},
  HAL_ID = {hal-01999453},
  HAL_VERSION = {v1}
}

Requirements

DC-DPM works with Apache Spark. In order to run it you must download and install Spark Release 2.0.0. The code is written in Scala, install Scala 2.11.6

Building

We use maven to build it, Use the given pom.xml file to build an executable jar containing all the dependencies.

Use

To execute DC-DPM use the following command :

$SPARK_HOME/bin/spark-submit --class "com.mycompany.dcdpm.App" DCDPM-jar-with-dependencies.jar <variance in clusters> <variance between centers> <dimensions> <number of workers> <number of distributions> <target to data file> <number of clusters for Kmeans> <number of real clusters> <real clusters are known>

Necessary parameters

  1. variance in clusters: We suppose that data are generated from a normal distribution, we need a covariance matrix with n dimensions which is an identity matrix with the value σ² in the diagonal. You should give the value of σ²
  2. variance between centers: We suppose that centers are generated from a normal distribution, we need a covariance matrix with n dimensions which is an identity matrix with the value σ² in the diagonal. You should give the value of σ²
  3. dimensions: the number of dimensions
  4. number of workers:
  5. number of distributions: in each distribution we perform several iterations of Gibbs Sampling on each worker and a synchronsation at the master level
  6. target to data file: The data file should be as follow :
  1. number of clusters for Kmeans: the initialization Of DPM is done by a K-means step, you should indicate the number of clusters for Kmeans initialization
  2. number of real clusters: if the ground truth is known, indicate the number of real clusters, else you can enter 0
  3. real clusters are known: if the ground truth is known, enter 1 else you can enter 0

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages