DC-DPM

This is a Distributed Clustering with Spark based on Dirichlet Process Mixture, this approach is described in the following paper:

Khadidja Meguelati, Bénédicte Fontez, Nadine Hilgert, Florent Masseglia. Dirichlet Process Mixture Models made Scalable and Effective by means of Massive Distribution. SAC: Symposium on Applied Computing, Apr 2019, Limassol, Cyprus.

Please kindly cite our paper if the code helps you. Thank you.

@inproceedings{meguelati:hal-01999453,
  TITLE = {Dirichlet Process Mixture Models made Scalable and Effective by means of Massive Distribution},
  AUTHOR = {Meguelati, Khadidja and Fontez, B{\'e}n{\'e}dicte and Hilgert, Nadine and Masseglia, Florent},
  URL = {https://hal.archives-ouvertes.fr/hal-01999453},
  BOOKTITLE = {{SAC: Symposium on Applied Computing}},
  ADDRESS = {Limassol, Cyprus},
  YEAR = {2019},
  MONTH = Apr,
  DOI = {10.1145/3297280.3297327},
  KEYWORDS = {Dirichlet Process Mixture Model ; Clustering ; Parallelism},
  PDF = {\url{https://hal.archives-ouvertes.fr/hal-01999453/file/ACM_SigConf_SAC2019.pdf}},
  HAL_ID = {hal-01999453},
  HAL_VERSION = {v1}
}

Requirements

DC-DPM works with Apache Spark. In order to run it you must download and install Spark Release 2.0.0. The code is written in Scala, install Scala 2.11.6

Building

We use maven to build it, Use the given pom.xml file to build an executable jar containing all the dependencies.

Use

To execute DC-DPM use the following command :

$SPARK_HOME/bin/spark-submit --class "com.mycompany.dcdpm.App" DCDPM-jar-with-dependencies.jar <variance in clusters> <variance between centers> <dimensions> <number of workers> <number of distributions> <target to data file> <number of clusters for Kmeans> <number of real clusters> <real clusters are known>

Necessary parameters

variance in clusters: We suppose that data are generated from a normal distribution, we need a covariance matrix with n dimensions which is an identity matrix with the value σ² in the diagonal. You should give the value of σ²
variance between centers: We suppose that centers are generated from a normal distribution, we need a covariance matrix with n dimensions which is an identity matrix with the value σ² in the diagonal. You should give the value of σ²
dimensions: the number of dimensions
number of workers:
number of distributions: in each distribution we perform several iterations of Gibbs Sampling on each worker and a synchronsation at the master level
target to data file: The data file should be as follow :

each data in a line
values are seperated by space " "
if the ground truth is known, the data file should contain the label of the real cluster for each data in the last column, see data with known ground truth.txt an example with 3 real clusters, and data with unknown ground truth.txt an example of the other case.

number of clusters for Kmeans: the initialization Of DPM is done by a K-means step, you should indicate the number of clusters for Kmeans initialization
number of real clusters: if the ground truth is known, indicate the number of real clusters, else you can enter 0
real clusters are known: if the ground truth is known, enter 1 else you can enter 0

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
src/main/scala/fr/inria/zenith/dcdpm		src/main/scala/fr/inria/zenith/dcdpm
README.md		README.md
data with known ground truth.txt		data with known ground truth.txt
data with unknown ground truth.txt		data with unknown ground truth.txt
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DC-DPM

Requirements

Building

Use

Necessary parameters

About

Releases

Packages

Languages

khadidjaM/DC-DPM

Folders and files

Latest commit

History

Repository files navigation

DC-DPM

Requirements

Building

Use

Necessary parameters

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages