Skip to content
No description, website, or topics provided.
Java Shell
Branch: master
Clone or download
Latest commit c8235c4 Mar 18, 2017
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
CLARA-MR Delete test Feb 20, 2017
FAMES-MR Add files via upload Feb 20, 2017
GREEDI Add files via upload Feb 20, 2017
MR-KMEDIAN Add files via upload Feb 20, 2017
PAM-MR Add files via upload Feb 20, 2017
PAMAE-Hadoop Add files via upload Feb 20, 2017
PAMAE-Spark Add files via upload Feb 20, 2017
example Delete WEIGHTED-KMEDIAN.jar Feb 20, 2017
figures Add files via upload Feb 20, 2017
jar Delete test Feb 20, 2017
scripts Update ConvergenceExp.sh Feb 20, 2017
README.md Update README.md Mar 18, 2017

README.md

PAMAE: PArallel k-Medoids clustering with high Accuracy and Efficiency

1. Overview

The k-medoids algorithm is one of the best-known clustering algorithms. Despite this, however, it is not as widely used for big data analytics as the k-means algorithm, mainly because of its high computational complexity. Many studies have attempted to solve the efficiency problem of the k-medoids algorithm, but all such studies have improved efficiency at the expense of accuracy. In this paper, we propose a novel parallel k-medoids algorithm, which we call PAMAE, that achieves both high accuracy and high efficiency. We identify two factors—"global search" and "entire data"—that are essential to achieving high accuracy, but are also very time-consuming if considered simultaneously. Thus, our key idea is to apply them individually through two phases: parallel seeding and parallel refinement, neither of which is costly. The first phase performs global search over sampled data, and the second phase performs local search over entire data. Our theoretical analysis proves that this serial execution of the two phases leads to an accurate solution that would be achieved by global search over entire data. In order to validate the merit of our approach, we implement PAMAE on Spark as well as Hadoop and conduct extensive experiments using various real-world data sets on 12 Microsoft Azure machines (48 cores). The results show that PAMAE significantly outperforms most of recent parallel algorithms and, at the same time, produces a clustering quality as comparable as the previous most-accurate algorithm.

2. Algorithms

  • PAM-MR [1]
  • FAMES-MR [2]
  • CLARA-MR' and CLARA-MR [3]
  • GREEDI [4]
  • MR-KMEDIAN [5]
  • PAMAE-Hadoop : Hadoop implementation of our algorithm
  • PAMAE-Spark : Spark implementation of our algorithm

Reference
[1] Xianfeng Yang and Liming Lian. 2014. A New Data Mining Algorithm Based on MapReduce and Hadoop. Int’l J. of Signal Processing, Image Processing, and Pattern Recognition 7, 2 (2014), 131–142.
[2] Ying-ting Zhu, Fu-zhang Wang, Xing-hua Shan, and Xiao-yan Lv. 2014. KMedoids Clustering Based on MapReduce and Optimal Search of Medoids. In Proc. 9th Int’l Conf. on Computer Science and Education. 573–577.
[3] Pelle Jakovits and Satish Narayana Srirama. 2013. Clustering on the Cloud: Reducing CLARA To MapReduce. In Proc. 2nd Nordic Sympo. on Cloud Computing and Internet Technologies. 64–71.
[4] Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, and Andreas Krause. 2013. Distributed Submodular Maximization: Identifying Representative Elements in Massive Data. In Proc. 27th Annual Conf. on Neural Information Processing Systems. 2049–2057.
[5] Alina Ene, Sungjin Im, and Benjamin Moseley. 2011. Fast Clustering using MapReduce. In Proc. 17th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining. 681–689.

3. Data Sets

Name # Object # Dim Size Type Link
Covertype 581,102 55 71 MB int link
Census1990 2,458,285 68 324 MB int link
Cosmo50 315,086,245 3 13.6 GB float link
TeraClickLog 4,373,472,329 13 300 GB float link 1,2,3

4. Configuration

  • We conducted experiments on 12 Microsoft Azure D12v2 instances loacted in Japan.
  • Each instance has four cores, 28GB of main memory, and 200GB of disk (SSD).
  • All instances run on Ubuntu 14.04. We used Hadoop 2.7.1 and Spark 1.6.1 for distributed parallel processing.
  • Ten out of 12 instances were used as worker nodes, and the remaining two instances were used as master nodes.
  • All the algorithms were written in the Java programming language and run on JDK 1.7.0_101.

5. How to Run

  • Compile.

    • Download the spark-assembly library from Apache Spark.
    • Make a jar file using IDE tools. For example, you can easily make it using Eclipse through project name->export->jar file. It is possible that you just download the jar files from jar folder.
  • Create Azure HDInsight instances.

  • Move the data sets into the HDFS folder.

    • Download all data sets from the above links and move them to the Azure master node.
    • Transfer your data sets from the Azure master node into HDFS.
      hadoop dfs -put [your master node path] [hdfs path]
  • Run the algirithms using the following command:

    • PAM-MR
      hadoop jar [jar file] [main class] [hdfs input path] [hdfs output path] [# of medoids]
    • FAMES-MR
      hadoop jar [jar file] [main class] [hdfs input path] [hdfs output path] [# of medoids]
    • CLARA-MR' (type : 0), CLARA-MR (type : 1)
      hadoop jar [jar file] [main class] [hdfs input path] [hdfs output path] [# of samples] [# of medoids] [type]
    • GREEDI
      hadoop jar [jar file] [main class] [hdfs input path] [hdfs output path] [# of partitions] [# of medoids]
    • MR-KMEDIAN
      Iteraive Sampling
      hadoop jar [jar file] [main class] [hdfs input path] [hdfs output path] [# of medoids] [upsilon] [# of cores] [# of objects]
      Weighted K-Median
      hadoop jar [jar file] [main class] [hdfs input path] [hdfs sample path] [hdfs output path] [# of medoids] [# of cores]
    • PAMAE-Hadoop
      hadoop jar [jar file] [main class] [hdfs input path] [hdfs output path] [# of sampled objects] [# of samples] [# of medoids] [# of partition]
    • PAMAE-Spark
      spark-submit --class [main class] [jar file] wasb://[hdfs address]/[hdfs input path] [# of medoids] [# of sampled objects] [# of samples] [# of partitions] [# of iteration for Phase II]

6. Example

  • The example_dataset.csv file is used to visualize the difference of the six algorithms. The data set consists of 10 clusters having 2D 10,000 objects.
  • The clustering results are shown in the figures below.


    Here, we can easily observe that other algorithms suffer from local optima or insufficient sample problems. Only our algorithm succeeds to find the exact medoids of 10 clusters.

7. Experiment

The script(.sh) files for our experiments are included in scripts.

  • Figures 5 and 8: Accuracy and efficiency experiment
    • This experiment is accuracy/efficiency comparisons of eight parallel algorithms.
    • Run all algorithms on the four real-world data sets when k is set to be 10, 25, and 50.
    • The results are Figure 5 and Figure 8 in our paper.
    • sh Hadoop-AccEffExp.sh and sh Spark-AccEffExp.sh
  • Figure 7: Convergence experiment
    • This experiment is to verify the convergence of Phase II.
    • Three different seeding strategies are used for this experiment:
      (40+2k,5) by CLARA-MR', (100+5K) by CLARA-MR, and (40K,5) by Phase I of PAMAE.
    • The number of iterations for Phase II is forced to increase from 1 to 10 when k is 50.
    • The result is Figure 7 in our paper.
    • sh ConvergenceExp.sh
  • Figure 9: Scalability experiment
    • This experiment is scalability test using the TeraClickLog data set on Spark.
    • Two storage-level options are used: MEMORY_ONLY and MEMORY_AND_DISK.
    • The result is Figure 9 in our paper.
    • sh ScalabilityExp.sh
You can’t perform that action at this time.