Earth Mover's Distance based Similarity Join on Hadoop
Java
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.

README.md

Melody-Join

The Hadoop based implementation of Melody-Join algorithm presented in the paper [Jin Huang, Rui Zhang, Rajkumar Buyya, and Jian Chen, "Melody-Join: Efficient Earth Mover's Distance Similarity Join Using MapReduce", ICDE2014.] (http://people.eng.unimelb.edu.au/huangj1/resources/icde2014_melody.pdf )

What It Does

It uses MapReduce/BSP parallel computation paradigm to retrieve all pairs of similar records from large datasets, i.e., to perform similarity join on datasets. The similarity is measured by the Earth Mover's Distance(EMD). The implementation here uses Euclidean distance as the ground distance of EMD.

Two types of joins are supported:

  • Distance threshold: the distance between the records from the retrieved pairs is below a given value, MapReduce only
  • Top-k: the retrieved pairs have the k smallest distances among all pairs from a cartesian product, MapReduce and BSP

The [MRSimJoin] (http://www.public.asu.edu/~ynsilva/SimCloud/publications.html) implementation released by the original authors is modified to generically process arbitrary dimensional datasets. That code is under the package mrsim.

Additionally, it provides a data generator which is capable to extract 16 types of content-based features from image files. The generated features then can be fed to the join algorithm to process. The feature extraction implementation is built upon [Lire] (http://www.semanticmetadata.net/lire/). The supported features are listed in [Supported Features] (https://github.com/jinhuang/melody-join#supported-features).

Input Datset Format

Data records are represented by histograms that have multiple bins. Each bin has a non-negetive weight and a multi-dimensional location vector. As we assume all datasets have the same histogram definition, the locations of bins are shared by all data records. All data are encoded in text file, and followings are their formats:

  • Histograms: each line represents one record, <id> <weight of bin 0> <weight of bin 1> ...
  • Bin Locations: all locations in one line, <bin 0 dimension 0> <bin 0 dimension 1> ... <bin n dimension 0> <bin n dimension 1> ...

As Melody-Join exploits the projection and normal lower bounds of EMD, it requires the projection vectors in the following format:

  • Projection vector: all vectors in one line, <vector 0 dimension 0> <vector 0 dimension 1> ... <vector n dimension 0> <vector n dimension 1> ...

All files should be accessible via HDFS paths.

How To Run EMD Join

The implementation uses Apache Maven for dependency management. Additionally, make sure your environment meets the following requirement

  • Apache Hadoop 2.0+
  • Apache Hama 0.7.0-SNAPSHOT

Then you can run mvn clean package assembly:single to generate the jar file.

The following dependencies should be in the /lib directory of the Hadoop and Hama:

  • commons-math3-3.1.1
  • commons-math-2.1
  • commons-collections-3.2.1
  • commons-configuration1.6

Before running the generated jar using Hadoop or Hama, fill the conf.properties file according to the provided melody-conf.properties.

The algorithms can be run using the command

hadoop jar <jar file path> com.iojin.melody.Join <conf.properties>
hama jar <jar file path> com.iojin.melody.Join <conf.properties>

The Hama 0.7.0-SNAPSHOT binary is included under the directory hama for convenience.

How To Run Data Generator

The implementation includes a data generator which extract typical content-based features from images files and convert their format to fit the requirement of the join program. The configurations are also included in the melody-conf.properties file. The Generator currently supports the following execution modes

  • local: where the generator extracts features and generates histograms locally on a single machine
  • mr: where the generator extracts features and generates histograms on a Hadoop cluster in a parallel manner

The running command for the local mode is

java -jar <jar file path> com.iojin.melody.Generate <conf.properties>

And the command for the mr mode is

hadoop jar <jar file path> com.iojin.melody.Generate <conf.properties>

When the mr mode is selected, the input image source can reside on

  • local: the images will be put into a big image bundle and uploaded to the HDFS
  • hdfs: the images will be put into a big image bundle and saved as temporary file on HDFS
  • url: the images will be fetched on the fly from Internet according to a http(https) url list, the images are not stored on HDFS

Supported Features

Authors

[Jin Huang] (http://people.eng.unimelb.edu.au/huangj1/) PhD Candidate Department of Computing and Information Systems The University of Melbourne jin.huang@unimelb.edu.au