Earth Mover's Distance Python2 Module
This module provides a function for computing the earth mover's distance (EMD) in Python2. It wraps Yossi Rubner's C implementation of the EMD into a Python module using a custom SWIG interface wrapper.
The EMD is a distance measure between two probability distributions. It can be used, for example, to retrieve similar images from a database. However, this implementation is not limited to images or histograms; it can represent distributions over any user-defined features, using any user-defined ground distance. Please see the original homepage for more explanations and references.
Building the module requires the following:
- Python2 C headers (
- C compiler and linker, as well as Make (
- SWIG Simple Wrapper and Interface Generator (
- EMD source code and interface definition (
emd.iin this repository)
The name in parentheses after the first three entries is the name of the Debian/Ubuntu Linux package that contains the required files.
If all requirements are met, the module can be built using Make.
make in the directory that contains the EMD source
code and interface definition. The output should look similar to the
>>> Building object file 'emd.o'. cc -o emd.o -c emd.c -fPIC -I/usr/include/python2.7 -I/usr/include/python2.7 >>> Generating C interface swig -python emd.i >>> Building object file 'emd_wrap.o'. cc -o emd_wrap.o -c emd_wrap.c -fPIC -I/usr/include/python2.7 -I/usr/include/python2.7 >>> Linking wrapper library '_emd.so'. ld -shared -o _emd.so emd.o emd_wrap.o rm emd_wrap.o emd.o emd_wrap.c
You can ignore warnings about redefinitions of
INFINITY and extra
tokens at the end of include statements. If everything went well, the
now directory contains the C library
_emd.so and its Python wrapper
example1.py is a port of
example1.c. It demonstrates
the basic use of the EMD module. Simply run the script with the
$ python2 example1.py 160.542770386
emd in the
emd module computes the EMD. (I bet you
were surprised by this fact.) It accepts two pairs of feature and
weight lists, and a callable for computing the ground distance.
from emd import emd print emd( (features1, weights1), (features2, weights2), distance)
In above code,
features2 are lists of arbitrary
Python objects. The implementation invokes the callable
pairs of feature objects to compute the distance matrix. Each feature
object is assigned a weight to characterize the distribution.
The implementation treats objects as opaquely as possible. The only limitations are:
weights1are lists of the same length. The entries of
weights1must be floating point numbers;
features1can contain arbitrary objects, but the
distancefunction must be able to process them. Similarly for
distanceargument is a callable object (for example a function) that accepts two arguments (from the
featureslists) and returns a floating point number.
example1.py for a filled-in example.
- The C implementation uses
floatas its fundamental data type for distances and weights. The (typically
double) floating point objects of Python will be down-cast and lose precision.
emd.hfor the hard-coded definitions of maximum signature size, maximum iteration count, and computational constants.
- Currently, the wrapper ignores the flow output argument that is available in the C implementation.
- The wrapper is not suited for concurrent execution. It uses a
global variable for the distance callback function, so calling
emdfrom concurrent threads will result in undefined behavior.
Only the wrapper code, the Makefile, the Python version of
example1.c, and this document were written by me and consequently
are (c) copyrighted by Peter Dinges. These files are available under
the open-source MIT License.
Yossi Rubner is the author of the C code that performs the actual computation. I have included his source code in this repository because I am unsure about the future availability of his homepage. (I could not reach him via email.)