GitHub - msparks/qjam: a distributed computing framework

msparks / qjam Public

forked from ali01/qjam

Notifications You must be signed in to change notification settings
Fork 0
Star 3

a distributed computing framework

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 477 Commits
benchmarks		benchmarks
bin		bin
examples		examples
matlab_adam		matlab_adam
poster		poster
qjam		qjam
report		report
sparse_autoencoder		sparse_autoencoder
tests		tests
.gitignore		.gitignore
README		README

Repository files navigation

qjam is a framework for distributed computing.

It leverages ssh public key setups to automatically bootstrap and start worker
nodes without manual intervention.


DEPENDENCIES
--------------------------------

The following packages are required to use qjam:

  * Python Nose (sudo aptitude install python-nose)
  * Python numpy (sudo aptitude install python-numpy)


EXAMPLES
--------------------------------

You need to have passwordless public-key ssh access to localhost as your
current username. If `ssh localhost' at a terminal gets you to a new prompt,
you're golden. If not, set up ssh keys and/or an ssh agent as appropriate.

The username used to access remote machines via ssh is the username used on the
master machine. You can specify an alternate username to be used on the remote
machines in two ways:

  * Set the QJAM_USER environment variable.
  * Use a 'User' directive in your ~/.ssh/config.

With the proper ssh setup, these example programs can be run directly from the
commandline:

  bin/sum-matrix-example.py
    This is a simple example that explains how to use the qjam API to execute
    code across many machines.

    Usage: ./sum-matrix-example.py localhost machine2 machine3

  examples/newton_sqrt.py
    Approximates the square root of a number using Newton's method.

    Usage: python examples/newton_sqrt.py 123456789


ARCHITECTURE
--------------------------------

There are three main pieces to the qjam framework: the Master, the
RemoteWorker, and the Worker.

The Worker is a program that is copied to all of the remote machines during the
bootstrapping process. It is responsible for waiting for instructions from the
Master, and upon receiving work, processing that work and returning the result.

The RemoteWorker is a special Python class that communicates with the remote
machines. One RemoteWorker has a single target machine that can be reached via
ssh. There can be many RemoteWorkers with the same target (say, in the case
where there are many cores on a machine), but only one target per
RemoteWorker. At creation, the RemoteWorker bootstraps the remote machine by
copying the requisite files to run the Worker program, via ssh. After the
bootstrapping process completes, the RemoteWorker starts a Worker process on
the remote machine and attaches to it. The RemoteWorker is the proxy between
the Master and the Worker.

The Master is a Python class that divides up work and assigns the work units
among its pool of RemoteWorker instances. These RemoteWorker instances relay
the work to the Worker programs running on the remote machines and wait for the
results.

Usage of the qjam framework is simple. There is one primary point of entry on a
Master instance: master.run(module, params, dataset). The 'module' argument is
a Python module object that contains a function 'mapfunc(params, dataset)' that
will be called by the worker on the params and slices of the whole dataset. The
'params' argument specifies an arbitrary Python object that is passed directly
to all workers. The 'dataset' argument is an instance of a DataSet type
(defined in qjam.dataset) that helps qjam determine how to slice the input
data.

There is a difference between 'params' and 'dataset': multiple calls to
master.run() can use the same 'params' and 'dataset', but the pieces of
'dataset' will not be retransferred to all of the remote machines on the second
call to master.run(); the slices of data are cached locally at the
nodes. However, 'params' will be transferred in full on every call to
master.run(). Therefore, 'params' is best used for data that changes between
calls to master.run(), and 'dataset' should be used for data that does not
change.


TESTS
--------------------------------

The tests for qjam are located in the 'tests' directory. To run the test suite,
type:

  nosetests

in the root directory. The tests are a good way to learn more about the
architecture of qjam.


RUNNING ON STANFORD AI LAB MACHINES
----------------------------------------

The yggdrasil machines are missing Python 2.6 and Numpy.

So, you must first install Python 2.6 and Numpy locally. As of
2010/11/29, yggdrasil[1-4] all have these installed at the prefix
/tmp/py26. Adapt the /tmp/update_yggdrasil_py26.sh script to install these on
other yggdrasils. Set the env var QJAM_REMOTE_PYTHON=/tmp/py26/bin/python in
your calls to qjam to use this Python.

Make sure /tmp/py26/bin/python is in your $PATH, and set $PYTHONPATH to the
top-level 'qjam' directory.

Examples:

yggdrasil1% QJAM_REMOTE_PYTHON=/tmp/py26/bin/python2.6 \
            bin/sum-matrix-example.py \
            yggdrasil1 yggdrasil2 yggdrasil3 yggdrasil4

yggdrasil1% QJAM_REMOTE_PYTHON=/tmp/py26/bin/python2.6 \
            python examples/newton_sqrt.py 123456789 yggdrasil1

yggdrasil1% QJAM_REMOTE_PYTHON=/tmp/py26/bin/python2.6 \
            nosetests tests/