Skip to content

innovationgarage/tfprism

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tensorflow Prism

TFPrism is a library that transforms your tensorflow graph to automatically do data parallelism for training. All you need to do to modify your single-cpu tensorflow code to run on a cluster is to send your training op and feed_dict through the library.

Example code

train_step = tf.train.GradientDescentOptimizer(0.9).minimize(loss)

with tf.Session('grpc://mycluster.example.com:5600') as sess:
    train_step, node_copier = tfprism.distribute_graph_on_all_tasks(train_step, sess)
    sess.run(init_op)

    for batch in batches:
        sess.run(
            train_step,
            feed_dict=node_copier.mangle_feed_dict(batch))

Installation

pip install .

Training server / cluster management

The example code above assumes that there is a tensorflow cluster running a set of worker tasks and parameter server tasks, apropriately named "/job:worker" "/job:ps" respectively. To set up this can be a bit tiresome, and if all you want is to quickly get a cluster up and running and parallelize your code, you can use the cluster management tool provided with tfprism.

To install the cluster management tools, you need to do

apt install parallel
pip install .[server]

on each node in your cluster. Once you have done so you can run

tfprism cluster start server1,server2,...serverN

to start your cluster. You need to be able to ssh without passwords (using public key auth) to all servers listed. After this you can connect to port grpc://server1:5600 using tensorflow.

About

Transforms your tensorflow graph to automatically do data parallelism for training

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages