Adding support for distributed TensorFlow. #14

vicaire · 2017-02-18T02:12:30Z

Adding support for distributed TensorFlow.

Tests:

Local execution, non distributed:

gcloud --verbosity=debug beta ml local train --package-path=youtube-8m-private --module-name=youtube-8m-private.train -- --train_data_pattern='gs://youtube8m-ml/1/video_level/train/train*.tfrecord' --train_dir=/tmp/yt8m_train --start_new_model

Local execution, distributed:

gcloud beta ml local train --package-path=youtube-8m-private --module-name=youtube-8m-private.train --distributed --parameter-server-count=1 --worker-count=4 -- --train_data_pattern='gs://youtube8m-ml/1/video_level/train/train*.tfrecord' --train_dir=/tmp/yt8m_train --start_new_model

Running on your own machine, python 2.7:

python youtube-8m-private/train.py --train_data_pattern='/tmp/features/train*.tfrecord' --train_dir=/tmp/features/video_level_logistic_model --start_new_model

Running on your own machine, python 3:

python3 youtube-8m-vicaire/train.py --train_data_pattern='/tmp/features/train*.tfrecord' --train_dir=/tmp/features/video_level_logistic_model

Distributed execution on cloud:

BUCKET_NAME=...; JOB_NAME=yt8m_train_$(date +%Y%m%d_%H%M%S); gcloud --verbosity=debug beta ml jobs submit training $JOB_NAME --package-path=youtube-8m-private --module-name=youtube-8m-private.train --staging-bucket=$BUCKET_NAME --region=us-central1 --config=youtube-8m-private/cloudml-gpu-distributed.yaml -- --train_data_pattern='gs://youtube8m-ml/1/video_level/train/train*.tfrecord' --train_dir=$BUCKET_NAME/yt8m_train_video_level_logistic_model --start_new_model

Non-distributed execution on cloud:

BUCKET_NAME=...; JOB_NAME=yt8m_train_$(date +%Y%m%d_%H%M%S); gcloud --verbosity=debug beta ml jobs submit training $JOB_NAME --package-path=youtube-8m-private --module-name=youtube-8m-private.train --staging-bucket=$BUCKET_NAME --region=us-central1 --config=youtube-8m-private/cloudml-gpu.yaml -- --train_data_pattern='gs://youtube8m-ml/1/video_level/train/train*.tfrecord' --train_dir=$BUCKET_NAME/yt8m_train_video_level_logistic_model --start_new_model

avoid NaNs.

isn't informative anymore.

… on line 149 of average_precision_calculator.py from list type to zip-type again as per Python 3 changes (#17)

nikhilk · 2017-02-21T21:06:39Z

train.py

+                    data_pattern + "'.")
+    logging.info("Number of training files: %s.", str(len(files)))
+    filename_queue = tf.train.string_input_producer(
+        files, num_epochs=num_epochs)


Add shuffle here...

Added the parameter "shuflle=True" to string_input_producer.

It is the default but it seems that being explicit can improve readability in this case.

nikhilk · 2017-02-21T21:08:33Z

train.py

+                   task_as_string(self.task), self.cluster.as_dict())
+      server = start_server(self.cluster, self.task)
+      target = server.target
+      device_fn = tf.train.replica_device_setter(


You need to tell what the ps_device is.
I don't know what merge_devices does, but I haven't used it.

Added the parameter '"ps_device="/job:ps"'. It is the default value but it seems that being explicit can improve readability.

Removing the parameter "merge_devices=True". It is also the default but this parameter is on the path to deprecation and specifying "merge_devices=False" triggers a warning.

…ultiple parameter servers. This CR also contains a couple changes to explicitely specify some default parameters to increase readability.

LeegleechN · 2017-02-21T23:34:41Z

train.py

  flags.DEFINE_string("optimizer", "AdamOptimizer",
                      "What optimizer class to use.")
+  flags.DEFINE_bool("log_device_placement", False,
+                    "Whether device placement should be logged.")


How about "Whether to write the device every op will run on into the logs on startup".

Changed to "Whether to write the device on which every op will run into the logs on startup."

LeegleechN · 2017-02-21T23:35:54Z

train.py

    training_data = [
-        reader.prepare_reader(filename_queue) for _ in xrange(num_readers)]
+        reader.prepare_reader(filename_queue) for _ in xrange(num_readers)


Please sync to head and test this in Python3. We are trying to maintain compatibility with both python versions now.

Synced to head.

Tested with Python3:

python3 youtube-8m-vicaire/train.py --train_data_pattern='/tmp/features/train*.tfrecord' --train_dir=/tmp/features/video_level_logistic_model

googlebot · 2017-02-22T00:40:30Z

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored by someone other than the pull request submitter. We need to confirm that they're okay with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of the commit author(s) and merge this pull request when appropriate.

vicaire and others added 14 commits February 17, 2017 17:20

Adding support for distributed TensforFlow.

7808eef

Readding the Training GAP to the summary.

240758b

Fixed the command-line example for frame-level training jobs.

f0eed29

Fixed a typo in the frame-level training job example.

426a5d2

Increase the epsilon value in the cross entropy implementation to try to

462baee

avoid NaNs.

Merge branch 'master' of github.com:google/youtube-8m

60517a5

Mention python 2.7.x requirement.

9b7b546

A bias term is unnecessary for the gating softmax.

d2955e4

Point people to the model implementations instead of models.py, which

bac206a

isn't informative anymore.

Use test set instead of validate for local inference command.

3723eef

Add support for Python3.

4dcf784

Changed several xrange to range, as per Python 3 specifications, also…

f5a5d0e

… on line 149 of average_precision_calculator.py from list type to zip-type again as per Python 3 changes (#17)

Remove dependency on the "future" package.

d05f7dd

Update readme to mention Python3 support.

e985a52

LeegleechN self-assigned this Feb 21, 2017

nikhilk reviewed Feb 21, 2017

View reviewed changes

Fixing a bug that prevents Tensorflow operations to be allocated to m…

3062507

…ultiple parameter servers. This CR also contains a couple changes to explicitely specify some default parameters to increase readability.

LeegleechN reviewed Feb 21, 2017

View reviewed changes

vicaire added 2 commits February 21, 2017 16:06

Improving the description for the flag 'log_device_placement'.

3439e33

Merge remote-tracking branch 'upstream/master' into distributed

cc95ac9

LeegleechN merged commit 070d57d into google:distributed Feb 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support for distributed TensorFlow. #14

Adding support for distributed TensorFlow. #14

vicaire commented Feb 18, 2017 •

edited

nikhilk Feb 21, 2017

vicaire Feb 21, 2017

nikhilk Feb 21, 2017

vicaire Feb 21, 2017

LeegleechN Feb 21, 2017

vicaire Feb 22, 2017

LeegleechN Feb 21, 2017

vicaire Feb 22, 2017

googlebot commented Feb 22, 2017

Adding support for distributed TensorFlow. #14

Adding support for distributed TensorFlow. #14

Conversation

vicaire commented Feb 18, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

googlebot commented Feb 22, 2017

vicaire commented Feb 18, 2017 •

edited