Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for distributed TensorFlow. #14

Merged
merged 17 commits into from Feb 22, 2017
Merged

Adding support for distributed TensorFlow. #14

merged 17 commits into from Feb 22, 2017

Conversation

vicaire
Copy link
Contributor

@vicaire vicaire commented Feb 18, 2017

Adding support for distributed TensorFlow.

Tests:

Local execution, non distributed:

gcloud --verbosity=debug beta ml local train --package-path=youtube-8m-private --module-name=youtube-8m-private.train -- --train_data_pattern='gs://youtube8m-ml/1/video_level/train/train*.tfrecord' --train_dir=/tmp/yt8m_train --start_new_model

Local execution, distributed:

gcloud beta ml local train --package-path=youtube-8m-private --module-name=youtube-8m-private.train --distributed --parameter-server-count=1 --worker-count=4 -- --train_data_pattern='gs://youtube8m-ml/1/video_level/train/train*.tfrecord' --train_dir=/tmp/yt8m_train --start_new_model

Running on your own machine, python 2.7:

python youtube-8m-private/train.py --train_data_pattern='/tmp/features/train*.tfrecord' --train_dir=/tmp/features/video_level_logistic_model --start_new_model

Running on your own machine, python 3:

python3 youtube-8m-vicaire/train.py --train_data_pattern='/tmp/features/train*.tfrecord' --train_dir=/tmp/features/video_level_logistic_model

Distributed execution on cloud:

BUCKET_NAME=...; JOB_NAME=yt8m_train_$(date +%Y%m%d_%H%M%S); gcloud --verbosity=debug beta ml jobs submit training $JOB_NAME --package-path=youtube-8m-private --module-name=youtube-8m-private.train --staging-bucket=$BUCKET_NAME --region=us-central1 --config=youtube-8m-private/cloudml-gpu-distributed.yaml -- --train_data_pattern='gs://youtube8m-ml/1/video_level/train/train*.tfrecord' --train_dir=$BUCKET_NAME/yt8m_train_video_level_logistic_model --start_new_model

Non-distributed execution on cloud:

BUCKET_NAME=...; JOB_NAME=yt8m_train_$(date +%Y%m%d_%H%M%S); gcloud --verbosity=debug beta ml jobs submit training $JOB_NAME --package-path=youtube-8m-private --module-name=youtube-8m-private.train --staging-bucket=$BUCKET_NAME --region=us-central1 --config=youtube-8m-private/cloudml-gpu.yaml -- --train_data_pattern='gs://youtube8m-ml/1/video_level/train/train*.tfrecord' --train_dir=$BUCKET_NAME/yt8m_train_video_level_logistic_model --start_new_model

@LeegleechN LeegleechN self-assigned this Feb 21, 2017
train.py Outdated
data_pattern + "'.")
logging.info("Number of training files: %s.", str(len(files)))
filename_queue = tf.train.string_input_producer(
files, num_epochs=num_epochs)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add shuffle here...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the parameter "shuflle=True" to string_input_producer.

It is the default but it seems that being explicit can improve readability in this case.

task_as_string(self.task), self.cluster.as_dict())
server = start_server(self.cluster, self.task)
target = server.target
device_fn = tf.train.replica_device_setter(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to tell what the ps_device is.
I don't know what merge_devices does, but I haven't used it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the parameter '"ps_device="/job:ps"'. It is the default value but it seems that being explicit can improve readability.

Removing the parameter "merge_devices=True". It is also the default but this parameter is on the path to deprecation and specifying "merge_devices=False" triggers a warning.

…ultiple parameter servers.

This CR also contains a couple changes to explicitely specify some default parameters to increase readability.
train.py Outdated
flags.DEFINE_string("optimizer", "AdamOptimizer",
"What optimizer class to use.")
flags.DEFINE_bool("log_device_placement", False,
"Whether device placement should be logged.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "Whether to write the device every op will run on into the logs on startup".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to "Whether to write the device on which every op will run into the logs on startup."

train.py Outdated
training_data = [
reader.prepare_reader(filename_queue) for _ in xrange(num_readers)]
reader.prepare_reader(filename_queue) for _ in xrange(num_readers)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please sync to head and test this in Python3. We are trying to maintain compatibility with both python versions now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced to head.

Tested with Python3:

python3 youtube-8m-vicaire/train.py --train_data_pattern='/tmp/features/train*.tfrecord' --train_dir=/tmp/features/video_level_logistic_model

@googlebot
Copy link

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored by someone other than the pull request submitter. We need to confirm that they're okay with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of the commit author(s) and merge this pull request when appropriate.

@LeegleechN LeegleechN merged commit 070d57d into google:distributed Feb 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants