Skip to content
This repository was archived by the owner on Nov 16, 2023. It is now read-only.
This repository was archived by the owner on Nov 16, 2023. It is now read-only.

Provide TF_CONFIG environment variable for distributed TensorFlow #15

@damienpontifex

Description

@damienpontifex

The TensorFlow ClusterConfig can parse worker and parameter server settings from a TF_CONFIG environment variable (see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/learn/python/learn/estimators/run_config.py#L64-L156)

I was trying to pass it via an environment variable in the job configuration file like so:

"{ 'cluster': { 'ps': $AZ_BATCHAI_PS_HOSTS, 'worker': $AZ_BATCHAI_WORKER_HOSTS }, 'task': { 'index': $AZ_BATCHAI_TASK_INDEX, 'type': '' } }"

Which is kind of fine, but falls down for a few cases:

  1. When there are no parameter servers (i.e. single node) the ps hosts should be an empty array, but in this case it's just an empty string.
  2. The variables for hosts and workers are comma separated and the TF code parses it as JSON, so would ideally be an array type inside this string.
  3. The 'task.type' property can be 'master', 'worker' or 'ps' but that doesn't seem to have an appropriate environment variable and I had to pass the option via command line args

More generally though, providing this configuration via a TF_CONFIG environment variable would significantly lower the bar to get distributed training working in TensorFlow and Azure Batch. It would also simplify command line arg parameters and mean just the appropriate data directories would need to be passed and mean the same arguments could be used across master, worker and ps potentially simplify the tensorflowSettings property further.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions