Provide TF_CONFIG environment variable for distributed TensorFlow

The TensorFlow `ClusterConfig` can parse worker and parameter server settings from a `TF_CONFIG` environment variable (see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/learn/python/learn/estimators/run_config.py#L64-L156)

I was trying to pass it via an environment variable in the job configuration file like so:
```json
"{ 'cluster': { 'ps': $AZ_BATCHAI_PS_HOSTS, 'worker': $AZ_BATCHAI_WORKER_HOSTS }, 'task': { 'index': $AZ_BATCHAI_TASK_INDEX, 'type': '' } }"
```

Which is kind of fine, but falls down for a few cases:
1. When there are no parameter servers (i.e. single node) the ps hosts should be an empty array, but in this case it's just an empty string.
2. The variables for hosts and workers are comma separated and the TF code parses it as JSON, so would ideally be an array type inside this string.
2. The 'task.type' property can be 'master', 'worker' or 'ps' but that doesn't seem to have an appropriate environment variable and I had to pass the option via command line args

More generally though, providing this configuration via a `TF_CONFIG` environment variable would significantly lower the bar to get distributed training working in TensorFlow and Azure Batch. It would also simplify command line arg parameters and mean just the appropriate data directories would need to be passed and mean the same arguments could be used across master, worker and ps potentially simplify the tensorflowSettings property further.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide TF_CONFIG environment variable for distributed TensorFlow #15

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Provide TF_CONFIG environment variable for distributed TensorFlow #15

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions