Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TF Compute Server #3525

Merged
merged 47 commits into from Jun 19, 2022
Merged

Conversation

EnricoMi
Copy link
Collaborator

@EnricoMi EnricoMi commented Apr 24, 2022

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

Adds helper code to spin up a Horovod job that serves a distributed TensorFlow Data Service:

horovodrun -np 4 python -m horovod.tensorflow.data.compute_worker /tmp/compute.json

as well as code to connect Tensorflow training tasks to it:

compute_config = TfDataServiceConfig.read('/tmp/compute.json', wait_for_file_creation=True)

dataset = dataset.repeat() \
    .send_to_data_service(compute_config, rank, size) \
    .prefetch(tf.data.experimental.AUTOTUNE)

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

@EnricoMi EnricoMi changed the title Branch tf compute server Add TF compute server Apr 24, 2022
@github-actions
Copy link

github-actions bot commented Apr 24, 2022

Unit Test Results

     893 files  +  57       893 suites  +57   9h 48m 47s ⏱️ -35s
     781 tests +  11       737 ✔️ +  10       44 💤 +  1  0 ±0 
19 061 runs  +285  13 666 ✔️ +235  5 395 💤 +50  0 ±0 

Results for commit 5becdd0. ± Comparison against base commit aeb960c.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Apr 24, 2022

Unit Test Results (with flaky tests)

  1 037 files  +  69    1 037 suites  +69   10h 23m 51s ⏱️ + 7m 1s
     781 tests +  11       737 ✔️ +  10       44 💤 +  1  0 ±0 
22 381 runs  +339  15 598 ✔️ +263  6 783 💤 +76  0 ±0 

Results for commit 5becdd0. ± Comparison against base commit aeb960c.

♻️ This comment has been updated with latest results.

@EnricoMi EnricoMi force-pushed the branch-tf-compute-server branch 2 times, most recently from 8bed5ac to bd94d87 Compare April 25, 2022 14:47
@EnricoMi EnricoMi added this to the v0.25.0 milestone Apr 27, 2022
@EnricoMi EnricoMi marked this pull request as ready for review April 27, 2022 11:35
@EnricoMi EnricoMi changed the title Add TF compute server Add TF Compute Server Apr 27, 2022
@EnricoMi EnricoMi force-pushed the branch-tf-compute-server branch 4 times, most recently from bb37d23 to 768414b Compare April 30, 2022 21:00
@EnricoMi EnricoMi force-pushed the branch-tf-compute-server branch 6 times, most recently from f338a29 to 7974688 Compare May 11, 2022 20:21
Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an awesome PR, I did not know this TF service existed. Couple of comments, but nothing major. LGTM.

docs/tensorflow.rst Outdated Show resolved Hide resolved
@staticmethod
def read(filename: str, wait_for_file_creation: bool = False) -> 'TfDataServiceConfig':
while wait_for_file_creation:
if os.path.exists(filename):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems potentially dangerous. Could it not be the case that the file has been created, but it has not been fully written to yet?

Usually in situations like this I'll either use a mutex (if the reader/writer are running in the same process) or a FileLock:

https://py-filelock.readthedocs.io/en/latest/

Copy link
Collaborator Author

@EnricoMi EnricoMi Jun 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Is the FileLock implementation guaranteed to work with any distributed file system? If the config file appears before the lock file, the lock does not work.

I have added a simple staged writing to the def write method that moves the finished file into place via os.rename inside the same directory.

horovod/tensorflow/data/compute_service.py Outdated Show resolved Hide resolved
horovod/tensorflow/data/compute_service.py Outdated Show resolved Hide resolved
@EnricoMi EnricoMi force-pushed the branch-tf-compute-server branch 3 times, most recently from 8b4a1cb to 2bf6e82 Compare June 9, 2022 20:10
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Co-authored-by: Terence Hernandez <t.na.m.hernandez@gmail.com>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
…r script

Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
This reverts commit 5ab15a1.

Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
@EnricoMi EnricoMi merged commit 32b5d3e into horovod:master Jun 19, 2022
@EnricoMi EnricoMi deleted the branch-tf-compute-server branch June 19, 2022 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants