New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TF Compute Server #3525
Add TF Compute Server #3525
Conversation
Unit Test Results (with flaky tests) 1 037 files + 69 1 037 suites +69 10h 23m 51s ⏱️ + 7m 1s Results for commit 5becdd0. ± Comparison against base commit aeb960c. ♻️ This comment has been updated with latest results. |
8bed5ac
to
bd94d87
Compare
bb37d23
to
768414b
Compare
f338a29
to
7974688
Compare
7974688
to
1f613fd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an awesome PR, I did not know this TF service existed. Couple of comments, but nothing major. LGTM.
@staticmethod | ||
def read(filename: str, wait_for_file_creation: bool = False) -> 'TfDataServiceConfig': | ||
while wait_for_file_creation: | ||
if os.path.exists(filename): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems potentially dangerous. Could it not be the case that the file has been created, but it has not been fully written to yet?
Usually in situations like this I'll either use a mutex (if the reader/writer are running in the same process) or a FileLock:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Is the FileLock implementation guaranteed to work with any distributed file system? If the config file appears before the lock file, the lock does not work.
I have added a simple staged writing to the def write
method that moves the finished file into place via os.rename
inside the same directory.
8b4a1cb
to
2bf6e82
Compare
2bf6e82
to
ceb7546
Compare
Signed-off-by: Enrico Minack <github@enrico.minack.dev> Co-authored-by: Terence Hernandez <t.na.m.hernandez@gmail.com>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
…r script Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
This reverts commit 5ab15a1. Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
ad2d59c
to
8319e44
Compare
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Checklist before submitting
Description
Adds helper code to spin up a Horovod job that serves a distributed TensorFlow Data Service:
as well as code to connect Tensorflow training tasks to it:
Review process to land