Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TF Compute Server #3525

Merged
merged 47 commits into from Jun 19, 2022
Merged

Commits on Jun 17, 2022

  1. Add support for Tensorflow Data Service

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    Co-authored-by: Terence Hernandez <t.na.m.hernandez@gmail.com>
    EnricoMi and TerenceHernandez committed Jun 17, 2022
    Copy the full SHA
    3c12358 View commit details
    Browse the repository at this point in the history
  2. Move compute_*.py into horovod.tensorflow.data, fix examples

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    6b335ea View commit details
    Browse the repository at this point in the history
  3. Make output_filename configurable in compute_worker.py

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    6444108 View commit details
    Browse the repository at this point in the history
  4. Add tf data service to docs

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    546fa49 View commit details
    Browse the repository at this point in the history
  5. Make worker and example work with horovodrun, move docs into tensorfl…

    …ow.rst
    
    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    67b8618 View commit details
    Browse the repository at this point in the history
  6. Make spark worker work with spark-submit

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    c75c984 View commit details
    Browse the repository at this point in the history
  7. Add horovodrun example to CI

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    83d3c50 View commit details
    Browse the repository at this point in the history
  8. Remove tensorflow_data_service.rst from summary.rst

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    d8c1476 View commit details
    Browse the repository at this point in the history
  9. Download to CWD directly, not mnist sub-directory

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    38119b8 View commit details
    Browse the repository at this point in the history
  10. Add spark-submit example to docs and CI

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    d9491ea View commit details
    Browse the repository at this point in the history
  11. Reduce run time for examples in CI

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    bd7c6a9 View commit details
    Browse the repository at this point in the history
  12. Use default path to fetch mnist dataset, which is pre-fetched in test…

    … images
    
    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    53b2ede View commit details
    Browse the repository at this point in the history
  13. Use --mpi instead of --gloo for MPI tests

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    ff9c5b0 View commit details
    Browse the repository at this point in the history
  14. Run two workers to save ram, remove -H option

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    22f9116 View commit details
    Browse the repository at this point in the history
  15. Escape $ differently in test command, but only for Buildkite

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    c6eac52 View commit details
    Browse the repository at this point in the history
  16. Revert "Escape $ differently in test command, but only for Buildkite"

    This reverts commit bd94d87.
    
    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    f423a2f View commit details
    Browse the repository at this point in the history
  17. Reference the worker file directly

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    ba22df1 View commit details
    Browse the repository at this point in the history
  18. Initialize Horovod for Tensorflow in tf worker

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    9e3998b View commit details
    Browse the repository at this point in the history
  19. Pin horovod task to GPU

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    8596ead View commit details
    Browse the repository at this point in the history
  20. Move rank and size around

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    9cf4813 View commit details
    Browse the repository at this point in the history
  21. Update CHANGELOG.md

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    847c01b View commit details
    Browse the repository at this point in the history
  22. Introducing TimeoutException

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    c0cabf5 View commit details
    Browse the repository at this point in the history
  23. Add tests for compute service

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    2d23441 View commit details
    Browse the repository at this point in the history
  24. Fix shutdown test

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    5c1c819 View commit details
    Browse the repository at this point in the history
  25. Add timeout parameter to TfDataServiceConfig

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    5664728 View commit details
    Browse the repository at this point in the history
  26. Add tf unit tests

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    02597e9 View commit details
    Browse the repository at this point in the history
  27. Syncronize tests, assert batches

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    d98faf3 View commit details
    Browse the repository at this point in the history
  28. Add processing_mode to send_to_data_service, improve logging

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    cc4f773 View commit details
    Browse the repository at this point in the history
  29. Relax assertions, add logging, add timeout parameter to compute worke…

    …r script
    
    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    7fef190 View commit details
    Browse the repository at this point in the history
  30. Add training-side tests

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    80171b2 View commit details
    Browse the repository at this point in the history
  31. Add DEBUG level to pytest tests

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    d1ed69a View commit details
    Browse the repository at this point in the history
  32. Skip round-robin test

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    6b1c85d View commit details
    Browse the repository at this point in the history
  33. Test processing modes

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    7c55576 View commit details
    Browse the repository at this point in the history
  34. Remove expected batches, skip pre tf2

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    df42dc8 View commit details
    Browse the repository at this point in the history
  35. Minor restructure of tests

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    f6d829f View commit details
    Browse the repository at this point in the history
  36. Remove port detection and address spec for worker

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    e18c2f8 View commit details
    Browse the repository at this point in the history
  37. Bind to single GPU

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    35c7cf4 View commit details
    Browse the repository at this point in the history
  38. Revert "Add DEBUG level to pytest tests"

    This reverts commit 5ab15a1.
    
    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    d26780d View commit details
    Browse the repository at this point in the history
  39. Minor comment fix

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    5eba267 View commit details
    Browse the repository at this point in the history
  40. Have horovod.tensorflow.data.compute_worker.py script broadcast config

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    61f711e View commit details
    Browse the repository at this point in the history
  41. Add some words about TF data service to docs

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    91afa2f View commit details
    Browse the repository at this point in the history
  42. Shutdown dispatcher in finally clause

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    cbae912 View commit details
    Browse the repository at this point in the history
  43. Move the finished config file into place

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    9e6946e View commit details
    Browse the repository at this point in the history
  44. Fix config broadcast for MPI in GPU environment

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    a24c9f5 View commit details
    Browse the repository at this point in the history
  45. Fixing typos in docs

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    5b6e765 View commit details
    Browse the repository at this point in the history
  46. Add tensorflow issue to skipped test

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    8319e44 View commit details
    Browse the repository at this point in the history
  47. Remove extra timeout from compute_worker_fn

    Signed-off-by: Enrico Minack <github@enrico.minack.dev>
    EnricoMi committed Jun 17, 2022
    Copy the full SHA
    5becdd0 View commit details
    Browse the repository at this point in the history