Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Slurm] Alternatives for job completion monitoring #25

Open
sylvlecl opened this issue Dec 2, 2019 · 2 comments
Open

[Slurm] Alternatives for job completion monitoring #25

sylvlecl opened this issue Dec 2, 2019 · 2 comments

Comments

@sylvlecl
Copy link
Contributor

sylvlecl commented Dec 2, 2019

  • Do you want to request a feature or report a bug?

Feature

  • What is the current behavior?

In order to monitor the completion of jobs submitted to Slurm, we use files and filesystem polling.
Depending on the polling frequency, this introduces some performance cost (delay between the end of the task and the time when the computation manager identifies it as completed), and some load on the underlying filesystem, in particular when multiple processes using a computation manager are running.

  • What is the expected behavior?

We could be able to configure the way the completion monitoring is performed.
Polling will be one implementation of this functionality.

Other interesting implementations would be :

  1. A very simple in house networking protocol, for example implemented with netty.
  2. Using a message broker (kafka, rabbitmq ...) : this should probably be left for implementation by client projects
  • What is the motivation / use case for changing the behavior?

Improving perceived performances while relieving the filesystem.

  • Please tell us about your environment:
    • powsybl-hpc version: 2.7.0
@yichen88
Copy link
Contributor

If slurm is in local mode, we can simply register a WatchService on flagDir.

@sylvlecl
Copy link
Contributor Author

Yes, but the problem is that even in "local" mode, there are good chances that the flag dir is actually on a shared filesystem, for instance a nfs mount, so that slurm nodes can access it. In that case, the watch service will probably not work (or be implemented with polling).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants