Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job Control and Monitoring APIs #15

Merged
merged 1 commit into from
Mar 30, 2017
Merged

Job Control and Monitoring APIs #15

merged 1 commit into from
Mar 30, 2017

Conversation

rhc54
Copy link
Member

@rhc54 rhc54 commented Jan 31, 2017

Signed-off-by: Ralph Castain rhc@open-mpi.org

@jjhursey
Copy link
Member

jjhursey commented Feb 9, 2017

The RFC seems to just the template. I think you might have forgotten to push.

@adammoody
Copy link
Contributor

Thanks @rhc54. This looks good to me. The heartbeat keys you have defined provide more capability even than my initial proposal.

Can we define default settings in the standard in case a user doesn't specify all keys? For example, can we default the "missed beats" to be 1, unless the user sets it to something else?

@rhc54
Copy link
Member Author

rhc54 commented Mar 10, 2017

Sure - I'll add that to the RFC. Please pass along any other suggestions!

@rhc54 rhc54 changed the title Job Control API Job Control and Monitoring APIs Mar 22, 2017
@rhc54
Copy link
Member Author

rhc54 commented Mar 22, 2017

@bosilca @abouteiller Please see what you think, especially with respect to the monitoring capability.

@rhc54
Copy link
Member Author

rhc54 commented Mar 27, 2017

@adammoody Can you please review and approve/comment?

pmix_info_t directives[2];
PMIX_INFO_LOAD(&directives[0], PMIX_JOB_CTRL_CHECKPOINT, "mycheckpoint.1", PMIX_STRING)
PMIX_INFO_LOAD(&directives[1], PMIX_JOB_CTRL_CHECKPOINT_SIGNAL, SIGUSR2, PMIX_INT);
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example makes me think about the SIGNAL/EVENT.

In this example, the requestor sets that a given signal should be triggered at the target clients. However, if the requestor is a tool, and the client an application process, the tool may manage a variety of application client types that employ a checkpointing system of their liking. It seems to me that in such cases, the target client should publish to its local server what is the signal/event that should be used to trigger a checkpoint, which would leave the requestor free of knowing what particular signal that client needs to trigger.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The monitoring part looks fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea - will update to include it. Thanks!

@rhc54 rhc54 added the APPROVED label Mar 30, 2017
@rhc54
Copy link
Member Author

rhc54 commented Mar 30, 2017

Shall be committed upon addition of example code

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
@rhc54 rhc54 removed the SUBMITTED label Mar 30, 2017
@rhc54 rhc54 merged commit a9d8787 into pmix:master Mar 30, 2017
@rhc54 rhc54 deleted the rfc/jobctrl branch March 30, 2017 21:14
@adammoody
Copy link
Contributor

@rhc54 Thanks for your time and effort to add the monitoring support. This will be a really useful feature for people.

As a next step, I'll work with you to write up some examples that could be used in cut-and-paste form for app developers, e.g., show the client code to register a heartbeat and then email the user and have the job killed if it gets stuck. The heartbeat macro you show is a great start along these lines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants