Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shut down an idle pilot #27

Merged
merged 2 commits into from Sep 15, 2021
Merged

Conversation

dsschult
Copy link
Contributor

Currently STARTD_NOCLAIM_SHUTDOWN is fixed at 30 minutes, and supervisord restarts condor repeatedly, never exiting the container.

This PR allows the idle time to be configured, and shuts down the container if condor_master exits normally.

@brianhlin
Copy link
Member

@dsschult could you rebase this on master? We've got some basic CI tests now so I'd like to run this change through them.

@dsschult
Copy link
Contributor Author

Rebased. And now I see tests running.

@brianhlin
Copy link
Member

@matyasselmeci @rynge what do you think about this? This would change the behavior of the containers quite a bit for our other users so we'd have to communicate it to them

@dsschult
Copy link
Contributor Author

The main behavior change is autorestart=unexpected. Not sure if that could be made an option, so default behavior is unchanged. Might be able to do that in the entrypoint script, before supervisord starts up.

@brianhlin
Copy link
Member

At this point users just start up the container when they want work from the access points and whether or not they get it is our problem. With this change, sites will have to monitor whether or not their container is running to account for transient lulls in usage.

@matyasselmeci
Copy link
Contributor

This seems similar to https://opensciencegrid.atlassian.net/browse/SOFTWARE-4608.

@brianhlin
Copy link
Member

That's in a similar vein but I think it means we only want autorestart on successful exit: the container should exit if the startd isn't working so that the admin can just check if their containers are running or not to determine if they're offering their resources to the OSPool.

@matyasselmeci
Copy link
Contributor

Wouldn't that delete the log files of that pilot, making it harder to troubleshoot?

@dsschult
Copy link
Contributor Author

dsschult commented Sep 15, 2021

I added a change to make it configurable. Then you can decide which policy you like better at runtime.

It now takes SUPERVISORD_RESTART_POLICY=unexpected as an env variable (and for some reason that didn't make it into the commit message).

Could potentially name it better if you can think of something.

Copy link
Member

@brianhlin brianhlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a change to make it configurable. Then you can decide which policy you like better at runtime.

All disagreements can be satisfied by more configuration :). This works for me!

Wouldn't that delete the log files of that pilot, making it harder to troubleshoot?

We do tell folks to mount something to /pilot so the logs should still be around. I think it's probably better than the alternative where you need to be an expert in condor and this container to realize that you're running the container but it's not advertising back to the OSPool.

@brianhlin brianhlin merged commit 6ac962c into opensciencegrid:master Sep 15, 2021
@dsschult dsschult deleted the idle_shutdown branch September 15, 2021 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants