New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
windows health checks run from a single powershell instance #6947
Conversation
Hello mwrock! Thanks for the pull request! Here is what will happen next:
Thank you for contributing! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't made it all the way through yet, but I thought some initial feedback might be useful. This is a good opportunity for me to learn more about the Windows-specific code.
// Always check to see if the powershell named pipe server is running. | ||
// This should only be false the first time this function is called | ||
// or if someone explicitly terminates the powershell process | ||
if self.pipe_exists().is_err() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to distinguish between when we expect the pipe to exist versus when we're expecting that we'll have to create it. I have concerns around getting in loops where we create the pipe over and over again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well we expect it to exist on the 2nd call to exec_hook
and onwards. However, its vwery possible someone can inadvertantly terminate the hosting pwsh
process and obliterate the pipe. If there is something fundamentally wrong and we can't create the pipe, the call to init_server
should err resulting in killing the supervisor. If for some reason the client did was not propperly disconnected from the pipe on the last call to exec_hook
, the call to init_server
would fail trying to create a new pipe that is busy. However we have 2 protections against this scenario:
- the client's pipe handle is dropped and thus closed at the end of
exec_hook
- the powershell server disconnects the client after returning its exit code.
Ok((pipe, poll)) | ||
} | ||
|
||
fn init_server<T>(&self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function intended to be called many times? The logic of exec_hook
seems to indicate that it could be. If needing to reestablish the named pipe more than once during the lifetime of a HealthCheckHook
is expected, I think the name init
is unexpected. Could that instead be rolled into connect
? If the connection should be durable, we could make it stateful in PipeHookClient
either through an Option
or the typestate pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Along the lines of the above comment, we don't "expect" to call it multiple times but we need to be prepared to call it multiple times in the event the hosting pwsh
process is terminated. I like keeping connect
focused on connecting to an existing pipe. It also avoids needing to pass in the args that need to be forwarded to init_service
. For instance, the quit
function connects if there is an existing pipe but there is no need to create a new pipe service just to quit.
I had originally implemented this using a durrable connection and the struct held an Option<Pipe>
in a mutatable property. The prblem that introduced is that the hook
's run
function is not setup to be mutable which it needs to be since it's pipe client can mutate. However changing the signature of run
to use a &mut self
proved to be pretty invasive.
I can definitely change the name of the init_server
function though. Perhaps to start_server
.
fn pipe_ready(&self, poll: &Poll, readiness: Ready) -> Result<()> { | ||
let mut events = Events::with_capacity(1024); | ||
loop { | ||
poll.poll(&mut events, None)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add liveliness_checker::mark_thread_alive()
here so we can track potential deadlocks. Here's one way to do it (note the change of return type):
fn pipe_ready(&self, poll: &Poll, readiness: Ready) -> std::io::Result<bool> {
let mut events = Events::with_capacity(1024);
let loop_value = loop {
let checked_thread = habitat_common::liveliness_checker::mark_thread_alive();
let result = poll.poll(&mut events, None).map(|_|
events.iter().any(|e| e.readiness().contains(readiness))
);
if let Ok(false) = result {
continue;
} else {
break checked_thread.unregister(result);
}
};
loop_value.into_result()
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing None
to poll
will block forever. It's hard to say before running this in the field if there are instances where we'd rather time out, but adding a env_config_duration
would allow tuning that without an additional release. I wonder if it ever makes sense to block longer than our intended delay between health check runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah my concern with adding a timeout is that effectively becomes a timeout for how long we allow the hook to run which we do not currently enforce in any of our hooks and I did not want to introduce a new concept of hook timeout and especially not only for windows. I think the worst that would happen here is the health check would stall but since they run in their own thread it would not interfere with the supervisor or other services in general.
Signed-off-by: mwrock <matt@mattwrock.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
This resolves #5584 and implements #6873
Signed-off-by: mwrock matt@mattwrock.com