Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Reduce the readiness checks for functions #249
This could be optimized a little further if new image for doing
Some performance numbers. Before (there was a timeout error):
cold start: 10.240251064300537
cold start: 1.8590199947357178
Motivation and Context
How Has This Been Tested?
Types of changes
This only breaks very old functions which use a version of the watchdog which does not have a /healtz endpoint
left a comment
Hi Berndt thanks for your patch, just a few tweaks needed before merge. If making the probe type is more work than you have time for maybe it can be done in a follow-up PR? Thanks, Alex
Hi @berndtj that is very thorough work, thanks for taking time to think through the configuration options and for signing-off the PR.
Here is what I was thinking:
Since we use the same-point for liveness and readiness, they should always be enabled, but the question is which mode. Compatibility mode or http-mode?
Both are needed to keep compatibility with existing functions, that's why the option is needed.
Given a value of
faas-netes and the gateway expose health via
If you are not using the watchdog and don't want to expose your health endpoint via
Given a value of
I think we could de-duplicate the options in the PR and use the same values for timeout / period checking and initial check for both liveness and readiness. At this stage they point at the same endpoint and react in the same way.
What are your thoughts on above?
That's kind of embarrassing (/healthz vs /_/health). I'm surprised it still passes readiness/liveness. I'm not sure the value needs to be configurable necessarily.
Anyway on to your other points. Yeah, I forgot about "compatiblity" mode, that's easy enough.
I explicitly did not dedupe the probes as I figured you actually want different values for liveness and readiness even if the endpoint is the same. For instance, I can live with a much longer period with liveness, but I want as short as possible for readiness.
Ok, updated based on comments. Probe is always on and defaults to http, but can be configured for lock. Also did a bit of deduping of code where applicable.
Lastly... I see errors occasionally with regards to cold start/first call:
I don't believe this has anything to do with this change (we've seen the same error within Dispatch when using OpenFaaS). It's something that should probably be addressed separately.
Yes, I only see it scaling from 0->1 (but to be honest I'm not testing subsequent requests). We've also seen it with Dispatch and openfaas when we are waiting on the function to become ready. It's likely the same issue.
I actually have a change to the actual readiness check that doesn't rely on the lock file at all. I'll give that a test. I think it actually does fix the issue.
Hi Berndt I know you have time away coming up, all I could do at this point is to take your commit, reset it, fix it and add it back again but it would lose your authorship. I could perhaps set the "git author" but it won't look like it does now in the history.