New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the readiness checks for functions #249

Merged
merged 1 commit into from Jul 21, 2018

Conversation

Projects
None yet
4 participants
@berndtj
Copy link
Contributor

berndtj commented Jul 16, 2018

  • Significantly improves scale up time for functions (when going
    from 0 -> 1)
  • Health check is hit more frequently, but should not noticibly
    impact performance
  • Use the httpget probe type and leverage the watchdog /healthz
    endpoint

This could be optimized a little further if new image for doing
the http probes where created which would block on connection errors
and return immediately when the response comes back, but the best
case is < 1s improvement.

Some performance numbers. Before (there was a timeout error):

cold start: 10.240251064300537
error calling function: Command 'echo -n "Test" | faas-cli -g http://192.168.64.78:31112 invoke hello-python' returned non-zero exit status 1.
cold start: 4.621361255645752
cold start: 5.6364970207214355
cold start: 11.648431777954102
cold start: 8.450724840164185
cold start: 9.854270935058594
cold start: 12.048357009887695
cold start: 12.24026870727539

After:

cold start: 1.8590199947357178
cold start: 1.8544681072235107
cold start: 2.065181016921997
cold start: 1.8414137363433838
cold start: 1.6598482131958008
cold start: 2.4577977657318115
cold start: 2.4510068893432617
cold start: 2.244048833847046
cold start: 2.6444039344787598

Description

Motivation and Context

Fix #218

  • I have raised an issue to propose this change (required)

How Has This Been Tested?

import requests
import subprocess
import time

for i in range(10):
    now = time.time()
    try:
        subprocess.check_output('echo -n "Test" | faas-cli -g http://192.168.64.78:31112 invoke hello-python', shell=True)
        print("cold start: %s" % (time.time() - now))
    except Exception as e:
        print("error calling function: %s" % e)
    resp = requests.post("http://192.168.64.78:31113/system/scale-function/hello-python", json={"serviceName": "hello-python", "replicas": 0})
    time.sleep(10)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

This only breaks very old functions which use a version of the watchdog which does not have a /healtz endpoint

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I've read the CONTRIBUTION guide
  • I have signed-off my commits with git commit -s
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@derek derek bot added the new-contributor label Jul 16, 2018

@derek

This comment has been minimized.

Copy link

derek bot commented Jul 16, 2018

Thank you for your contribution. I've just checked and your commit doesn't appear to be signed-off.
That's something we need before your Pull Request can be merged. Please see our contributing guide.

@derek derek bot added the no-dco label Jul 16, 2018

@berndtj berndtj force-pushed the berndtj:reduce-readiness branch from 02fc6ee to c19eefa Jul 16, 2018

@derek derek bot removed the no-dco label Jul 16, 2018

@alexellis
Copy link
Member

alexellis left a comment

Hi Berndt thanks for your patch, just a few tweaks needed before merge. If making the probe type is more work than you have time for maybe it can be done in a follow-up PR? Thanks, Alex

probe := &apiv1.Probe{
Handler: apiv1.Handler{
Exec: &apiv1.ExecAction{
Command: []string{"cat", path},
HTTPGet: &apiv1.HTTPGetAction{

This comment has been minimized.

@alexellis

alexellis Jul 16, 2018

Member

This can't be turned on by default, it needs to be optional.

This comment has been minimized.

@berndtj

berndtj Jul 16, 2018

Contributor

Oh, I had thought based on our off-line conversation this wasn't the case. Easy to make optional

},
},
InitialDelaySeconds: 3,
InitialDelaySeconds: 0,

This comment has been minimized.

@alexellis

alexellis Jul 16, 2018

Member

Please introduce a configuration item for this. You can largely copy and paste from the existing variables.

Should also be available via helm as an option.

This comment has been minimized.

@berndtj

berndtj Jul 16, 2018

Contributor

ok

TimeoutSeconds: 1,
PeriodSeconds: 10,
PeriodSeconds: 1,

This comment has been minimized.

@alexellis

alexellis Jul 16, 2018

Member

This should also be a configuration item with a default of the previous value for compatibility. When used in dispatch you'd just set your values via the helm chart

This comment has been minimized.

@berndtj

berndtj Jul 16, 2018

Contributor

Are you sure you want the default to be the previous value(s)? These values will have far more positive effect than negative, and shouldn't break anything existing

@berndtj

This comment has been minimized.

Copy link
Contributor

berndtj commented Jul 16, 2018

Yes @alexellis I assumed any change to the actual probe, would be a separate PR

@berndtj berndtj force-pushed the berndtj:reduce-readiness branch 2 times, most recently from a5214a5 to e33512a Jul 16, 2018

@berndtj

This comment has been minimized.

Copy link
Contributor

berndtj commented Jul 16, 2018

Pretty much exposed everything. Let me know if you think this is going a bit far. Also, I left the default for the liveness probe the same as before, but the readiness probe has new values which make 0->1 scaling faster.

@alexellis

This comment has been minimized.

Copy link
Member

alexellis commented Jul 17, 2018

Hi @berndtj that is very thorough work, thanks for taking time to think through the configuration options and for signing-off the PR.

Here is what I was thinking:

Since we use the same-point for liveness and readiness, they should always be enabled, but the question is which mode. Compatibility mode or http-mode?

   probe_type: http 
   probe_type: lock 

Both are needed to keep compatibility with existing functions, that's why the option is needed.

Given a value of http then the /_/health endpoint should be queried (as defined in the watchdog). The OpenFaaS watchdog uses a prefix to avoid any clashing of function endpoints:

/_/health

https://github.com/openfaas/faas/blob/master/watchdog/main.go#L53

faas-netes and the gateway expose health via /healthz because they are not functions, but services.

If you are not using the watchdog and don't want to expose your health endpoint via /_/health then perhaps this should be configurable in the helm chart, for your use only?

Given a value of lock then the existing code should still run and the http probe should not be added.

I think we could de-duplicate the options in the PR and use the same values for timeout / period checking and initial check for both liveness and readiness. At this stage they point at the same endpoint and react in the same way.

What are your thoughts on above?

@berndtj

This comment has been minimized.

Copy link
Contributor

berndtj commented Jul 17, 2018

That's kind of embarrassing (/healthz vs /_/health). I'm surprised it still passes readiness/liveness. I'm not sure the value needs to be configurable necessarily.

Anyway on to your other points. Yeah, I forgot about "compatiblity" mode, that's easy enough.

I explicitly did not dedupe the probes as I figured you actually want different values for liveness and readiness even if the endpoint is the same. For instance, I can live with a much longer period with liveness, but I want as short as possible for readiness.

@berndtj berndtj force-pushed the berndtj:reduce-readiness branch from e33512a to 8e9ceb8 Jul 18, 2018

@berndtj

This comment has been minimized.

Copy link
Contributor

berndtj commented Jul 18, 2018

Ok, updated based on comments. Probe is always on and defaults to http, but can be configured for lock. Also did a bit of deduping of code where applicable.

Lastly... I see errors occasionally with regards to cold start/first call:

2018/07/18 00:16:00 error with upstream request to: /function/hello-python, Post http://hello-python.openfaas-fn.svc.cluster.local.:8080/function/hello-python: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

I don't believe this has anything to do with this change (we've seen the same error within Dispatch when using OpenFaaS). It's something that should probably be addressed separately.

@alexellis

This comment has been minimized.

Copy link
Member

alexellis commented Jul 19, 2018

@berndtj on the last comment I have a question.

When do you see that issue? Is it specifically when scaling 0 to 1 or at other times?

@berndtj

This comment has been minimized.

Copy link
Contributor

berndtj commented Jul 19, 2018

Yes, I only see it scaling from 0->1 (but to be honest I'm not testing subsequent requests). We've also seen it with Dispatch and openfaas when we are waiting on the function to become ready. It's likely the same issue.

I actually have a change to the actual readiness check that doesn't rely on the lock file at all. I'll give that a test. I think it actually does fix the issue.

@@ -113,6 +113,20 @@ spec:
value: "{{ .Values.faasnetesd.writeTimeout }}"
- name: image_pull_policy
value: {{ .Values.faasnetesd.imagePullPolicy | quote }}
- name: http_probe

This comment has been minimized.

@alexellis

alexellis Jul 19, 2018

Member

Hi, these will need to be added to the README.md in the chart to show what values are valid and what they mean.

I think we should have defaults over there.

We also need to deploy via plain YAML via the ./yaml/ folder, so I imagine this needs updating too? That or sane (existing) defaults have to be added to the code.

This comment has been minimized.

@alexellis

alexellis Jul 19, 2018

Member

(Just seen the defaults in the code, if the defaults work well then we could update the YAML later.) Best way to test is to kubectl delete the two OpenFaaS namespaces, then apply the YAML folder again.

WriteTimeout time.Duration
ImagePullPolicy string
Port int
HTTPProbe bool

This comment has been minimized.

@alexellis

alexellis Jul 19, 2018

Member

Think this might be useful to comment on:

// HTTPProbe when set to true switches readiness and liveness probe to access /_/health over HTTP instead of accessing /tmp/.lock.
HTTPProbe bool
ReadinessProbeInitialDelaySeconds int
ReadinessProbeTimeoutSeconds int
ReadinessProbePeriodSeconds int

This comment has been minimized.

@alexellis

alexellis Jul 19, 2018

Member

Curious if this is worth making a Golang duration in this PR or a follow-up?

The other configs support Golang durations now, could call durationVal.Seconds() in the code to convert if that makes sense.

This comment has been minimized.

@alexellis

alexellis Jul 19, 2018

Member

I don't consider this as compulsory - just want your take on it.

@berndtj

This comment has been minimized.

Copy link
Contributor

berndtj commented Jul 19, 2018

I hadn't even considered the yaml ;). I'll make sure and test first

Handler: apiv1.Handler{
var handler apiv1.Handler

if config.HTTPProbe {

This comment has been minimized.

@stefanprodan

stefanprodan Jul 19, 2018

Member

This looks good for now but in the future we should have a way to switch this flag from the function definition so that is backwards compatible with functions that are built with the old watchdog.

This comment has been minimized.

@alexellis

alexellis Jul 20, 2018

Member

For instance, we could use the new annotations field being worked on by @ewilde

ReadinessProbePeriodSeconds int
LivenessProbeInitialDelaySeconds int
LivenessProbeTimeoutSeconds int
LivenessProbePeriodSeconds int

This comment has been minimized.

@stefanprodan

stefanprodan Jul 19, 2018

Member

I would make these of type time.Duration but we can address this at a later time.

@berndtj berndtj force-pushed the berndtj:reduce-readiness branch from 8e9ceb8 to ed8b6d3 Jul 19, 2018

@alexellis

This comment has been minimized.

Copy link
Member

alexellis commented Jul 20, 2018

Not going to be popular for saying this, but we've had some Chart changes merged since the PR.

This generally means resetting the commit, rebasing the chart then running make charts again before doing a commit with a force.

Other than that LGTM.

Alex

@alexellis

This comment has been minimized.

Copy link
Member

alexellis commented Jul 21, 2018

Hi Berndt I know you have time away coming up, all I could do at this point is to take your commit, reset it, fix it and add it back again but it would lose your authorship. I could perhaps set the "git author" but it won't look like it does now in the history.

Alex

@berndtj

This comment has been minimized.

Copy link
Contributor

berndtj commented Jul 21, 2018

I can fix it up right now.

Reduce the readiness checks for functions
* Significantly improves scale up time for functions (when going
  from 0 -> 1)
* Health check is hit more frequently, but should not noticibly
  impact performance
* Use the httpget probe type and leverage the watchdog /healthz
  endpoint
* Make all probe attributes configurable in charts

This could be optimized a little further if new image for doing
the http probes where created which would block on connection errors
and return immediately when the response comes back, but the best
case is < 1s improvement.

Some performance numbers.  Before (there was a timeout error):

cold start: 10.240251064300537
error calling function: Command 'echo -n "Test" | faas-cli -g http://192.168.64.78:31112 invoke hello-python' returned non-zero exit status 1.
cold start: 4.621361255645752
cold start: 5.6364970207214355
cold start: 11.648431777954102
cold start: 8.450724840164185
cold start: 9.854270935058594
cold start: 12.048357009887695
cold start: 12.24026870727539

After:

cold start: 1.8590199947357178
cold start: 1.8544681072235107
cold start: 2.065181016921997
cold start: 1.8414137363433838
cold start: 1.6598482131958008
cold start: 2.4577977657318115
cold start: 2.4510068893432617
cold start: 2.244048833847046
cold start: 2.6444039344787598

Signed-off-by: Berndt Jung <bjung@vmware.com>

@berndtj berndtj force-pushed the berndtj:reduce-readiness branch from ed8b6d3 to d451c1e Jul 21, 2018

@alexellis alexellis merged commit aa04e3e into openfaas:master Jul 21, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

alexellis added a commit that referenced this pull request Aug 3, 2018

Fix tests broken by #249
The following commit did not update tests and it seems the
Dockerfile / CI was not running them either, found by Lucas.
Error in: aa04e3e

Tested with:

- go test ./test
- make

Signed-off-by: Alex Ellis (VMware) <alexellis2@gmail.com>
@alexellis

This comment has been minimized.

Copy link
Member

alexellis commented Aug 3, 2018

We just discovered the tests were broken in this commit, fixed in #249.

alexellis added a commit that referenced this pull request Aug 3, 2018

Fix tests broken by #249
The following commit did not update tests and it seems the
Dockerfile / CI was not running them either, found by Lucas.
Error in: aa04e3e

Tested with:

- go test ./test
- make

Signed-off-by: Alex Ellis (VMware) <alexellis2@gmail.com>
@dkozlov

This comment has been minimized.

Copy link

dkozlov commented Aug 5, 2018

Hi @berndtj, Do you have plans to add /_/health endpoint in https://github.com/openfaas-incubator/of-watchdog functions?

@alexellis

This comment has been minimized.

Copy link
Member

alexellis commented Aug 5, 2018

This is the wrong repo for the question. There's already a disk-based health check with s http one in progress - openfaas-incubator/of-watchdog#13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment