New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive latency when serving function list endpoint #64

Closed
alexellis opened this Issue Dec 2, 2018 · 8 comments

Comments

Projects
None yet
1 participant
@alexellis
Member

alexellis commented Dec 2, 2018

I've noticed excessive latency when serving function list endpoint when compared to faas-netes. For instance we have 100 function deployments on the community cluster right now running with faas-netes, but calling the 127.0.0.1:8080/system/functions endpoint takes real 0m 0.05s

The operator is taking a similar amount of time some of the time, but on subsequent requests can be taking 3-5 seconds or longer, or even timing out.

Reproduce the issue

  • Deploy the operator
  • Deploy 2-5 functions from the store
  • Find the pod name for the operator
  • Exec into the pod and the operator container:
kubectl exec -ti -n openfaas gateway-784469b897-k75t5 -c gateway sh

Run:

time wget -O - 127.0.0.1:8081/system/functions

real	0m 0.03s
real	0m 4.34s
real	0m 2.79s
real	0m 2.60s
real	0m 1.99s
...

Using kubectl gives a time:

time kubectl get function -n openfaas-fn

Could this be caused by locking or the way we are looking up services?

@alexellis alexellis changed the title from Excessive latency when querying to Excessive latency when serving function list endpoint Dec 2, 2018

@alexellis

This comment has been minimized.

@alexellis

This comment has been minimized.

Member

alexellis commented Dec 2, 2018

The functions I deployed were from the certifier so some have embedded annotations/labels. Even with store functions: figlet, certinfo and nodeinfo - calling the endpoint seems to give more latency when called several times.

For instance:

faas-cli store deploy figlet -g 127.0.0.1:31112 --annotation abc=abcdef --name figlet1
faas-cli store deploy figlet -g 127.0.0.1:31112 --annotation abc=abcdef --name figlet2
faas-cli store deploy figlet -g 127.0.0.1:31112 --annotation abc=abcdef --name figlet3
faas-cli store deploy figlet -g 127.0.0.1:31112 --annotation abc=abcdef --name figlet4
faas-cli store deploy figlet -g 127.0.0.1:31112 --annotation abc=abcdef --name figlet5
@alexellis

This comment has been minimized.

Member

alexellis commented Dec 2, 2018

It might be worth us adding the following for belt and braces, but it made no difference to performance:

		if r.Body != nil {
			defer r.Body.Close()
		}
@alexellis

This comment has been minimized.

Member

alexellis commented Dec 2, 2018

This is the kind of latency I'm seeing when I added tracing points with glog and time.Start/Since:

W1202 23:12:58.966410       1 list.go:32] Functions.List took - 0.006309s
W1202 23:12:58.971994       1 list.go:52] certinfo took - 0.004477s
W1202 23:12:58.976054       1 list.go:52] figlet took - 0.004006s
W1202 23:12:58.980199       1 list.go:52] figlet1 took - 0.004097s
W1202 23:12:58.984183       1 list.go:52] figlet2 took - 0.003930s
W1202 23:12:58.988290       1 list.go:52] figlet3 took - 0.004043s
W1202 23:12:58.992148       1 list.go:52] figlet4 took - 0.003807s
W1202 23:12:58.996088       1 list.go:52] figlet5 took - 0.003889s
W1202 23:12:58.999794       1 list.go:52] nodeinfo took - 0.003657s
W1202 23:12:58.999920       1 list.go:61] Marshal/Write took - 0.000083s
W1202 23:13:00.670198       1 list.go:32] Functions.List took - 0.010643s
W1202 23:13:00.673821       1 list.go:52] certinfo took - 0.003575s
W1202 23:13:00.677265       1 list.go:52] figlet took - 0.003400s
W1202 23:13:00.680955       1 list.go:52] figlet1 took - 0.003645s
W1202 23:13:00.684608       1 list.go:52] figlet2 took - 0.003607s
W1202 23:13:00.688301       1 list.go:52] figlet3 took - 0.003647s
W1202 23:13:00.691868       1 list.go:52] figlet4 took - 0.003522s
W1202 23:13:00.695411       1 list.go:52] figlet5 took - 0.003498s
W1202 23:13:00.698807       1 list.go:52] nodeinfo took - 0.003350s
W1202 23:13:00.698965       1 list.go:61] Marshal/Write took - 0.000117s
W1202 23:13:01.073426       1 list.go:32] Functions.List took - 0.005144s
W1202 23:13:01.076768       1 list.go:52] certinfo took - 0.003297s
W1202 23:13:01.080407       1 list.go:52] figlet took - 0.003594s
W1202 23:13:01.083922       1 list.go:52] figlet1 took - 0.003469s
W1202 23:13:01.087426       1 list.go:52] figlet2 took - 0.003460s
W1202 23:13:01.275161       1 list.go:52] figlet3 took - 0.187682s
W1202 23:13:01.474320       1 list.go:52] figlet4 took - 0.199105s
W1202 23:13:01.674123       1 list.go:52] figlet5 took - 0.199742s
W1202 23:13:01.874214       1 list.go:52] nodeinfo took - 0.200040s
W1202 23:13:01.874372       1 list.go:61] Marshal/Write took - 0.000110s
W1202 23:13:01.906746       1 list.go:32] Functions.List took - 0.004899s
W1202 23:13:02.074201       1 list.go:52] certinfo took - 0.167396s
W1202 23:13:02.274338       1 list.go:52] figlet took - 0.200088s
W1202 23:13:02.339328       1 list.go:32] Functions.List took - 0.004992s
W1202 23:13:02.474905       1 list.go:52] figlet1 took - 0.200516s
W1202 23:13:02.674724       1 list.go:52] certinfo took - 0.335339s
W1202 23:13:02.874153       1 list.go:52] figlet2 took - 0.399197s
W1202 23:13:03.072271       1 list.go:52] figlet took - 0.397504s
W1202 23:13:03.272005       1 list.go:52] figlet3 took - 0.397813s
W1202 23:13:03.474272       1 list.go:52] figlet1 took - 0.401969s
W1202 23:13:03.672093       1 list.go:52] figlet4 took - 0.400069s
W1202 23:13:03.872032       1 list.go:52] figlet2 took - 0.397711s
W1202 23:13:03.961350       1 list.go:32] Functions.List took - 0.001673s
W1202 23:13:04.074394       1 list.go:52] figlet5 took - 0.402270s
W1202 23:13:04.274593       1 list.go:52] figlet3 took - 0.402529s
W1202 23:13:04.474242       1 list.go:52] certinfo took - 0.512861s
W1202 23:13:04.674375       1 list.go:52] nodeinfo took - 0.599928s
W1202 23:13:04.674500       1 list.go:61] Marshal/Write took - 0.000081s
W1202 23:13:04.766454       1 list.go:32] Functions.List took - 0.005187s
W1202 23:13:04.874378       1 list.go:52] figlet4 took - 0.599734s
W1202 23:13:05.074505       1 list.go:52] figlet took - 0.600202s
W1202 23:13:05.274491       1 list.go:52] certinfo took - 0.507988s
W1202 23:13:05.471950       1 list.go:52] figlet5 took - 0.597535s
W1202 23:13:05.674611       1 list.go:52] figlet1 took - 0.600054s
W1202 23:13:05.874562       1 list.go:52] figlet took - 0.600020s
W1202 23:13:06.074191       1 list.go:52] nodeinfo took - 0.602213s
W1202 23:13:06.074319       1 list.go:61] Marshal/Write took - 0.000087s
W1202 23:13:06.274449       1 list.go:52] figlet2 took - 0.599780s
W1202 23:13:06.474518       1 list.go:52] figlet1 took - 0.599905s
W1202 23:13:06.674474       1 list.go:52] figlet3 took - 0.399967s
W1202 23:13:06.874438       1 list.go:52] figlet2 took - 0.399872s
W1202 23:13:07.072574       1 list.go:52] figlet4 took - 0.398059s
W1202 23:13:07.274146       1 list.go:52] figlet3 took - 0.399659s
W1202 23:13:07.474255       1 list.go:52] figlet5 took - 0.401645s
W1202 23:13:07.674566       1 list.go:52] figlet4 took - 0.400369s
W1202 23:13:07.874167       1 list.go:52] nodeinfo took - 0.399864s
W1202 23:13:07.874323       1 list.go:61] Marshal/Write took - 0.000115s

It seems like the "getReplicas" function is a source of latency adding about 0.5s per function in the cluster. https://github.com/openfaas-incubator/openfaas-operator/blob/master/pkg/server/replicas.go#L52

@alexellis

This comment has been minimized.

Member

alexellis commented Dec 2, 2018

Doing a list of the functions followed by a listing of the deployments seems to be more performant and time-consistent than listing functions and then querying each deployment individually. With that change I can't reproduce the issue as above.

@alexellis

This comment has been minimized.

Member

alexellis commented Dec 3, 2018

@stefanprodan suggested this behaviour may be related to rate limiting enforced by the Kubernetes API Server.

Given my approach reduces API calls this would make sense. faas-netes also does a List followed by iteration rather than Get x N deployments. The suggestion from Stefan is to use a Shared Informer which caches the results it hears from the Kubernetes API server rather than calling out to the server directly.

I'll test this first with Get then if it resolves the issue would probably recommend using List anyway for consistency.

@alexellis

This comment has been minimized.

Member

alexellis commented Dec 3, 2018

The following uses a shared informer: openfaas/openfaas-operator:0.8.12-rc6 and Get - this resolves the latency issue.

@alexellis alexellis referenced this issue Dec 3, 2018

Merged

Fix rate limiting latency for list endpoint #65

7 of 12 tasks complete
@alexellis

This comment has been minimized.

Member

alexellis commented Dec 4, 2018

Derek close: via #65

@derek derek bot closed this Dec 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment