Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

z2jh 0.10.6 scale test failure with kube-scheduler v1.19.2 #2025

Open
mriedem opened this issue Feb 5, 2021 · 24 comments
Open

z2jh 0.10.6 scale test failure with kube-scheduler v1.19.2 #2025

mriedem opened this issue Feb 5, 2021 · 24 comments
Labels

Comments

@mriedem
Copy link
Contributor

mriedem commented Feb 5, 2021

Bug description

Yesterday I upgraded our testing environment cluster from z2jh 0.9.0 to 0.10.6. This is on IBM Cloud's Kubernetes service and running Kubernetes 1.18.5. There is nothing very magical about this deployment, we have a custom authenticator but that's about it. All user notebook pod storage is backed by an s3fs object store rather than PVCs. And we're using placeholder pods for scaling and preparation for scheduled large user events to scale up user nodes etc.

Before promoting the change to our pre-production cluster I wanted to run a scale test of 2000 active user notebook pods using https://github.com/IBM/jupyter-tools. I started by scaling the user-placeholder replicas to 2000 which created ~71 additional user nodes (the actual profile we use on the pods in the testing cluster are a much smaller resource footprint so I probably would have been fine with 1K placeholders to scale up the user nodes, but alas).

Then I ran the hub-stress-test script with -c 2000 which will create 2000 users and then start the notebook servers (pods) in batches of 10. When the scale test got to around 500-600 active user pods I was seeing a lot of timeout failures and consecutive spawn failures in the hub logs which was causing the hub to crash, e.g.:

Feb 4 17:59:26 hub-58c84bf48b-hdqfg hub CRITICAL CRITICAL 2021-02-04T23:59:26.202Z [JupyterHub base:976] Aborting due to 5 consecutive spawn failures
Feb 4 18:04:45 hub-58c84bf48b-hdqfg hub CRITICAL CRITICAL 2021-02-05T00:04:45.924Z [JupyterHub base:976] Aborting due to 5 consecutive spawn failures
Feb 4 18:04:47 hub-58c84bf48b-hdqfg hub CRITICAL CRITICAL 2021-02-05T00:04:47.932Z [JupyterHub base:976] Aborting due to 6 consecutive spawn failures

Later I also noticed that it didn't appear that the placeholder pods were being pre-empted so I started digging into the user-scheduler pod logs and saw this:

Feb 4 19:58:30 user-scheduler-6c4d89654-hgrqd user-scheduler fatal fatal error: concurrent map writes

And there were a lot of restarts of the 2 user-scheduler pod replicas:

$ kubectl get all -n jhub | grep user-scheduler
pod/user-scheduler-6c4d89654-hgrqd 0/1 CrashLoopBackOff 28 5h49m
pod/user-scheduler-6c4d89654-l8dcm 0/1 CrashLoopBackOff 28 5h49m

image

Resources were not an issue for the user-scheduler pods:

image

image

Eventually I found this Kubernetes issue kubernetes/kubernetes#96832 which says it was introduced in 1.19, and I didn't think that applied to us since we're running Kubernetes 1.18.5. But then I noticed that z2jh 0.10.6 is using kube-scheduler 1.19.2 per 39e0cd5.

So the next thing to try was the backport for that Kubenertes issue: kubernetes/kubernetes#96809

That's in kube-scheduler 1.19.5: https://github.com/kubernetes/kubernetes/blob/v1.19.5/pkg/scheduler/framework/plugins/volumebinding/volume_binding.go

So I upgraded the kube-scheduler version to be v1.19.5 in our helm chart:

scheduling:
  userScheduler:
    enabled: true
    replicas: 2
    image:
      # Pick up the fix for https://github.com/kubernetes/kubernetes/pull/96809;
      # we can remove this when we upgrade to z2jh >= 0.11.0.
      tag: v1.19.5

I then ran our scale test back up to 1K users and then another 1K after that, so far so good and the user-placeholder pods were being pre-empted during the scale test as expected:

image

Furthermore there are no errors in the user-scheduler logs.

Expected behaviour

Should be able to scale up our environment to 2000 or more with z2jh 0.10.6.

Actual behaviour

See above.

How to reproduce

See above - note that another user was mentioning a similar issue in Gitter on Jan 28:

image

Your personal set up

See above. If you need anything more specific I'm happy to share but this seems pretty straight-forward.

@mriedem mriedem added the bug label Feb 5, 2021
@mriedem
Copy link
Contributor Author

mriedem commented Feb 5, 2021

I realize that latest z2jh is using kube-scheduler 1.19.7 so that would fix the issue for us if we upgraded to the latest z2jh, but if possible it'd be good to fix this on the 0.10.x series for those that aren't jumping straight to the 0.11.x series (like us). We have a large user event happening in a couple of weeks so I'm trying to be conservative with the changes that we roll out to our production cluster before that event happens. For now we've got our fix in our helm chart, but it'd probably be good to patch the z2jh 0.10.x series with kube-scheduler (at least) 1.19.5 for anyone else that upgrades to the 0.10.x series and hits the same problem (like Matthew Brett above).

@mriedem
Copy link
Contributor Author

mriedem commented Feb 5, 2021

If a 0.10.7 release isn't an option then maybe at least calling out the known issue in the docs is an alternative.

@manics
Copy link
Member

manics commented Feb 5, 2021

Thanks for the writeup!

Is the only required change for 0.10.6 scheduling.userScheduler.image.tag: v1.19.5? If so I don't think it's worth the overhead of a 0.10.7 release, there are lots of minor bug fixes that are continuously added to Z2JH. If an Z2JH admin goes from 0.10.6 to 0.10.7 that's a deliberate decision on their part which means reading the changelog, testing, etc, so I don't think a new release makes things much easier anyway.

What do you think about writing up a post on https://discourse.jupyter.org/ with this fix instead? It's more likely to be seen there, and I also think your testing process would be of interest to others.

@mriedem
Copy link
Contributor Author

mriedem commented Feb 5, 2021

Is the only required change for 0.10.6 scheduling.userScheduler.image.tag: v1.19.5

Yes, in this case. I also ran into jupyterhub/kubespawner#466 but that's a different issue which is already fixed in kubespawner 0.15.0 and that's picked up in z2jh 0.11.0.

If so I don't think it's worth the overhead of a 0.10.7 release, there are lots of minor bug fixes that are continuously added to Z2JH. If an Z2JH admin goes from 0.10.6 to 0.10.7 that's a deliberate decision on their part which means reading the changelog, testing, etc, so I don't think a new release makes things much easier anyway.

Yup, good point.

What do you think about writing up a post on https://discourse.jupyter.org/ with this fix instead? It's more likely to be seen there, and I also think your testing process would be of interest to others.

Since I've already got a fix and am mostly trying to raise awareness on this, yeah I agree mentioning it in the forum is a good alternative, I'll do that and then close this issue.

@consideRatio
Copy link
Member

This is an amazing writeup @mriedem!!!! Thank you soooooooooooooooo much for doing this investigation!!!! ❤️ 🎉 I love that the issues @matthew-brett reported can be pinned to a reason, because I were really clueless about them! Nice work tracking this down all the way to kube-scheduler!

Concrete action points that come to my mind:

  • Update the changelog retrospectively to mark 0.11.0 as fixing this bug through dep: bump kube-scheduler from 1.19.2 to 1.19.7 #1981.
  • Document ambition of backporting bugfixes and/or security fixes.
    At this point I think we need more maintenance capacity overall in the project to manage multiple patch releases by backporting to older versions, but this also relate to the experience of doing so and how much time and effort it takes given that experience. I personally lack it but know that @minrk is more efficient for example so perhaps it can be something that becomes more and more viable over time.

@mriedem
Copy link
Contributor Author

mriedem commented Feb 5, 2021

FWIW, my second run of the scale test, to go from 1K to 2K active user notebook pods, started hitting failures around 1830 total active user pods. The hub restarted twice during that second run. There were no restarts of the user-scheduler pods. Unfortunately while this run was going on there is an issue with IBM's LogDNA service so I don't easily have an idea of what the hub errors were, but I'm assuming they were crashes due to reaching the consecutive spawn failure limit.

Another thing I noticed is that the user-placeholders didn't dip as much as I'd expect when adding 1000 user pods, this is current:

$ kubectl -n jhub get statefulset/user-placeholder
NAME               READY       AGE
user-placeholder   1895/2000   157d

And a handful more of user nodes were scaled up, so it seems the issues with the placeholder pod pre-emption is not totally gone with that kube-scheduler fix. I was looking at #1837 and wondering if that might be affecting this but it's a bit over my head on how it could change the behavior from what we had working before (with z2jh 0.9.0-n212.hdd4fd8bd).

@consideRatio
Copy link
Member

@mriedem does kubectl logs deploy/hub --previous work or not?

@mriedem I concluded that 0.9.0-n246.h3d6e7579 introduced a new system in #1778. Before that, kube-scheduler v1.16.11 was used. 0.9.0-n212 was released 04 September 2020.

Backtracking in time what 0.9.0-n212 was using, was the old kube-scheduler version of v1.16.11, and the old configuration of it as well.

How we started kube-scheduler

            - /usr/local/bin/kube-scheduler
            - --scheduler-name={{ .Release.Name }}-user-scheduler
            - --policy-configmap=user-scheduler
            - --policy-configmap-namespace={{ .Release.Namespace }}
            - --lock-object-name=user-scheduler
            - --lock-object-namespace={{ .Release.Namespace }}
            - --leader-elect-resource-lock=configmaps
            - --v={{ .Values.scheduling.userScheduler.logLevel | default 4 }}

How we configured kube-scheduler

{
  "kind": "Policy",
  "apiVersion": "v1",
  "predicates": [
    { "name": "PodFitsResources" },
    { "name": "HostName" },
    { "name": "PodFitsHostPorts" },
    { "name": "MatchNodeSelector" },
    { "name": "NoDiskConflict" },
    { "name": "PodToleratesNodeTaints" },
    { "name": "MaxEBSVolumeCount" },
    { "name": "MaxGCEPDVolumeCount" },
    { "name": "MaxAzureDiskVolumeCount" },
    { "name": "MaxCSIVolumeCountPred" },
    { "name": "CheckVolumeBinding" },
    { "name": "NoVolumeZoneConflict" },
    { "name": "MatchInterPodAffinity" }
  ],
  "priorities": [
    { "name": "NodePreferAvoidPodsPriority",  "weight": 161051 },
    { "name": "NodeAffinityPriority",         "weight": 14641 },
    { "name": "InterPodAffinityPriority",     "weight": 1331 },
    { "name": "MostRequestedPriority",        "weight": 121 },
    { "name": "ImageLocalityPriority",        "weight": 11}
  ],
  "hardPodAffinitySymmetricWeight" : 100,
  "alwaysCheckAllPredicates" : false
}

Currently, for anyone in k8s 1.16+, we use kube-scheduler 1.19.7 and...

            - /usr/local/bin/kube-scheduler
            # NOTE: --leader-elect-... (new) and --lock-object-... (deprecated)
            #       flags are silently ignored in favor of whats defined in the
            #       passed KubeSchedulerConfiguration whenever --config is
            #       passed.
            #
            # ref: https://kubernetes.io/docs/reference/command-line-tools-reference/kube-scheduler/
            #
            # NOTE: --authentication-skip-lookup=true is used to avoid a
            #       seemingly harmless error, if we need to not skip
            #       "authentication lookup" in the future, see the linked issue.
            #
            # ref: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/1894
            {{- if .Capabilities.APIVersions.Has "storage.k8s.io/v1/CSINode" }}
            - --config=/etc/user-scheduler/config.yaml
            - --authentication-skip-lookup=true
    # plugins ref: https://kubernetes.io/docs/reference/scheduling/config/#scheduling-plugins-1
    plugins:
      score:
        disabled:
          - name: SelectorSpread
          - name: TaintToleration
          - name: PodTopologySpread
          - name: NodeResourcesBalancedAllocation
          - name: NodeResourcesLeastAllocated
          # Disable plugins to be allowed to enable them again with a different
          # weight and avoid an error.
          - name: NodePreferAvoidPods
          - name: NodeAffinity
          - name: InterPodAffinity
          - name: ImageLocality
        enabled:
          - name: NodePreferAvoidPods
            weight: 161051
          - name: NodeAffinity
            weight: 14631
          - name: InterPodAffinity
            weight: 1331
          - name: NodeResourcesMostAllocated
            weight: 121
          - name: ImageLocality
            weight: 11

@consideRatio
Copy link
Member

Another thing I noticed is that the user-placeholders didn't dip as much as I'd expect when adding 1000 user pods, this is current:

$ kubectl -n jhub get statefulset/user-placeholder
NAME               READY       AGE
user-placeholder   1895/2000   157d

And a handful more of user nodes were scaled up, so it seems the issues with the placeholder pod pre-emption is not totally gone with that kube-scheduler fix.

@mriedem Hmmm... that reads to me that 105 user-placeholder pods are pending or similar and that there were massive amounts of available seats for real users. Were there real user pods in a pending state at this point?

I wonder if there is a limit on how much eviction can happen in a short amount of time and such. I think when the scheduler concludes it schedule a pod it will ask something else if a lower priority pod could be evicted to make room for the real user pod rather than doing that itself. I'm not sure about this though, but it would be very relevant to better diagnose this issue to understand how the schedulers decision leads to evictions.

Hmmm... @mriedem have you inspected the resource requests on the scheduler also btw? If it fail to keep up with attempts to schedule, eviction won't happen fast enough either. Perhaps having more user-scheduler pods or more resources allocated to them will help?

# here is mybinder.org-deploy's GKE clusters user-scheduler pods right now
$ kubectl top pod -l component=user-scheduler
NAME                              CPU(cores)   MEMORY(bytes)   
user-scheduler-769b46d576-7dh8q   3m           53Mi            
user-scheduler-769b46d576-lppjz   3m           43Mi

@mriedem
Copy link
Contributor Author

mriedem commented Feb 5, 2021

@mriedem does kubectl logs deploy/hub --previous work or not?

Got the logs back, it was what I suspected, consecutive spawn failure limit being reached:

Feb 5 11:02:21 hub-85499595cb-46b7s hub CRITICAL CRITICAL 2021-02-05T17:02:21.575Z [JupyterHub base:976] Aborting due to 6 consecutive spawn failures

Also seeing some of these, not sure what those might becoming from:

Feb 5 10:43:32 hub-85499595cb-46b7s hub ERROR ERROR 2021-02-05T16:43:32.747Z [JupyterHub reflector:270] Error when watching resources, retrying in 0.2s
    Traceback (most recent call last):
      File "/usr/local/lib/python3.8/dist-packages/kubespawner/reflector.py", line 210, in _watch_and_update
        resource_version = self._list_and_update()
      File "/usr/local/lib/python3.8/dist-packages/kubespawner/reflector.py", line 155, in _list_and_update
        initial_resources = getattr(self.api, self.list_method_name)(
      File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api/core_v1_api.py", line 15302, in list_namespaced_pod
        return self.list_namespaced_pod_with_http_info(namespace, **kwargs)  # noqa: E501
      File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api/core_v1_api.py", line 15413, in list_namespaced_pod_with_http_info
        return self.api_client.call_api(
      File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api_client.py", line 348, in call_api
        return self.__call_api(resource_path, method,
      File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api_client.py", line 180, in __call_api
        response_data = self.request(
      File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api_client.py", line 373, in request
        return self.rest_client.GET(url,
      File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/rest.py", line 239, in GET
        return self.request("GET", url,
      File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/rest.py", line 233, in request
        raise ApiException(http_resp=r)
    kubernetes.client.exceptions.ApiException: (500)
    Reason: Internal Server Error
    HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Fri, 05 Feb 2021 16:43:32 GMT', 'Content-Length': '150'})
    HTTP response body: b'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"rpc error: code = Unavailable desc = transport is closing","code":500}\n'

@mriedem Hmmm... that reads to me that 105 user-placeholder pods are pending or similar and that there were massive amounts of available seats for real users. Were there real user pods in a pending state at this point?

After stepping away for a bit here is where the placeholder stateful set is at:

$ kubectl -n jhub get statefulset/user-placeholder
NAME               READY       AGE
user-placeholder   1901/2000   157d

I don't see any pending real user pods:

osboxes@osboxes:~$ kubectl -n jhub get pod -l component=singleuser-server | grep Pending
osboxes@osboxes:~$ 
$ kubectl -n jhub get pod -l component=singleuser-server | grep Running | wc -l
1825

Hmmm... @mriedem have you inspected the resource requests on the scheduler also btw? If it fail to keep up with attempts to schedule, eviction won't happen fast enough either. Perhaps having more user-scheduler pods or more resources allocated to them will help?

Maybe, I can try scaling up the user-scheduler replicas. Here is what we have currently (though I don't have a scale test running at the moment):

$ kubectl top pod -l component=user-scheduler -n jhub
NAME                              CPU(cores)   MEMORY(bytes)   
user-scheduler-55dff7c5d7-j6zpl   68m          238Mi           
user-scheduler-55dff7c5d7-tc2s8   8m           157Mi 

The resource requirements are the same as from default z2jh:

$ kubectl describe -n jhub deploy/user-scheduler
Name:                   user-scheduler
Namespace:              jhub
CreationTimestamp:      Tue, 01 Sep 2020 07:49:11 -0400
Labels:                 app=jupyterhub
                        app.kubernetes.io/managed-by=Helm
                        chart=jupyterhub-0.10.6
                        component=user-scheduler
                        heritage=Helm
                        release=jhub
Annotations:            deployment.kubernetes.io/revision: 4
                        meta.helm.sh/release-name: jhub
                        meta.helm.sh/release-namespace: jhub
Selector:               app=jupyterhub,component=user-scheduler,release=jhub
Replicas:               2 desired | 2 updated | 2 total | 2 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           app=jupyterhub
                    component=user-scheduler
                    release=jhub
  Annotations:      checksum/config-map: 5ca07792017036f4ad0509d5fcabdbe2e3c972a9e94ca111a0f16a13363682f1
  Service Account:  user-scheduler
  Containers:
   user-scheduler:
    Image:      k8s.gcr.io/kube-scheduler:v1.19.5
    Port:       <none>
    Host Port:  <none>
    Command:
      /usr/local/bin/kube-scheduler
      --config=/etc/user-scheduler/config.yaml
      --authentication-skip-lookup=true
      --v=4
    Requests:
      cpu:        50m
      memory:     256Mi
    Liveness:     http-get http://:10251/healthz delay=15s timeout=1s period=10s #success=1 #failure=3
    Readiness:    http-get http://:10251/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/user-scheduler from config (rw)
  Volumes:
   config:
    Type:               ConfigMap (a volume populated by a ConfigMap)
    Name:               user-scheduler
    Optional:           false
  Priority Class Name:  jhub-default-priority
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    NewReplicaSetAvailable
OldReplicaSets:  <none>
NewReplicaSet:   user-scheduler-55dff7c5d7 (2/2 replicas created)
Events:          <none>

I did see these events when describing the statefulset/user-placeholder:

  Warning  FailedCreate      27m                     statefulset-controller  create Pod user-placeholder-1930 in StatefulSet user-placeholder failed error: The POST operation against Pod could not be completed at this time, please try again.
  Warning  FailedCreate      27m                     statefulset-controller  create Pod user-placeholder-1810 in StatefulSet user-placeholder failed error: The POST operation against Pod could not be completed at this time, please try again.

Checking some logs on that first pod I see this:

Feb 5 13:21:45 user-scheduler-55dff7c5d7-j6zpl user-scheduler info I0205 19:21:45.521366       1 scheduling_queue.go:812] About to try and schedule pod jhub/user-placeholder-1930
Feb 5 13:21:45 user-scheduler-55dff7c5d7-j6zpl user-scheduler info I0205 19:21:45.521385       1 scheduler.go:452] Attempting to schedule pod: jhub/user-placeholder-1930
Feb 5 13:21:45 user-scheduler-55dff7c5d7-j6zpl user-scheduler info I0205 19:21:45.548070       1 factory.go:449] "Unable to schedule pod; no fit; waiting" pod="jhub/user-placeholder-1930" err="0/82 nodes are available: 13 Too many pods, 2 node(s) didn't match node selector, 67 Insufficient memory."
Feb 5 13:21:45 user-scheduler-55dff7c5d7-j6zpl user-scheduler info I0205 19:21:45.548319       1 scheduler.go:347] Updating pod condition for jhub/user-placeholder-1930 to (PodScheduled==False, Reason=Unschedulable) 

So maybe the run up to 1000 real users was OK but the next step up to 2000 users maybe tipped something over in the scheduler trying to account for where all of the other pods are located across what is now 80 user nodes? IOW, as we add more users the scheduler is working harder and making mistakes.

@mriedem
Copy link
Contributor Author

mriedem commented Feb 5, 2021

Also regarding user-scheduler deployment resources, during the last scale up runs the user-scheduler pods didn't seem to break a sweat:

image

image

@mriedem
Copy link
Contributor Author

mriedem commented Feb 5, 2021

I should maybe note that our deployment only uses a single user notebook image and our singleuser pod resource specs on those in our pre-prod and production clusters are:

  cpu:
    limit: 8
    guarantee: 0.2
  memory:
    limit: 8G
    guarantee: 1G

However, in our testing cluster (where I'm doing the scale testing) the singleuser pods have this profile:

  {
    "display_name": "micro",
    "slug": "micro",
    "description": "Useful for scale testing a lot of pods",
    "default": true,
    "kubespawner_override": {
      "cpu_guarantee": 0.015,
      "cpu_limit": 1,
      "mem_guarantee": "64M",
      "mem_limit": "1G"
    }
  }

But the placeholder pod template in all environments is this:

Pod Template:
  Labels:  app=jupyterhub
           component=user-placeholder
           release=jhub
  Containers:
   pause:
    Image:      k8s.gcr.io/pause:3.2
    Port:       <none>
    Host Port:  <none>
    Requests:
      memory:           1G
    Environment:        <none>
    Mounts:             <none>
  Volumes:              <none>
  Priority Class Name:  jhub-user-placeholder-priority

So in the scale tests we end up having a lot of those micro pods and I've really inflated the number of user nodes since I scaled up the placeholder replicas to 2000. So I'm not surprised that I'm seeing Insufficient memory above when trying to schedule more placeholders.

I'm going to scale the placeholder replicas down to 1000, deploy the hub with kubespawner 0.15.0, purge the existing singleuser notebook pods and then redo the scale test.

@consideRatio
Copy link
Member

@mriedem ooooh but hey!

These are the defaults of the user-scheduler's resource requests I think!! It can very well be throttled.

scheduling
  userScheduler:
    resources:
      requests:
        cpu: 50m
        memory: 256Mi

Can you try increasing the request or limit of CPU? I think it won't matter adding replicas as it seems that only one is taking the load pretty much.


It can also be that the k8s-apiserver is failing to respond quickly enough and that is throttling things.

There are plenty of things that could bottleneck scheduling:

  1. k8s api-server is overburdened
  2. k8s part that does the eviction of user-placeholder pods to make normal userpods with higher priority schedule if is throttling for some reason
  3. the user-scheduler is not scheduling pods fast enough
# get pending real-user pods
kubectl get pods -l component=singleuser-server -o json | jq -r '
  .items[]
  | select(.status.phase == "Pending") | .metadata.name
' | wc -l

@meeseeksmachine
Copy link

This issue has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/scheduling-errors-with-z2jh-0-10-x/7826/1

@consideRatio
Copy link
Member

consideRatio commented Feb 5, 2021

Investigation and misc...

kube-scheduler will not consider all nodes

If you don't specify a threshold, Kubernetes calculates a figure using a linear formula that yields 50% for a 100-node cluster and yields 10% for a 5000-node cluster. The lower bound for the automatic value is 5%.

https://kubernetes.io/docs/concepts/scheduling-eviction/scheduler-perf-tuning/

kube-scheduler's post-filter plugin does the pre-emption / eviction

It seems that kube-scheduler is doing the pre-emption / eviction itself as part of a post-filter plugin. Filters are ruling out nodes for scheduling, so the default-preemption post-filter plugin will probably trigger if there was no node to schedule on at all. At that point, it will pre-empt a pods with lower priority on nodes to accomplish its goal of making room for the pod it wants to schedule.

https://github.com/kubernetes/kubernetes/blob/036cab71a6faefa84b10a199a61bcdc38e3572c3/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go#L91

Overview of kube-scheduler configuration / plugins etc

https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/

Off topic idea to solve another issue

There is an issue about user pods being scheduled on freshly booted nodes without images pulled when there were nodes with warm seats but that were full because user-placeholders hogged the room. In thos situations, it would be preferably to evict a user-placeholder pod to allow room for a user pod on the older node with images pre-pulled already.

If we could make a plugin that deletes a scheduled user-placeholder pod for each real user pod to be scheduled, we could solve this I think. This assumes the scheduler will sort the pods based on priority before trying to scheduled them as well, which it does.

@mriedem
Copy link
Contributor Author

mriedem commented Feb 6, 2021

An update before I drop off for the weekend. I made some changes:

  • Bumped kubespawner to 0.15.0 to get around that TypeError on hub restart
  • Double guaranteed resources for the user-scheduler pods (cpu=100m, memory=512Mi)
  • Made our default singleuser mem guarantee match our "micro" profile of just 64M rather than the default 1G from z2jh (this was to make scaling up placeholders match resource consumption for our actual singleuser pods in our testing cluster)
  • Purged all of the servers from the previous scale test run

This time I scaled up the placeholder pods to 2400 to get 24 user nodes to start. I also watched the user-scheduler top usage.

Again there were placeholder pods being evicted but apparently not fast enough because more user nodes were added by the auto-scaler. Placeholder pods got own to ~1700 but then started going back up as the new user nodes were added. It almost seems like the scheduler is competing between adding user nodes for the placeholder pods to satisfy the stateful set and the actual user pods which have higher priority.

One of the user-scheduler pods hit a probe failure and was restarted.

image

Average user-scheduler pod core usage was around .3:

image

The hub crashed a couple of times due to consecutive pod spawn failure limits being reached:

image

I think that's just due to contention between the user pods and the placeholder pods and the pre-emption not working like we'd expect.

To try and cut off the placeholders I scaled those down to 1000 so the user pods could take over. At that point I was up to 33 user nodes.

The hub API response times are all less than 1s on average as well.

Now that the run is done here are the user pod stats:

osboxes@osboxes:~$ kubectl -n jhub get pods -l component=singleuser-server | grep -c Running
1710
osboxes@osboxes:~$ kubectl -n jhub get pods -l component=singleuser-server | grep -c OutOfpods
13

So over 250 failed to spawn.

The hub was restarted 4 times:

$ kubectl -n jhub get pods -l component=hub
NAME                   READY   STATUS    RESTARTS   AGE
hub-6cdf5d66c8-jq55f   1/1     Running   4          152m

The user-scheduler pods restarted once:

$ kubectl -n jhub get pods -l component=user-scheduler
NAME                             READY   STATUS    RESTARTS   AGE
user-scheduler-7d69c8cb9-f6ftt   1/1     Running   1          3h13m
user-scheduler-7d69c8cb9-hc4tj   1/1     Running   0          3h12m

Looks like this is the error from the user-scheduler pod that crashed:

Feb 5 16:24:15 user-scheduler-7d69c8cb9-f6ftt user-scheduler error E0205 22:24:15.322755       1 request.go:1001] Unexpected error when reading response body: http2: server sent GOAWAY and closed the connection; LastStreamID=153, ErrCode=NO_ERROR, debug=""
Feb 5 16:24:15 user-scheduler-7d69c8cb9-f6ftt user-scheduler Error E0205 22:24:15.322856       1 scheduler.go:342] Error updating pod jhub/user-placeholder-396: unexpected error when reading response body. Please retry. Original error: http2: server sent GOAWAY and closed the connection; LastStreamID=153, ErrCode=NO_ERROR, debug=""
Feb 5 17:12:05 user-scheduler-7d69c8cb9-f6ftt user-scheduler error E0205 23:12:05.705182       1 factory.go:477] pod: jhub/user-placeholder-2335 is already present in the active queue
Feb 5 17:38:02 user-scheduler-7d69c8cb9-f6ftt user-scheduler error E0205 23:38:02.229964       1 runtime.go:78] Observed a panic: runtime.boundsError{x:0, y:0, signed:true, code:0x0} (runtime error: index out of range [0] with length 0)
Feb 5 17:38:02 user-scheduler-7d69c8cb9-f6ftt user-scheduler error panic: runtime error: index out of range [0] with length 0 [recovered]
	panic: runtime error: index out of range [0] with length 0
Feb 5 17:51:10 user-scheduler-7d69c8cb9-hc4tj user-scheduler error E0205 23:51:10.508533       1 default_preemption.go:596] Error preempting pod jhub/user-placeholder-1347: pods "user-placeholder-1347" not found
Feb 5 17:51:10 user-scheduler-7d69c8cb9-hc4tj user-scheduler error E0205 23:51:10.508616       1 scheduler.go:474] Status after running PostFilter plugins for pod jhub/jupyter-hub-2dstress-2dtest-2d1021: &{1 [pods "user-placeholder-1347" not found]} 

Next steps for me are to go through the notes/links in the comment above.

@mriedem
Copy link
Contributor Author

mriedem commented Feb 10, 2021

One thought that came to mind is we are currently using the same resources for the user-placeholder pods as our singleuser notebook pods, but do people generally create fewer placeholder replicas but make them larger? We had a large user event last year where the scheduler and auto-scaler didn't seem to be packing singleuser pods like we expected so we changed the size on the placeholder pods to be much larger which then evicted them and freed up space on the user nodes for the singleuser pods.

@mriedem
Copy link
Contributor Author

mriedem commented Feb 10, 2021

Feb 5 17:38:02 user-scheduler-7d69c8cb9-f6ftt user-scheduler error panic: runtime error: index out of range [0] with length 0 [recovered]
panic: runtime error: index out of range [0] with length 0

That might be kubernetes/kubeadm#2284 but it looks to be fixed in 1.19.5:

kubernetes/kubernetes@922b918

So I'm not sure why we'd be hitting that since we upgrade kube-scheduler to 1.19.5. Unless I misunderstand where that fix gets applied and we just haven't done it yet.

@consideRatio
Copy link
Member

It would be interesting to see some kube-scheduler metrics. Perhaps that could give some insights as well.

Googling I found some grafana dashboard definitions: https://grafana.com/grafana/dashboards/12130/reviews

But, the user-scheduler needs to have a service targeting it with annotations to promethus scrapes from it, and, it needs to have ports opened for scraping.

Also, not confident about how this will be influenced by the fact that we have two scheduler pods in a leader election setup, will sometimes one reply saying its calm and easy, while othertimes the other reply saying it is very busy?


https://help.sumologic.com/Metrics/Kubernetes_Metrics#scheduler-kube-scheduler-metrics

Feb 5 16:24:15 user-scheduler-7d69c8cb9-f6ftt user-scheduler error E0205 22:24:15.322755       1 request.go:1001] Unexpected error when reading response body: http2: server sent GOAWAY and closed the connection; LastStreamID=153, ErrCode=NO_ERROR, debug=""
Feb 5 16:24:15 user-scheduler-7d69c8cb9-f6ftt user-scheduler Error E0205 22:24:15.322856       1 scheduler.go:342] Error updating pod jhub/user-placeholder-396: unexpected error when reading response body. Please retry. Original error: http2: server sent GOAWAY and closed the connection; LastStreamID=153, ErrCode=NO_ERROR, debug=""
Feb 5 17:12:05 user-scheduler-7d69c8cb9-f6ftt user-scheduler error E0205 23:12:05.705182       1 factory.go:477] pod: jhub/user-placeholder-2335 is already present in the active queue
Feb 5 17:38:02 user-scheduler-7d69c8cb9-f6ftt user-scheduler error E0205 23:38:02.229964       1 runtime.go:78] Observed a panic: runtime.boundsError{x:0, y:0, signed:true, code:0x0} (runtime error: index out of range [0] with length 0)
Feb 5 17:38:02 user-scheduler-7d69c8cb9-f6ftt user-scheduler error panic: runtime error: index out of range [0] with length 0 [recovered]
	panic: runtime error: index out of range [0] with length 0
Feb 5 17:51:10 user-scheduler-7d69c8cb9-hc4tj user-scheduler error E0205 23:51:10.508533       1 default_preemption.go:596] Error preempting pod jhub/user-placeholder-1347: pods "user-placeholder-1347" not found
Feb 5 17:51:10 user-scheduler-7d69c8cb9-hc4tj user-scheduler error E0205 23:51:10.508616       1 scheduler.go:474] Status after running PostFilter plugins for pod jhub/jupyter-hub-2dstress-2dtest-2d1021: &{1 [pods "user-placeholder-1347" not found]}

I note that there are significant time delays between various errors here, and only the last two lines could have been the cause of the restart I think.

@mriedem
Copy link
Contributor Author

mriedem commented Feb 13, 2021

To sort of close the loop on our placeholder eviction problem, @rmoe did some testing and found that with a much larger placeholder pod (8GB memory) and far fewer placeholder replicas we were able to scale up to our previous expected numbers (3K singleuser pods) without hitting spawn failures, the user node auto-scaler kicking in, or the hub crashing. Must be something about the K8S scheduler in newer releases (1.18+) because it was packing more placeholders on a user node but not evicting them to make room for singleuser pods when those pods were the same size, or maybe it was not evicting fast enough while trying to scale back up the placeholders before the auto-scaler kicked in.

@consideRatio
Copy link
Member

consideRatio commented Feb 13, 2021

@mriedem thank you so much for reporting back this hard gained experience!!!

Configuring user-placeholder resources
I recently discovered that the user-placeholder resources may be uncustomizable. How did you in practice customize the user-placeholder's resource requests to be larger than the normal users in practice? kubectl edit on the statefulset or did you try do it through the helm chart?

Here is the code where the resources are set, by referencing a complicated helper function not considering the scheduling.userPlaceholder.resources config:

containers:
- name: pause
image: {{ .Values.prePuller.pause.image.name }}:{{ .Values.prePuller.pause.image.tag }}
resources:
{{- include "jupyterhub.resources" . | nindent 12 }}
{{- with .Values.scheduling.userPlaceholder.containerSecurityContext }}

Another workaround idea

By using larger placeholder pods you reduce the performance required by the user-scheduler and the k8s api-server it interacts with. I wonder what is bottlenecking, it could be that the user-scheduler in its new form is making more requests than before, and due to that is throttled. Consider for example server sent GOAWAY and closed the connection was logged at some point.

With this in mind, I wonder how much k8s api-server requests and such are made given various configuration of the kube-scheduler binary. What if we configured it to be a bit more crude in how it schedules pod in order to reduce the amount of requests made?

More concretely, what would happen if changes introduced with 0.10.0 were reverted to some degree, what if we instead of enable a larger set of node scoring plugins only used one single.

scheduling:
  userScheduler:
    enabled: true
    replicas: 2
    logLevel: 4
    # plugins ref: https://kubernetes.io/docs/reference/scheduling/config/#scheduling-plugins-1
    plugins:
      score:
        disabled:
          - name: SelectorSpread
          - name: TaintToleration
          - name: PodTopologySpread
          - name: NodeResourcesBalancedAllocation
          - name: NodeResourcesLeastAllocated
          # Disable plugins to be allowed to enable them again with a different
          # weight and avoid an error.
          - name: NodePreferAvoidPods
          - name: NodeAffinity
          - name: InterPodAffinity
          - name: ImageLocality
        # What if we never enable the NodeAffinity, InterPodAffinity, ImageLocality again?
        # I suspect InterPodAffinity and/or ImageLocality could cause a significant performance hit in practice.
        enabled:
          - name: NodePreferAvoidPods
            weight: 161051
          - name: NodeResourcesMostAllocated
            weight: 121
        # These are enabled by default
        enabled:
          - name: NodePreferAvoidPods
            weight: 161051
          - name: NodeAffinity
            weight: 14631
          - name: InterPodAffinity
            weight: 1331
          - name: NodeResourcesMostAllocated
            weight: 121
          - name: ImageLocality
            weight: 11

@mriedem
Copy link
Contributor Author

mriedem commented Feb 15, 2021

I recently discovered that the user-placeholder resources may be uncustomizable. How did you in practice customize the user-placeholder's resource requests to be larger than the normal users in practice? kubectl edit on the statefulset or did you try do it through the helm chart?

@rmoe has been editing the user-placeholder stateful set manually. We think we might have a workaround though by using a default profile via c.KubeSpawner.profile_list to override the singleuser resources in the helm chart and then set singleuser.memory.guarantee: 8G in the helm chart values to be used by the placeholder pod spec exclusively, at least until such time as specifying resources for placeholders is supported by z2jh.

@consideRatio
Copy link
Member

Hey @mriedem looking into this issue again, specifying resources for the user placeholder specifically is supported at this point in time at least via scheduling.userPlaceholder.resources. It was implemented in #2055 and part of 1.0.0+.

@consideRatio
Copy link
Member

@yuvipanda do you have references to your work with using user placeholder's for entire nodes?

I recall that you had managed to improve performance by using fewer and larger user placeholders that fit entire nodes instead of having many smaller, and that we concluded it was good for other reasons as well. For example, one would never end up with a node that is partially full with user placeholder pods etc but always manage to pack nodes full of real users.

I figure I'll write an issue about this in this repo and try to cover the discussion regarding the choice of having user-placeholders vs node-placeholders and consider how we make it easy for users to work with node placeholders in this repo.

I wonder if we can manage to get node placeholders with strict anti-affinity or similar and nodeSelector labels rather than specific resource requests? I'd like to deliberate a bit on that in a dedicated issue anyhow, having a reference to your past work on it would be great!

@consideRatio
Copy link
Member

I spoke with @Ryan-McPartlan about scaling issues and wanted to tag Ryan here for visibility. Also, @yuvipanda has discussed using "node placeholder" pods instead of "user placeholder" pods, as well as relying on for example GKE's pod scheduler instead of deploying one of our own.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants