Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/stats/summary endpoints failing with "Context Deadline Exceeded" on Windows nodes with very high CPU usage #95735

Closed
marosset opened this issue Oct 20, 2020 · 21 comments · Fixed by #95950 or #96051
Labels
area/kubelet kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/windows Categorizes an issue or PR as relevant to SIG Windows.
Projects

Comments

@marosset
Copy link
Contributor

What happened:
Windows nodes running with high (>90%) CPU usage are having calls to /stats/summary time out with "Context Deadline Exceeded".

When this happens metrics-server cannot gather accurate CPU/memory usage for the node or pods running on the node which results in

  • HPA not functioning
  • kubectl top node/pod returning instead of values

What you expected to happen:
Windows nodes should be able to run with high CPU utilization without negatively impacting Kubernetes functionality

How to reproduce it (as minimally and precisely as possible):

  1. Create a cluster with Windows Server 2019 nodes
  2. Schedule CPU intensive workloads into cluster
  3. Scale workload to put node into high CPU state
  4. Query node metrics with kubectl top node

Anything else we need to know?:
This issue appears to be easier to reproduce on 1.18.x nodes vs 1.16.x or 1.17.x nodes

Ex:
image

image

metrics-server logs show
E1020 10:58:24.539304 1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:akswin000000: unable to fetch metrics from Kubelet akswin000000 (10.240.0.115): Get https://10.240.0.115:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded

Kubelet logs show

E1002 00:26:59.894818    4680 remote_runtime.go:495] ListContainerStats with filter &ContainerStatsFilter{Id:,PodSandboxId:,LabelSelector:map[string]string{},} 
from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E1002 00:43:44.801208    4680 handler.go:321] HTTP InternalServerError serving /stats/summary: Internal Error: failed to list pod stats: 
failed to list all container stats: rpc error: code = Unknown desc = operation timeout: context deadline exceeded

and

E1002 00:46:31.844828    4680 remote_runtime.go:495] ListContainerStats with filter &ContainerStatsFilter{Id:,PodSandboxId:,LabelSelector:map[string]string{},} 
from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded

Environment:

  • Kubernetes version (use kubectl version): 1.18.8
  • Cloud provider or hardware configuration: AKS / Azure
  • OS (e.g: cat /etc/os-release): WIndows Server 2019
  • Kernel (e.g. uname -a):
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others: Docker EE version 19.3.11
@marosset marosset added the kind/bug Categorizes issue or PR as related to a bug. label Oct 20, 2020
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Oct 20, 2020
@k8s-ci-robot
Copy link
Contributor

@marosset: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Oct 20, 2020
@marosset
Copy link
Contributor Author

/sig windows
/area kubelet

@k8s-ci-robot k8s-ci-robot added sig/windows Categorizes an issue or PR as relevant to SIG Windows. area/kubelet and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 20, 2020
@jsturtevant
Copy link
Contributor

I was able to reproduce context deadline exceeded on 1.18.8 during a scale operation: kubectl scale --replicas 9 -n default deployment iis-2019-burn-cpu where it fully subscribes all CPU's in Allocatable:

describe node 3654k8s00000000
Capacity:                                                                                       
  cpu:                            2                                           
  ephemeral-storage:              30945276Ki                                                                                                                 
  memory:                         8388148Ki                                                                                                                  
  pods:                           30                                          
Allocatable:                                                                                             
  cpu:                            2                                           
  ephemeral-storage:              30945276Ki                                                                                                                 
  memory:                         6290996Ki
<clip>
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                       Requests    Limits
  --------                       --------    ------
  cpu                            2 (100%)    2 (100%)
  memory                         2400m (0%)  6400m (0%)
  ephemeral-storage              0 (0%)      0 (0%)
  attachable-volumes-azure-disk  0           0
k top nodes
NAME                                 CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%     
k8s-agentlinux-36542735-vmss000000   62m          3%     920Mi           12%         
k8s-master-36542735-0                170m         8%     2255Mi          31%         
3654k8s00000000                      <unknown>                           <unknown>               <unknown>               <unknown>

Metric server logs:

 23:24:22.986388       1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:3654k8s00000000: [unable to get CPU for container "iis" in pod default/iis-2019-burn-cpu-56b7bcf8b5-2ndl4 on node "3654k8s00000000", discarding data: missing cpu usage metric, unable to get CPU for container "iis" in pod default/iis-2019-burn-cpu-56b7bcf8b5-skqhv on node "3654k8s00000000", discarding data: missing cpu usage metric, unable to get CPU for container "iis" in pod default/iis-2019-burn-cpu-56b7bcf8b5-446lb on node "3654k8s00000000", discarding data: missing cpu usage metric, unable to get CPU for container "iis" in pod default/iis-2019-burn-cpu-56b7bcf8b5-zt2x4 on node "3654k8s00000000", discarding data: missing cpu usage metric]
E1020 23:25:27.264682       1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:3654k8s00000000: unable to fetch metrics from Kubelet 3654k8s00000000 (3654k8s00000000): Get https://3654k8s00000000:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded

I do not see "context deadline" in kubelet but I did find some logs assocated:

> I1020 23:24:54.460804    5052 kubelet.go:1953] SyncLoop (PLEG): "iis-2019-burn-cpu-56b7bcf8b5-zt2x4_default(10d79830-998d-4664-a95f-3d417d5d4a38)", event: &pleg.PodLifecycleEvent{ID:"10d79830-998d-4664-a95f-3d417d5d4a38", 
Type:"ContainerStarted", Data:"357eb5a243edf86b91f41b7e84765b07e0e3e402c7f58e2f4f4d8f30eb9a6131"}                                                                                                                                 
  E1020 23:25:28.857884    5052 handler.go:309] Error writing response: http2: stream closed                                                                                                                                      
  time="2020-10-20T23:25:40Z" level=error msg="failed to waitForNotification: callbackNumber 12470 does not exist in callbackMap"     

Once the pods are scheduled the node remains at 100% cpu and I no longer see errors but kubectl top nodes is not accurate:

 kubectl top nodes
NAME                                 CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%  
3654k8s00000000                      1200m        60%    1704Mi          27%      

on the node (run after the pod deployment stabilized)

GET-COUNTER -Counter "\Processor(_Total)\% Processor Time" -SampleInterval 2 -MaxSamples 3

Timestamp                 CounterSamples
---------                 --------------
10/21/2020 3:43:10 PM     \\3654k8s00000000\processor(_total)\% processor time :
                          100


10/21/2020 3:43:46 PM     \\3654k8s00000000\processor(_total)\% processor time :
                          99.9556389707573


10/21/2020 3:43:54 PM     \\3654k8s00000000\processor(_total)\% processor time :
                          100

@jsturtevant
Copy link
Contributor

Adding cpu=500m to kubelete's --system-reserved= flag and I no longer see the errors. This reserves some cpu for kubelet and other processes. We call this out in the documentation for memory but do not do the same for cpu.

@marosset
Copy link
Contributor Author

Also, without setting system reserved CPU, it looks like kubectl top node does not accurately reflect CPU usage on the nodes when CPU utilization is very high. When I got some nodes into this state I did not see any error messages in logs from metrics-server and also I noticed that that querying the metrics API directly from http://localhost:8001/apis/metrics.k8s.io/v1beta1/nodes showed recent timestamps.

@jsturtevant
Copy link
Contributor

I run into this again during scaling, collectiong logs for analysis here:

k scale --replicas=16 deployment validate-windows-cpu-consumption
k get pods
NAME                                                READY   STATUS              RESTARTS   AGE
validate-windows-cpu-consumption-5dd7bb8c89-44nxf   0/1     Pending             0          20s
validate-windows-cpu-consumption-5dd7bb8c89-5p9zv   0/1     Terminating         0          85s
validate-windows-cpu-consumption-5dd7bb8c89-6h59l   0/1     Pending             0          20s
validate-windows-cpu-consumption-5dd7bb8c89-6xzbq   0/1     Terminating         0          85s
validate-windows-cpu-consumption-5dd7bb8c89-7m8vq   0/1     Pending             0          20s
validate-windows-cpu-consumption-5dd7bb8c89-7prvj   0/1     Pending             0          20s
validate-windows-cpu-consumption-5dd7bb8c89-7vgh9   0/1     Pending             0          20s
validate-windows-cpu-consumption-5dd7bb8c89-bkqxc   0/1     Pending             0          20s
validate-windows-cpu-consumption-5dd7bb8c89-c5sj8   0/1     Terminating         0          85s
validate-windows-cpu-consumption-5dd7bb8c89-dqr2k   0/1     Pending             0          20s
validate-windows-cpu-consumption-5dd7bb8c89-ftgns   0/1     Pending             0          20s
validate-windows-cpu-consumption-5dd7bb8c89-jcqwm   0/1     Pending             0          20s
validate-windows-cpu-consumption-5dd7bb8c89-k7djt   0/1     Pending             0          20s
validate-windows-cpu-consumption-5dd7bb8c89-mxxlk   0/1     Terminating         0          85s
validate-windows-cpu-consumption-5dd7bb8c89-n5fdw   0/1     Pending             0          20s
validate-windows-cpu-consumption-5dd7bb8c89-nnscp   1/1     Running             0          3m40s
validate-windows-cpu-consumption-5dd7bb8c89-prx2f   0/1     Pending             0          20s
validate-windows-cpu-consumption-5dd7bb8c89-qk7q5   0/1     Pending             0          20s
validate-windows-cpu-consumption-5dd7bb8c89-r76rg   0/1     Terminating         0          85s
validate-windows-cpu-consumption-5dd7bb8c89-tcv6x   0/1     Pending             0          20s
validate-windows-cpu-consumption-5dd7bb8c89-wvnzd   0/1     Terminating         0          85s
validate-windows-cpu-consumption-5dd7bb8c89-zx8q5   0/1     ContainerCreating   0          20s
k top nodes  
NAME                                 CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%     
k8s-agentlinux-30512767-vmss000000   71m          3%     854Mi           11%         
k8s-master-30512767-0                195m         9%     1724Mi          23%         
3051k8s00000000                      <unknown>                           <unknown>               <unknown>               <unknown>    

Metric Server logs:

k logs -n kube-system metrics-server-b6dbc49fc-tc9wr 
E1022 20:55:01.280170       1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:3051k8s00000000: unable to fetch metrics from Kubelet 3051k8s00000000 (10.240.0.65): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = container 22e29768e0f0df2496a61ac0f59f462bafe61e0148a30b111d3090d930f3c3e8 encountered an error during Properties: failure in a Windows system call: The requested virtual machine or container operation is not valid in the current state. (0xc0370105)"
E1022 20:55:43.002766       1 reststorage.go:135] unable to fetch node metrics for node "3051k8s00000000": no metrics known for node
E1022 20:56:17.838434       1 reststorage.go:135] unable to fetch node metrics for node "3051k8s00000000": no metrics known for node

Kubelet logs:

W1022 20:55:01.271376    5032 empty_dir.go:453] Warning: Failed to clear quota on c:\var\lib\kubelet\pods\a3dffcba-7458-4c11-88f0-dd208413198e\volumes\kubernetes.io~secret\default-token-f89l9: not
implemented
E1022 20:55:01.272353    5032 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/secret/a3dffcba-7458-4c11-88f0-dd208413198e-default-token-f89l9  
podName:a3dffcba-7458-4c11-88f0-dd208413198e nodeName:}" failed. No retries permitted until 2020-10-22 20:55:17.2723536 +0000 GMT m=+8511.317416101 (durationBeforeRetry 16s). Error:
"UnmountVolume.TearDown failed for volume \"default-token-f89l9\" (UniqueName: \"kubernetes.io/secret/a3dffcba-7458-4c11-88f0-dd208413198e-default-token-f89l9\") pod
\"a3dffcba-7458-4c11-88f0-dd208413198e\" (UID: \"a3dffcba-7458-4c11-88f0-dd208413198e\") : remove
c:\\var\\lib\\kubelet\\pods\\a3dffcba-7458-4c11-88f0-dd208413198e\\volumes\\kubernetes.io~secret\\default-token-f89l9: The process cannot access the file because it is being used by another process."      
time="2020-10-22T20:55:01Z" level=error msg="hcsshim::ComputeSystem::Properties - End Operation - Error" cid=22e29768e0f0df2496a61ac0f59f462bafe61e0148a30b111d3090d930f3c3e8 error="Properties
22e29768e0f0df2496a61ac0f59f462bafe61e0148a30b111d3090d930f3c3e8: The requested virtual machine or container operation is not valid in the current state."
E1022 20:55:01.273353    5032 remote_runtime.go:495] ListContainerStats with filter &ContainerStatsFilter{Id:,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error:      
code = Unknown desc = container 22e29768e0f0df2496a61ac0f59f462bafe61e0148a30b111d3090d930f3c3e8 encountered an error during Properties: failure in a Windows system call: The requested virtual machine or  
container operation is not valid in the current state. (0xc0370105)
E1022 20:55:01.274356    5032 handler.go:321] HTTP InternalServerError serving /stats/summary: Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc  
= container 22e29768e0f0df2496a61ac0f59f462bafe61e0148a30b111d3090d930f3c3e8 encountered an error during Properties: failure in a Windows system call: The requested virtual machine or container operation  
is not valid in the current state. (0xc0370105)
I1022 20:55:01.344609    5032 kuberuntime_manager.go:422] No sandbox for pod "validate-windows-cpu-consumption-5dd7bb8c89-zx8q5_default(1405d68f-c97e-4232-aafe-b279f30e87d1)" can be found. Need to start   
a new one
W1022 20:55:02.034274    5032 cri_stats_provider_windows.go:89] Failed to get HNS endpoint "" with error 'json: cannot unmarshal array into Go value of type hns.HNSEndpoint', continue to get stats for     
other endpoints
time="2020-10-22T20:55:03Z" level=error msg="hcsshim::ComputeSystem::Properties - End Operation - Error" cid=c1f33c99795ccde9566fcd1f53fbcd9f0f90e98c843db08e855481827b2bf678 error="Properties
c1f33c99795ccde9566fcd1f53fbcd9f0f90e98c843db08e855481827b2bf678: The requested virtual machine or container operation is not valid in the current state."

@jsturtevant
Copy link
Contributor

@jeremyje @pjh have you experienced this at all?

@jeremyje
Copy link
Contributor

I believe @jingxu97 has seen high CPU issues before. Not particularly on this. I've heard there were kernel issues in the past that triggered high CPU cases.

@marosset
Copy link
Contributor Author

Some investigations on our end are showing that the DiagTrack service running in the infra/pause container can consume non-trivial amounts of CPU.
I'll create an issue to address that problem as it might help out some here.
That certainly will not eliminate the problem we are seeing tho.

@marosset
Copy link
Contributor Author

#95840 is an issue for DiagTrack

@marosset
Copy link
Contributor Author

marosset commented Oct 26, 2020

The differences in behavior between 1.17.* and 1.18.* clusters looks like it is caused by #86101 (which was NOT backported to any prior releases)

K8s version <= 1.17 behavior
Prior to these changes CPU weights for containers were inferred based on limit values set in the pod specs.

The observed behavior on Windows system when CPU weights were used was that under high-CPU load pods/containers were not guaranteed their scheduling time.

The good news here is that system critical services like kubelet/docker/etc were still able to get enough CPU time for the nodes to function properly.

K8s version >= 1.18 behavior
After these changes CPU limits were enforced for pods.
The observed behavior on 1.18 clusters is the when a Windows node enters a high-CPU state is that containers are guaranteed their scheduling time but host resources start to get starved. Usually this manifests as the node entering the NotReady state, metrics not working correctly (because calls to /stats/summary are timing out) or both.

Suggestion / possible fixes

  • Set the Windows process priority of the kubelet to AboveAverage \ High to help ensure it gets CPU time
    • Kubelet can do this itself though golang.org/x/sys/window calls (GetCurrentProcess() and Get/SetPriorityClass())
  • Set reserve CPU with with --system-reserves=[cpu=500m]
    • This needs to scaled based on number of CPUs on host based on my experimentations
    • Currently guidance on k8s.io for Windows only mentions reserving system-memory
    • It is still possible to over-commit CPU and starve kubelet/docker/HCS if users schedule pods with mis-matching requests and limits
  • Update k8s.io documentation.

@ravisantoshgudimetla
Copy link
Contributor

Set the Windows process priority of the kubelet to AboveAverage \ High to help ensure it gets CPU time

Kubelet can do this itself though golang.org/x/sys/window calls (GetCurrentProcess() and Get/SetPriorityClass())

I can work on this, I am wondering if we should do the same for CRI(like docker) and CNI processes.

@marosset
Copy link
Contributor Author

@ravisantoshgudimetla Sure, if you want to work on this go ahead.
I think we should probably try to make this be toggleable via a kubelet flag.
I believe there are other windows specific flags already implemented in the kubelet (like one to run as a windows service IIRC)

@jsturtevant
Copy link
Contributor

docker/containerd should probably have a flag that allows them to run as high priory as well but would likely be different issues in the respective projects

@marosset
Copy link
Contributor Author

marosset commented Sep 2, 2021

/reopen
I don't think this was ever fully addressed

@k8s-ci-robot
Copy link
Contributor

@marosset: Reopened this issue.

In response to this:

/reopen
I don't think this was ever fully addressed

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Sep 2, 2021
SIG-Windows automation moved this from Done (v1.19) to Backlog Sep 2, 2021
@jsturtevant
Copy link
Contributor

also related to #104283

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 1, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 31, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

SIG-Windows automation moved this from Backlog to Done (v1.23) Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/windows Categorizes an issue or PR as relevant to SIG Windows.
Projects
SIG-Windows
  
Done (v1.23)
6 participants