Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KubeAPIErrorBudgetBurn warnings fire after upgrade to 1.18.6 #9698

Closed
demisx opened this issue Aug 6, 2020 · 18 comments
Closed

KubeAPIErrorBudgetBurn warnings fire after upgrade to 1.18.6 #9698

demisx opened this issue Aug 6, 2020 · 18 comments

Comments

@demisx
Copy link

demisx commented Aug 6, 2020

1. What kops version are you running? The command kops version, will display
this information.

Version 1.18.0

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-16T00:04:31Z", GoVersion:"go1.14.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T16:51:04Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

Cluster upgrade from 1.16 to 1.18

5. What happened after the commands executed?
The following warning alert started to fire (see below). I am not sure where to look for the problem and how to fix it.

alert: KubeAPIErrorBudgetBurn
expr: sum(apiserver_request:burnrate3d)
  > (1 * 0.01) and sum(apiserver_request:burnrate6h) > (1 * 0.01)
for: 3h
labels:
  severity: warning
annotations:
  message: The API server is burning too much error budget
  runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorbudgetburn

6. What did you expect to happen?
No warning alerts

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2019-10-22T17:23:28Z"
  generation: 10
  name: prod.dl.org
spec:
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://kops-dl-org-prod/prod.dl.org
  dnsZone: Z1KK30M1ET94CJ
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-us-west-2a
      name: a
    - encryptedVolume: true
      instanceGroup: master-us-west-2b
      name: b
    - encryptedVolume: true
      instanceGroup: master-us-west-2d
      name: d
    memoryRequest: 100Mi
    name: main
    version: 3.2.24
  - cpuRequest: 100m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-us-west-2a
      name: a
    - encryptedVolume: true
      instanceGroup: master-us-west-2b
      name: b
    - encryptedVolume: true
      instanceGroup: master-us-west-2d
      name: d
    memoryRequest: 100Mi
    name: events
    version: 3.2.24
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    featureGates:
      TTLAfterFinished: "true"
  kubeControllerManager:
    featureGates:
      TTLAfterFinished: "true"
  kubeProxy:
    metricsBindAddress: 0.0.0.0
  kubelet:
    anonymousAuth: false
    featureGates:
      TTLAfterFinished: "true"
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.18.6
  masterInternalName: api.internal.prod.dl.org
  masterPublicName: api.prod.dl.org
  networkCIDR: 172.20.0.0/16
  networking:
    calico:
      majorVersion: v3
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 172.20.32.0/19
    name: us-west-2a
    type: Private
    zone: us-west-2a
  - cidr: 172.20.64.0/19
    name: us-west-2b
    type: Private
    zone: us-west-2b
  - cidr: 172.20.96.0/19
    name: us-west-2d
    type: Private
    zone: us-west-2d
  - cidr: 172.20.0.0/22
    name: utility-us-west-2a
    type: Utility
    zone: us-west-2a
  - cidr: 172.20.4.0/22
    name: utility-us-west-2b
    type: Utility
    zone: us-west-2b
  - cidr: 172.20.8.0/22
    name: utility-us-west-2d
    type: Utility
    zone: us-west-2d
  topology:
    dns:
      type: Public
    masters: private
    nodes: private

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-03-15T21:02:54Z"
  generation: 2
  labels:
    kops.k8s.io/cluster: prod.dl.org
  name: bastions
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200716
  machineType: t2.micro
  maxSize: 0
  minSize: 0
  nodeLabels:
    kops.k8s.io/instancegroup: bastions
  role: Bastion
  subnets:
  - utility-us-west-2a

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2019-10-22T17:23:28Z"
  generation: 4
  labels:
    kops.k8s.io/cluster: prod.dl.org
  name: master-us-west-2a
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200716
  machineType: t3a.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-west-2a
  role: Master
  subnets:
  - us-west-2a

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2019-10-22T17:23:28Z"
  generation: 4
  labels:
    kops.k8s.io/cluster: prod.dl.org
  name: master-us-west-2b
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200716
  machineType: t3a.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-west-2b
  role: Master
  subnets:
  - us-west-2b

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2019-10-22T17:23:28Z"
  generation: 4
  labels:
    kops.k8s.io/cluster: prod.dl.org
  name: master-us-west-2d
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200716
  machineType: t3a.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-west-2d
  role: Master
  subnets:
  - us-west-2d

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2019-10-22T17:26:50Z"
  generation: 4
  labels:
    kops.k8s.io/cluster: prod.dl.org
  name: nodes-us-west-2a
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200716
  machineType: t3a.large
  maxPrice: "0.05"
  maxSize: 5
  minSize: 1
  mixedInstancesPolicy:
    instances:
    - t3a.large
    - t3.large
    - r5a.large
    - m5.large
    onDemandAboveBase: 0
    spotInstancePools: 4
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-us-west-2a
  role: Node
  subnets:
  - us-west-2a

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2019-10-22T17:27:42Z"
  generation: 4
  labels:
    kops.k8s.io/cluster: prod.dl.org
  name: nodes-us-west-2b
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200716
  machineType: t3a.large
  maxPrice: "0.05"
  maxSize: 5
  minSize: 1
  mixedInstancesPolicy:
    instances:
    - t3a.large
    - t3.large
    - r5a.large
    - m5.large
    onDemandAboveBase: 0
    spotInstancePools: 4
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-us-west-2b
  role: Node
  subnets:
  - us-west-2b

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2019-10-22T17:28:34Z"
  generation: 4
  labels:
    kops.k8s.io/cluster: prod.dl.org
  name: nodes-us-west-2d
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200716
  machineType: t3a.large
  maxPrice: "0.05"
  maxSize: 5
  minSize: 1
  mixedInstancesPolicy:
    instances:
    - t3a.large
    - t3.large
    - r5a.large
    - m5.large
    onDemandAboveBase: 0
    spotInstancePools: 4
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-us-west-2d
  role: Node
  subnets:
  - us-west-2d

9. Anything else do we need to know?
I am not seeing anything crazy written to the kube-apiserver log. Just this (see below). Should I check something else?

$ kubectl logs -n kube-system -lk8s-app=kube-apiserver -c kube-apiserver
I0806 10:39:27.118112       1 log.go:172] http: TLS handshake error from 172.20.0.190:39560: tls: first record does not look like a TLS handshake
I0806 10:39:27.125020       1 log.go:172] http: TLS handshake error from 172.20.0.190:39562: tls: first record does not look like a TLS handshake
I0806 10:39:27.131772       1 log.go:172] http: TLS handshake error from 172.20.0.190:39564: tls: first record does not look like a TLS handshake
I0806 13:49:34.220916       1 log.go:172] http: TLS handshake error from 172.20.0.190:48782: tls: first record does not look like a TLS handshake
I0806 15:46:47.807678       1 trace.go:116] Trace[1935067621]: "Get" url:/api/v1/namespaces/kube-system/pods/kube-apiserver-ip-172-20-39-122.us-west-2.compute.internal/log,user-agent:kubectl/v1.18.6 (darwin/amd64) kubernetes/dff82dc,client:172.20.0.190 (started: 2020-08-06 15:46:27.987322174 +0000 UTC m=+78736.098618310) (total time: 19.820232218s):
Trace[1935067621]: [19.820230198s] [19.80677796s] Transformed response object
I0806 15:46:47.807717       1 trace.go:116] Trace[439512417]: "Get" url:/api/v1/namespaces/kube-system/pods/kube-apiserver-ip-172-20-97-123.us-west-2.compute.internal/log,user-agent:kubectl/v1.18.6 (darwin/amd64) kubernetes/dff82dc,client:172.20.0.190 (started: 2020-08-06 15:46:27.992094089 +0000 UTC m=+78736.103390235) (total time: 19.815537094s):
Trace[439512417]: [19.815536244s] [19.804237986s] Transformed response object
I0806 15:46:47.807847       1 trace.go:116] Trace[2088981802]: "Get" url:/api/v1/namespaces/kube-system/pods/kube-apiserver-ip-172-20-94-219.us-west-2.compute.internal/log,user-agent:kubectl/v1.18.6 (darwin/amd64) kubernetes/dff82dc,client:172.20.0.190 (started: 2020-08-06 15:46:27.988739067 +0000 UTC m=+78736.100035203) (total time: 19.819060288s):
Trace[2088981802]: [19.819059348s] [19.804137435s] Transformed response object
I0806 15:38:30.929435       1 trace.go:116] Trace[788389340]: "List" url:/api/v1/secrets,user-agent:fluxd/v0.0.0 (linux/amd64) kubernetes/$Format,client:172.20.106.218 (started: 2020-08-06 15:38:30.348145008 +0000 UTC m=+77776.473287302) (total time: 581.201989ms):
Trace[788389340]: [581.200379ms] [528.735779ms] Writing http response done count:134
I0806 15:41:34.990938       1 trace.go:116] Trace[1399877601]: "List" url:/api/v1/secrets,user-agent:fluxd/v0.0.0 (linux/amd64) kubernetes/$Format,client:172.20.106.218 (started: 2020-08-06 15:41:34.478087614 +0000 UTC m=+77960.603229898) (total time: 512.803702ms):
Trace[1399877601]: [512.802272ms] [473.454048ms] Writing http response done count:134
I0806 15:41:50.210129       1 trace.go:116] Trace[1495977473]: "List" url:/api/v1/namespaces/monitoring/secrets,user-agent:helm-operator/v0.0.0 (linux/amd64) kubernetes/$Format,client:172.20.106.218 (started: 2020-08-06 15:41:49.696962617 +0000 UTC m=+77975.822104911) (total time: 513.125004ms):
Trace[1495977473]: [513.122854ms] [468.787573ms] Writing http response done count:5
I0806 15:41:50.542065       1 trace.go:116] Trace[345410093]: "List" url:/api/v1/namespaces/monitoring/secrets,user-agent:helm-operator/v0.0.0 (linux/amd64) kubernetes/$Format,client:172.20.106.218 (started: 2020-08-06 15:41:49.995633773 +0000 UTC m=+77976.120776067) (total time: 546.356231ms):
Trace[345410093]: [546.354241ms] [475.799666ms] Writing http response done count:5
I0806 15:44:39.187617       1 trace.go:116] Trace[590956055]: "List" url:/api/v1/secrets,user-agent:fluxd/v0.0.0 (linux/amd64) kubernetes/$Format,client:172.20.106.218 (started: 2020-08-06 15:44:38.591954763 +0000 UTC m=+78144.717097057) (total time: 595.608228ms):
Trace[590956055]: [595.606578ms] [546.292841ms] Writing http response done count:134
I0806 09:51:51.369290       1 log.go:172] http: TLS handshake error from 172.20.8.25:36724: tls: first record does not look like a TLS handshake
I0806 09:51:51.375823       1 log.go:172] http: TLS handshake error from 172.20.8.25:36726: tls: first record does not look like a TLS handshake
I0806 09:51:51.414232       1 log.go:172] http: TLS handshake error from 172.20.8.25:36738: tls: first record does not look like a TLS handshake
I0806 09:51:51.420446       1 log.go:172] http: TLS handshake error from 172.20.8.25:36740: tls: first record does not look like a TLS handshake
I0806 09:51:51.426999       1 log.go:172] http: TLS handshake error from 172.20.8.25:36742: tls: first record does not look like a TLS handshake
I0806 09:51:51.433691       1 log.go:172] http: TLS handshake error from 172.20.8.25:36744: tls: first record does not look like a TLS handshake
I0806 09:51:51.440140       1 log.go:172] http: TLS handshake error from 172.20.8.25:36746: tls: first record does not look like a TLS handshake
I0806 10:40:51.908353       1 log.go:172] http: TLS handshake error from 172.20.8.25:39314: tls: no cipher suite supported by both client and server
I0806 11:13:18.864421       1 log.go:172] http: TLS handshake error from 172.20.8.25:40998: tls: first record does not look like a TLS handshake
I0806 13:46:37.659591       1 log.go:172] http: TLS handshake error from 172.20.8.25:48924: local error: tls: bad record MAC
@johngmyers
Copy link
Member

Looks like something's trying to talk to the apiserver with non-TLS HTTP.

@justinaslelys
Copy link

justinaslelys commented Aug 7, 2020

@demisx
I'm facing exactly the same issue after the upgrade from 1.17 (whats different from your cluster - I'm using cilium).

If you look what actually apiserver_request:burnrate6h is, it's a combination of slow requests OR errors on LIST|GET.

        (
          (
            # too slow
            sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[6h]))
            -
            (
              sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope="resource",le="0.1"}[6h])) +
              sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope="namespace",le="0.5"}[6h])) +
              sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope="cluster",le="5"}[6h]))
            )
          )
          +
          # errors
          sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET",code=~"5.."}[6h]))
        )
        /
        sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[6h]))
      labels:
        verb: read
      record: apiserver_request:burnrate6h

Using those formulas I've found that in my case it's slow requests causing a problem.

I've installed API Server dashboard (for convenience) and found that some LISTs are taking longer than they should:
Screenshot 2020-08-07 at 10 01 55

Apiserver logs around similar time (combined from 3 masters):

Trace[1779858798]: [3.148645316s] [3.148591286s] Object stored in database
I0807 06:30:01.109419       1 trace.go:116] Trace[2060077304]: "Get" url:/api/v1/namespaces/ingress-nginx/configmaps/ingress-controller-leader-nginx-internal,user-agent:nginx-ingress-controller/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.193.31 (started: 2020-08-07 06:29:54.06573089 +0000 UTC m=+72866.991413790) (total time: 7.043661571s):
Trace[2060077304]: [7.04358989s] [7.04358317s] About to write a response
I0807 06:30:01.107672       1 trace.go:116] Trace[2090894920]: "Get" url:/api/v1/namespaces/ingress-nginx/configmaps/ingress-controller-leader-nginx-internal,user-agent:nginx-ingress-controller/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.147.123 (started: 2020-08-07 06:29:54.066965029 +0000 UTC m=+72866.992647919) (total time: 7.040680386s):
Trace[2090894920]: [7.040637156s] [7.040630436s] About to write a response
I0807 06:30:01.106044       1 trace.go:116] Trace[1639175304]: "Get" url:/api/v1/namespaces/kube-system/configmaps/kops-controller-leader,user-agent:kops-controller/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.201.143 (started: 2020-08-07 06:29:54.853707821 +0000 UTC m=+72867.779390711) (total time: 6.25231116s):
Trace[1639175304]: [6.252270029s] [6.252260849s] About to write a response
I0807 06:30:01.106478       1 trace.go:116] Trace[1483561289]: "Get" url:/api/v1/namespaces/ingress-nginx/configmaps/ingress-controller-leader-nginx-public,user-agent:nginx-ingress-controller/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.147.123 (started: 2020-08-07 06:29:54.230118394 +0000 UTC m=+72867.155801284) (total time: 6.876333434s):
Trace[1483561289]: [6.876292123s] [6.876283183s] About to write a response
I0807 06:30:01.106697       1 trace.go:116] Trace[491217306]: "Get" url:/api/v1/namespaces/ingress-nginx/configmaps/ingress-controller-leader-nginx-acceptance-cms,user-agent:nginx-ingress-controller/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.193.31 (started: 2020-08-07 06:29:54.065550567 +0000 UTC m=+72866.991233467) (total time: 7.041118764s):
Trace[491217306]: [7.041077133s] [7.041070722s] About to write a response
I0807 06:30:01.106882       1 trace.go:116] Trace[551629569]: "Get" url:/api/v1/namespaces/ingress-nginx/configmaps/ingress-controller-leader-nginx,user-agent:nginx-ingress-controller/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.147.123 (started: 2020-08-07 06:29:54.224949497 +0000 UTC m=+72867.150632407) (total time: 6.881908637s):
Trace[551629569]: [6.881841536s] [6.881829806s] About to write a response
I0807 06:30:01.105047       1 trace.go:116] Trace[410011724]: "Get" url:/api/v1/namespaces/ingress-nginx/configmaps/ingress-controller-leader-nginx-public,user-agent:nginx-ingress-controller/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.193.31 (started: 2020-08-07 06:29:59.254851022 +0000 UTC m=+72872.180533912) (total time: 1.850171504s):
Trace[410011724]: [1.850127263s] [1.850117713s] About to write a response
I0807 06:30:01.105502       1 trace.go:116] Trace[1333423977]: "Get" url:/api/v1/namespaces/kube-system/configmaps/coredns-autoscaler,user-agent:cluster-proportional-autoscaler/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.147.123 (started: 2020-08-07 06:29:57.539666801 +0000 UTC m=+72870.465349691) (total time: 3.565809602s):
Trace[1333423977]: [3.565774231s] [3.565763121s] About to write a response
I0807 06:30:01.104580       1 trace.go:116] Trace[1237990545]: "Get" url:/api/v1/namespaces/ingress-nginx/configmaps/ingress-controller-leader-nginx-acceptance-cms,user-agent:nginx-ingress-controller/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.147.123 (started: 2020-08-07 06:29:54.0650784 +0000 UTC m=+72866.990761290) (total time: 7.039469799s):
Trace[1237990545]: [7.039403388s] [7.039392667s] About to write a response
I0807 06:30:01.102643       1 trace.go:116] Trace[724288147]: "Get" url:/api/v1/namespaces/default,user-agent:kube-apiserver/v1.18.6 (linux/amd64) kubernetes/dff82dc,client:::1 (started: 2020-08-07 06:29:57.10929586 +0000 UTC m=+72870.034978750) (total time: 3.99332166s):
Trace[724288147]: [3.993290589s] [3.993279849s] About to write a response
I0807 06:30:01.102411       1 trace.go:116] Trace[1184972991]: "Get" url:/api/v1/namespaces/kube-system/endpoints/kube-controller-manager,user-agent:kube-controller-manager/v1.18.6 (linux/amd64) kubernetes/dff82dc/leader-election,client:127.0.0.1 (started: 2020-08-07 06:29:54.065299814 +0000 UTC m=+72866.990982714) (total time: 7.037081932s):
Trace[1184972991]: [7.037023501s] [7.037005781s] About to write a response
I0807 06:30:01.102496       1 trace.go:116] Trace[1914128814]: "Get" url:/api/v1/persistentvolumes/cms-efs-pv,user-agent:kubelet/v1.18.6 (linux/amd64) kubernetes/dff82dc,client:10.0.147.123 (started: 2020-08-07 06:29:54.085547597 +0000 UTC m=+72867.011230487) (total time: 7.016920151s):
Trace[1914128814]: [7.0168755s] [7.01686696s] About to write a response
I0807 06:30:01.101728       1 trace.go:116] Trace[1291545872]: "Get" url:/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler,user-agent:kube-scheduler/v1.18.6 (linux/amd64) kubernetes/dff82dc/leader-election,client:127.0.0.1 (started: 2020-08-07 06:29:54.09175972 +0000 UTC m=+72867.017442610) (total time: 7.009912566s):
Trace[1291545872]: [7.009850435s] [7.009824984s] About to write a response
I0807 06:30:00.944339       1 trace.go:116] Trace[1342146895]: "Get" url:/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler,user-agent:cluster-autoscaler/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.162.134 (started: 2020-08-07 06:29:54.06772783 +0000 UTC m=+72866.993410720) (total time: 6.876582547s):
Trace[1342146895]: [6.876582547s] [6.876558217s] END
E0807 06:30:00.942843       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
I0807 06:30:00.943405       1 trace.go:116] Trace[1137492827]: "Get" url:/api/v1/namespaces/kube-system/configmaps/cluster-autoscaler-status,user-agent:cluster-autoscaler/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.162.134 (started: 2020-08-07 06:29:54.570202502 +0000 UTC m=+72867.495885392) (total time: 6.373164801s):
Trace[1137492827]: [6.373164801s] [6.373153501s] END
E0807 06:30:00.943847       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
I0807 06:30:00.458438       1 healthz.go:234] [+]ping ok
[+]log ok
[-]etcd failed: reason withheld
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
healthz check failed
I0807 06:29:59.506958       1 trace.go:116] Trace[1443714326]: "List" url:/apis/batch/v1/jobs,user-agent:kube-controller-manager/v1.18.6 (linux/amd64) kubernetes/dff82dc/system:serviceaccount:kube-system:cronjob-controller,client:127.0.0.1 (started: 2020-08-07 06:29:55.039163837 +0000 UTC m=+66196.225622958) (total time: 4.467766019s):
Trace[1443714326]: [4.466443983s] [4.466424913s] Listing from storage done
I0807 06:29:59.506073       1 trace.go:116] Trace[1011047555]: "Get" url:/api/v1/namespaces/kube-system/configmaps/kops-controller-leader,user-agent:kops-controller/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.146.71 (started: 2020-08-07 06:29:56.328473579 +0000 UTC m=+66197.514932690) (total time: 3.177572806s):
Trace[1011047555]: [3.177524426s] [3.177515316s] About to write a response
I0807 06:29:59.505442       1 trace.go:116] Trace[965879463]: "List etcd3" key:/jobs,resourceVersion:,limit:500,continue: (started: 2020-08-07 06:29:55.039188757 +0000 UTC m=+66196.225647868) (total time: 4.466226791s):
Trace[965879463]: [4.466226791s] [4.466226791s] END
I0807 06:29:59.503467       1 trace.go:116] Trace[1073531398]: "Get" url:/api/v1/nodes/ip-10-0-162-134.eu-west-1.compute.internal,user-agent:protokube/v0.0.0 (linux/amd64) kubernetes/$Format,client:127.0.0.1 (started: 2020-08-07 06:29:58.117318155 +0000 UTC m=+66199.303777276) (total time: 1.386118519s):
Trace[1073531398]: [1.385255018s] [1.385246168s] About to write a response
I0807 06:29:59.502273       1 trace.go:116] Trace[902744542]: "Get" url:/api/v1/namespaces/default,user-agent:kube-apiserver/v1.18.6 (linux/amd64) kubernetes/dff82dc,client:::1 (started: 2020-08-07 06:29:56.648017785 +0000 UTC m=+66197.834476906) (total time: 2.854213634s):
Trace[902744542]: [2.854159534s] [2.854148124s] About to write a response
I0807 06:29:59.502722       1 trace.go:116] Trace[1296014947]: "Get" url:/api/v1/namespaces/kube-system/endpoints/kube-controller-manager,user-agent:kube-controller-manager/v1.18.6 (linux/amd64) kubernetes/dff82dc/leader-election,client:127.0.0.1 (started: 2020-08-07 06:29:54.499288654 +0000 UTC m=+66195.685747765) (total time: 5.003411821s):
Trace[1296014947]: [5.00338469s] [5.003363679s] About to write a response
E0807 06:29:59.496664       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: request timed out"}
I0807 06:29:59.497002       1 trace.go:116] Trace[1393232789]: "Get" url:/api/v1/namespaces/kube-system/configmaps/kops-controller-leader,user-agent:kops-controller/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.162.134 (started: 2020-08-07 06:29:52.333973384 +0000 UTC m=+66193.520432495) (total time: 7.162986621s):
Trace[1393232789]: [7.162986621s] [7.162976631s] END
E0807 06:29:59.497423       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: request timed out"}
I0807 06:29:59.497987       1 trace.go:116] Trace[760182879]: "Get" url:/api/v1/namespaces/kube-system/endpoints/kube-scheduler,user-agent:kube-scheduler/v1.18.6 (linux/amd64) kubernetes/dff82dc/leader-election,client:127.0.0.1 (started: 2020-08-07 06:29:51.85116812 +0000 UTC m=+66193.037627241) (total time: 7.646787357s):
Trace[760182879]: [7.646787357s] [7.646764776s] END
I0807 06:29:58.617058       1 trace.go:116] Trace[1315296369]: "GuaranteedUpdate etcd3" type:*coordination.Lease (started: 2020-08-07 06:29:51.614274113 +0000 UTC m=+66192.800733244) (total time: 7.002733197s):
Trace[1315296369]: [7.002733197s] [7.00217732s] END
E0807 06:29:58.617196       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: request timed out, possibly due to previous leader failure"}
I0807 06:29:58.617560       1 trace.go:116] Trace[1907253726]: "Update" url:/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-10-0-162-134.eu-west-1.compute.internal,user-agent:kubelet/v1.18.6 (linux/amd64) kubernetes/dff82dc,client:127.0.0.1 (started: 2020-08-07 06:29:51.614133131 +0000 UTC m=+66192.800592292) (total time: 7.003402225s):
Trace[1907253726]: [7.003402225s] [7.003305003s] END
I0807 06:29:54.081853       1 trace.go:116] Trace[656995652]: "Get" url:/api/v1/namespaces/acceptance/persistentvolumeclaims/cms-efs,user-agent:kubelet/v1.18.6 (linux/amd64) kubernetes/dff82dc,client:10.0.147.123 (started: 2020-08-07 06:29:52.840403673 +0000 UTC m=+72865.766086563) (total time: 1.241399278s):
Trace[656995652]: [1.241238086s] [1.241229686s] About to write a response
I0807 06:29:54.082774       1 trace.go:116] Trace[1352207043]: "Get" url:/api/v1/namespaces/kube-system/endpoints/kube-scheduler,user-agent:kube-scheduler/v1.18.6 (linux/amd64) kubernetes/dff82dc/leader-election,client:127.0.0.1 (started: 2020-08-07 06:29:52.070811947 +0000 UTC m=+72864.996494837) (total time: 2.011934148s):
Trace[1352207043]: [2.011896028s] [2.011876838s] About to write a response
I0807 06:29:54.088020       1 trace.go:116] Trace[1548824378]: "Update" url:/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-10-0-201-143.eu-west-1.compute.internal,user-agent:kubelet/v1.18.6 (linux/amd64) kubernetes/dff82dc,client:127.0.0.1 (started: 2020-08-07 06:29:50.883341732 +0000 UTC m=+65659.906369537) (total time: 3.204643639s):
Trace[1548824378]: [3.204096422s] [3.203985161s] Object stored in database
I0807 06:29:54.087390       1 trace.go:116] Trace[1918609084]: "GuaranteedUpdate etcd3" type:*coordination.Lease (started: 2020-08-07 06:29:50.883498894 +0000 UTC m=+65659.906526709) (total time: 3.203852569s):
Trace[1918609084]: [3.203813418s] [3.20322758s] Transaction committed
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
healthz check failed
[+]log ok
[-]etcd failed: reason withheld
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-apiextensions-informers ok
I0807 06:29:53.673092       1 healthz.go:234] [+]ping ok
I0807 06:29:52.498004       1 trace.go:116] Trace[163737924]: "List" url:/api/v1/pods,user-agent:cilium-operator/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.175.74 (started: 2020-08-07 06:29:51.041842333 +0000 UTC m=+65660.064870158) (total time: 1.456138657s):
Trace[163737924]: [1.456138657s] [1.456065366s] END
I0807 06:29:52.497536       1 trace.go:116] Trace[361053280]: "List etcd3" key:/pods,resourceVersion:,limit:0,continue: (started: 2020-08-07 06:29:51.041922804 +0000 UTC m=+65660.064950609) (total time: 1.45557445s):
Trace[361053280]: [1.45557445s] [1.45557445s] END
E0807 06:29:52.497573       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: leader changed"}
E0807 06:29:52.496426       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: leader changed"}
I0807 06:29:52.496833       1 trace.go:116] Trace[1481734084]: "Get" url:/api/v1/namespaces/kube-system/endpoints/kube-controller-manager,user-agent:kube-controller-manager/v1.18.6 (linux/amd64) kubernetes/dff82dc/leader-election,client:127.0.0.1 (started: 2020-08-07 06:29:51.717795216 +0000 UTC m=+66192.904254357) (total time: 778.999938ms):
Trace[1481734084]: [778.999938ms] [778.977508ms] END
I0807 06:19:12.584734       1 trace.go:116] Trace[1233625436]: "Get" url:/api/v1/namespaces/acceptance/pods/zeus-cron-gw1s-1596781140-6pcnn,user-agent:kube-controller-manager/v1.18.6 (linux/amd64) kubernetes/dff82dc/system:serviceaccount:kube-system:generic-garbage-collector,client:127.0.0.1 (started: 2020-08-07 06:19:11.952256671 +0000 UTC m=+65553.138715792) (total time: 632.450929ms):
Trace[1233625436]: [632.409079ms] [632.401929ms] About to write a response
I0807 06:19:12.581544       1 trace.go:116] Trace[1174886116]: "Get" url:/api/v1/namespaces/acceptance/pods/zeus-cron-gw1g-1596781140-4stjt,user-agent:kube-controller-manager/v1.18.6 (linux/amd64) kubernetes/dff82dc/system:serviceaccount:kube-system:generic-garbage-collector,client:127.0.0.1 (started: 2020-08-07 06:19:11.951993878 +0000 UTC m=+65553.138452999) (total time: 629.478642ms):
Trace[1174886116]: [629.382671ms] [629.375481ms] About to write a response
I0807 06:19:12.582095       1 trace.go:116] Trace[479205559]: "Get" url:/api/v1/namespaces/acceptance/pods/zeus-cron-app-1596781140-l9f9s,user-agent:kube-controller-manager/v1.18.6 (linux/amd64) kubernetes/dff82dc/system:serviceaccount:kube-system:generic-garbage-collector,client:127.0.0.1 (started: 2020-08-07 06:19:11.95139352 +0000 UTC m=+65553.137852631) (total time: 630.645807ms):
Trace[479205559]: [630.602037ms] [630.591346ms] About to write a response
I0807 06:19:12.582576       1 trace.go:116] Trace[649348523]: "Get" url:/api/v1/namespaces/acceptance/pods/zeus-cron-gwac-1596781140-75sdt,user-agent:kube-controller-manager/v1.18.6 (linux/amd64) kubernetes/dff82dc/system:serviceaccount:kube-system:generic-garbage-collector,client:127.0.0.1 (started: 2020-08-07 06:19:11.952885719 +0000 UTC m=+65553.139344840) (total time: 629.637374ms):
Trace[649348523]: [629.592004ms] [629.584564ms] About to write a response
I0807 06:19:12.583040       1 trace.go:116] Trace[241412352]: "Get" url:/api/v1/namespaces/acceptance/pods/zeus-cron-gw2g-1596781140-q4qhs,user-agent:kube-controller-manager/v1.18.6 (linux/amd64) kubernetes/dff82dc/system:serviceaccount:kube-system:generic-garbage-collector,client:127.0.0.1 (started: 2020-08-07 06:19:11.952543555 +0000 UTC m=+65553.139002666) (total time: 630.446144ms):
Trace[241412352]: [630.404723ms] [630.397043ms] About to write a response
I0807 06:19:12.584087       1 trace.go:116] Trace[2071769037]: "Get" url:/api/v1/namespaces/acceptance/pods/zeus-cron-gw1a-1596781140-h4nfq,user-agent:kube-controller-manager/v1.18.6 (linux/amd64) kubernetes/dff82dc/system:serviceaccount:kube-system:generic-garbage-collector,client:127.0.0.1 (started: 2020-08-07 06:19:11.951707724 +0000 UTC m=+65553.138166835) (total time: 632.343758ms):
Trace[2071769037]: [632.260537ms] [632.253527ms] About to write a response
I0807 06:19:12.576826       1 trace.go:116] Trace[1716152648]: "Create" url:/api/v1/namespaces/acceptance/events,user-agent:kube-controller-manager/v1.18.6 (linux/amd64) kubernetes/dff82dc/system:serviceaccount:kube-system:cronjob-controller,client:127.0.0.1 (started: 2020-08-07 06:19:11.845850729 +0000 UTC m=+65553.032309840) (total time: 730.932733ms):
Trace[1716152648]: [730.879013ms] [730.803792ms] Object stored in database
I0807 06:13:21.511467       1 trace.go:116] Trace[182093860]: "List" url:/api/v1/namespaces/monitoring/secrets,user-agent:operator/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.176.47 (started: 2020-08-07 06:13:20.832414292 +0000 UTC m=+71873.758097172) (total time: 678.977349ms):
Trace[182093860]: [678.974229ms] [632.95188ms] Writing http response done count:46
I0807 06:13:18.265880       1 trace.go:116] Trace[904391237]: "List" url:/api/v1/namespaces/monitoring/secrets,user-agent:operator/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.176.47 (started: 2020-08-07 06:13:17.090494217 +0000 UTC m=+71870.016177117) (total time: 1.175335154s):
Trace[904391237]: [1.175334004s] [1.096289682s] Writing http response done count:46
I0807 06:13:17.067468       1 trace.go:116] Trace[327754048]: "Get" url:/api/v1/namespaces/ingress-nginx/configmaps/ingress-controller-leader-nginx-internal,user-agent:nginx-ingress-controller/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.193.31 (started: 2020-08-07 06:13:06.428847134 +0000 UTC m=+71859.354530024) (total time: 10.638584087s):
Trace[327754048]: [10.638584087s] [10.638573467s] END
I0807 06:13:17.057091       1 trace.go:116] Trace[152040739]: "Get" url:/api/v1/namespaces/ingress-nginx/configmaps/ingress-controller-leader-nginx-acceptance-cms,user-agent:nginx-ingress-controller/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.193.31 (started: 2020-08-07 06:13:06.836436342 +0000 UTC m=+71859.762119222) (total time: 10.220601274s):
Trace[152040739]: [10.220601274s] [10.220595864s] END
I0807 06:13:17.050671       1 trace.go:116] Trace[735678549]: "Get" url:/api/v1/namespaces/kube-system/configmaps/coredns-autoscaler,user-agent:cluster-proportional-autoscaler/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.147.123 (started: 2020-08-07 06:13:07.539992527 +0000 UTC m=+71860.465675417) (total time: 9.510638823s):
Trace[735678549]: [9.510638823s] [9.510628623s] END
I0807 06:13:17.037834       1 trace.go:116] Trace[1032149158]: "Get" url:/api/v1/namespaces/ingress-nginx/configmaps/ingress-controller-leader-nginx,user-agent:nginx-ingress-controller/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.147.123 (started: 2020-08-07 06:13:07.722480747 +0000 UTC m=+71860.648163647) (total time: 9.315313601s):
Trace[1032149158]: [9.315313601s] [9.315303171s] END
I0807 06:13:17.018368       1 trace.go:116] Trace[1936293305]: "Delete" url:/apis/cilium.io/v2/ciliumidentities/45459,user-agent:cilium-operator/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.175.74 (started: 2020-08-07 06:13:09.9807415 +0000 UTC m=+64659.003769305) (total time: 7.037558341s):
Trace[1936293305]: [7.03748035s] [7.037381739s] Object deleted from database
I0807 06:13:17.015581       1 trace.go:116] Trace[1546788469]: "Get" url:/api/v1/namespaces/ingress-nginx/configmaps/ingress-controller-leader-nginx-public,user-agent:nginx-ingress-controller/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.147.123 (started: 2020-08-07 06:13:07.797198705 +0000 UTC m=+71860.722881595) (total time: 9.21834301s):
Trace[1546788469]: [9.21834301s] [9.21833172s] END
I0807 06:13:17.010438       1 trace.go:116] Trace[1769020839]: "Get" url:/api/v1/namespaces/kube-system/endpoints/kube-controller-manager,user-agent:kube-controller-manager/v1.18.6 (linux/amd64) kubernetes/dff82dc/leader-election,client:127.0.0.1 (started: 2020-08-07 06:13:08.42384498 +0000 UTC m=+71861.349527860) (total time: 8.586553258s):
Trace[1769020839]: [8.586553258s] [8.586531558s] END
I0807 06:13:16.995984       1 trace.go:116] Trace[681486628]: "Get" url:/api/v1/namespaces/ingress-nginx/configmaps/ingress-controller-leader-nginx-internal,user-agent:nginx-ingress-controller/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.147.123 (started: 2020-08-07 06:13:14.963539294 +0000 UTC m=+71867.889222194) (total time: 2.032409298s):
Trace[681486628]: [2.032330407s] [2.032320567s] About to write a response
E0807 06:13:16.996230       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: request timed out"}
E0807 06:13:16.996282       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: request timed out"}
E0807 06:13:16.996315       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: request timed out"}
E0807 06:13:16.996335       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: request timed out"}
E0807 06:13:16.996353       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: request timed out"}
I0807 06:13:16.996444       1 trace.go:116] Trace[1141472074]: "Get" url:/api/v1/namespaces/ingress-nginx/configmaps/ingress-controller-leader-nginx-public,user-agent:nginx-ingress-controller/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.193.31 (started: 2020-08-07 06:13:11.211903795 +0000 UTC m=+71864.137586685) (total time: 5.784519084s):
Trace[1141472074]: [5.784481724s] [5.784473884s] About to write a response
I0807 06:13:16.996698       1 trace.go:116] Trace[244452943]: "Get" url:/api/v1/namespaces/ingress-nginx/configmaps/ingress-controller-leader-nginx-acceptance-cms,user-agent:nginx-ingress-controller/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.147.123 (started: 2020-08-07 06:13:11.014513402 +0000 UTC m=+71863.940196332) (total time: 5.982158561s):
Trace[244452943]: [5.98211899s] [5.98211112s] About to write a response
I0807 06:13:16.997255       1 trace.go:116] Trace[1996820160]: "Get" url:/api/v1/namespaces/acceptance/pods/cms-68d9f76dc5-kxr89,user-agent:kubelet/v1.18.6 (linux/amd64) kubernetes/dff82dc,client:10.0.147.123 (started: 2020-08-07 06:13:07.858187447 +0000 UTC m=+71860.783870437) (total time: 9.139043154s):
Trace[1996820160]: [9.139043154s] [9.139032044s] END
E0807 06:13:16.986169       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: request timed out"}
I0807 06:13:16.983392       1 trace.go:116] Trace[783227481]: "Get" url:/api/v1/namespaces/acceptance/persistentvolumeclaims/cms-efs,user-agent:kubelet/v1.18.6 (linux/amd64) kubernetes/dff82dc,client:10.0.147.123 (started: 2020-08-07 06:13:07.874238297 +0000 UTC m=+71860.799921177) (total time: 9.109013705s):
Trace[783227481]: [9.109013705s] [9.109002825s] END
E0807 06:13:16.981072       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: request timed out"}
I0807 06:13:16.981330       1 trace.go:116] Trace[5262116]: "Get" url:/api/v1/namespaces/default,user-agent:kube-apiserver/v1.18.6 (linux/amd64) kubernetes/dff82dc,client:::1 (started: 2020-08-07 06:13:06.06229836 +0000 UTC m=+71858.987981260) (total time: 10.918992273s):
Trace[5262116]: [10.918992273s] [10.918983913s] END
E0807 06:13:16.981944       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: request timed out"}
I0807 06:13:16.982101       1 trace.go:116] Trace[1117900533]: "Get" url:/api/v1/namespaces/kube-system/pods/kube-scheduler-ip-10-0-146-71.eu-west-1.compute.internal,user-agent:kubelet/v1.18.6 (linux/amd64) kubernetes/dff82dc,client:127.0.0.1 (started: 2020-08-07 06:13:16.229272702 +0000 UTC m=+71869.154955592) (total time: 752.799553ms):
Trace[1117900533]: [752.474338ms] [752.464598ms] About to write a response
I0807 06:13:16.982532       1 trace.go:116] Trace[813457766]: "Get" url:/api/v1/namespaces/kube-system/pods/kube-controller-manager-ip-10-0-201-143.eu-west-1.compute.internal,user-agent:kubelet/v1.18.6 (linux/amd64) kubernetes/dff82dc,client:127.0.0.1 (started: 2020-08-07 06:13:16.026190106 +0000 UTC m=+64665.049217911) (total time: 956.280553ms):
Trace[813457766]: [949.780165ms] [949.770435ms] About to write a response
I0807 06:13:16.981381       1 trace.go:116] Trace[1657074742]: "List" url:/api/v1/pods,user-agent:cilium-operator/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.175.74 (started: 2020-08-07 06:13:07.900334945 +0000 UTC m=+64656.923362770) (total time: 9.081001058s):
Trace[1657074742]: [9.081001058s] [9.080962558s] END
E0807 06:13:16.974264       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: request timed out"}
I0807 06:13:16.973425       1 trace.go:116] Trace[726184448]: "Get" url:/api/v1/namespaces/default,user-agent:kube-apiserver/v1.18.6 (linux/amd64) kubernetes/dff82dc,client:::1 (started: 2020-08-07 06:13:12.460236007 +0000 UTC m=+64661.483263812) (total time: 4.513140119s):
Trace[726184448]: [4.513077448s] [4.513049747s] About to write a response
I0807 06:13:16.970299       1 trace.go:116] Trace[161780535]: "Get" url:/api/v1/namespaces/ingress-nginx/configmaps/ingress-controller-leader-nginx,user-agent:nginx-ingress-controller/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.175.74 (started: 2020-08-07 06:13:07.428445543 +0000 UTC m=+64656.451473348) (total time: 9.541819171s):
Trace[161780535]: [9.541819171s] [9.541807501s] END
I0807 06:13:16.968968       1 trace.go:116] Trace[475562247]: "Get" url:/api/v1/namespaces/acceptance/services/cms,user-agent:helm/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.247.139 (started: 2020-08-07 06:13:05.590485902 +0000 UTC m=+64654.613513717) (total time: 11.378438884s):
Trace[475562247]: [11.378438884s] [11.378426934s] END
E0807 06:13:16.969489       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: request timed out"}
I0807 06:13:16.969876       1 trace.go:116] Trace[51329129]: "List etcd3" key:/pods,resourceVersion:,limit:0,continue: (started: 2020-08-07 06:13:07.900380315 +0000 UTC m=+64656.923408540) (total time: 9.069464234s):
Trace[51329129]: [9.069464234s] [9.069464234s] END
E0807 06:13:16.969910       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: request timed out"}
E0807 06:13:16.968602       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: request timed out"}
Trace[1254346185]: [10.000486815s] [10.000462865s] END
E0807 06:13:16.653972       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
I0807 06:13:16.654850       1 trace.go:116] Trace[1254346185]: "Get" url:/api/v1/namespaces/kube-system/endpoints/kube-scheduler,user-agent:kube-scheduler/v1.18.6 (linux/amd64) kubernetes/dff82dc/leader-election,client:127.0.0.1 (started: 2020-08-07 06:13:06.654186651 +0000 UTC m=+64655.677214456) (total time: 10.000486815s):
I0807 06:13:16.529083       1 trace.go:116] Trace[795141371]: "Get" url:/api/v1/namespaces/kube-system/configmaps/cluster-autoscaler-status,user-agent:cluster-autoscaler/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.162.134 (started: 2020-08-07 06:13:08.295100189 +0000 UTC m=+64657.318127984) (total time: 8.233948396s):
Trace[795141371]: [8.233948396s] [8.233938036s] END
I0807 06:13:16.527981       1 trace.go:116] Trace[567248037]: "Get" url:/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler,user-agent:cluster-autoscaler/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.162.134 (started: 2020-08-07 06:13:06.508873464 +0000 UTC m=+64655.531901269) (total time: 10.019072076s):
Trace[567248037]: [10.019072076s] [10.019062736s] END
E0807 06:13:16.528721       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
E0807 06:13:16.527581       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
I0807 06:13:16.056040       1 trace.go:116] Trace[971316149]: "Get" url:/api/v1/namespaces/kube-system/endpoints/kube-scheduler,user-agent:kube-scheduler/v1.18.6 (linux/amd64) kubernetes/dff82dc/leader-election,client:127.0.0.1 (started: 2020-08-07 06:13:11.967211506 +0000 UTC m=+71864.892894386) (total time: 4.088788043s):
Trace[971316149]: [4.088788043s] [4.088764152s] END
E0807 06:13:16.055513       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
I0807 06:13:15.850956       1 trace.go:116] Trace[526413577]: "Get" url:/api/v1/namespaces/kube-system/configmaps/kops-controller-leader,user-agent:kops-controller/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.201.143 (started: 2020-08-07 06:13:05.851330883 +0000 UTC m=+64654.874358688) (total time: 9.999578604s):
Trace[526413577]: [9.999578604s] [9.999568574s] END
E0807 06:13:15.850603       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
I0807 06:13:15.723729       1 trace.go:116] Trace[1481438935]: "List etcd3" key:/jobs,resourceVersion:,limit:500,continue: (started: 2020-08-07 06:13:10.769517388 +0000 UTC m=+64659.792545193) (total time: 4.954163027s):
Trace[1481438935]: [4.954163027s] [4.954163027s] END
E0807 06:13:15.723764       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
E0807 06:13:15.723916       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
E0807 06:13:15.723977       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
I0807 06:13:15.724025       1 trace.go:116] Trace[1361963526]: "List" url:/apis/batch/v1/jobs,user-agent:kube-controller-manager/v1.18.6 (linux/amd64) kubernetes/dff82dc/system:serviceaccount:kube-system:cronjob-controller,client:127.0.0.1 (started: 2020-08-07 06:13:10.769486398 +0000 UTC m=+64659.792514203) (total time: 4.954508531s):
Trace[1361963526]: [4.954508531s] [4.954483431s] END
I0807 06:13:15.725151       1 trace.go:116] Trace[304666810]: "Get" url:/apis/batch/v1/namespaces/acceptance/jobs/zeus-cron-gw2g-1596780780,user-agent:kube-controller-manager/v1.18.6 (linux/amd64) kubernetes/dff82dc/system:serviceaccount:kube-system:job-controller,client:127.0.0.1 (started: 2020-08-07 06:13:09.977885772 +0000 UTC m=+64659.000913577) (total time: 5.747228012s):
Trace[304666810]: [5.747228012s] [5.747219762s] END
I0807 06:13:15.726188       1 trace.go:116] Trace[545611142]: "Get" url:/apis/batch/v1/namespaces/acceptance/jobs/zeus-cron-app-1596780780,user-agent:kube-controller-manager/v1.18.6 (linux/amd64) kubernetes/dff82dc/system:serviceaccount:kube-system:job-controller,client:127.0.0.1 (started: 2020-08-07 06:13:09.979131678 +0000 UTC m=+64659.002159483) (total time: 5.7470216s):
Trace[545611142]: [5.7470216s] [5.74701414s] END
E0807 06:13:15.599670       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
I0807 06:13:15.599992       1 trace.go:116] Trace[886094614]: "Get" url:/api/v1/namespaces/kube-system/endpoints/kube-controller-manager,user-agent:kube-controller-manager/v1.18.6 (linux/amd64) kubernetes/dff82dc/leader-election,client:127.0.0.1 (started: 2020-08-07 06:13:05.599827317 +0000 UTC m=+64654.622855132) (total time: 10.000120861s):
Trace[886094614]: [10.000120861s] [10.00009684s] END
I0807 06:13:10.537937       1 healthz.go:234] [+]ping ok
[+]log ok
[-]etcd failed: reason withheld
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
healthz check failed
I0807 06:13:10.458289       1 healthz.go:234] [+]ping ok
[+]log ok
[-]etcd failed: reason withheld
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
healthz check failed
I0807 06:13:09.996217       1 trace.go:116] Trace[403515704]: "Delete" url:/apis/cilium.io/v2/namespaces/acceptance/ciliumendpoints/zeus-cron-gwtf-1596780780-5kmzk,user-agent:cilium-agent/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.143.159 (started: 2020-08-07 06:13:06.112703276 +0000 UTC m=+65187.299162387) (total time: 3.883470884s):
Trace[403515704]: [3.883391923s] [3.883297552s] Object deleted from database
I0807 06:13:09.977802       1 trace.go:116] Trace[1410702708]: "GuaranteedUpdate etcd3" type:*unstructured.Unstructured (started: 2020-08-07 06:13:08.661051795 +0000 UTC m=+64657.684079620) (total time: 1.316728635s):
Trace[1410702708]: [1.316233358s] [1.31414073s] Transaction committed
I0807 06:13:09.983594       1 trace.go:116] Trace[1749693897]: "Patch" url:/apis/cilium.io/v2/namespaces/acceptance/ciliumendpoints/cms-68d9f76dc5-kxr89/status,user-agent:cilium-agent/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.147.123 (started: 2020-08-07 06:13:08.660867023 +0000 UTC m=+64657.683894828) (total time: 1.322687755s):
Trace[1749693897]: [1.32233201s] [1.32080904s] Object stored in database
I0807 06:13:09.981203       1 trace.go:116] Trace[130153913]: "Update" url:/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-10-0-201-143.eu-west-1.compute.internal,user-agent:kubelet/v1.18.6 (linux/amd64) kubernetes/dff82dc,client:127.0.0.1 (started: 2020-08-07 06:13:04.645504561 +0000 UTC m=+64653.668532366) (total time: 5.335659155s):
Trace[130153913]: [5.335590744s] [5.335474182s] Object stored in database
Trace[1956585047]: [1.937387283s] [1.937376643s] About to write a response
I0807 06:13:09.978666       1 trace.go:116] Trace[302583028]: "Get" url:/api/v1/namespaces/kube-system/endpoints/kube-controller-manager,user-agent:kube-controller-manager/v1.18.6 (linux/amd64) kubernetes/dff82dc/leader-election,client:127.0.0.1 (started: 2020-08-07 06:13:08.156852449 +0000 UTC m=+65189.343311570) (total time: 1.821781192s):
Trace[302583028]: [1.821745162s] [1.821727182s] About to write a response
I0807 06:13:09.979356       1 trace.go:116] Trace[1956585047]: "Get" url:/api/v1/namespaces/kube-system/configmaps/kops-controller-leader,user-agent:kops-controller/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.162.134 (started: 2020-08-07 06:13:08.041895866 +0000 UTC m=+65189.228354997) (total time: 1.937434723s):
I0807 06:13:09.976964       1 trace.go:116] Trace[676313116]: "GuaranteedUpdate etcd3" type:*coordination.Lease (started: 2020-08-07 06:13:04.645737865 +0000 UTC m=+64653.668765720) (total time: 5.331192724s):
Trace[676313116]: [5.331164203s] [5.330477654s] Transaction committed
I0807 06:13:09.976842       1 trace.go:116] Trace[192067124]: "Get" url:/api/v1/namespaces/kube-system/configmaps/kops-controller-leader,user-agent:kops-controller/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.146.71 (started: 2020-08-07 06:13:07.055196506 +0000 UTC m=+65188.241655627) (total time: 2.921623313s):
Trace[192067124]: [2.921577003s] [2.921566913s] About to write a response
I0807 06:13:09.975062       1 trace.go:116] Trace[661551568]: "Get" url:/api/v1/namespaces/default,user-agent:kube-apiserver/v1.18.6 (linux/amd64) kubernetes/dff82dc,client:::1 (started: 2020-08-07 06:13:06.629073509 +0000 UTC m=+65187.815532630) (total time: 3.345941179s):
Trace[661551568]: [3.345889458s] [3.345877818s] About to write a response
I0807 06:13:09.975897       1 trace.go:116] Trace[2101659364]: "Get" url:/api/v1/namespaces/kube-system/endpoints/kube-scheduler,user-agent:kube-scheduler/v1.18.6 (linux/amd64) kubernetes/dff82dc/leader-election,client:127.0.0.1 (started: 2020-08-07 06:13:08.508879743 +0000 UTC m=+65189.695338854) (total time: 1.466990185s):
Trace[2101659364]: [1.466957604s] [1.466939953s] About to write a response
I0807 06:13:09.970468       1 trace.go:116] Trace[2060751332]: "GuaranteedUpdate etcd3" type:*coordination.Lease (started: 2020-08-07 06:13:06.202295244 +0000 UTC m=+71859.127978134) (total time: 3.768139027s):
Trace[2060751332]: [3.768119017s] [3.767599139s] Transaction committed
I0807 06:13:09.970631       1 trace.go:116] Trace[1058289082]: "Update" url:/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-10-0-147-123.eu-west-1.compute.internal,user-agent:kubelet/v1.18.6 (linux/amd64) kubernetes/dff82dc,client:10.0.147.123 (started: 2020-08-07 06:13:06.202160022 +0000 UTC m=+71859.127842912) (total time: 3.768428861s):
Trace[1058289082]: [3.76836041s] [3.768269028s] Object stored in database
E0807 06:13:09.965759       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: leader changed"}
I0807 06:13:09.966006       1 trace.go:116] Trace[2104184172]: "Get" url:/apis/batch/v1/namespaces/acceptance/jobs/zeus-cron-gwtf-1596780780,user-agent:kube-controller-manager/v1.18.6 (linux/amd64) kubernetes/dff82dc/system:serviceaccount:kube-system:job-controller,client:127.0.0.1 (started: 2020-08-07 06:13:04.739204483 +0000 UTC m=+64653.762232288) (total time: 5.226765548s):
Trace[2104184172]: [5.226765548s] [5.226756688s] END
E0807 06:13:09.964955       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: leader changed"}
I0807 06:13:09.965250       1 trace.go:116] Trace[1438200376]: "Get" url:/api/v1/namespaces/kube-system/endpoints/kube-scheduler,user-agent:kube-scheduler/v1.18.6 (linux/amd64) kubernetes/dff82dc/leader-election,client:127.0.0.1 (started: 2020-08-07 06:13:06.064100277 +0000 UTC m=+71858.989783177) (total time: 3.901113666s):
Trace[1438200376]: [3.901113666s] [3.901096376s] END
I0807 06:13:09.966167       1 trace.go:116] Trace[223385911]: "Patch" url:/apis/cilium.io/v2/ciliumidentities/39169/status,user-agent:cilium-agent/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.143.159 (started: 2020-08-07 06:13:06.081608525 +0000 UTC m=+65187.268067656) (total time: 3.884537427s):
Trace[223385911]: [3.884537427s] [3.884129462s] END
I0807 06:13:09.965645       1 trace.go:116] Trace[1205949895]: "GuaranteedUpdate etcd3" type:*unstructured.Unstructured (started: 2020-08-07 06:13:06.081849118 +0000 UTC m=+65187.268308239) (total time: 3.883720607s):
Trace[1205949895]: [3.883720607s] [3.883554395s] END
E0807 06:13:09.965793       1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: leader changed"}
I0807 06:13:08.590529       1 healthz.go:234] [+]ping ok
[+]log ok
[-]etcd failed: reason withheld
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
healthz check failed
I0807 06:13:06.077871       1 trace.go:116] Trace[773813581]: "GuaranteedUpdate etcd3" type:*unstructured.Unstructured (started: 2020-08-07 06:13:04.793682452 +0000 UTC m=+65185.980141603) (total time: 1.284154658s):
Trace[773813581]: [1.283704723s] [1.28262773s] Transaction committed
I0807 06:13:06.078256       1 trace.go:116] Trace[81245877]: "Update" url:/apis/cilium.io/v2/ciliumidentities/45459/status,user-agent:cilium-agent/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.143.159 (started: 2020-08-07 06:13:04.793090645 +0000 UTC m=+65185.979549766) (total time: 1.2851312s):
Trace[81245877]: [1.284832966s] [1.28434106s] Object stored in database
I0807 06:08:21.306999       1 trace.go:116] Trace[1291006070]: "List" url:/api/v1/namespaces/monitoring/secrets,user-agent:operator/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.176.47 (started: 2020-08-07 06:08:20.621147423 +0000 UTC m=+71573.546830313) (total time: 685.811853ms):
Trace[1291006070]: [685.810623ms] [629.556012ms] Writing http response done count:46
I0807 06:08:20.087495       1 trace.go:116] Trace[213798601]: "List" url:/api/v1/namespaces/monitoring/secrets,user-agent:operator/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.0.176.47 (started: 2020-08-07 06:08:19.321672335 +0000 UTC m=+71572.247355215) (total time: 765.779678ms):
Trace[213798601]: [765.777368ms] [688.677486ms] Writing http response done count:46
I0807 06:08:19.420525       1 controller.go:606] quota admission added evaluator for: ingresses.extensions
I0807 06:08:19.378044       1 controller.go:606] quota admission added evaluator for: deployments.apps

To investigate further I've enabled apiserver audit log but cannot determine problems yet.

I would be happy to get the tips how to proceed with debugging or how to interpret those logs.

@demisx
Copy link
Author

demisx commented Aug 8, 2020

I am not sure whether apiserver_request:burnrate3d or apiserver_request:burnrate6h is firing in my case since they are both present in the alert expression:

sum(apiserver_request:burnrate3d) > (1 * 0.01) and sum(apiserver_request:burnrate6h) > (1 * 0.01)`. 

I don't see any errors written to the api server log. Just these "LIST" requests from the flux daemon. Pretty lost at this point.

I0808 15:22:43.816594       1 trace.go:116] Trace[1771355195]: "List" url:/api/v1/secrets,user-agent:fluxd/v0.0.0 (linux/amd64) kubernetes/$Format,client:172.20.106.218 (started: 2020-08-08 15:22:43.145114257 +0000 UTC m=+249629.270256551) (total time: 671.436227ms):
Trace[1771355195]: [671.434607ms] [627.688687ms] Writing http response done count:134

@demisx
Copy link
Author

demisx commented Aug 19, 2020

Has anyone been able to figure the source of the problem? We are constantly getting hit with warning alerts from all 1.18.6 clusters.

@demisx
Copy link
Author

demisx commented Aug 22, 2020

After disabling Flux and Helm Operator, my LIST operation is around 50 ms now and consistent with the others.

Screen Shot 2020-08-22 at 12 48 12 PM

However, the warning is still firing. Do you guys know why the Prometheus query UI returns different values when queried by the record rule name and the expression it corresponds to? For example, let's take apiserver_request:burnrate6h:

    - expr: |
        (
          (
            # too slow
            sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[6h]))
            -
            (
              sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope=~"resource|",le="0.1"}[6h])) +
              sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope="namespace",le="0.5"}[6h])) +
              sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET",scope="cluster",le="5"}[6h]))
            )
          )
          +
          # errors
          sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET",code=~"5.."}[6h]))
        )
        /
        sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[6h]))
      labels:
        verb: read
      record: apiserver_request:burnrate6h

When queried by the record rule name I get 0.024... :

Screen Shot 2020-08-22 at 12 42 16 PM

But, when queried by the expression itself I get 0.000002816...:

Screen Shot 2020-08-22 at 12 43 05 PM

Aren't these numbers supposed to be the same?

@hakman
Copy link
Member

hakman commented Aug 23, 2020

One observation about the config, the etcd verrsion is incorrect. It should be 3.4.3.

@demisx
Copy link
Author

demisx commented Aug 23, 2020

One observation about the config, the etcd verrsion is incorrect. It should be 3.4.3.

@hakman Thank you for noting that. Will this be something that kops will correct in future releases or do I need to edit the cluster manually?

@hakman
Copy link
Member

hakman commented Aug 24, 2020

This is something you manually added to your cluster config and kops won't override it. You should change it manually to 3.4.3.

@justinaslelys
Copy link

I do not have etcd version specified but have the same problem so most probably it's not the root cause of this.
I'm still have the same issue and I do not have Flux and Helm Operator.

@demisx
Copy link
Author

demisx commented Aug 24, 2020

@hakman Hmm.. 🤔 I've never touched the etcd version. I've always gone through kops upgrade cluster && kops rolling-update cluster --yes to upgrade the cluster. Anyway, I've manually changed the etcd version to 3.4.3 and then tried to do the rolling update, but ran into a problem with calico node pods erring out not being able to connect to the store. I couldn't roll back to 3.2.24 either because of this error, so I am recreating my test cluster from scratch now.

@hakman
Copy link
Member

hakman commented Aug 24, 2020

There was a bug with the create cluster --networking calico that was always setting etcd.version: 3.2.24. I just fixed it today. Sorry for the confusion.

@demisx
Copy link
Author

demisx commented Aug 24, 2020

@hakman What would be the proper way to upgrade an existing cluster to v3.4.3? Wait for the next kops 1.18.x release?

@hakman
Copy link
Member

hakman commented Aug 24, 2020

Not sure this change will make it to the next release. The effects are the same as setting the version to 3.4.3 in cluster config, nothing fancy. Upgrade should be probably done by going though 3.3.x also, to ensure the compatibility (at least this is what I think).

@demisx
Copy link
Author

demisx commented Aug 24, 2020

@xoxbet Do you get different values too when querying by the record rule name (e.g. apiserver_request:burnrate6h) vs the expression it corresponds to? Just as I've shown in my comment above.

@justinaslelys
Copy link

@demisx I get the same result when using raw expression and pre-generated record. Where from did you take expression? Yours is a bit different than mines (scope=~"resource|" vs scope="resource"). If I use your expression, I get different result than the record.

Check if expression you use to query prometheus matches the one here:
kubectl describe prometheusrule prometheus-operator-kube-apiserver.rules -n monitoring

Interestingly, our clusters doesn't share too much in common, we are even using different networking components.
Here is my cluster too, maybe someone can spot something

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2020-02-25T21:50:18Z"
  generation: 38
  name: hostname.tld
spec:
  additionalPolicies:
    node: <some stuff>
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudLabels:
    env: dev
  cloudProvider: aws
  configBase: s3://bucket/hostname.tld
  dnsZone: hostname.tld
  docker:
    logDriver: json-file
    logOpt:
      - max-size=20m
      - max-file=1
  etcdClusters:
    - cpuRequest: 200m
      etcdMembers:
        - instanceGroup: master-eu-west-1a
          name: a
        - instanceGroup: master-eu-west-1b
          name: b
        - instanceGroup: master-eu-west-1c
          name: c
      memoryRequest: 100Mi
      name: main
    - cpuRequest: 100m
      etcdMembers:
        - instanceGroup: master-eu-west-1a
          name: a
        - instanceGroup: master-eu-west-1b
          name: b
        - instanceGroup: master-eu-west-1c
          name: c
      memoryRequest: 100Mi
      name: events
  fileAssets:
    - content: |
        # Log all requests at the Metadata level.
        apiVersion: audit.k8s.io/v1
        kind: Policy
        rules:
        - level: Metadata
      name: apiserver-audit
      path: /srv/kubernetes/audit.yaml
      roles:
        - Master
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    auditLogMaxAge: 10
    auditLogMaxBackups: 1
    auditLogMaxSize: 100
    auditLogPath: /var/log/kube-apiserver-audit.log
    auditPolicyFile: /srv/kubernetes/audit.yaml
    disableBasicAuth: true
  kubeControllerManager:
    horizontalPodAutoscalerDownscaleDelay: 15m0s
    horizontalPodAutoscalerSyncPeriod: 8s
    horizontalPodAutoscalerUpscaleDelay: 1m0s
  kubeDNS:
    nodeLocalDNS:
      enabled: true
    provider: CoreDNS
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
    evictionHard: memory.available<900Mi,nodefs.available<10%,nodefs.inodesFree<5%,imagefs.available<10%,imagefs.inodesFree<5%
  kubernetesApiAccess:
    - 0.0.0.0/0
  kubernetesVersion: 1.18.6
  masterInternalName: api.internal.hostname.tld
  masterPublicName: api.hostname.tld
  networkCIDR: 10.0.0.0/16
  networkID: vpc-110011
  networking:
    cilium:
      IPTablesRulesNoinstall: false
      autoDirectNodeRoutes: false
      bpfCTGlobalAnyMax: 0
      bpfCTGlobalTCPMax: 0
      clusterName: ""
      cniBinPath: ""
      enableNodePort: false
      enablePrometheusMetrics: true
      enableRemoteNodeIdentity: false
      enableipv4: false
      enableipv6: false
      monitorAggregation: ""
      nodeInitBootstrapFile: ""
      preallocateBPFMaps: true
      reconfigureKubelet: false
      removeCbrBridge: false
      restartPods: false
      sidecarIstioProxyImage: ""
      toFqdnsEnablePoller: false
      version: v1.7.5
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
    - 0.0.0.0/0
  subnets:
    <skipping, long and boring>
  topology:
    bastion:
      bastionPublicName: bastion.hostname.tld
    dns:
      type: Public
    masters: private
    nodes: private

@justinaslelys
Copy link

The fact we have different expressions was suspicious for me, so I've updated Prometheus operator from 8.13.8 to 9.3.1 and updated prometheusrules too (I have prometheus rules versioned in my git repo because I have some of alerts customised).

For now (1 hour after the upgrade) I don't see KubeAPIErrorBudgetBurn pending in alerts. I will keep you posted.

@justinaslelys
Copy link

Ok, I'm 99 percent sure that an issue is an old record expression. Compare the result of old (left) and updated (right) - the result very different
prometheus-rule-old-vs-new

@demisx try updating prometheus rules and let's see if we can close this issue.

@demisx
Copy link
Author

demisx commented Aug 25, 2020

@xoxbet Oh, great catch 👍. I must've grabbed those rules from the wrong place. Today, I've copied them into the query UI from the cluster using your kubectl command and the values did match. I've also upgraded to the latest v9.3.1 of prometheus-operator helm chart and my alerts are green so far! Feels SO GOOD not being constantly messaged with the KubeAPIErrorBudgetBurn error. Thank you! 🙏

@demisx demisx closed this as completed Aug 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants