New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After upgrading to 1.7.0, Kubelet no longer reports cAdvisor stats #48483

Closed
unixwitch opened this Issue Jul 5, 2017 · 43 comments

Comments

Projects
None yet
@unixwitch
Copy link

unixwitch commented Jul 5, 2017

Is this a BUG REPORT or FEATURE REQUEST?: Bug report.

/kind bug

What happened:

I upgraded a cluster from 1.6.6 to 1.7.0. Kubelet no longer reports cAdvisor metrics such as container_cpu_usage_seconds_total on its metrics endpoint (https://node:10250/metrics/). Kubelet's own metrics are still there. cAdvisor itself (http://node:4194/) does show container metrics.

What you expected to happen:

Nothing in the release notes suggests this interface has changed, so I expected the metrics would still be there.

How to reproduce it (as minimally and precisely as possible):

I don't know, but I can reproduce it reliably on this cluster; rebooting or reinstalling nodes doesn't make a difference.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.0+coreos.0", GitCommit:"8c1bf133b4129042ef8f7d1ffac1be14ee83ed10", GitTreeState:"clean", BuildDate:"2017-06-30T17:46:00Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration**: GCE
  • OS (e.g. from /etc/os-release): CoreOS 1409.5.0
  • Kernel (e.g. uname -a): Linux staging-worker-710d.c.torchkube.internal 4.11.6-coreos-r1 #1 SMP Thu Jun 22 22:04:38 UTC 2017 x86_64 Intel(R) Xeon(R) CPU @ 2.20GHz GenuineIntel GNU/Linux
  • Install tools: Custom scripts.
  • Others:
@k8s-merge-robot

This comment has been minimized.

Copy link
Contributor

k8s-merge-robot commented Jul 5, 2017

@unixwitch There are no sig labels on this issue. Please add a sig label by:
(1) mentioning a sig: @kubernetes/sig-<team-name>-misc
e.g., @kubernetes/sig-api-machinery-* for API Machinery
(2) specifying the label manually: /sig <label>
e.g., /sig scalability for sig/scalability

Note: method (1) will trigger a notification to the team. You can find the team list here and label list here

@unixwitch

This comment has been minimized.

Copy link
Author

unixwitch commented Jul 5, 2017

@kubernetes/sig-node-misc

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jul 5, 2017

@unixwitch: Reiterating the mentions to trigger a notification:
@kubernetes/sig-node-misc.

In response to this:

@kubernetes/sig-node-misc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dixudx

This comment has been minimized.

Copy link
Member

dixudx commented Jul 5, 2017

@unixwitch seems to be related to cadvisor. See whether PR #48485 could fix this.

@unixwitch

This comment has been minimized.

Copy link
Author

unixwitch commented Jul 5, 2017

Using latest release-1.7 plus 71160031 doesn't seem to make a difference. It logs this at startup now:

Jul 05 10:42:45 staging-worker-710d.c.torchkube.internal kubelet-test[21596]: I0705 10:42:45.483241   21596 cadvisor_linux.go:124] starting cadvisor manager ...
Jul 05 10:42:46 staging-worker-710d.c.torchkube.internal kubelet-test[21596]: I0705 10:42:46.218169   21596 cadvisor_linux.go:124] starting cadvisor manager ...

But the metrics are still missing:

# curl -isSk --cert /var/lib/prometheus/k8s/torchbox-staging-crt.pem --key /var/lib/prometheus/k8s/torchbox-staging-key.pem https://172.31.208.9:10250/metrics | grep container_cpu
#
@unixwitch

This comment has been minimized.

Copy link
Author

unixwitch commented Jul 5, 2017

I'm not sure if this is related, but Kubelet is also logging this every 10 seconds:

Jul 05 10:53:11 staging-worker-710d.c.torchkube.internal kubelet-test[21596]: W0705 10:53:11.192776   21596 helpers.go:771] eviction manager: no observation found for eviction signal allocatableNodeFs.available
@unixwitch

This comment has been minimized.

Copy link
Author

unixwitch commented Jul 5, 2017

This looks the same as #47744, but the fix for that was merged before 1.7.0 release, so I'm not sure why it's still broken.

@Random-Liu

@FarhadF

This comment has been minimized.

Copy link

FarhadF commented Jul 5, 2017

I have the same issue on newly installed cluster.

Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.0", GitCommit:"d3ada0119e776222f11ec7945e6d860061339aad", GitTreeState:"clean", BuildDate:"2017-06-29T23:15:59Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.0", GitCommit:"d3ada0119e776222f11ec7945e6d860061339aad", GitTreeState:"clean", BuildDate:"2017-06-29T22:55:19Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

Missing all container_* metrics in http://10.100.1.3:10254/metrics | grep container_*
I can see other metrics without grep!

Container metrics are available on curl http://localhost:10255/stats/summary

{
  "node": {
   "nodeName": "k2",
   "systemContainers": [
    {
     "name": "kubelet",
     "startTime": "2017-07-05T16:13:55Z",
     "cpu": {
      "time": "2017-07-05T16:19:30Z",
      "usageNanoCores": 29075162,
      "usageCoreNanoSeconds": 12165039327
     },
     "memory": {
      "time": "2017-07-05T16:19:30Z",
      "usageBytes": 37052416,
      "workingSetBytes": 36323328,
      "rssBytes": 34512896,
      "pageFaults": 123283,
      "majorPageFaults": 10
     },
     "userDefinedMetrics": null
    },
    {
     "name": "runtime",
     "startTime": "2017-07-03T09:30:05Z",
     "cpu": {
      "time": "2017-07-05T16:19:37Z",
      "usageNanoCores": 5825907,
      "usageCoreNanoSeconds": 1184794434270
     },
     "memory": {
      "time": "2017-07-05T16:19:37Z",
      "usageBytes": 646012928,
      "workingSetBytes": 235999232,
      "rssBytes": 60485632,
      "pageFaults": 617224,
      "majorPageFaults": 325
     },
     "userDefinedMetrics": null
    }
   ],
   "startTime": "2017-07-03T09:30:05Z",
   "cpu": {
    "time": "2017-07-05T16:19:37Z",
    "usageNanoCores": 98265931,
    "usageCoreNanoSeconds": 9257390739986
   },
   "memory": {
    "time": "2017-07-05T16:19:37Z",
    "availableBytes": 1477287936,
    "usageBytes": 1241866240,
    "workingSetBytes": 624513024,
    "rssBytes": 647168,
    "pageFaults": 41456,
    "majorPageFaults": 95
   },
   "fs": {
    "time": "2017-07-05T16:19:37Z",
    "availableBytes": 3012079616,
    "capacityBytes": 6166740992,
    "usedBytes": 2821214208,
    "inodesFree": 320027,
    "inodes": 387072,
    "inodesUsed": 67045
   },
   "runtime": {
    "imageFs": {
     "time": "2017-07-05T16:19:37Z",
     "availableBytes": 3012079616,
     "capacityBytes": 6166740992,
     "usedBytes": 801880663,
     "inodesFree": 320027,
     "inodes": 387072,
     "inodesUsed": 67045
    }
   }
  },
  "pods": [
   {
    "podRef": {
     "name": "kubernetes-dashboard-103235509-q4m9d",
     "namespace": "kube-system",
     "uid": "267b4bf8-5fe6-11e7-a494-0050568a7e6c"
    },
    "startTime": "2017-07-03T12:01:30Z",
    "containers": [
     {
      "name": "kubernetes-dashboard",
      "startTime": "2017-07-03T12:01:31Z",
      "cpu": {
       "time": "2017-07-05T16:19:35Z",
       "usageNanoCores": 1180606,
       "usageCoreNanoSeconds": 61823932041
      },
      "memory": {
       "time": "2017-07-05T16:19:35Z",
       "usageBytes": 23384064,
       "workingSetBytes": 23384064,
       "rssBytes": 22962176,
       "pageFaults": 10201,
       "majorPageFaults": 35
      },
      "rootfs": {
       "time": "2017-07-05T16:19:35Z",
       "availableBytes": 3012079616,
       "capacityBytes": 6166740992,
       "usedBytes": 135471104,
       "inodesFree": 320027,
       "inodes": 387072,
       "inodesUsed": 12
      },
      "logs": {
       "time": "2017-07-05T16:19:35Z",
       "availableBytes": 3012079616,
       "capacityBytes": 6166740992,
       "usedBytes": 36864,
       "inodesFree": 320027,
       "inodes": 387072,
       "inodesUsed": 67045
      },
      "userDefinedMetrics": null
     }
    ],
    "network": {
     "time": "2017-07-05T16:19:43Z",
     "rxBytes": 9887383,
     "rxErrors": 0,
     "txBytes": 23368295,
     "txErrors": 0
    },
    "volume": [
     {
      "time": "2017-07-05T16:14:55Z",
      "availableBytes": 1050886144,
      "capacityBytes": 1050898432,
      "usedBytes": 12288,
      "inodesFree": 256558,
      "inodes": 256567,
      "inodesUsed": 9,
      "name": "default-token-60j8w"
     }
    ]
   },
   {
    "podRef": {
     "name": "node-exporter-60p8r",
     "namespace": "monitoring",
     "uid": "2e6af934-6005-11e7-a494-0050568a7e6c"
    },
    "startTime": "2017-07-03T15:35:02Z",
    "containers": [
     {
      "name": "node-exporter",
      "startTime": "2017-07-03T15:35:02Z",
      "cpu": {
       "time": "2017-07-05T16:19:30Z",
       "usageNanoCores": 1185574,
       "usageCoreNanoSeconds": 144826707561
      },
      "memory": {
       "time": "2017-07-05T16:19:30Z",
       "usageBytes": 8609792,
       "workingSetBytes": 8609792,
       "rssBytes": 8179712,
       "pageFaults": 4938,
       "majorPageFaults": 9
      },
      "rootfs": {
       "time": "2017-07-05T16:19:30Z",
       "availableBytes": 3012079616,
       "capacityBytes": 6166740992,
       "usedBytes": 21422080,
       "inodesFree": 320027,
       "inodes": 387072,
       "inodesUsed": 12
      },
      "logs": {
       "time": "2017-07-05T16:19:30Z",
       "availableBytes": 3012079616,
       "capacityBytes": 6166740992,
       "usedBytes": 28672,
       "inodesFree": 320027,
       "inodes": 387072,
       "inodesUsed": 67045
      },
      "userDefinedMetrics": null
     }
    ],
    "volume": [
     {
      "time": "2017-07-05T16:14:55Z",
      "availableBytes": 1050886144,
      "capacityBytes": 1050898432,
      "usedBytes": 12288,
      "inodesFree": 256558,
      "inodes": 256567,
      "inodesUsed": 9,
      "name": "default-token-f74v5"
     }
    ]
   },
   {
    "podRef": {
     "name": "nginx-ingress-controller-d6h56",
     "namespace": "kube-system",
     "uid": "ce0ecea5-5ff6-11e7-a494-0050568a7e6c"
    },
    "startTime": "2017-07-03T13:52:08Z",
    "containers": [
     {
      "name": "nginx-ingress-controller",
      "startTime": "2017-07-03T13:52:08Z",
      "cpu": {
       "time": "2017-07-05T16:19:41Z",
       "usageNanoCores": 3253897,
       "usageCoreNanoSeconds": 423721194278
      },
      "memory": {
       "time": "2017-07-05T16:19:41Z",
       "usageBytes": 79507456,
       "workingSetBytes": 79491072,
       "rssBytes": 75460608,
       "pageFaults": 616490,
       "majorPageFaults": 33
      },
      "rootfs": {
       "time": "2017-07-05T16:19:41Z",
       "availableBytes": 3012079616,
       "capacityBytes": 6166740992,
       "usedBytes": 130162688,
       "inodesFree": 320027,
       "inodes": 387072,
       "inodesUsed": 29
      },
      "logs": {
       "time": "2017-07-05T16:19:41Z",
       "availableBytes": 3012079616,
       "capacityBytes": 6166740992,
       "usedBytes": 49152,
       "inodesFree": 320027,
       "inodes": 387072,
       "inodesUsed": 67045
      },
      "userDefinedMetrics": null
     }
    ],
    "volume": [
     {
      "time": "2017-07-05T16:14:55Z",
      "availableBytes": 1050886144,
      "capacityBytes": 1050898432,
      "usedBytes": 12288,
      "inodesFree": 256558,
      "inodes": 256567,
      "inodesUsed": 9,
      "name": "default-token-60j8w"
     }
    ]
   },
   {
    "podRef": {
     "name": "kube-state-metrics-deployment-1863931462-7ckb2",
     "namespace": "monitoring",
     "uid": "169b8f97-6185-11e7-a494-0050568a7e6c"
    },
    "startTime": "2017-07-05T13:23:09Z",
    "containers": [
     {
      "name": "kube-state-metrics",
      "startTime": "2017-07-05T13:23:09Z",
      "cpu": {
       "time": "2017-07-05T16:19:29Z",
       "usageNanoCores": 593473,
       "usageCoreNanoSeconds": 7616961025
      },
      "memory": {
       "time": "2017-07-05T16:19:29Z",
       "usageBytes": 11620352,
       "workingSetBytes": 11620352,
       "rssBytes": 11276288,
       "pageFaults": 5246,
       "majorPageFaults": 0
      },
      "rootfs": {
       "time": "2017-07-05T16:19:29Z",
       "availableBytes": 3012079616,
       "capacityBytes": 6166740992,
       "usedBytes": 45719552,
       "inodesFree": 320027,
       "inodes": 387072,
       "inodesUsed": 13
      },
      "logs": {
       "time": "2017-07-05T16:19:29Z",
       "availableBytes": 3012079616,
       "capacityBytes": 6166740992,
       "usedBytes": 24576,
       "inodesFree": 320027,
       "inodes": 387072,
       "inodesUsed": 67045
      },
      "userDefinedMetrics": null
     }
    ],
    "network": {
     "time": "2017-07-05T16:19:38Z",
     "rxBytes": 7551698,
     "rxErrors": 0,
     "txBytes": 3007631,
     "txErrors": 0
    },
    "volume": [
     {
      "time": "2017-07-05T16:14:55Z",
      "availableBytes": 1050886144,
      "capacityBytes": 1050898432,
      "usedBytes": 12288,
      "inodesFree": 256558,
      "inodes": 256567,
      "inodesUsed": 9,
      "name": "default-token-f74v5"
     }
    ]
   },
   {
    "podRef": {
     "name": "grafana-3205277920-3rv9g",
     "namespace": "monitoring",
     "uid": "5ed85a62-6009-11e7-a494-0050568a7e6c"
    },
    "startTime": "2017-07-03T16:05:01Z",
    "containers": [
     {
      "name": "grafana",
      "startTime": "2017-07-03T16:05:02Z",
      "cpu": {
       "time": "2017-07-05T16:19:32Z",
       "usageNanoCores": 1523897,
       "usageCoreNanoSeconds": 302809923832
      },
      "memory": {
       "time": "2017-07-05T16:19:32Z",
       "usageBytes": 71905280,
       "workingSetBytes": 35860480,
       "rssBytes": 12009472,
       "pageFaults": 3290696,
       "majorPageFaults": 14
      },
      "rootfs": {
       "time": "2017-07-05T16:19:32Z",
       "availableBytes": 3012079616,
       "capacityBytes": 6166740992,
       "usedBytes": 316682240,
       "inodesFree": 320027,
       "inodes": 387072,
       "inodesUsed": 13
      },
      "logs": {
       "time": "2017-07-05T16:19:32Z",
       "availableBytes": 3012079616,
       "capacityBytes": 6166740992,
       "usedBytes": 196608,
       "inodesFree": 320027,
       "inodes": 387072,
       "inodesUsed": 67045
      },
      "userDefinedMetrics": null
     }
    ],
    "network": {
     "time": "2017-07-05T16:19:36Z",
     "rxBytes": 14553836,
     "rxErrors": 0,
     "txBytes": 174770316,
     "txErrors": 0
    },
    "volume": [
     {
      "time": "2017-07-05T16:14:55Z",
      "availableBytes": 1050886144,
      "capacityBytes": 1050898432,
      "usedBytes": 12288,
      "inodesFree": 256558,
      "inodes": 256567,
      "inodesUsed": 9,
      "name": "default-token-f74v5"
     }
    ]
   }
  ]
 }
@dixudx

This comment has been minimized.

Copy link
Member

dixudx commented Jul 6, 2017

@FarhadF But it works well on my new created cluster.

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.0", GitCommit:"d3ada0119e776222f11ec7945e6d860061339aad", GitTreeState:"clean", BuildDate:"2017-06-29T23:15:59Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.0", GitCommit:"d3ada0119e776222f11ec7945e6d860061339aad", GitTreeState:"clean", BuildDate:"2017-06-29T22:55:19Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
$ curl http://localhost:4194/metrics | grep container_*
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0# HELP container_cpu_system_seconds_total Cumulative system cpu time consumed in seconds.
# TYPE container_cpu_system_seconds_total counter
container_cpu_system_seconds_total{id="/"} 302.97
container_cpu_system_seconds_total{id="/docker"} 22.5
container_cpu_system_seconds_total{id="/init.scope"} 0.72
container_cpu_system_seconds_total{id="/kubepods"} 37.44
container_cpu_system_seconds_total{id="/kubepods/besteffort"} 37.47
container_cpu_system_seconds_total{id="/kubepods/besteffort/pod541daf716354cf26f8397227012897da"} 13.89
container_cpu_system_seconds_total{id="/kubepods/besteffort/pod82b0a0bc89364213d292b9240a42d1ab"} 2.46
container_cpu_system_seconds_total{id="/kubepods/besteffort/pod82b0a0bc89364213d292b9240a42d1ab/41ecc652971c6f77055b843a22f8eb09d93a354745e1e175e1b1e7d0f823c152/kube-proxy"} 2.4
container_cpu_system_seconds_total{id="/kubepods/besteffort/podcc6968656fd8366efd6c451ff7e122f4"} 14.61
container_cpu_system_seconds_total{id="/kubepods/besteffort/podf70b33a895a6f7d2a84d34fc5af97783"} 6.11
container_cpu_system_seconds_total{id="/kubepods/burstable"} 0
container_cpu_system_seconds_total{id="/system.slice"} 42.52
container_cpu_system_seconds_total{id="/system.slice/audit-rules.service"} 0
container_cpu_system_seconds_total{id="/system.slice/containerd.service"} 1.1
container_cpu_system_seconds_total{id="/system.slice/coreos-setup-environment.service"} 0
....
....
@dchen1107

This comment has been minimized.

Copy link
Member

dchen1107 commented Jul 6, 2017

This looks like a dup of #47744. @dashpole can you please verify this? Thanks!

@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented Jul 6, 2017

On a newly created cluster from head, this particular issue appears to be resolved, and is most likely a dup of #47744.

curl localhost:4194/metrics | grep container_cpu_usage_seconds_total % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0# HELP container_cpu_usage_seconds_total Cumulative cpu time consumed per cpu in seconds. TYPE container_cpu_usage_seconds_total counter container_cpu_usage_seconds_total{cpu="cpu00",id="/"} 142.443179425 container_cpu_usage_seconds_total{cpu="cpu00",id="/kubepods"} 87.230398293 container_cpu_usage_seconds_total{cpu="cpu00",id="/kubepods/besteffort"} 0.141097833 container_cpu_usage_seconds_total{cpu="cpu00",id="/kubepods/besteffort/podb9ffb65e628276cfe6b3ab57640baa55"} 0.141097833 container_cpu_usage_seconds_total{cpu="cpu00",id="/kubepods/burstable"} 86.840747259 ...

@unixwitch

This comment has been minimized.

Copy link
Author

unixwitch commented Jul 6, 2017

I'm not sure this is #47744 because it was still broken for me with 1.7.1-beta.0.3 Kubelet (with 1.7.0 master). That build does have e90c477 in it, which I thought was the fix for #47744.

I can bring up a test cluster to see if this is related to upgrading, but I imagine that's unlikely. Maybe it's affected by commandline options or system configuration? (Running in rkt vs on the host made no difference for me.)

@unixwitch

This comment has been minimized.

Copy link
Author

unixwitch commented Jul 6, 2017

New cluster with 1.7.1-beta.0.3 Kubelet:

test48483-master-mgtc ~ # kubectl get --all-namespaces pod -owide | grep test48483-worker-ng6f
2017-07-06 19:52:59.358716 I | proto: duplicate proto type registered: google.protobuf.Any
2017-07-06 19:52:59.358891 I | proto: duplicate proto type registered: google.protobuf.Duration
2017-07-06 19:52:59.358960 I | proto: duplicate proto type registered: google.protobuf.Timestamp
kube-lego       kube-lego-4240885720-wqfv7                      1/1       Running   0          1m        172.29.1.3      test48483-worker-ng6f
kube-system     calico-node-vb5l6                               1/1       Running   0          1m        172.31.208.25   test48483-worker-ng6f
kube-system     kube-proxy-test48483-worker-ng6f                1/1       Running   0          1m        172.31.208.25   test48483-worker-ng6f
kube-system     kube-state-metrics-1811189913-0fvmh             1/1       Running   0          1m        172.29.1.4      test48483-worker-ng6f
kube-system     kube-state-metrics-1811189913-5fm84             1/1       Running   0          1m        172.29.1.2      test48483-worker-ng6f
kube-system     node-exporter-jj4jd                             1/1       Running   0          1m        172.31.208.25   test48483-worker-ng6f
test48483-master-mgtc ~ # curl -sS http://test48483-worker-ng6f:10255/metrics|grep container_
# HELP kubelet_running_container_count Number of containers currently running
# TYPE kubelet_running_container_count gauge
kubelet_running_container_count 6
kubelet_runtime_operations{operation_type="container_status"} 6
kubelet_runtime_operations_latency_microseconds{operation_type="container_status",quantile="0.5"} 2337
kubelet_runtime_operations_latency_microseconds{operation_type="container_status",quantile="0.9"} 3912
kubelet_runtime_operations_latency_microseconds{operation_type="container_status",quantile="0.99"} 3912
kubelet_runtime_operations_latency_microseconds_sum{operation_type="container_status"} 19886
kubelet_runtime_operations_latency_microseconds_count{operation_type="container_status"} 6
test48483-master-mgtc ~ # 
@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented Jul 6, 2017

@unixwitch I finally realized you are using the wrong port. 10255 is the kubelet's port for prometheus metrics. As you can see, it gives a metric for runtime operation latency. Port 4194 is the cadvisor port, which has container metrics. See if that works.

@unixwitch

This comment has been minimized.

Copy link
Author

unixwitch commented Jul 7, 2017

@dashpole The problem is that in 1.6 and earlier, port 10255 returned cAdvisor container metrics. The fact it no longer does is an incompatible change which has broken Prometheus, which uses this port to scrape from: https://github.com/prometheus/prometheus/blob/release-1.7/discovery/kubernetes/node.go#L156

If this was intentionally changed, shouldn't there have been an entry in the release notes?

Does this also mean it's now impossible to scrape container metrics over TLS (which worked before using port 10250)? That seems like a significant regression in functionality.

@smarterclayton

This comment has been minimized.

Copy link
Contributor

smarterclayton commented Jul 7, 2017

This does seem like a regression in behavior.

@dchen1107

This comment has been minimized.

Copy link
Member

dchen1107 commented Jul 7, 2017

@luxas is this caused by your change on cAdvisor availability: kubernetes/release#356?

@luxas

This comment has been minimized.

Copy link
Member

luxas commented Jul 7, 2017

@dchen1107 No, definitely not. That was disabling the public cAdvisor port for kubeadm setups only.

It's reported that custom scripts were used and this happened even though cAdvisor was accessible publicly.

@luxas

This comment has been minimized.

Copy link
Member

luxas commented Jul 7, 2017

This seems very kubelet-internal. Also notice the error log message attached above

@unixwitch

This comment has been minimized.

Copy link
Author

unixwitch commented Jul 7, 2017

I wasn't aware of kubernetes/release#356, but if I understand it right, this means a cluster installed by kubeadm has no way to access cAdvisor metrics from Prometheus at all (without manual configuration by the administrator): they are no longer exposed by Kubelet, and they can't be retrieved from cAdvisor directly because its HTTP server is disabled.

It seems to me that disabling cAdvisor by default is a good idea (metrics should not be exposed to the world without authentication) and the new behaviour in Kubelet should be reverted so that metrics are once again available behind authentication. Although it's still not clear to me if the Kubelet change was intentional or not, and if so, what the rationale was for it.

@unixwitch

This comment has been minimized.

Copy link
Author

unixwitch commented Jul 7, 2017

(As an aside, I was planning to disable cAdvisor with --cadvisor-port=0 in our clusters to avoid exposing unauthenticated metrics, but I had to revert that for 1.7.0 because of this change; so even though we don't use kubeadm, this is still a functionality regression for us, even if we can work around it.)

@luxas

This comment has been minimized.

Copy link
Member

luxas commented Jul 7, 2017

I'm still pretty sure cAdvisor is running just fine and pretty much everything still works although you disable the cAdvisor public port. cAdvisor is run inside of the kubelet and still accessible at <node-ip>:10250/stats/ IIRC. That endpoint shows you everything cAdvisor would have show in an unauthenticated manner.

However, in order to be focused, I think that that is unrelated to the issue being present here. Even though cAdvisor is externally accessible kubelet won't show these container metrics in its API, right?

Which is indeed a regression from v1.6

@unixwitch

This comment has been minimized.

Copy link
Author

unixwitch commented Jul 7, 2017

cAdvisor is run inside of the kubelet and still accessible at :10250/stats/

But this outputs JSON, which Prometheus doesn't understand. There is no way to collect the metrics in Prometheus format any more, at least in kubeadm's default configuration. (Edit: unless there's a way to make /stats/ output the metrics in Prometheus format. But I couldn't find any documentation suggesting that is the case.)

I think that that is unrelated to the issue being present here

Well, the two changes are unrelated, yes. But the combination of both together is quite unfortunate for Prometheus users as both existing sources of Prometheus-format cAdvisor metrics have been disabled at the same time.

Even though cAdvisor is externally accessible kubelet won't show these container metrics in its API, right?

Right. The only way to collect the metrics in Prometheus format is via the cAdvisor HTTP server.

@luxas

This comment has been minimized.

Copy link
Member

luxas commented Jul 8, 2017

So the right thing to do here now is to investigate what made kubelet stop reporting cAdvisor container metrics in its own /metrics endpoint in all cases.

Hopefully we can patch this and restore the v1.6 behavior.

@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented Jul 10, 2017

cc @grobie
Ok, so I have tracked the issue down to google/cadvisor#1460.
Specifically, changing prometheus.MustRegister( to r := prometheus.NewRegistry; r.MustRegister( caused the metrics to no longer be displayed on the kubelet's port 10250/metrics, and only on port 4194/metrics.
Based on the original issue, I don't think this behavior was intended, although I could be wrong.

@luxas

This comment has been minimized.

Copy link
Member

luxas commented Jul 11, 2017

@brian-brazil Happy to have that discussion in sig-instrumentation, but IMO, it's more important to fix this issue, get things back to normal, and then plan for a possible deprecation and removal (after ~6 months) of the feature when we have a viable alternative.

@grobie

This comment has been minimized.

Copy link
Member

grobie commented Jul 11, 2017

I will be working on a fix, will send a PR tomorrow hopefully.

@alindeman

This comment has been minimized.

Copy link
Contributor

alindeman commented Jul 12, 2017

@grobie Do you expect to change it back so that :10255/metrics includes cAdvisor metrics? Or will the fix be something different? I ask because this broke prometheus-operator's ability to scrape cAdvisor metrics, and I'm wondering if I should propose a change to prometheus-operator to look for metrics on the cAdvisor port, or just hold out for cAdvisor metrics to come back on port 10255.

@grobie

This comment has been minimized.

Copy link
Member

grobie commented Jul 12, 2017

@alindeman I understood the request to bring back cAdvisor metrics on :10255/metrics for now to restore the 1.6 behavior.

I'm still trying to find the best way to restore the old behavior and test the fix, and given the recent events at SoundCloud I'm also quite busy at the moment, but should have a PR ready by tomorrow.

@alindeman

This comment has been minimized.

Copy link
Contributor

alindeman commented Jul 12, 2017

@grobie Thanks for working on it ❤️

@smarterclayton

This comment has been minimized.

Copy link
Contributor

smarterclayton commented Jul 18, 2017

We could potentially reintroduce this at a new, cadvisor specific host endpoint such as :10250/metrics/cadvisor and also correct some of the issues related to consistency mentioned in #45053. Agree with the cost profile of the metrics - it's likely you'd want to scrape kubelet and this endpoint at different endpoints.

I have a quick patch that mostly cleanly puts cadvisor registration at the new path. While keeping exact compatibility is desirable, I don't think moving scrapes to a new path violates the looser API guarantees on the metrics endpoints if we can improve the scalability of the collectors at the same time. Unsecured metrics are a bigger problem, especially where we are regressing from securing them with the kubelet security profile to a lower (even if local) level.

@smarterclayton

This comment has been minimized.

Copy link
Contributor

smarterclayton commented Jul 18, 2017

@DirectXMan12 i'm inclined to do the separation but on the main port - opinions?

@fgrzadkowski

This comment has been minimized.

Copy link
Member

fgrzadkowski commented Jul 18, 2017

k8s-merge-robot added a commit that referenced this issue Jul 19, 2017

Merge pull request #49079 from smarterclayton/restore_metrics
Automatic merge from submit-queue

Restore cAdvisor prometheus metrics to the main port

But under a new path - `/metrics/cadvisor`. This ensures a secure port still exists for metrics while getting the benefit of separating out container metrics from the kubelet's metrics as recommended in the linked issue.

Fixes #48483

```release-note-action-required
Restored cAdvisor prometheus metrics to the main port -- a regression that existed in v1.7.0-v1.7.2
cAdvisor metrics can now be scraped from `/metrics/cadvisor` on the kubelet ports.
Note that you have to update your scraping jobs to get kubelet-only metrics from `/metrics` and `container_*` metrics from `/metrics/cadvisor`
```
@grobie

This comment has been minimized.

Copy link
Member

grobie commented Jul 19, 2017

Thanks a lot for picking this up @smarterclayton. I got a bit stuck writing an acceptance test for the expected metrics under /metrics. While it's breaking compatibility with 1.6, I think splitting metrics in general makes sense.

@luxas

This comment has been minimized.

Copy link
Member

luxas commented Jul 19, 2017

We should definitely have a conformance test for this now -- feel free to write one @grobie :)

unixwitch added a commit to unixwitch/prometheus that referenced this issue Jul 19, 2017

documentation: update Kubernetes example for 1.7
Kubernetes 1.7+ no longer exposes cAdvisor metrics on the Kubelet
metrics endpoint.  Update the example configuration to scrape cAdvisor
in addition to Kubelet.  The provided configuration works for 1.7.3+
and commented notes are given for 1.7.2 and earlier versions.

Also remove the comment about node (Kubelet) CA not matching the master
CA.  Since the example no longer connects directly to the nodes, it
doesn't matter what CA they're using.

References:

- kubernetes/kubernetes#48483
- kubernetes/kubernetes#49079

@squat squat referenced this issue Jul 20, 2017

Closed

V1.7.1 patchset #152

juliusv added a commit to prometheus/prometheus that referenced this issue Jul 21, 2017

documentation: update Kubernetes example for 1.7 (#2918)
Kubernetes 1.7+ no longer exposes cAdvisor metrics on the Kubelet
metrics endpoint.  Update the example configuration to scrape cAdvisor
in addition to Kubelet.  The provided configuration works for 1.7.3+
and commented notes are given for 1.7.2 and earlier versions.

Also remove the comment about node (Kubelet) CA not matching the master
CA.  Since the example no longer connects directly to the nodes, it
doesn't matter what CA they're using.

References:

- kubernetes/kubernetes#48483
- kubernetes/kubernetes#49079
@hanikesn

This comment has been minimized.

Copy link

hanikesn commented Jul 24, 2017

Sorry to hijack this issue. But there's clearly a problem with the cadvisor endpoint in 1.7.1. It randomly reports either systemd cgroups or docker containers e.g. for container_memory_usage_bytes.

@matthiasr

This comment has been minimized.

Copy link
Member

matthiasr commented Jul 24, 2017

Please don't hijack issues, it just creates confusion. Once this change is released (presumably with 1.7.3) or building from the release branch before that, please confirm whether your issue persists. If it does, it's a new issue, please file it separately. If it doesn't, it was probably related, but is already dealt with.

ntfrnzn added a commit to StackPointCloud/trusted-charts that referenced this issue Oct 27, 2017

Restore pod metrics to prometheus
At some point the cadvisor metrics were moved from the standard kubelet metrics
see kubernetes/kubernetes#48483 .
Change the servicemonitor-kubelet to scrape the subpath as well.

rimusz added a commit to StackPointCloud/trusted-charts that referenced this issue Nov 3, 2017

Update/bump istio v0.2.10 (#87)
* apply updates to prometheus for CRD and rbac

* Restore pod metrics to prometheus

At some point the cadvisor metrics were moved from the standard kubelet metrics
see kubernetes/kubernetes#48483 .
Change the servicemonitor-kubelet to scrape the subpath as well.

* increment versions

* Update/multiple charts (#84)

* update of many charts

* add releases to non prod branches

* toggle prometheus-operator rbac ->false

* toggle prometheus rbac ->false

* bump istio to v0.2.10

rimusz added a commit to StackPointCloud/trusted-charts that referenced this issue Nov 3, 2017

Update/bump istio v0.2.10 (#88)
* apply updates to prometheus for CRD and rbac

* Restore pod metrics to prometheus

At some point the cadvisor metrics were moved from the standard kubelet metrics
see kubernetes/kubernetes#48483 .
Change the servicemonitor-kubelet to scrape the subpath as well.

* increment versions

* Update/multiple charts (#84)

* update of many charts

* add releases to non prod branches

* toggle prometheus-operator rbac ->false

* toggle prometheus rbac ->false

* bump istio to v0.2.10
@zz

This comment has been minimized.

Copy link

zz commented Dec 8, 2017

add kubernetes-cadvisors job in prometheus config to fix prometheus miss container_* metrics, if you install prometheus with helm.

      - job_name: 'kubernetes-cadvisors'

        # Default to scraping over https. If required, just disable this or change to
        # `http`.
        scheme: https

        # This TLS & bearer token file config is used to connect to the actual scrape
        # endpoints for cluster components. This is separate to discovery auth
        # configuration because discovery & scraping are two separate concerns in
        # Prometheus. The discovery auth config is automatic if Prometheus runs inside
        # the cluster. Otherwise, more config options have to be provided within the
        # <kubernetes_sd_config>.
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          # If your node certificates are self-signed or use a different CA to the
          # master CA, then disable certificate verification below. Note that
          # certificate verification is an integral part of a secure infrastructure
          # so this should only be disabled in a controlled environment. You can
          # disable certificate verification by uncommenting the line below.
          #
          insecure_skip_verify: true
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

        kubernetes_sd_configs:
          - role: node

        relabel_configs:
          - action: labelmap
            regex: __meta_kubernetes_node_label_(.+)
          - target_label: __address__
            replacement: kubernetes.default.svc:443
          - source_labels: [__meta_kubernetes_node_name]
            regex: (.+)
            target_label: __metrics_path__
            replacement: /api/v1/nodes/${1}:4194/proxy/metrics
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment