Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing network metrics from cadvisor after some time #23492

Closed
Reamer opened this issue Jul 26, 2019 · 5 comments · Fixed by google/cadvisor#2284 or #23585
Closed

Missing network metrics from cadvisor after some time #23492

Reamer opened this issue Jul 26, 2019 · 5 comments · Fixed by google/cadvisor#2284 or #23585
Assignees

Comments

@Reamer
Copy link

Reamer commented Jul 26, 2019

Hi,

I'm using okd 3.11 with cri-o. After some uptime of origin-node, cadvisor doesn't report all network metrics (e.g. container_network_transmit_packets_total).

Version

I've build own RPMs for Origin and installed this RPM yesterday (25.07.2019) without any changes from branch release-3.11. My hope was a possible fix.

11:01 $ oc version
oc v3.11.0+0cbc58b
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://s-openshift.mycompany.com:443
openshift v3.11.0+3b2d3b6-227
kubernetes v1.11.0+d4cacc0
Steps To Reproduce
  1. RSH to Prometheus instance
  2. curl -s -k -H "Authorization: Bearer [My-Token]" https://10.20.15.42:10250/metrics/cadvisor | grep container_network_transmit_packets_total | grep namespace=\"thanos\"
container_network_transmit_packets_total{container_name="POD",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6588346a_aefa_11e9_ab2c_005056945e02.slice/crio-45848b690fe11481394b28c108b1fb25f31c5aaa41ff93ad4f56c71501dc90b6.scope",image="",interface="eth0",name="k8s_POD_thanos-querier-6d98c99694-dj8hg_thanos_6588346a-aefa-11e9-ab2c-005056945e02_0",namespace="thanos",pod_name="thanos-querier-6d98c99694-dj8hg"} 334942
container_network_transmit_packets_total{container_name="POD",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod69a3dfe5_aefa_11e9_ab2c_005056945e02.slice/crio-d07f32d58bd0b4240685147aa1af32e36f7558684c600a397c99af68a16c40c8.scope",image="",interface="eth0",name="k8s_POD_minio-2_thanos_69a3dfe5-aefa-11e9-ab2c-005056945e02_0",namespace="thanos",pod_name="minio-2"} 2.115004e+06
  1. wait some time e.g. 5 hours.
  2. Again without output curl -s -k -H "Authorization: Bearer [My-Token]" https://10.20.15.42:10250/metrics/cadvisor | grep container_network_transmit_packets_total | grep namespace=\"thanos\"
Current Result

After some hours network metrics disappear for interface eth0
Network_eth0

Expected Result

Stable network metrics from cadvisor.

@Reamer
Copy link
Author

Reamer commented Aug 7, 2019

Started an extra cadvisor process with root privileges in parallel at the same time, when I noticed that metrics disappear.

Steps to Reproduce
  1. Start cadvisor version 0.30.2
  2. curl metrics
curl -s localhost:8080/metrics  | grep container_network_transmit_packets_total | grep namespace=\"thanos\"

container_network_transmit_packets_total{container_label_app="minio",container_label_component="",container_label_controller_revision_hash="minio-8c6c9fd47",container_label_io_kubernetes_container_name="POD",container_label_io_kubernetes_pod_name="minio-2",container_label_io_kubernetes_pod_namespace="thanos",container_label_io_kubernetes_pod_uid="d3d9436e-b8d3-11e9-afaf-00505694ab34",container_label_logging_infra="",container_label_openshift_io_component="",container_label_pod_template_generation="",container_label_pod_template_hash="",container_label_provider="",container_label_statefulset_kubernetes_io_pod_name="minio-2",container_label_thanos_metrics="",container_label_type="",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podd3d9436e_b8d3_11e9_afaf_00505694ab34.slice/crio-7cb8d3377607ed3b4c20006c755459a9b48e978f8f7f5683192e2489ef651ba0.scope",image="",interface="eth0",name="k8s_POD_minio-2_thanos_d3d9436e-b8d3-11e9-afaf-00505694ab34_0"} 555179
container_network_transmit_packets_total{container_label_app="thanos-querier",container_label_component="",container_label_controller_revision_hash="",container_label_io_kubernetes_container_name="POD",container_label_io_kubernetes_pod_name="thanos-querier-6d98c99694-dj8hg",container_label_io_kubernetes_pod_namespace="thanos",container_label_io_kubernetes_pod_uid="6588346a-aefa-11e9-ab2c-005056945e02",container_label_logging_infra="",container_label_openshift_io_component="",container_label_pod_template_generation="",container_label_pod_template_hash="2854755250",container_label_provider="",container_label_statefulset_kubernetes_io_pod_name="",container_label_thanos_metrics="true",container_label_type="",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6588346a_aefa_11e9_ab2c_005056945e02.slice/crio-45848b690fe11481394b28c108b1fb25f31c5aaa41ff93ad4f56c71501dc90b6.scope",image="",interface="eth0",name="k8s_POD_thanos-querier-6d98c99694-dj8hg_thanos_6588346a-aefa-11e9-ab2c-005056945e02_0"} 6.133791e+06
  1. Make the same curl to origin-cadvisor from prometheus pod - RSH to prometheus pod
curl -s -k -H "Authorization: Bearer [My-Token]" https://10.20.15.42:10250/metrics/cadvisor | grep container_network_transmit_packets_total | grep namespace=\"thanos\"

container_network_transmit_packets_total{container_name="POD",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6588346a_aefa_11e9_ab2c_005056945e02.slice/crio-45848b690fe11481394b28c108b1fb25f31c5aaa41ff93ad4f56c71501dc90b6.scope",image="",interface="eth0",name="k8s_POD_thanos-querier-6d98c99694-dj8hg_thanos_6588346a-aefa-11e9-ab2c-005056945e02_0",namespace="thanos",pod_name="thanos-querier-6d98c99694-dj8hg"} 6.131489e+06
Temporary Bugfix
  1. Restart origin-node - systemctl restart origin-node
  2. Make a curl to origin-cadvisor from prometheus pod - RSH to prometheus pod
curl -s -k -H "Authorization: Bearer IcbS-IHeu_LfFX8JzjhY5TNLk2e0gmaHkTxHGk1e7Ks" https://10.20.15.42:10250/metrics/cadvisor | grep container_network_transmit_packets_total | grep namespace=\"thanos\"

container_network_transmit_packets_total{container_name="POD",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6588346a_aefa_11e9_ab2c_005056945e02.slice/crio-45848b690fe11481394b28c108b1fb25f31c5aaa41ff93ad4f56c71501dc90b6.scope",image="",interface="eth0",name="k8s_POD_thanos-querier-6d98c99694-dj8hg_thanos_6588346a-aefa-11e9-ab2c-005056945e02_0",namespace="thanos",pod_name="thanos-querier-6d98c99694-dj8hg"} 6.135159e+06
container_network_transmit_packets_total{container_name="POD",id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podd3d9436e_b8d3_11e9_afaf_00505694ab34.slice/crio-7cb8d3377607ed3b4c20006c755459a9b48e978f8f7f5683192e2489ef651ba0.scope",image="",interface="eth0",name="k8s_POD_minio-2_thanos_d3d9436e-b8d3-11e9-afaf-00505694ab34_0",namespace="thanos",pod_name="minio-2"} 560093

Is there some process, that cadvisor will loss information about running containers?

@Reamer
Copy link
Author

Reamer commented Aug 8, 2019

Related Ticket in Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1646886

@Reamer
Copy link
Author

Reamer commented Aug 8, 2019

@sjenning
Can confirm your described steps to reproduce this issue in google/cadvisor#2284
I can also confirm, that your PR fixed this issue. Just updated my test environment.
Hopefully this fix is merged soon into cadvisor and origin.
I can open a PR for origin if your interested.

@sjenning
Copy link
Contributor

sjenning commented Aug 8, 2019

@Reamer the process is somewhat complicated unfortunately. Things have to be done in different repos in different versions in a particular order. I'll move it along as quickly as process allows 🤞 Thanks for the offer though!

@uselessidbr
Copy link

uselessidbr commented May 18, 2020

Hello! How to apply this fix on OKD 3.11? As i understand the cadvisor is integrated into Kubelet. I tried to search for any package update but found none. I'm not sure but it seems to me that this process is implemented by the origin-node container. Should i just update the origin-node container's image?

My cluster was deployed by ansible playbook.

# oc version
oc v3.11.0+62803d0-1
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://xyz:8443
openshift v3.11.0+7876dd5-361
kubernetes v1.11.0+d4cacc0

@sjenning Can you clarify it, please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants