Skip to content

Issue fix: get wrong machine info on Arm64 guest#2456

Merged
dashpole merged 1 commit intogoogle:masterfrom
lubinszARM:pr_cacheinfo
Apr 9, 2020
Merged

Issue fix: get wrong machine info on Arm64 guest#2456
dashpole merged 1 commit intogoogle:masterfrom
lubinszARM:pr_cacheinfo

Conversation

@lubinszARM
Copy link
Copy Markdown
Contributor

@lubinszARM lubinszARM commented Apr 1, 2020

The latest kubernetes deployment on Arm64 VM-s(official ubuntu18.04) always fails.
Because k8s always get num_cores=0 from cAdvisor on Arm64 VM-s.

The reason is that, there is no cache info on Arm64 VM-s.
So the process of getMachineInfo was blocked by the step of getCacheInfo
on Arm64 guest.

And the good news is, we can get cache info on Arm64 hosts.

When this patch was merged, I will deliver a patch to update the version
of cAdvisor in kubernetes as soon as possible.

Signed-off-by: bblu bin.lu@arm.com

@k8s-ci-robot
Copy link
Copy Markdown
Collaborator

Hi @lubinszARM. Thanks for your PR.

I'm waiting for a google member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@lubinszARM
Copy link
Copy Markdown
Contributor Author

/assign @dashpole

@lubinszARM
Copy link
Copy Markdown
Contributor Author

Hi @dashpole
Can you help to review it?
Thanks.

@k8s-ci-robot
Copy link
Copy Markdown
Collaborator

@lubinszARM: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/ok-to-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dims
Copy link
Copy Markdown
Collaborator

dims commented Apr 1, 2020

@lubinszARM i'd rather see the full patch ... not sure why we need to merge this change quickly

@dashpole
Copy link
Copy Markdown
Collaborator

dashpole commented Apr 1, 2020

/ok-to-test

@lubinszARM
Copy link
Copy Markdown
Contributor Author

lubinszARM commented Apr 2, 2020

i'd rather see the full patch ... not sure why we need to merge this change quickly

Hi,
Currently, our conformance tests were blocked due to this issue.
https://k8s-testgrid.appspot.com/conformance-arm#Periodic%20Arm64%20conformance%20test

Because our conformance tests were deployed in Arm64 VM-s, and the guest kernel (official ubuntu18.04) doesn't show the cache information.

@lubinszARM lubinszARM force-pushed the pr_cacheinfo branch 2 times, most recently from 0256340 to 4fcdf9a Compare April 2, 2020 05:01
@iwankgb
Copy link
Copy Markdown
Collaborator

iwankgb commented Apr 2, 2020

Why have you decided to ignore errors on all the platforms rather then on ARM only?

Comment thread utils/sysinfo/sysinfo.go Outdated

// On some Linux platforms(such as Arm64 guest kernel), cache info may not exist.
// So, we should ignore error here.
_ = addCacheInfo(sysFs, &node)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we klog.Warningf here at least? It will log frequently on Arm, but this is just meant to fix the conformance test temporarily.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@lubinszARM
Copy link
Copy Markdown
Contributor Author

lubinszARM commented Apr 3, 2020

Why have you decided to ignore errors on all the platforms rather then on ARM only?

Not all ARM platforms, just only some ARM VM-s, such as official ubuntu-18.04 virtual machine.
cAdvisor can work fine on Arm host and some other ARM VM-s, such as Amazon EC2 A1.
Our conformance test on Amazon EC2 A1 was not blocked, please see:
https://k8s-testgrid.appspot.com/conformance-arm#Periodic%20Arm64%20conformance%20test%20on%20AWS

@lubinszARM
Copy link
Copy Markdown
Contributor Author

Hi @dashpole
Any idea?

@dashpole
Copy link
Copy Markdown
Collaborator

dashpole commented Apr 8, 2020

I believe this was fixed by #2471

@lubinszARM lubinszARM force-pushed the pr_cacheinfo branch 2 times, most recently from 3eea895 to bdf54ad Compare April 9, 2020 04:42
@lubinszARM
Copy link
Copy Markdown
Contributor Author

lubinszARM commented Apr 9, 2020

I believe this was fixed by #2471

Hi @dashpole ,
I have compiled the code with your patch.
Please see the log:
root@ubuntu-03:/go/src/github.com/google/cadvisor# git log | grep kk-fix-cpus-without-nodes
Merge pull request #2471 from katarzyna-z/kk-fix-cpus-without-nodes

The issue is not gone.

In #2471, if no "/sys/devices/system/node/" was found in Linux, it will call getCPUTopology().
But in getCPUTopology(), it also needs to get cache info. And it will have the same issue.
So I also skip the cache-missing-issue in getCPUTopology.

Please see the machine info as reference:

{"num_cores":0,"num_physical_cores":2,"num_sockets":1,"cpu_frequency_khz":0,"memory_capacity":8329531392,"memory_by_type":{},"nvm":{"memory_mode_capacity":0,"app direct_mode_capacity":0,"avg_power_budget":0},"hugepages":[{"page_size":1048576,"num_pages":0},{"page_size":2048,"num_pages":0},{"page_size":32768,"num_pages":0},{"page_size":64,"num_pages":0}],"machine_id":"af15b5a662c84009a24a3a245edf00a0","system_uuid":"62f5f880-9aa0-4f6a-b468-7287cc9fd704","boot_id":"174f24ab-23a6-4445-8b88-b1cd812c27bd","filesystems":[{"device":"/run/user/1000","capacity":832950272,"type":"vfs","inodes":1016788,"has_inodes":true},{"device":"tmpfs","capacity":832950272,"type":"vfs","inodes":1016788,"has_inodes":true},{"device":"/run","capacity":832954368,"type":"vfs","inodes":1016788,"has_inodes":true},{"device":"/dev/mapper/ubuntu--01--vg-root","capacity":45747568640,"type":"vfs","inodes":2853424,"has_inodes":true},{"device":"/dev/shm","capacity":4164763648,"type":"vfs","inodes":1016788,"has_inodes":true},{"device":"/run/lock","capacity":5242880,"type":"vfs","inodes":1016788,"has_inodes":true},{"device":"/sys/fs/cgroup","capacity":4164763648,"type":"vfs","inodes":1016788,"has_inodes":true}],"disk_map":{"253:0":{"name":"dm-0","major":253,"minor":0,"size":46749712384,"scheduler":"none"},"253:1":{"name":"dm-1","major":253,"minor":1,"size":1027604480,"scheduler":"none"},"8:0":{"name":"sda","major":8,"minor":0,"size":48318382080,"scheduler":"mq-deadline"}},"network_devices":[{"name":"enp1s0","mac_address":"52:54:00:e1:49:c6","speed":-1,"mtu":1500}],"topology":null,"cloud_provider":"Unknown","instance_type":"Unknown","instance_id":"None"}

After applying with my patch, the machinInfo will be normal:

{"num_cores":2,"num_physical_cores":2,"num_sockets":1,"cpu_frequency_khz":0,"memory_capacity":8329531392,"memory_by_type":{},"nvm":{"memory_mode_capacity":0,"app direct_mode_capacity":0,"avg_power_budget":0},"hugepages":[{"page_size":1048576,"num_pages":0},{"page_size":2048,"num_pages":0},{"page_size":32768,"num_pages":0},{"page_size":64,"num_pages":0}],"machine_id":"af15b5a662c84009a24a3a245edf00a0","system_uuid":"62f5f880-9aa0-4f6a-b468-7287cc9fd704","boot_id":"174f24ab-23a6-4445-8b88-b1cd812c27bd","filesystems":[{"device":"/dev/shm","capacity":4164763648,"type":"vfs","inodes":1016788,"has_inodes":true},{"device":"/run/lock","capacity":5242880,"type":"vfs","inodes":1016788,"has_inodes":true},{"device":"/sys/fs/cgroup","capacity":4164763648,"type":"vfs","inodes":1016788,"has_inodes":true},{"device":"/run/user/1000","capacity":832950272,"type":"vfs","inodes":1016788,"has_inodes":true},{"device":"tmpfs","capacity":832950272,"type":"vfs","inodes":1016788,"has_inodes":true},{"device":"/run","capacity":832954368,"type":"vfs","inodes":1016788,"has_inodes":true},{"device":"/dev/mapper/ubuntu--01--vg-root","capacity":45747568640,"type":"vfs","inodes":2853424,"has_inodes":true}],"disk_map":{"253:0":{"name":"dm-0","major":253,"minor":0,"size":46749712384,"scheduler":"none"},"253:1":{"name":"dm-1","major":253,"minor":1,"size":1027604480,"scheduler":"none"},"8:0":{"name":"sda","major":8,"minor":0,"size":48318382080,"scheduler":"mq-deadline"}},"network_devices":[{"name":"enp1s0","mac_address":"52:54:00:e1:49:c6","speed":-1,"mtu":1500}],"topology":[{"node_id":0,"memory":8329531392,"hugepages":[{"page_size":1048576,"num_pages":0},{"page_size":2048,"num_pages":0},{"page_size":32768,"num_pages":0},{"page_size":64,"num_pages":0}],"cores":[{"core_id":0,"thread_ids":[0],"caches":null},{"core_id":1,"thread_ids":[1],"caches":null}],"caches":null}],"cloud_provider":"Unknown","instance_type":"Unknown","instance_id":"None"}

@lubinszARM
Copy link
Copy Markdown
Contributor Author

I also added a test case of 'TestGetNodesInfoWithoutCacheInfo' for this issue.

@lubinszARM lubinszARM force-pushed the pr_cacheinfo branch 2 times, most recently from d81b408 to 63b5186 Compare April 9, 2020 06:00
The latest kubernetes deployment on Arm64 VM-s always fails.
Because k8s always get num_cores=0 from cAdvisor on Arm64 VM-s.

The reason is that, there is no cache info on Arm64 VM-s.
And the good news is that, we can get cache info on Arm64 hosts.

When this patch was merged, I will deliver a patch to update the version
of cAdvisor in kubernetes as soon as possible.

Signed-off-by: bblu <bin.lu@arm.com>
Copy link
Copy Markdown
Collaborator

@dashpole dashpole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Thanks for the unit test!

@dashpole
Copy link
Copy Markdown
Collaborator

dashpole commented Apr 9, 2020

cc @katarzyna-z
I'm going to go ahead with this mitigation for now to unblock the kubernetes conformance tests. This includes a unit test to verify we don't break them again.

@dashpole dashpole merged commit e35a0d5 into google:master Apr 9, 2020
@lubinszARM
Copy link
Copy Markdown
Contributor Author

Hi @dims
Can we update the version of cAdvisor in k8s side?
Thanks.

@dims
Copy link
Copy Markdown
Collaborator

dims commented Apr 10, 2020

@lubinszARM go for it

@lubinszARM
Copy link
Copy Markdown
Contributor Author

Hi @dashpole
Can you add a new tag for cadvisor, such as v0.36.1?
So that, I can use this new tag in kubernetes.

Thanks.

@dims
Copy link
Copy Markdown
Collaborator

dims commented Apr 13, 2020

@lubinszARM there's a few more things that need to get in. we typically do this just before code freeze of kubernetes to do a final sync.

Comment thread utils/sysinfo/sysinfo.go
err = addCacheInfo(sysFs, &node)
if err != nil {
return nil, 0, err
klog.Warningf("Found node without cache information, nodeDir: %s", nodeDir)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this appear in the syslog / journal by default? If so, that may be a concern and I would encourage using a lower log level, or a flag to suppress it.

I came here via an issue on the k3s repo after having a log fill up with this message. Turns out, this code gets hit a fair bit.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexellis
Unfortunately, this is still in the syslog.
I will resubmit a PR to solve your troubles

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants