Issue fix: get wrong machine info on Arm64 guest#2456
Conversation
|
Hi @lubinszARM. Thanks for your PR. I'm waiting for a google member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/assign @dashpole |
|
Hi @dashpole |
|
@lubinszARM: Cannot trigger testing until a trusted user reviews the PR and leaves an DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@lubinszARM i'd rather see the full patch ... not sure why we need to merge this change quickly |
|
/ok-to-test |
Hi, Because our conformance tests were deployed in Arm64 VM-s, and the guest kernel (official ubuntu18.04) doesn't show the cache information. |
0256340 to
4fcdf9a
Compare
|
Why have you decided to ignore errors on all the platforms rather then on ARM only? |
|
|
||
| // On some Linux platforms(such as Arm64 guest kernel), cache info may not exist. | ||
| // So, we should ignore error here. | ||
| _ = addCacheInfo(sysFs, &node) |
There was a problem hiding this comment.
Can we klog.Warningf here at least? It will log frequently on Arm, but this is just meant to fix the conformance test temporarily.
Not all ARM platforms, just only some ARM VM-s, such as official ubuntu-18.04 virtual machine. |
|
Hi @dashpole |
|
I believe this was fixed by #2471 |
3eea895 to
bdf54ad
Compare
Hi @dashpole , The issue is not gone. In #2471, if no "/sys/devices/system/node/" was found in Linux, it will call getCPUTopology(). Please see the machine info as reference: {"num_cores":0,"num_physical_cores":2,"num_sockets":1,"cpu_frequency_khz":0,"memory_capacity":8329531392,"memory_by_type":{},"nvm":{"memory_mode_capacity":0,"app direct_mode_capacity":0,"avg_power_budget":0},"hugepages":[{"page_size":1048576,"num_pages":0},{"page_size":2048,"num_pages":0},{"page_size":32768,"num_pages":0},{"page_size":64,"num_pages":0}],"machine_id":"af15b5a662c84009a24a3a245edf00a0","system_uuid":"62f5f880-9aa0-4f6a-b468-7287cc9fd704","boot_id":"174f24ab-23a6-4445-8b88-b1cd812c27bd","filesystems":[{"device":"/run/user/1000","capacity":832950272,"type":"vfs","inodes":1016788,"has_inodes":true},{"device":"tmpfs","capacity":832950272,"type":"vfs","inodes":1016788,"has_inodes":true},{"device":"/run","capacity":832954368,"type":"vfs","inodes":1016788,"has_inodes":true},{"device":"/dev/mapper/ubuntu--01--vg-root","capacity":45747568640,"type":"vfs","inodes":2853424,"has_inodes":true},{"device":"/dev/shm","capacity":4164763648,"type":"vfs","inodes":1016788,"has_inodes":true},{"device":"/run/lock","capacity":5242880,"type":"vfs","inodes":1016788,"has_inodes":true},{"device":"/sys/fs/cgroup","capacity":4164763648,"type":"vfs","inodes":1016788,"has_inodes":true}],"disk_map":{"253:0":{"name":"dm-0","major":253,"minor":0,"size":46749712384,"scheduler":"none"},"253:1":{"name":"dm-1","major":253,"minor":1,"size":1027604480,"scheduler":"none"},"8:0":{"name":"sda","major":8,"minor":0,"size":48318382080,"scheduler":"mq-deadline"}},"network_devices":[{"name":"enp1s0","mac_address":"52:54:00:e1:49:c6","speed":-1,"mtu":1500}],"topology":null,"cloud_provider":"Unknown","instance_type":"Unknown","instance_id":"None"} After applying with my patch, the machinInfo will be normal: {"num_cores":2,"num_physical_cores":2,"num_sockets":1,"cpu_frequency_khz":0,"memory_capacity":8329531392,"memory_by_type":{},"nvm":{"memory_mode_capacity":0,"app direct_mode_capacity":0,"avg_power_budget":0},"hugepages":[{"page_size":1048576,"num_pages":0},{"page_size":2048,"num_pages":0},{"page_size":32768,"num_pages":0},{"page_size":64,"num_pages":0}],"machine_id":"af15b5a662c84009a24a3a245edf00a0","system_uuid":"62f5f880-9aa0-4f6a-b468-7287cc9fd704","boot_id":"174f24ab-23a6-4445-8b88-b1cd812c27bd","filesystems":[{"device":"/dev/shm","capacity":4164763648,"type":"vfs","inodes":1016788,"has_inodes":true},{"device":"/run/lock","capacity":5242880,"type":"vfs","inodes":1016788,"has_inodes":true},{"device":"/sys/fs/cgroup","capacity":4164763648,"type":"vfs","inodes":1016788,"has_inodes":true},{"device":"/run/user/1000","capacity":832950272,"type":"vfs","inodes":1016788,"has_inodes":true},{"device":"tmpfs","capacity":832950272,"type":"vfs","inodes":1016788,"has_inodes":true},{"device":"/run","capacity":832954368,"type":"vfs","inodes":1016788,"has_inodes":true},{"device":"/dev/mapper/ubuntu--01--vg-root","capacity":45747568640,"type":"vfs","inodes":2853424,"has_inodes":true}],"disk_map":{"253:0":{"name":"dm-0","major":253,"minor":0,"size":46749712384,"scheduler":"none"},"253:1":{"name":"dm-1","major":253,"minor":1,"size":1027604480,"scheduler":"none"},"8:0":{"name":"sda","major":8,"minor":0,"size":48318382080,"scheduler":"mq-deadline"}},"network_devices":[{"name":"enp1s0","mac_address":"52:54:00:e1:49:c6","speed":-1,"mtu":1500}],"topology":[{"node_id":0,"memory":8329531392,"hugepages":[{"page_size":1048576,"num_pages":0},{"page_size":2048,"num_pages":0},{"page_size":32768,"num_pages":0},{"page_size":64,"num_pages":0}],"cores":[{"core_id":0,"thread_ids":[0],"caches":null},{"core_id":1,"thread_ids":[1],"caches":null}],"caches":null}],"cloud_provider":"Unknown","instance_type":"Unknown","instance_id":"None"} |
|
I also added a test case of 'TestGetNodesInfoWithoutCacheInfo' for this issue. |
d81b408 to
63b5186
Compare
The latest kubernetes deployment on Arm64 VM-s always fails. Because k8s always get num_cores=0 from cAdvisor on Arm64 VM-s. The reason is that, there is no cache info on Arm64 VM-s. And the good news is that, we can get cache info on Arm64 hosts. When this patch was merged, I will deliver a patch to update the version of cAdvisor in kubernetes as soon as possible. Signed-off-by: bblu <bin.lu@arm.com>
dashpole
left a comment
There was a problem hiding this comment.
lgtm. Thanks for the unit test!
|
cc @katarzyna-z |
|
Hi @dims |
|
@lubinszARM go for it |
|
Hi @dashpole Thanks. |
|
@lubinszARM there's a few more things that need to get in. we typically do this just before code freeze of kubernetes to do a final sync. |
| err = addCacheInfo(sysFs, &node) | ||
| if err != nil { | ||
| return nil, 0, err | ||
| klog.Warningf("Found node without cache information, nodeDir: %s", nodeDir) |
There was a problem hiding this comment.
Will this appear in the syslog / journal by default? If so, that may be a concern and I would encourage using a lower log level, or a flag to suppress it.
I came here via an issue on the k3s repo after having a log fill up with this message. Turns out, this code gets hit a fair bit.
There was a problem hiding this comment.
@alexellis
Unfortunately, this is still in the syslog.
I will resubmit a PR to solve your troubles
The latest kubernetes deployment on Arm64 VM-s(official ubuntu18.04) always fails.
Because k8s always get num_cores=0 from cAdvisor on Arm64 VM-s.
The reason is that, there is no cache info on Arm64 VM-s.
So the process of getMachineInfo was blocked by the step of getCacheInfo
on Arm64 guest.
And the good news is, we can get cache info on Arm64 hosts.
When this patch was merged, I will deliver a patch to update the version
of cAdvisor in kubernetes as soon as possible.
Signed-off-by: bblu bin.lu@arm.com