Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incorrect CPU topology on single NUMA and multi socket platform. #2799

Conversation

Creatone
Copy link
Collaborator

@Creatone Creatone commented Feb 5, 2021

Fixes #2798

Signed-off-by: Paweł Szulik pawel.szulik@intel.com

@k8s-ci-robot
Copy link
Collaborator

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@google-cla google-cla bot added the cla: yes label Feb 5, 2021
@Creatone
Copy link
Collaborator Author

Creatone commented Feb 5, 2021

/ok-to-test

@kashifest
Copy link

It seems physical package id is missing,
[sysinfo.go:406] Cannot read physical package id for /fakeSysfs/devices/system/node/node0/cpu1, physical_package_id file does not exist, err: file does not exist
are we calling GetCPUPhysicalPackageID at the correct location ?

Signed-off-by: Paweł Szulik <pawel.szulik@intel.com>
@Creatone Creatone force-pushed the creatone/single-numa-multi-socket-topology-fix branch from 2976b22 to 76514b8 Compare February 9, 2021 10:58
@Creatone
Copy link
Collaborator Author

Creatone commented Feb 9, 2021

It seems physical package id is missing,
[sysinfo.go:406] Cannot read physical package id for /fakeSysfs/devices/system/node/node0/cpu1, physical_package_id file does not exist, err: file does not exist
are we calling GetCPUPhysicalPackageID at the correct location ?

This PR was draft at this time. Now it's ready for review.

@Creatone Creatone marked this pull request as ready for review February 9, 2021 11:02
Copy link
Collaborator

@iwankgb iwankgb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@iwankgb iwankgb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to un-LGTM it: it may cause yet another set of issues with kubelet (see comments in the linked issue).

@Creatone
Copy link
Collaborator Author

I have to un-LGTM it: it may cause yet another set of issues with kubelet (see comments in the linked issue).

Unfortunately, issues are already there. This PR fixes only one issue, the wrong threads to CPU matching.

@iwankgb
Copy link
Collaborator

iwankgb commented Feb 11, 2021

I know but CPU manager must not be affected - keep in mind that it is enabled for all guaranteed pods by default.

@Insullone
Copy link

@Creatone We tested the fix and it can detect cpu topologies correctly. By the way, could you backport the fix to 0.38.x as well?

@Insullone
Copy link

Just clarify: when I said "We tested the fix and it can detect cpu topologies correctly", I meant it fixed the specific issue "wrong threads to CPU matching".

Copy link
Collaborator

@bobbypage bobbypage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Creatone for the fix. Can you please explain what the issue was / how this fixes it, I'm not super clear? Is the issue due to missing check of core.SocketID == physicalPackageID?

@Creatone
Copy link
Collaborator Author

@bobbypage Imagine you have a platform with one NUMA node and 4 sockets with one CPU in. In this case, every CPU has core_id = 0, so cAdvisor was putting them in the same core in topology.
Ex.

"topology": [
   {
     "node_id": 0,
     "memory": 50628362240,
     "hugepages": [
       {
         "page_size": 1048576,
         "num_pages": 8
       }
     ],
     "cores": [
       {
         "core_id": 0,
         "thread_ids": [
           0,
           1,
           2,
           3,
         ],
         "caches": null,
         "socket_id": 3
       }
     ],
     "caches": null
   }
 ]

And this is wrong. We need to differentiate these CPUs. It's needed to consist of psychical_package_id in our topology discovery. This parameter describes on which socket is CPU.
This is the output that we need:

"topology": [
  {
    "node_id": 0,
    "memory": 50628362240,
    "hugepages": [
      {
        "page_size": 1048576,
        "num_pages": 8
      }
    ],
    "cores": [
      {
        "core_id": 0,
        "thread_ids": [
          0
        ],
        "caches": null,
        "socket_id": 0
      },
      {
        "core_id": 1,
        "thread_ids": [
          1
        ],
        "caches": null,
        "socket_id": 1
      },
      {
        "core_id": 2,
        "thread_ids": [
          2
        ],
        "caches": null,
        "socket_id": 2
      },
      {
        "core_id": 3,
        "thread_ids": [
          3
        ],
        "caches": null,
        "socket_id": 3
      },
    ],
    "caches": null
  }
]

@hanamantagoudvk
Copy link

@Creatone @bobbypage : Is the review ok ? If review is done, it needs to be merged.

@bobbypage
Copy link
Collaborator

got it, thanks @Creatone for the fix. LGTM.

@bobbypage bobbypage merged commit 45a67e1 into google:master Feb 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect CPU topology on Single NUMA and Multi socket system leads to performance degradation for POD
7 participants