Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CPUInfo logging #15661

Closed
wants to merge 4 commits into from
Closed

Fix CPUInfo logging #15661

wants to merge 4 commits into from

Conversation

tianleiwu
Copy link
Contributor

@tianleiwu tianleiwu commented Apr 24, 2023

Description

When CPU Info is initializing, the default logger might or might not exist yet. Add a check of default logger before logging.

Motivation and Context

#15650 and #10038

@tianleiwu tianleiwu marked this pull request as draft April 24, 2023 22:19
@@ -132,7 +133,13 @@ void CPUIDInfo::ArmLinuxInit() {
#ifdef CPUINFO_SUPPORTED
pytorch_cpuinfo_init_ = cpuinfo_initialize();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may also let Environment::Initialize() initialize CPUIDInfo. Then we can have a more user-friendly exception handling, and also this class would not need to use logging since we can return the full error message by using an onnxruntime::Status. And even if it needs to, it would be fine because we can be sure logging is already initialized when Environment::Initialize() is called.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may also let Environment::Initialize() initialize CPUIDInfo. Then we can have a more user-friendly exception handling, and also this class would not need to use logging since we can return the full error message by using an onnxruntime::Status. And even if it needs to, it would be fine because we can be sure logging is already initialized when Environment::Initialize() is called.

That change may cause trouble. MLAS::Platform depends on CPUIDInfo. Currently CPUIDInfo must be initialized before the Platform static initializer is called.

The proposed change, if done properly, would be a very big endeavor. First we need to add an MLAS init interface to be called by Environment::Initialize(). Second we need to change all other MLAS APIs to return some error status if called before MLAS init. And third MLAS is not the only one using static initialization in ORT, we need to find all of them and figure out dependency relationships among them, break possible circular dependencies.

I am not saying don't do it. I am suggesting that such action needs planning and coordination between multiple teams.

Copy link
Member

@snnn snnn Apr 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since MLAS::Platform is initialized after CPUIDInfo, can we start to change that part? I don't see any blocker there. ORT usually does not use static initialization, and we have a tool to find out all dynamic initializers. If there is no dependency between these dynamic initializers, it would be fine. For example, if the constructor of CPUIDInfo doesn't use global logger, and the CPUINFO library also doesn't have any dynamic initializer, this code would work fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How? platform needs CPUIDInfo to know which cpu it is running on, which in turn let MLAS pick the best assembly kernels to achieve high performance.

chenfucn
chenfucn previously approved these changes Apr 25, 2023
@pranavsharma
Copy link
Contributor

CPUIDInfo is not a global object. Its construction is controlled by GetCPUIDInfo(). I'm curious why will it get constructed before the environment (and hence the logger) is constructed?

@tianleiwu
Copy link
Contributor Author

tianleiwu commented Apr 25, 2023

CPUIDInfo is not a global object. Its construction is controlled by GetCPUIDInfo(). I'm curious why will it get constructed before the environment (and hence the logger) is constructed?

The root cause is unknown.

static CPUIDInfo cpuid_info;

Another possible fix is to change GetCPUIDInfo() to dynamically allocate a CPUIDInfo object and use a smart pointer to hold it.

@snnn
Copy link
Member

snnn commented Apr 25, 2023

Another possible fix is to change GetCPUIDInfo() to dynamically allocate a CPUIDInfo object and use a smart pointer to hold it.

Or you pass a logger* pointer to the GetCPUIDInfo() function. Then the dependency relation should be very clear and nobody would hit the same issue again. Such errors will be found at compile time instead of runtime.

@tianleiwu tianleiwu marked this pull request as ready for review April 26, 2023 00:24
@tianleiwu
Copy link
Contributor Author

tianleiwu commented Apr 26, 2023

I would like to clarify that I did not reproduce the issue in x86/x64 machine, where logger is available when CPUInfo is initialized so this is not a design issue.

The logging issue seems to exist in ARM64 machine only. I do not have access to an ARM64 machine to reproduce this so the root cause is unknown. But the fix shall avoid the ORT exception when logger is not available.

@snnn
Copy link
Member

snnn commented Apr 26, 2023

But our ARM64 disables CPUInfo?

@chenfucn
Copy link
Contributor

chenfucn commented May 3, 2023

MLAS::Platform() depends on CPUIDInfo(). MLAS::Platform uses a global object.

During MLAS::Platform initialization, it asks CPUIDInfo for a set of CPU features that the current hardware platform provides. These CPU features are critical for picking the correct assembly kernels to achieve best performance.

@chenfucn
Copy link
Contributor

chenfucn commented May 3, 2023

But our ARM64 disables CPUInfo?

No.

@chenfucn
Copy link
Contributor

chenfucn commented May 3, 2023

I would like to clarify that I did not reproduce the issue in x86/x64 machine, where logger is available when CPUInfo is initialized so this is not a design issue.

The logging issue seems to exist in ARM64 machine only. I do not have access to an ARM64 machine to reproduce this so the root cause is unknown. But the fix shall avoid the ORT exception when logger is not available.

Last year I have tested this on andriod phones raspberri-pi, in all of them cpuinfo initialize correct so there is no exception thrown.

@jcampbell05
Copy link

I would like to clarify that I did not reproduce the issue in x86/x64 machine, where logger is available when CPUInfo is initialized so this is not a design issue.
The logging issue seems to exist in ARM64 machine only. I do not have access to an ARM64 machine to reproduce this so the root cause is unknown. But the fix shall avoid the ORT exception when logger is not available.

Last year I have tested this on andriod phones raspberri-pi, in all of them cpuinfo initialize correct so there is no exception thrown.

It's specifically ARM64 AWS Lambda machines

@MengLinMaker
Copy link

MengLinMaker commented Sep 17, 2023

It's specifically ARM64 AWS Lambda machines

@tianleiwu, this is how the failure happens:

  1. ARM64 AWS Lambda does not populate "/sys/devices" folder, which torch/cpuinfo reads, returning no info.
  2. Previously, onnxruntime would throw an exception. After PR Tolerate cpuinfo init failure #10199, no cpu info is logged as warning.
  3. But logger is not initialised, leading to exception in logging::LoggingManager.

Your proposal solves the 3rd step.

If AWS provided "/sys/devices" info, that would solve the issue.
Perhaps, this problem is caused by the proprietary Graviton chip AWS uses.
It is unclear if solving this logging issue would allow ARM64 AWS lambdas to use onnxruntime.

@mLupine
Copy link

mLupine commented Dec 27, 2023

I'm experiencing the issue described above on an ARM64 AWS Lambda instance when trying to init onnxruntime. @tianleiwu / @chenfucn can we re-open this PR and get it merged?

Error in cpuinfo: failed to parse the list of possible processors in /sys/devices/system/cpu/possible
Error in cpuinfo: failed to parse the list of present processors in /sys/devices/system/cpu/present
Error in cpuinfo: failed to parse both lists of possible and present processors
terminate called after throwing an instance of 'onnxruntime::OnnxRuntimeException'
what():  /onnxruntime_src/include/onnxruntime/core/common/logging/logging.h:294 static const onnxruntime::logging::Logger& onnxruntime::logging::LoggingManager::DefaultLogger() Attempt to use DefaultLogger but none has been registered.
INIT_REPORT Init Duration: 31541.97 ms	Phase: invoke	Status: error	Error Type: Runtime.ExitError
START RequestId: 85d9436c-f665-4658-8949-82ce60051ddd Version: $LATEST
RequestId: 85d9436c-f665-4658-8949-82ce60051ddd Error: Runtime exited with error: signal: aborted
Runtime.ExitError

@MengLinMaker
Copy link

MengLinMaker commented Dec 28, 2023

@mLupine Try running Onnxruntime on x86_64 Lambda instead - worked for me.

Unfortunately removing the error thrown by the logger will not solve this critical issue:

  • ARM64 Lambda does not provide this standard linux file for CPU info: /sys/devices/system/cpu/possible,
  • Onnxruntime requires this CPU info to optimise calculations.

I've previously deployed Pytorch to ARM64 Lambda for inferencing. But this requires Docker and deployment from an ARM server/computer.

@mLupine
Copy link

mLupine commented Dec 28, 2023

@MengLinMaker we're currently using x86 Lambdas with ONNX, and I'm working on migrating them all to ARM to simplify our architecture and reduce execution costs.

Pytorch 2.1.2 implemented a patch that allows it to run on ARM Lambdas despite /sys/devices/system/cpu/possible being unavailable, so I'm pretty sure it should also be possible to make onnxruntime work with just some minor adjustments. This PR is the first and possibly the most important one of them, as it completely prevents the project from running with no workaround available.

I do acknowledge that some optimizations might not be available on such runtime environment, but that's something I'm totally okay with.

@mLupine
Copy link

mLupine commented Jan 2, 2024

Small update: I've confirmed that fixing the logging issue makes onnxruntime work on ARM Lambdas with no further issues. This PR seems to be the only thing needed for the upstream version to work.

@jcampbell05
Copy link

Small update: I've confirmed that fixing the logging issue makes onnxruntime work on ARM Lambdas with no further issues. This PR seems to be the only thing needed for the upstream version to work.

Great work! Happy to do any testing needed on our side.

@jcampbell05
Copy link

jcampbell05 commented Jan 8, 2024

Perhaps, this problem is caused by the proprietary Graviton chip AWS uses.

To add context to this, no it's not to do with Gravitron. It's just how the lambda runtime works, it also doesn't mount /sys for intel chips but cpuinfo is able to use other methods to detect features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants