-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix CPUInfo logging #15661
Fix CPUInfo logging #15661
Conversation
@@ -132,7 +133,13 @@ void CPUIDInfo::ArmLinuxInit() { | |||
#ifdef CPUINFO_SUPPORTED | |||
pytorch_cpuinfo_init_ = cpuinfo_initialize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may also let Environment::Initialize() initialize CPUIDInfo. Then we can have a more user-friendly exception handling, and also this class would not need to use logging since we can return the full error message by using an onnxruntime::Status. And even if it needs to, it would be fine because we can be sure logging is already initialized when Environment::Initialize() is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may also let Environment::Initialize() initialize CPUIDInfo. Then we can have a more user-friendly exception handling, and also this class would not need to use logging since we can return the full error message by using an onnxruntime::Status. And even if it needs to, it would be fine because we can be sure logging is already initialized when Environment::Initialize() is called.
That change may cause trouble. MLAS::Platform depends on CPUIDInfo. Currently CPUIDInfo must be initialized before the Platform static initializer is called.
The proposed change, if done properly, would be a very big endeavor. First we need to add an MLAS init interface to be called by Environment::Initialize(). Second we need to change all other MLAS APIs to return some error status if called before MLAS init. And third MLAS is not the only one using static initialization in ORT, we need to find all of them and figure out dependency relationships among them, break possible circular dependencies.
I am not saying don't do it. I am suggesting that such action needs planning and coordination between multiple teams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since MLAS::Platform is initialized after CPUIDInfo, can we start to change that part? I don't see any blocker there. ORT usually does not use static initialization, and we have a tool to find out all dynamic initializers. If there is no dependency between these dynamic initializers, it would be fine. For example, if the constructor of CPUIDInfo doesn't use global logger, and the CPUINFO library also doesn't have any dynamic initializer, this code would work fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How? platform needs CPUIDInfo to know which cpu it is running on, which in turn let MLAS pick the best assembly kernels to achieve high performance.
CPUIDInfo is not a global object. Its construction is controlled by GetCPUIDInfo(). I'm curious why will it get constructed before the environment (and hence the logger) is constructed? |
The root cause is unknown.
Another possible fix is to change GetCPUIDInfo() to dynamically allocate a CPUIDInfo object and use a smart pointer to hold it. |
Or you pass a logger* pointer to the GetCPUIDInfo() function. Then the dependency relation should be very clear and nobody would hit the same issue again. Such errors will be found at compile time instead of runtime. |
6b41a8a
to
d0af332
Compare
I would like to clarify that I did not reproduce the issue in x86/x64 machine, where logger is available when CPUInfo is initialized so this is not a design issue. The logging issue seems to exist in ARM64 machine only. I do not have access to an ARM64 machine to reproduce this so the root cause is unknown. But the fix shall avoid the ORT exception when logger is not available. |
But our ARM64 disables CPUInfo? |
MLAS::Platform() depends on CPUIDInfo(). MLAS::Platform uses a global object. During MLAS::Platform initialization, it asks CPUIDInfo for a set of CPU features that the current hardware platform provides. These CPU features are critical for picking the correct assembly kernels to achieve best performance. |
No. |
Last year I have tested this on andriod phones raspberri-pi, in all of them cpuinfo initialize correct so there is no exception thrown. |
It's specifically ARM64 AWS Lambda machines |
@tianleiwu, this is how the failure happens:
Your proposal solves the 3rd step. If AWS provided "/sys/devices" info, that would solve the issue. |
I'm experiencing the issue described above on an ARM64 AWS Lambda instance when trying to init onnxruntime. @tianleiwu / @chenfucn can we re-open this PR and get it merged?
|
@mLupine Try running Onnxruntime on x86_64 Lambda instead - worked for me. Unfortunately removing the error thrown by the logger will not solve this critical issue:
I've previously deployed Pytorch to ARM64 Lambda for inferencing. But this requires Docker and deployment from an ARM server/computer. |
@MengLinMaker we're currently using x86 Lambdas with ONNX, and I'm working on migrating them all to ARM to simplify our architecture and reduce execution costs. Pytorch 2.1.2 implemented a patch that allows it to run on ARM Lambdas despite I do acknowledge that some optimizations might not be available on such runtime environment, but that's something I'm totally okay with. |
Small update: I've confirmed that fixing the logging issue makes onnxruntime work on ARM Lambdas with no further issues. This PR seems to be the only thing needed for the upstream version to work. |
Great work! Happy to do any testing needed on our side. |
To add context to this, no it's not to do with Gravitron. It's just how the lambda runtime works, it also doesn't mount /sys for intel chips but cpuinfo is able to use other methods to detect features. |
Description
When CPU Info is initializing, the default logger might or might not exist yet. Add a check of default logger before logging.
Motivation and Context
#15650 and #10038