Fix CPUInfo logging #15661

tianleiwu · 2023-04-24T22:18:57Z

Description

When CPU Info is initializing, the default logger might or might not exist yet. Add a check of default logger before logging.

Motivation and Context

#15650 and #10038

snnn · 2023-04-25T04:32:24Z

onnxruntime/core/common/cpuid_info.cc

@@ -132,7 +133,13 @@ void CPUIDInfo::ArmLinuxInit() {
 #ifdef CPUINFO_SUPPORTED
  pytorch_cpuinfo_init_ = cpuinfo_initialize();


You may also let Environment::Initialize() initialize CPUIDInfo. Then we can have a more user-friendly exception handling, and also this class would not need to use logging since we can return the full error message by using an onnxruntime::Status. And even if it needs to, it would be fine because we can be sure logging is already initialized when Environment::Initialize() is called.

You may also let Environment::Initialize() initialize CPUIDInfo. Then we can have a more user-friendly exception handling, and also this class would not need to use logging since we can return the full error message by using an onnxruntime::Status. And even if it needs to, it would be fine because we can be sure logging is already initialized when Environment::Initialize() is called.

That change may cause trouble. MLAS::Platform depends on CPUIDInfo. Currently CPUIDInfo must be initialized before the Platform static initializer is called.

The proposed change, if done properly, would be a very big endeavor. First we need to add an MLAS init interface to be called by Environment::Initialize(). Second we need to change all other MLAS APIs to return some error status if called before MLAS init. And third MLAS is not the only one using static initialization in ORT, we need to find all of them and figure out dependency relationships among them, break possible circular dependencies.

I am not saying don't do it. I am suggesting that such action needs planning and coordination between multiple teams.

Since MLAS::Platform is initialized after CPUIDInfo, can we start to change that part? I don't see any blocker there. ORT usually does not use static initialization, and we have a tool to find out all dynamic initializers. If there is no dependency between these dynamic initializers, it would be fine. For example, if the constructor of CPUIDInfo doesn't use global logger, and the CPUINFO library also doesn't have any dynamic initializer, this code would work fine.

How? platform needs CPUIDInfo to know which cpu it is running on, which in turn let MLAS pick the best assembly kernels to achieve high performance.

onnxruntime/core/common/cpuid_info.cc

pranavsharma · 2023-04-25T20:04:12Z

CPUIDInfo is not a global object. Its construction is controlled by GetCPUIDInfo(). I'm curious why will it get constructed before the environment (and hence the logger) is constructed?

tianleiwu · 2023-04-25T20:47:26Z

CPUIDInfo is not a global object. Its construction is controlled by GetCPUIDInfo(). I'm curious why will it get constructed before the environment (and hence the logger) is constructed?

The root cause is unknown.

onnxruntime/onnxruntime/core/common/cpuid_info.h

Line 14 in 9bf08bd

static CPUIDInfo cpuid_info;

Another possible fix is to change GetCPUIDInfo() to dynamically allocate a CPUIDInfo object and use a smart pointer to hold it.

snnn · 2023-04-25T20:50:39Z

Another possible fix is to change GetCPUIDInfo() to dynamically allocate a CPUIDInfo object and use a smart pointer to hold it.

Or you pass a logger* pointer to the GetCPUIDInfo() function. Then the dependency relation should be very clear and nobody would hit the same issue again. Such errors will be found at compile time instead of runtime.

onnxruntime/core/common/cpuid_info.h

tianleiwu · 2023-04-26T21:29:59Z

I would like to clarify that I did not reproduce the issue in x86/x64 machine, where logger is available when CPUInfo is initialized so this is not a design issue.

The logging issue seems to exist in ARM64 machine only. I do not have access to an ARM64 machine to reproduce this so the root cause is unknown. But the fix shall avoid the ORT exception when logger is not available.

snnn · 2023-04-26T21:50:12Z

But our ARM64 disables CPUInfo?

chenfucn · 2023-05-03T18:35:05Z

MLAS::Platform() depends on CPUIDInfo(). MLAS::Platform uses a global object.

During MLAS::Platform initialization, it asks CPUIDInfo for a set of CPU features that the current hardware platform provides. These CPU features are critical for picking the correct assembly kernels to achieve best performance.

chenfucn · 2023-05-03T18:35:49Z

But our ARM64 disables CPUInfo?

No.

chenfucn · 2023-05-03T18:37:46Z

I would like to clarify that I did not reproduce the issue in x86/x64 machine, where logger is available when CPUInfo is initialized so this is not a design issue.

The logging issue seems to exist in ARM64 machine only. I do not have access to an ARM64 machine to reproduce this so the root cause is unknown. But the fix shall avoid the ORT exception when logger is not available.

Last year I have tested this on andriod phones raspberri-pi, in all of them cpuinfo initialize correct so there is no exception thrown.

jcampbell05 · 2023-09-13T19:55:08Z

I would like to clarify that I did not reproduce the issue in x86/x64 machine, where logger is available when CPUInfo is initialized so this is not a design issue.
The logging issue seems to exist in ARM64 machine only. I do not have access to an ARM64 machine to reproduce this so the root cause is unknown. But the fix shall avoid the ORT exception when logger is not available.

Last year I have tested this on andriod phones raspberri-pi, in all of them cpuinfo initialize correct so there is no exception thrown.

It's specifically ARM64 AWS Lambda machines

MengLinMaker · 2023-09-17T06:16:35Z

It's specifically ARM64 AWS Lambda machines

@tianleiwu, this is how the failure happens:

ARM64 AWS Lambda does not populate "/sys/devices" folder, which torch/cpuinfo reads, returning no info.
Previously, onnxruntime would throw an exception. After PR Tolerate cpuinfo init failure #10199, no cpu info is logged as warning.
But logger is not initialised, leading to exception in logging::LoggingManager.

Your proposal solves the 3rd step.

If AWS provided "/sys/devices" info, that would solve the issue.
Perhaps, this problem is caused by the proprietary Graviton chip AWS uses.
It is unclear if solving this logging issue would allow ARM64 AWS lambdas to use onnxruntime.

mLupine · 2023-12-27T20:19:56Z

I'm experiencing the issue described above on an ARM64 AWS Lambda instance when trying to init onnxruntime. @tianleiwu / @chenfucn can we re-open this PR and get it merged?

Error in cpuinfo: failed to parse the list of possible processors in /sys/devices/system/cpu/possible
Error in cpuinfo: failed to parse the list of present processors in /sys/devices/system/cpu/present
Error in cpuinfo: failed to parse both lists of possible and present processors
terminate called after throwing an instance of 'onnxruntime::OnnxRuntimeException'
what():  /onnxruntime_src/include/onnxruntime/core/common/logging/logging.h:294 static const onnxruntime::logging::Logger& onnxruntime::logging::LoggingManager::DefaultLogger() Attempt to use DefaultLogger but none has been registered.
INIT_REPORT Init Duration: 31541.97 ms	Phase: invoke	Status: error	Error Type: Runtime.ExitError
START RequestId: 85d9436c-f665-4658-8949-82ce60051ddd Version: $LATEST
RequestId: 85d9436c-f665-4658-8949-82ce60051ddd Error: Runtime exited with error: signal: aborted
Runtime.ExitError

MengLinMaker · 2023-12-28T13:27:20Z

@mLupine Try running Onnxruntime on x86_64 Lambda instead - worked for me.

Unfortunately removing the error thrown by the logger will not solve this critical issue:

ARM64 Lambda does not provide this standard linux file for CPU info: /sys/devices/system/cpu/possible,
Onnxruntime requires this CPU info to optimise calculations.

I've previously deployed Pytorch to ARM64 Lambda for inferencing. But this requires Docker and deployment from an ARM server/computer.

mLupine · 2023-12-28T13:35:36Z

@MengLinMaker we're currently using x86 Lambdas with ONNX, and I'm working on migrating them all to ARM to simplify our architecture and reduce execution costs.

Pytorch 2.1.2 implemented a patch that allows it to run on ARM Lambdas despite /sys/devices/system/cpu/possible being unavailable, so I'm pretty sure it should also be possible to make onnxruntime work with just some minor adjustments. This PR is the first and possibly the most important one of them, as it completely prevents the project from running with no workaround available.

I do acknowledge that some optimizations might not be available on such runtime environment, but that's something I'm totally okay with.

mLupine · 2024-01-02T11:20:44Z

Small update: I've confirmed that fixing the logging issue makes onnxruntime work on ARM Lambdas with no further issues. This PR seems to be the only thing needed for the upstream version to work.

jcampbell05 · 2024-01-08T18:06:56Z

Small update: I've confirmed that fixing the logging issue makes onnxruntime work on ARM Lambdas with no further issues. This PR seems to be the only thing needed for the upstream version to work.

Great work! Happy to do any testing needed on our side.

jcampbell05 · 2024-01-08T18:09:13Z

Perhaps, this problem is caused by the proprietary Graviton chip AWS uses.

To add context to this, no it's not to do with Gravitron. It's just how the lambda runtime works, it also doesn't mount /sys for intel chips but cpuinfo is able to use other methods to detect features.

Fix CPUInfo logging

549085d

tianleiwu marked this pull request as draft April 24, 2023 22:19

fix build

0a7f55c

snnn reviewed Apr 25, 2023

View reviewed changes

chenfucn reviewed Apr 25, 2023

View reviewed changes

onnxruntime/core/common/cpuid_info.cc Outdated Show resolved Hide resolved

chenfucn previously approved these changes Apr 25, 2023

View reviewed changes

constexpr

61bdccb

tianleiwu dismissed chenfucn’s stale review via 61bdccb April 25, 2023 19:51

fix const char*

d0af332

tianleiwu marked this pull request as ready for review April 26, 2023 00:24

snnn reviewed Apr 26, 2023

View reviewed changes

onnxruntime/core/common/cpuid_info.h Outdated Show resolved Hide resolved

tianleiwu force-pushed the tlwu/default_logger branch from 6b41a8a to d0af332 Compare April 26, 2023 01:05

tianleiwu mentioned this pull request Apr 27, 2023

Importing onnxruntime on AWS Lambdas with ARM64 processor causes crash #10038

Open

tianleiwu closed this May 26, 2023

snnn mentioned this pull request May 26, 2023

Memory leak in cpuinfo_x86_linux_init #16117

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CPUInfo logging #15661

Fix CPUInfo logging #15661

tianleiwu commented Apr 24, 2023 •

edited

snnn Apr 25, 2023

chenfucn Apr 25, 2023

snnn Apr 25, 2023 •

edited

chenfucn May 3, 2023

pranavsharma commented Apr 25, 2023

tianleiwu commented Apr 25, 2023 •

edited

snnn commented Apr 25, 2023

tianleiwu commented Apr 26, 2023 •

edited

snnn commented Apr 26, 2023

chenfucn commented May 3, 2023

chenfucn commented May 3, 2023

chenfucn commented May 3, 2023

jcampbell05 commented Sep 13, 2023

MengLinMaker commented Sep 17, 2023 •

edited

mLupine commented Dec 27, 2023

MengLinMaker commented Dec 28, 2023 •

edited

mLupine commented Dec 28, 2023

mLupine commented Jan 2, 2024

jcampbell05 commented Jan 8, 2024

jcampbell05 commented Jan 8, 2024 •

edited

		@@ -132,7 +133,13 @@ void CPUIDInfo::ArmLinuxInit() {
		#ifdef CPUINFO_SUPPORTED
		pytorch_cpuinfo_init_ = cpuinfo_initialize();

Fix CPUInfo logging #15661

Fix CPUInfo logging #15661

Conversation

tianleiwu commented Apr 24, 2023 • edited

Description

Motivation and Context

snnn Apr 25, 2023

Choose a reason for hiding this comment

chenfucn Apr 25, 2023

Choose a reason for hiding this comment

snnn Apr 25, 2023 • edited

Choose a reason for hiding this comment

chenfucn May 3, 2023

Choose a reason for hiding this comment

pranavsharma commented Apr 25, 2023

tianleiwu commented Apr 25, 2023 • edited

snnn commented Apr 25, 2023

tianleiwu commented Apr 26, 2023 • edited

snnn commented Apr 26, 2023

chenfucn commented May 3, 2023

chenfucn commented May 3, 2023

chenfucn commented May 3, 2023

jcampbell05 commented Sep 13, 2023

MengLinMaker commented Sep 17, 2023 • edited

mLupine commented Dec 27, 2023

MengLinMaker commented Dec 28, 2023 • edited

mLupine commented Dec 28, 2023

mLupine commented Jan 2, 2024

jcampbell05 commented Jan 8, 2024

jcampbell05 commented Jan 8, 2024 • edited

tianleiwu commented Apr 24, 2023 •

edited

snnn Apr 25, 2023 •

edited

tianleiwu commented Apr 25, 2023 •

edited

tianleiwu commented Apr 26, 2023 •

edited

MengLinMaker commented Sep 17, 2023 •

edited

MengLinMaker commented Dec 28, 2023 •

edited

jcampbell05 commented Jan 8, 2024 •

edited