Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ledmon[2374]: Unsupported AMD interface #65

Closed
minorsatellite opened this issue Apr 25, 2020 · 31 comments
Closed

ledmon[2374]: Unsupported AMD interface #65

minorsatellite opened this issue Apr 25, 2020 · 31 comments

Comments

@minorsatellite
Copy link

minorsatellite commented Apr 25, 2020

New install of Ubuntu 20.04 LTS, clean upgrade from Ubuntu 18.04 LTS where ledmon was previously working.

Hardware: Dell R7425 (AMD Epyc Architecture)

In syslog I am seeing the following entires:
Apr 25 00:51:36 host ledmon[2374]: ledmon[2374]: Unsupported AMD interface Apr 25 00:51:36 host ledmon[2374]: Unsupported AMD interface Apr 25 00:51:46 host ledmon[2374]: ledmon[2374]: Unsupported AMD interface Apr 25 00:51:46 host ledmon[2374]: Unsupported AMD interface Apr 25 00:51:56 host ledmon[2374]: ledmon[2374]: Unsupported AMD interface Apr 25 00:51:56 host ledmon[2374]: Unsupported AMD interface

admin@host:~$ sudo service ledmon status ● ledmon.service - Enclosure LED Utilities Loaded: loaded (/lib/systemd/system/ledmon.service; enabled; vendor preset: enabled) Active: active (running) since Sat 2020-04-25 00:42:23 UTC; 17min ago Main PID: 2374 (ledmon) Tasks: 1 (limit: 115335) Memory: 2.0M CGroup: /system.slice/ledmon.service └─2374 /usr/sbin/ledmon --foreground

Apr 25 00:59:37 host ledmon[2374]: ledmon[2374]: Unsupported AMD interface Apr 25 00:59:37 host ledmon[2374]: Unsupported AMD interface Apr 25 00:59:47 host ledmon[2374]: ledmon[2374]: Unsupported AMD interface Apr 25 00:59:47 host ledmon[2374]: Unsupported AMD interface Apr 25 00:59:57 host ledmon[2374]: ledmon[2374]: Unsupported AMD interface Apr 25 00:59:57 host ledmon[2374]: Unsupported AMD interface Apr 25 01:00:07 host ledmon[2374]: ledmon[2374]: Unsupported AMD interface Apr 25 01:00:07 host ledmon[2374]: Unsupported AMD interface Apr 25 01:00:17 host ledmon[2374]: ledmon[2374]: Unsupported AMD interface Apr 25 01:00:17 host ledmon[2374]: Unsupported AMD interface

@minorsatellite
Copy link
Author

Attached is an strace capture:
ledmon_strace.log

@mdabrows
Copy link
Contributor

Hello,

it looks that "PowerEdge R7425" is unexpected by AMD code checking platform name (_get_amd_led_interface). @nfont could you please take a look on this issue? I am afraid that with current implementation this problem will return with every new OEM platform.

Regards,
Mariusz

@nfont
Copy link
Contributor

nfont commented Apr 28, 2020

It appears this occurred when I added support for IPMI led control.

Before the update to support IPMI the check for EM enablement on AMD was the _amd_sgpio_em_enabled() routine. This routine validates enclosure management capabilities on AMD systems.

With the addition of IPMI support a new amd_em_enabled() routine was added that looks up the AMD platform and uses that to determine if SGPIO or IPMI is to be used and then calls the appropriate routine to verify that SGPIO or IPMI is enabled. It seems this approach was not entirely correct as it breaks systems that do support SGPIO but do not have their platforms specifically checked for.

I think the fix for this is to check the platform name to see if can use IPMI and the check to see if IPMI EM is enabled. This check is still needed since we need to know the channel and slave address for the platform to use IPMI. If the platform name is not listed as supporting IPMI we should just call _amd_sgpio_em_enabled() to see if it's possible to use SGPIO.

I can put together a patch for this and get it submitted.

@minorsatellite
Copy link
Author

I can put together a patch for this and get it submitted.

Thank you. When is this likely to get merged and then become available via Ubuntu update repos?

@nfont
Copy link
Contributor

nfont commented May 1, 2020

I have opened a pull request with the fix for this issue.

#66

@minorsatellite
Copy link
Author

I have opened a pull request with the fix for this issue.

#66

Has the pull request been approved, has progress been made?

@mtkaczyk
Copy link
Contributor

mtkaczyk commented May 8, 2020

Hi,
Fix is currently under review.
Could you test it and confirm that it fixes the issue?

Thanks,
Mariusz

@minorsatellite
Copy link
Author

I am experiencing some hardware related issues so as soon as I get my system back online I will give it a try, thanks.

Doe it need to be compiled or are pre-compiled binaries available?

@mtkaczyk
Copy link
Contributor

It needs to be compiled.
Please follow README.

@minorsatellite
Copy link
Author

I have my system back online now. I am trying to compile but can't get past ./configure

checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /usr/bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking whether make supports the include directive... yes (GNU style)
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking whether gcc understands -c and -o together... yes
checking dependency style of gcc... gcc3
checking for gcc option to accept ISO C99... none needed
checking whether C compiler accepts -Wformat -Werror=format-security... yes
checking whether C compiler accepts -Werror=format-overflow=2... yes
checking whether C compiler accepts -Werror=format-truncation=1... yes
checking whether C compiler accepts -Werror=shift-negative-value... yes
checking whether C compiler accepts -Werror=alloca... yes
checking whether C compiler accepts -Werror=missing-field-initializers... yes
checking whether C compiler accepts -Werror=format-signedness... yes
checking whether make supports nested variables... (cached) yes
checking for pkg-config... /usr/bin/pkg-config
checking pkg-config is at least version 0.9.0... yes
checking for shm_unlink in -lrt... yes
checking for sg_ll_send_diag in -lsgutils2... no
configure: error: libsgutils not found

Suggestions?

@bkucman
Copy link
Contributor

bkucman commented May 20, 2020

Hi,

Following packages are required for building and compiling:

RHEL SLES Debian/Ubuntu
pkg-config
automake automake automake
autoconf autoconf autoconf
sg3_utils-devel libsgutils-devel libsgutils2-dev (missing libsgutils)
systemd-devel libudev-devel libudev-dev (missing libudev)
pciutils-devel.x86_64 pciutils-devel libpci-dev (missing libpci)

So, I think you will need to install the last 3.

Regards,
Blazej

@minorsatellite
Copy link
Author

Thanks, I was looking for that list. Where can I find it?

@bkucman
Copy link
Contributor

bkucman commented May 20, 2020

This table will be added to README soon.

@minorsatellite
Copy link
Author

Thanks, that is exactly the information I was looking for. Where was it hidden previously?

@minorsatellite
Copy link
Author

Ran make install ... still broken

sudo ledctl
ledctl: SGPIO EM not supported for /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2
ledctl: controller discovery: /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2 - enclosure management not supported.
ledctl: missing operand(s)... run ledctl --help for details.
ledctl: main(): _ibpi_parse() failed (status=STATUS_IBPI_DETERMINE_ERROR).

From /var/log/syslog
May 20 14:45:55 hostname ledmon[11536]: ledmon[11536]: Unsupported AMD interface
May 20 14:45:55 hostname ledmon[11536]: Unsupported AMD interface
May 20 14:46:00 hostname ledmon[11536]: ledmon[11536]: Unsupported AMD interface
May 20 14:46:00 hostname ledmon[11536]: Unsupported AMD interface
May 20 14:46:05 hostname ledmon[11536]: ledmon[11536]: Unsupported AMD interface
May 20 14:46:05 hostname ledmon[11536]: Unsupported AMD interface
May 20 14:46:10 hostname ledmon[11536]: ledmon[11536]: Unsupported AMD interface
May 20 14:46:10 hostname ledmon[11536]: Unsupported AMD interface
May 20 14:46:15 hostname ledmon[11536]: ledmon[11536]: Unsupported AMD interface
May 20 14:46:15 hostname ledmon[11536]: Unsupported AMD interface
May 20 14:46:20 hostname ledmon[11536]: ledmon[11536]: Unsupported AMD interface
May 20 14:46:20 hostname ledmon[11536]: Unsupported AMD interface

External storage not currently connected.

@mtkaczyk
Copy link
Contributor

Please ensure that you don't have two ledmon and ledctl in system. The simplest way is to uninstall ledmon package from repository. Sometimes "make install" puts binaries in different location than package.

You can also run ledctl directly without installation. Please look into src folder after "make". Your binaries will be there. Just run ./src/ledmon or ./src/ledctl

Mariusz

@minorsatellite
Copy link
Author

minorsatellite commented May 21, 2020 via email

@mtkaczyk
Copy link
Contributor

Looks like issue is still there, @nfont could you look into it again?

@nfont
Copy link
Contributor

nfont commented May 21, 2020

@mtkaczyk, I think there is a bug in the option parsing for ledctl. The recent update to allow non-root users (ff49cce) adds a second call to getopt_long() without resetting the getopt internal variables. This results in not parsing any options in _cmdline_parse() in ledctl.c. The following update corrects this.

diff --git a/src/ledctl.c b/src/ledctl.c
index 774360e6812b..3d241f416bdb 100644
--- a/src/ledctl.c
+++ b/src/ledctl.c
@@ -572,6 +572,7 @@ static status_t _cmdline_parse(int argc, char *argv[])
        int opt, opt_index = -1;
        status_t status = STATUS_SUCCESS;
 
+       optind = 1;
        do {
                opt = getopt_long(argc, argv, shortopt, longopt, &opt_index);
                if (opt == -1)

@minorsatellite, A couple of questions since I cannot re-create the failure you're seeing.

Can you confirm that this same system configuration was working with the previous version of ledmon.

This is reporting that SGPIO EM is not enabled. On my systems I have to enabled enclosure management in the BIOS, can you verify that you have enclosure management enabled.

Can you run the following command and provide the output, this will help me track down what may be happening. You will likely need the patch above to correct option parsing in ledctl.

#> sudo ./ledctl --all -L

@minorsatellite
Copy link
Author

This is a new install on new hardware so I cannot confirm. This is my first sortie with Dell AMD architecture so its all quite new to me. In the Dell BIOS itself I see no option to enable EM, however there is such an option under Device Settings for the PERC card. The Dell-branded, certified HBA card (which I care most about as it will manage by external storage, but not currently connected), unfortunately has no such option.

Requested output from command below:
sudo ./ledctl --all -L
[sudo] password for storage-admin:
/dev/shm/ledmon.conf: does not exist, using global config file
ledctl: SGPIO EM not supported for /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2
ledctl: controller discovery: /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2 - enclosure management not supported.
ledctl: missing operand(s)... run ledctl --help for details.
ledctl: main(): _ibpi_parse() failed (status=STATUS_IBPI_DETERMINE_ERROR).

@nfont
Copy link
Contributor

nfont commented May 21, 2020

Requested output from command below:
sudo ./ledctl --all -L
[sudo] password for storage-admin:
/dev/shm/ledmon.conf: does not exist, using global config file
ledctl: SGPIO EM not supported for /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2
ledctl: controller discovery: /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2 - enclosure management not supported.

This output indicates that enclosure management is not supported for this device. This could be because it is not enabled in BIOS on your system. What the enablement check is looking at is the
/sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2/em_message_supported file for it to specify 'sgpio'.

@minorsatellite
Copy link
Author

Any idea where in the Dell BIOS I would enable it? Is there a global setting outside of the adapter cards themselves?

@nfont
Copy link
Contributor

nfont commented May 21, 2020

No, I don't know where it is in Dell BIOS. I would hope that Dell has published a manual or spec that includes how to enable SGPIO enclosure management for blinking LEDs.

@mtkaczyk
Copy link
Contributor

Hi @nfont,
you're right there is a bug.
Could you create pull request with change for optid reset?

Thanks,
Mariusz

@nfont
Copy link
Contributor

nfont commented May 22, 2020

Hi @mtkaczyk,

Pull request sent: #69

@minorsatellite
Copy link
Author

I am checking with Dell on whether or not the SFF-8485 standard is supported on their Dell/LSI branded card. The card appears to be a rebranded, Dell firmware flashed 9300-8e. According the the Broadcom Product Brief page linked below, SFF-8485 is not supported in the 9300-8e. Does that sound right? How is that even possible?

https://docs.broadcom.com/doc/12352000

@minorsatellite
Copy link
Author

minorsatellite commented May 23, 2020

I am trying to confirm whether or not this one particular HBA is the culprit, or if it is some other PCI device. I am little confused because my original suspicion was the Dell/LSI controller, but looking at the output from ledctl command, and searching for the device using standard Linux commands, that theory does not hold up:

sudo ./ledctl --all -L
/dev/shm/ledmon.conf: does not exist, using global config file
ledctl: SGPIO EM not supported for /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2
ledctl: controller discovery: /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2 - enclosure management not supported.
ledctl: missing operand(s)... run ledctl --help for details.
ledctl: main(): _ibpi_parse() failed (status=STATUS_IBPI_DETERMINE_ERROR).

Below is the location provided respectively, by the lshw and lspci commands for the qualified Dell 12Gbps SAS HBA:

pci@0000:41:00.0  scsi1      storage        SAS3008 PCI-Express Fusion-MPT SAS-3
0000:41:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02)

Grepping through the output of lspci and lshw, I find this instead:

> 0000:05:00.2 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51) pci@0000:05:00.2 storage FCH SATA Controller [AHCI mode]

The question is, how likely is it that an onboard SATA controller would cause ledctl to crap out?

@minorsatellite
Copy link
Author

Any further input on this issue?

@mtkaczyk
Copy link
Contributor

Hi,
I have merged fix for longopt parsing. It should eliminate "ledctl: main(): _ibpi_parse() failed (status=STATUS_IBPI_DETERMINE_ERROR)."

For AMD hardware issues I'm not able to help you. due to lack of hardware. I believe that @nfont is the AMD interface maintainer.
Pull #66 is still waiting for your verification.

@minorsatellite
Copy link
Author

I'll post the response above to the PR you referenced. Thanks.

@mtkaczyk
Copy link
Contributor

mtkaczyk commented Jul 2, 2020

The #66 has been verified.
Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants