Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rasdaemon wrong mapping label #67

Open
garadar opened this issue Jun 22, 2022 · 0 comments
Open

Rasdaemon wrong mapping label #67

garadar opened this issue Jun 22, 2022 · 0 comments

Comments

@garadar
Copy link

garadar commented Jun 22, 2022

Hi all,

I have an issue with the label mapping of dimm:

First here my dimm without label:

(rubis)-[root@rubis247 ~] $ ras-mc-ctl --error-count
Label                         	CE	UE
CPU_SrcID#0_Ha#0_Chan#0_DIMM#0	0	0
CPU_SrcID#1_Ha#0_Chan#3_DIMM#0	0	0
CPU_SrcID#0_Ha#0_Chan#3_DIMM#0	0	0
CPU_SrcID#0_Ha#0_Chan#1_DIMM#0	0	0
CPU_SrcID#0_Ha#0_Chan#2_DIMM#0	0	0
CPU_SrcID#1_Ha#0_Chan#0_DIMM#0	5539	0
CPU_SrcID#1_Ha#0_Chan#1_DIMM#0	0	0
CPU_SrcID#1_Ha#0_Chan#2_DIMM#0	0	0

According to the report without label, I saw the cpu1 channel 0 slot 0 has 5539 Correctable error.

Then I label my dim according to the Intel documentation for the mainboard S2600KPR:

https://www.intel.com/content/dam/support/us/en/documents/server-products/server-boards/S2600KP_HNS2600KP.pdf
Page 54

(rubis)-[root@rubis247 ~]$ ras-mc-ctl --mainboard
ras-mc-ctl: mainboard: Intel Corporation model S2600KPR
(rubis)-[root@rubis247 ~]$ cat /etc/ras/dimm_labels.d/intel
vendor: Intel Corporation
  model: S2600KPR
#  <label>: <mc>.channel>.<slot>
    #CPU1
    DIMM_A1: 0.0.0
    DIMM_B1: 0.1.0
    DIMM_C1: 0.2.0
    DIMM_D1: 0.3.0

    #CPU2
    DIMM_E1: 1.0.0
    DIMM_F1: 1.1.0
    DIMM_G1: 1.2.0
    DIMM_H1: 1.3.0

Then I register my label and I print them:

(rubis)-[root@rubis247 ~]$ ras-mc-ctl --print-labels
LOCATION                            CONFIGURED LABEL     SYSFS CONTENTS      
mc0 channel 0 slot 0                DIMM_A1              DIMM_A1             
mc0 channel 1 slot 0                DIMM_B1              DIMM_B1             
mc0 channel 2 slot 0                DIMM_C1              DIMM_C1             
mc0 channel 3 slot 0                DIMM_D1              DIMM_D1             
mc1 channel 0 slot 0                DIMM_E1              DIMM_E1             
mc1 channel 1 slot 0                DIMM_F1              DIMM_F1             
mc1 channel 2 slot 0                DIMM_G1              DIMM_G1             
mc1 channel 3 slot 0                DIMM_H1              DIMM_H1

The mc1 channel 0 slot 0 correpond to the dimm E1, which seems to be the good mapping according to the documentation. So I should have the 5539 error tagged on the dimm_E1 but i Have:

(rubis)-[root@rubis247 ~]$ ras-mc-ctl --print-label
Label  	CE	UE
DIMM_E1	0	0
DIMM_D1	0	0
DIMM_H1	0	0
DIMM_F1	0	0
DIMM_G1	0	0
DIMM_A1	5539	0
DIMM_B1	0	0
DIMM_C1	0	0

I also check the ipmi sel and it's confirming the correctable errors are on DIMM_E1 and not DIMM_A1

Maybe am I doing something wrong (or maybe a bug), someone can confirm my mind ? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant