Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCC sensor can not be activated #58

Closed
Kenthliu opened this issue Mar 17, 2016 · 18 comments
Closed

OCC sensor can not be activated #58

Kenthliu opened this issue Mar 17, 2016 · 18 comments
Labels

Comments

@Kenthliu
Copy link
Contributor

We found some board can not get OCC sensor.
Hwmon driver did not appear in /sys/class/hwmon
Attachment is the hconsole log and journalctrl log.
OCCfail.txt

Linux version 4.3.6-openbmc-20160222-1 (openpower@openpower-VirtualBox) (gcc version 4.9.3 (GCC) ) #1 Tue Mar 8 14:03:29 CST 2016

[941]: Installing OCC device
------------[ cut here ]------------
WARNING: CPU: 0 PID: 1072 at /home/openpower/openbmc/build/tmp/work-shared/barreleye/kernel-source/fs/sysfs/dir.c:31 sysfs_warn_dup+0x50/0x70()
sysfs: cannot create duplicate filename '/devices/platform/ahb/ahb:apb/1e78a000.i2c/i2c-3/i2c-3/3-0050/hwmon/hwmon3/freq1_input'
Modules linked in:
CPU: 0 PID: 1072 Comm: sh Not tainted 4.3.6-openbmc-20160222-1 #1
Hardware name: ASpeed SoC
[<c000f28c>] (unwind_backtrace) from [<c000cf0c>] (show_stack+0x10/0x14)
[<c000cf0c>] (show_stack) from [<c0016734>] (warn_slowpath_common+0x84/0xac)
[<c0016734>] (warn_slowpath_common) from [<c0016788>] (warn_slowpath_fmt+0x2c/0x3c)
[<c0016788>] (warn_slowpath_fmt) from [<c00dd914>] (sysfs_warn_dup+0x50/0x70)
[<c00dd914>] (sysfs_warn_dup) from [<c00dd674>] (sysfs_add_file_mode_ns+0xfc/0x180)
[<c00dd674>] (sysfs_add_file_mode_ns) from [<c00ddf80>] (internal_create_group+0x18c/0x24c)
[<c00ddf80>] (internal_create_group) from [<c0295650>] (set_occ_online+0x19c/0x35c)
[<c0295650>] (set_occ_online) from [<c00dcb28>] (kernfs_fop_write+0x128/0x188)
[<c00dcb28>] (kernfs_fop_write) from [<c008caf4>] (__vfs_write+0x20/0xd0)
[<c008caf4>] (__vfs_write) from [<c008d1a8>] (vfs_write+0xa8/0x130)
[<c008d1a8>] (vfs_write) from [<c008d8a0>] (SyS_write+0x40/0x80)
[<c008d8a0>] (SyS_write) from [<c000a280>] (ret_fast_syscall+0x0/0x38)
---[ end trace 31651e1056550963 ]---
occ-i2c 3-0050: error create freq sysfs entry
@nkskjames
Copy link
Contributor

It sounds like the hardware is definitely bad, so I'm calling this an error handling enhancement as opposed to a bug. The hwmon driver should have a better error message.

@adamliyi
Copy link
Member

I'd propose a workaround just for testing (still now clear about what is the root cause):
See attach.

power8_occ_i2c.c.patch.txt

diff --git a/drivers/hwmon/power8_occ_i2c.c b/drivers/hwmon/power8_occ_i2c.c
index c9c70d1..c5f843a 100644
--- a/drivers/hwmon/power8_occ_i2c.c
+++ b/drivers/hwmon/power8_occ_i2c.c
@@ -1317,6 +1317,12 @@ static ssize_t set_occ_online(struct device *dev,
            return PTR_ERR(data->hwmon_dev);

        err = occ_create_hwmon_attribute(dev);
+       /* Workaround: For some reason, a sysfs entry already existed.
+        * occ_create_hwmon_attribute() will do clean up.
+        * Try to create again. */
+       if (err == -EEXIST) {
+           err = occ_create_hwmon_attribute(dev);
+       }
        if (err) {
            hwmon_device_unregister(data->hwmon_dev);
            return err;
@@ -1325,7 +1331,7 @@ static ssize_t set_occ_online(struct device *dev,
        dev_dbg(dev, "%s: sensor '%s'\n",
            dev_name(data->hwmon_dev), data->client->name);
    } else if (val == 0) {
-       if (data->occ_online == 0)
+       if (data->occ_online == 0 && !data->hwmon_dev)
            return count;

        occ_remove_sysfs_files(data->hwmon_dev);

@shenki
Copy link
Member

shenki commented Mar 18, 2016

Have you attempted to work out why the sysfs entry already exists?

@shenki
Copy link
Member

shenki commented Mar 18, 2016

This is a bug in the occ driver that needs to be resolved.

@adamliyi
Copy link
Member

@shenki , Thanks and I now know how to paste code in comments correctly :-).
The problem is that I cannot reproduce this issue on Barreleye - make it hard to debug. According to @Kenthliu, the issue happens when poweron host. I tried "obmcutil poweron" several times, but no such warning appears. And I just proposed this workaround for @Kenthliu to test on his Barreleye - see if OCC driver can still be used even with such warning. Next step we need to find out the root cause for duplicated occ sysfs attributes.

@nkskjames nkskjames added bug and removed enhancement labels Mar 21, 2016
@Kenthliu
Copy link
Contributor Author

Hi @adamliyi and @shenki
I saw different symptom in different system.
I ran the version that add @adamliyi 's patch. But OCC sensor still did not show up.
I suppose it's the same fail to previous one but it seems not.

59.83819|ISTEP 21. 1
68.20420|================================================
68.20423|Error reported by occc (0x2A00)
68.20423|
68.20423| ModuleId 0x02 unknown
68.20424| ReasonCode 0x2a00 unknown
68.20424| UserData1 unknown : 0x000c4654df8ec7c3
68.20424| UserData2 unknown : 0x0000044b00000065
68.20425|User Data Section 0, type UD
68.20425| Subsection type 0x15
68.20425| ComponentId hb-trace (0x3100)
68.20425|User Data Section 1, type UD
68.20426| Subsection type 0x04
68.20426| ComponentId errl (0x0100)
68.20426|User Data Section 2, type UD
68.20426| Subsection type 0x06
68.20427| ComponentId errl (0x0100)
68.20427| CALLOUT
68.20427| HW CALLOUT
68.20427| Reporting CPU ID: 111
68.20427| Called out entity:
68.20428|User Data Section 3, type UD
68.20428| Subsection type 0x06
68.20428| ComponentId errl (0x0100)
68.20428| CALLOUT
68.20429| PROCEDURE ERROR
68.20429| Procedure: 85
68.20429|User Data Section 4, type UD
68.20429| Subsection type 0x00
68.20429| ComponentId occc (0x2a00)
68.20430|User Data Section 5, type UD
68.20430| Subsection type 0x03
68.20430| ComponentId errl (0x0100)
68.20430|User Data Section 6, type UD
68.20431| Subsection type 0x01
68.20431| ComponentId errl (0x0100)
68.20431| STRING
68.20431| Hostboot Build ID:
68.20432|User Data Section 7, type UD
68.20432| Subsection type 0x04
68.20432| ComponentId errl (0x0100)
68.20432|================================================
68.56955|================================================
68.56955|Error reported by occc (0x2A00)
68.56956|
68.56956| ModuleId 0x03 unknown
68.56956| ReasonCode 0x2a00 unknown
68.56957| UserData1 unknown : 0x000c4654df8ec7c3
68.56957| UserData2 unknown : 0x0000044b00000065
68.56957|User Data Section 0, type UD
68.56957| Subsection type 0x15
68.56958| ComponentId hb-trace (0x3100)
68.56958|User Data Section 1, type UD
68.56958| Subsection type 0x04
68.56958| ComponentId errl (0x0100)
68.56959|User Data Section 2, type UD
68.56959| Subsection type 0x06
68.56959| ComponentId errl (0x0100)
68.56959| CALLOUT
68.56960| HW CALLOUT
68.56960| Reporting CPU ID: 1140
68.56960| Called out entity:
68.56960|User Data Section 3, type UD
68.56961| Subsection type 0x06
68.56961| ComponentId errl (0x0100)
68.56961| CALLOUT
68.56961| PROCEDURE ERROR
68.56962| Procedure: 85
68.56962|User Data Section 4, type UD
68.56962| Subsection type 0x00
68.56962| ComponentId occc (0x2a00)
68.56963|User Data Section 5, type UD
68.56963| Subsection type 0x03
68.56963| ComponentId errl (0x0100)
68.56963|User Data Section 6, type UD
68.56964| Subsection type 0x01
68.56964| ComponentId errl (0x0100)
68.56964| STRING
68.56964| Hostboot Build ID:
68.56965|User Data Section 7, type UD
68.56965| Subsection type 0x04
68.56965| ComponentId errl (0x0100)
68.56965|================================================
87.55895|htmgt|OCCs are not active (rc=0x2605). Attempting OCC Reset
93.62257|================================================
93.62258|Error reported by occc (0x2A00)
93.62258|
93.62258| ModuleId 0x02 unknown
93.62258| ReasonCode 0x2a00 unknown
93.62259| UserData1 unknown : 0x000c4654df8ec7c3
93.62259| UserData2 unknown : 0x0000044b00000065
93.62259|User Data Section 0, type UD
93.62259| Subsection type 0x15
93.62260| ComponentId hb-trace (0x3100)
93.62260|User Data Section 1, type UD
93.62260| Subsection type 0x04
93.62260| ComponentId errl (0x0100)
93.62261|User Data Section 2, type UD
93.62261| Subsection type 0x06
93.62261| ComponentId errl (0x0100)
93.62261| CALLOUT
93.62262| HW CALLOUT
93.62262| Reporting CPU ID: 107
93.62262| Called out entity:
93.62262|User Data Section 3, type UD
93.62263| Subsection type 0x06
93.62263| ComponentId errl (0x0100)
93.62263| CALLOUT
93.62263| PROCEDURE ERROR
93.62263| Procedure: 85
93.62264|User Data Section 4, type UD
93.62264| Subsection type 0x00
93.62264| ComponentId occc (0x2a00)
93.62264|User Data Section 5, type UD
93.62265| Subsection type 0x03
93.62265| ComponentId errl (0x0100)
93.62265|User Data Section 6, type UD
93.62265| Subsection type 0x01
93.62266| ComponentId errl (0x0100)
93.62266| STRING
93.62266| Hostboot Build ID:
93.62266|User Data Section 7, type UD
93.62266| Subsection type 0x04
93.62267| ComponentId errl (0x0100)
93.62267|================================================
94.02325|================================================
94.02325|Error reported by occc (0x2A00)
94.02325|
94.02326| ModuleId 0x03 unknown
94.02326| ReasonCode 0x2a00 unknown
94.02326| UserData1 unknown : 0x000c4654df8ec7c3
94.02327| UserData2 unknown : 0x0000044b00000065
94.02327|User Data Section 0, type UD
94.02327| Subsection type 0x15
94.02327| ComponentId hb-trace (0x3100)
94.02328|User Data Section 1, type UD
94.02328| Subsection type 0x04
94.02328| ComponentId errl (0x0100)
94.02328|User Data Section 2, type UD
94.02329| Subsection type 0x06
94.02329| ComponentId errl (0x0100)
94.02329| CALLOUT
94.02329| HW CALLOUT
94.02329| Reporting CPU ID: 107
94.02330| Called out entity:
94.02330|User Data Section 3, type UD
94.02330| Subsection type 0x06
94.02330| ComponentId errl (0x0100)
94.02331| CALLOUT
94.02331| PROCEDURE ERROR
94.02331| Procedure: 85
94.02331|User Data Section 4, type UD
94.02332| Subsection type 0x00
94.02332| ComponentId occc (0x2a00)
94.02332|User Data Section 5, type UD
94.02332| Subsection type 0x03
94.02332| ComponentId errl (0x0100)
94.02333|User Data Section 6, type UD
94.02333| Subsection type 0x01
94.02333| ComponentId errl (0x0100)
94.02333| STRING
94.02334| Hostboot Build ID:
94.02334|User Data Section 7, type UD
94.02334| Subsection type 0x04
94.02334| ComponentId errl (0x0100)
94.02335|================================================
112.40932|htmgt|OCCs are not active (rc=0x2605). Attempting OCC Reset
125.93643|================================================
125.93643|Error reported by occc (0x2A00)
125.93643|
125.93644| ModuleId 0x02 unknown
125.93644| ReasonCode 0x2a00 unknown
125.93644| UserData1 unknown : 0x000c4654df8ec7c3
125.93644| UserData2 unknown : 0x0000044b00000065
125.93645|User Data Section 0, type UD
125.93645| Subsection type 0x15
125.93645| ComponentId hb-trace (0x3100)
125.93646|User Data Section 1, type UD
125.93646| Subsection type 0x04
125.93646| ComponentId errl (0x0100)
125.93646|User Data Section 2, type UD
125.93647| Subsection type 0x06
125.93647| ComponentId errl (0x0100)
125.93647| CALLOUT
125.93647| HW CALLOUT
125.93648| Reporting CPU ID: 111
125.93648| Called out entity:
125.93648|User Data Section 3, type UD
125.93648| Subsection type 0x06
125.93649| ComponentId errl (0x0100)
125.93649| CALLOUT
125.93649| PROCEDURE ERROR
125.93649| Procedure: 85
125.93650|User Data Section 4, type UD
125.93650| Subsection type 0x00
125.93650| ComponentId occc (0x2a00)
125.93650|User Data Section 5, type UD
125.93651| Subsection type 0x03
125.93651| ComponentId errl (0x0100)
125.93651|User Data Section 6, type UD
125.93651| Subsection type 0x01
125.93652| ComponentId errl (0x0100)
125.93652| STRING
125.93652| Hostboot Build ID:
125.93652|User Data Section 7, type UD
125.93653| Subsection type 0x04
125.93653| ComponentId errl (0x0100)
125.93653|================================================
118.78015|================================================
118.78016|Error reported by occc (0x2A00)
118.78016|
118.78016| ModuleId 0x03 unknown
118.78017| ReasonCode 0x2a00 unknown
118.78017| UserData1 unknown : 0x000c4654df8ec7c3
118.78017| UserData2 unknown : 0x0000044b00000065
118.78018|User Data Section 0, type UD
118.78018| Subsection type 0x15
118.78018| ComponentId hb-trace (0x3100)
118.78018|User Data Section 1, type UD
118.78019| Subsection type 0x04
118.78019| ComponentId errl (0x0100)
118.78019|User Data Section 2, type UD
118.78019| Subsection type 0x06
118.78020| ComponentId errl (0x0100)
118.78020| CALLOUT
118.78020| HW CALLOUT
118.78020| Reporting CPU ID: 111
118.78020| Called out entity:
118.78021|User Data Section 3, type UD
118.78021| Subsection type 0x06
118.78021| ComponentId errl (0x0100)
118.78021| CALLOUT
118.78022| PROCEDURE ERROR
118.78022| Procedure: 85
118.78022|User Data Section 4, type UD
118.78022| Subsection type 0x00
118.78022| ComponentId occc (0x2a00)
118.78023|User Data Section 5, type UD
118.78023| Subsection type 0x03
118.78023| ComponentId errl (0x0100)
118.78023|User Data Section 6, type UD
118.78024| Subsection type 0x01
118.78024| ComponentId errl (0x0100)
118.78024| STRING
118.78024| Hostboot Build ID:
118.78025|User Data Section 7, type UD
118.78025| Subsection type 0x04
118.78025| ComponentId errl (0x0100)
118.78025|================================================
136.76839|htmgt|OCCs are not active (rc=0x2605). Attempting OCC Reset
142.78960|================================================
142.78961|Error reported by occc (0x2A00)
142.78961|
142.78962| ModuleId 0x02 unknown
142.78962| ReasonCode 0x2a00 unknown
142.78963| UserData1 unknown : 0x000c4654df8ec7c3
142.78963| UserData2 unknown : 0x0000044b00000065
142.78964|User Data Section 0, type UD
142.78964| Subsection type 0x15
142.78965| ComponentId hb-trace (0x3100)
142.78965|User Data Section 1, type UD
142.78965| Subsection type 0x04
142.78966| ComponentId errl (0x0100)
142.78966|User Data Section 2, type UD
142.78966| Subsection type 0x06
142.78967| ComponentId errl (0x0100)
142.78967| CALLOUT
142.78967| HW CALLOUT
142.78967| Reporting CPU ID: 90
142.78968| Called out entity:
142.78968|User Data Section 3, type UD
142.78968| Subsection type 0x06
142.78968| ComponentId errl (0x0100)
142.78968| CALLOUT
142.78969| PROCEDURE ERROR
142.78969| Procedure: 85
142.78969|User Data Section 4, type UD
142.78969| Subsection type 0x00
142.78970| ComponentId occc (0x2a00)
142.78970|User Data Section 5, type UD
142.78970| Subsection type 0x03
142.78970| ComponentId errl (0x0100)
142.78971|User Data Section 6, type UD
142.78971| Subsection type 0x01
142.78971| ComponentId errl (0x0100)
142.78971| STRING
142.78971| Hostboot Build ID:
142.78972|User Data Section 7, type UD
142.78972| Subsection type 0x04
142.78972| ComponentId errl (0x0100)
142.78972|================================================
155.16504|================================================
155.16504|Error reported by occc (0x2A00)
155.16504|
155.16505| ModuleId 0x03 unknown
155.16505| ReasonCode 0x2a00 unknown
155.16505| UserData1 unknown : 0x000c4654df8ec7c3
155.16506| UserData2 unknown : 0x0000044b00000065
155.16506|User Data Section 0, type UD
155.16506| Subsection type 0x15
155.16506| ComponentId hb-trace (0x3100)
155.16507|User Data Section 1, type UD
155.16507| Subsection type 0x04
155.16507| ComponentId errl (0x0100)
155.16507|User Data Section 2, type UD
155.16508| Subsection type 0x06
155.16508| ComponentId errl (0x0100)
155.16508| CALLOUT
155.16508| HW CALLOUT
155.16509| Reporting CPU ID: 90
155.16509| Called out entity:
155.16509|User Data Section 3, type UD
155.16509| Subsection type 0x06
155.16510| ComponentId errl (0x0100)
155.16510| CALLOUT
155.16510| PROCEDURE ERROR
155.16510| Procedure: 85
155.16510|User Data Section 4, type UD
155.16511| Subsection type 0x00
155.16511| ComponentId occc (0x2a00)
155.16511|User Data Section 5, type UD
155.16512| Subsection type 0x03
155.16512| ComponentId errl (0x0100)
155.16512|User Data Section 6, type UD
155.16512| Subsection type 0x01
155.16513| ComponentId errl (0x0100)
155.16513| STRING
155.16513| Hostboot Build ID:
155.16513|User Data Section 7, type UD
155.16514| Subsection type 0x04
155.16514| ComponentId errl (0x0100)
155.16514|================================================
161.06516|htmgt|OCCs are not active (rc=0x2605). Attempting OCC Reset
176.05064|htmgt|setOccActiveSensors failed. (OCC0 state:0)
176.05066|================================================
176.05066|Error reported by ipmi (0x2500)
161.07500| Set sensor reading command failed.
161.07500| ModuleId 0x03 IPMI::MOD_IPMISENSOR
161.07500| ReasonCode 0x2508 IPMI::RC_SET_SENSOR_FAILURE
161.07501| UserData1 BMC IPMI Completion code. : 0x00000000000000c3
161.07502| UserData2 bytes [0-3]sensor number bytes [4-7]HUID of target. : 0x0000000800130000
161.07502|User Data Section 0, type UD
161.07502| Subsection type 0x06
161.07502| ComponentId errl (0x0100)
161.07503| CALLOUT
161.07503| PROCEDURE ERROR
161.07503| Procedure: 85
161.07503|User Data Section 1, type UD
161.07504| Subsection type 0x15
161.07504| ComponentId hb-trace (0x3100)
161.07504|User Data Section 2, type UD
161.07504| Subsection type 0x03
161.07505| ComponentId errl (0x0100)
161.07505|User Data Section 3, type UD
161.07505| Subsection type 0x01
161.07505| ComponentId errl (0x0100)
161.07506| STRING
161.07506| Hostboot Build ID:
161.07506|================================================
176.07556|htmgt|setOccActiveSensors failed. (OCC1 state:0)
161.49222|================================================
161.49222|Error reported by ipmi (0x2500)
161.49222| Set sensor reading command failed.
161.49222| ModuleId 0x03 IPMI::MOD_IPMISENSOR
161.49223| ReasonCode 0x2508 IPMI::RC_SET_SENSOR_FAILURE
161.49223| UserData1 BMC IPMI Completion code. : 0x00000000000000c3
161.49224| UserData2 bytes [0-3]sensor number bytes [4-7]HUID of target. : 0x0000000a00130001
161.49224|User Data Section 0, type UD
161.49225| Subsection type 0x06
161.49225| ComponentId errl (0x0100)
161.49225| CALLOUT
161.49225| PROCEDURE ERROR
161.49225| Procedure: 85
161.49226|User Data Section 1, type UD
161.49226| Subsection type 0x15
161.49226| ComponentId hb-trace (0x3100)
161.49226|User Data Section 2, type UD
161.49227| Subsection type 0x03
161.49227| ComponentId errl (0x0100)
161.49227|User Data Section 3, type UD
161.49228| Subsection type 0x01
161.49228| ComponentId errl (0x0100)
161.49228| STRING
161.49228| Hostboot Build ID:
161.49229|================================================
177.09829|htmgt|OCCs are not active. The system will remain in safe mode (RC: 0x2616 for OCC0)
177.09638|================================================
177.09638|Error reported by htmgt (0x2600)
177.12072| OCC not ready for target state
177.12072| ModuleId 0x02 HTMGT_MOD_WAIT_FOR_OCC_READY
177.12072| ReasonCode 0x2605 HTMGT_RC_OCC_NOT_READY
177.12073| UserData1 OCC instance : 0x0000000100000028
177.12073| UserData2 poll attempts : 0x0000000100000000
177.12073|User Data Section 0, type UD
177.12074| Subsection type 0x15
177.12074| ComponentId hb-trace (0x3100)
177.12074|User Data Section 1, type UD
177.12074| Subsection type 0x06
177.12075| ComponentId errl (0x0100)
177.12075| CALLOUT
177.12075| PROCEDURE ERROR
177.12075| Procedure: 85
177.12076|User Data Section 2, type UD
177.12076| Subsection type 0x03
177.12076| ComponentId errl (0x0100)
177.12077|User Data Section 3, type UD
177.12077| Subsection type 0x01
177.12077| ComponentId errl (0x0100)
177.12077| STRING
177.12078| Hostboot Build ID:
177.12078|================================================
162.34073|================================================
162.34074|Error reported by htmgt (0x2600)
162.34074|
162.34074| ModuleId 0x07 unknown
162.34074| ReasonCode 0x2616 unknown
162.34075| UserData1 unknown : 0x0000000000000000
162.34075| UserData2 unknown : 0x0000000000000000
162.34075|User Data Section 0, type UD
162.34076| Subsection type 0x15
162.34076| ComponentId hb-trace (0x6|User Data Section 1, type UD
162.34076| Subsection type 0x06
162.34077| ComponentId errl (0x0100)
162.34077| CALLOUT
162.34077| PROCEDURE ERROR
162.34077| Procedure: 16
162.34078|User Data Section 2, type UD
162.34078| Subsection type 0x06
162.34078| ComponentId errl (0x0100)
162.34078| CALLOUT
162.34079| PROCEDURE ERROR
162.34079| Procedure: 85
162.34079|User Data Section 3, type UD
162.34079| Subsection type 0x03
162.34079| ComponentId errl (0x0100)
162.34080|User Data Section 4, type UD
162.34080| Subsection type 0x01
162.34080| ComponentId errl (0x0100)
162.34080| STRING
162.34081| Hostboot Build ID:
162.34081|================================================
186.10328|ISTEP 21. 2
169.72752|ISTEP 21. 3

@adamliyi
Copy link
Member

@Kenthliu , the error message in your last comment looks like an issue of the OCC itself, not BMC.

@Kenthliu
Copy link
Contributor Author

I tried @adamliyi solution. It will fix this issue. Here is the log. I found another error message in that system, it seems another issue,please help to check it. @nkskjames
occputty2.txt

117.60312|================================================
117.60313|Error reported by unknown (0xE500)
117.60313|
117.60313| ModuleId 0x0b unknown
117.60314| ReasonCode 0xe504 unknown
117.60314| UserData1 unknown : 0x000d000a00000404
117.60315| UserData2 unknown : 0xffff001000000000
117.60315|User Data Section 0, type UD
117.60315| Subsection type 0x06
117.60316| ComponentId errl (0x0100)
117.60316| CALLOUT
117.60316| HW CALLOUT
117.60316| Reporting CPU ID: 8
117.60317| Called out entity:
117.60317|User Data Section 1, type UD
117.60317| Subsection type 0x33
117.60318| ComponentId unknown (0xe500)
117.60318|User Data Section 2, type UD
117.60318| Subsection type 0x01
117.60319| ComponentId unknown (0xe500)
117.60319| STRING
117.60319|
117.60319|User Data Section 3, type UD
117.60320| Subsection type 0x15
117.60320| ComponentId hb-trace (0x3100)
117.60320|User Data Section 4, type UD
117.60321| Subsection type 0x03
117.60321| ComponentId errl (0x0100)
117.60321|User Data Section 5, type UD
117.60322| Subsection type 0x01
117.60322| ComponentId errl (0x0100)
117.60322| STRING
117.60322| Hostboot Build ID:
117.60323|User Data Section 6, type UD
117.60323| Subsection type 0x04
117.60323| ComponentId errl (0x0100)
117.60324|================================================

@shenki
Copy link
Member

shenki commented Mar 22, 2016

@Kenthliu you need to open a new bug with this error, it's not related to the original bug.

@williamspatrick how do you want these to be tracked? Should we get people to report the issue against open-power/hostboot?

@adamliyi
Copy link
Member

Submitted a pull request to fix this issue: openbmc/linux#65
@Kenthliu has tested on his 12-core Barreleye and the "duplicated filename" warning disappears.
This fix is not good. It is still hard-coded the sensor number for 12-core CPU.
I will try to create the occ hwmon sysfs entries dynamically based on sensor number polled from OCC. But I am not sure I can catch the next release window.

adamliyi added a commit to adamliyi/openbmc_linux that referenced this issue Mar 22, 2016
…upport more sensors

This patch fixes issue: openbmc/skeleton#58

The hwmon sys attributes are created using statically defined arrays.
Some POWER CPU has 10-core, while some POWER CPU has 12-core.
The more cores, the more OCC sensors. E.g, for 12-core CPU,
there will be 28 temperature sensors. The statically defined
array will overflow in this case.

This is a temporary fix. Will need to generate the hwmon sysfs attributes
dynamically.

Signed-off-by: Yi Li <adamliyi@msn.com>
@williamspatrick
Copy link
Member

Errors reported by "0xE500" correspond to the diagnostics component, so
they are not typically a software issue. This one in particular is a
memory ECC UE:

Mba.prf.err.C: PRDR_ERROR_SIGNATURE ( 0xffff0010, "", "Maintenance UE")

On Tue, Mar 22, 2016 at 12:26:28AM -0700, KenLiu wrote:

I tried @adamliyi solution. It will fix this issue. Here is the log. I found another error message in that system, it seems another issue,please help to check it. @nkskjames
occputty2.txt

117.60312|================================================
117.60313|Error reported by unknown (0xE500)
117.60313|
117.60313| ModuleId 0x0b unknown
117.60314| ReasonCode 0xe504 unknown
117.60314| UserData1 unknown : 0x000d000a00000404
117.60315| UserData2 unknown : 0xffff001000000000
117.60315|User Data Section 0, type UD
117.60315| Subsection type 0x06
117.60316| ComponentId errl (0x0100)
117.60316| CALLOUT
117.60316| HW CALLOUT
117.60316| Reporting CPU ID: 8
117.60317| Called out entity:
117.60317|User Data Section 1, type UD
117.60317| Subsection type 0x33
117.60318| ComponentId unknown (0xe500)
117.60318|User Data Section 2, type UD
117.60318| Subsection type 0x01
117.60319| ComponentId unknown (0xe500)
117.60319| STRING
117.60319|
117.60319|User Data Section 3, type UD
117.60320| Subsection type 0x15
117.60320| ComponentId hb-trace (0x3100)
117.60320|User Data Section 4, type UD
117.60321| Subsection type 0x03
117.60321| ComponentId errl (0x0100)
117.60321|User Data Section 5, type UD
117.60322| Subsection type 0x01
117.60322| ComponentId errl (0x0100)
117.60322| STRING
117.60322| Hostboot Build ID:
117.60323|User Data Section 6, type UD
117.60323| Subsection type 0x04
117.60323| ComponentId errl (0x0100)
117.60324|================================================


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
#58 (comment)

Patrick Williams

@Kenthliu
Copy link
Contributor Author

@williamspatrick So you think this is a another HW issue that memory has something wrong? I think this is related to the SEL log we saw as attachment.
Dimm16 log.txt

@Kenthliu
Copy link
Contributor Author

@adamliyi image-bmc_occ_0322_final works fine. This issue can be closed now. For the DIMM issue, I will open another issue if needed. Thanks.

@nkskjames
Copy link
Contributor

@Kenthliu I would like to close this, but looks like you need to open another issue against the DIMM fail. I will wait to close until other issue is opened so we don't forget.

@Kenthliu
Copy link
Contributor Author

Kenthliu commented May 3, 2016

@nkskjames I open the issue at #71. But the symptom could not 100% reproduce in every system.

@Kenthliu
Copy link
Contributor Author

This issue happened again in 0.7. Please help to notice.
occfail2.txt

@adamliyi
Copy link
Member

obmc_v0.7 does not include my temporary fix in: openbmc/linux#65. So this bug appears again.

adamliyi added a commit to adamliyi/openbmc_linux that referenced this issue May 16, 2016
This patch is a linux-4.4 port of previous patch:
openbmc@f8087df

It is created as the backup workaround in order to catch 5/19 release,
for the blocking occ issue: openbmc/skeleton#58

I am working on a new fix to dynamically create those sysfs attributes.
If the new fix cannot catch 5/19 release, we can use this patch temporarily.
When the new fix come out, this patch can be replaced.

Signed-off-by: Yi Li <adamliyi@msn.com>
adamliyi added a commit to adamliyi/openbmc_linux that referenced this issue May 20, 2016
This patch fixes issue: openbmc/skeleton#58.

OCC sensor number varies for different platforms.
The patch creates hwmon sysfs attributes dynamically, using sensor information
get from OCC. Previously the sysfs attributes are created using statically
defined data structures.

Signed-off-by: Yi Li <adamliyi@msn.com>
shenki pushed a commit to openbmc/linux that referenced this issue May 20, 2016
This patch fixes issue: openbmc/skeleton#58.

OCC sensor number varies for different platforms.  The patch creates
hwmon sysfs attributes dynamically, using sensor information get from
OCC. Previously the sysfs attributes are created using statically
defined data structures.

Signed-off-by: Yi Li <adamliyi@msn.com>
Signed-off-by: Joel Stanley <joel@jms.id.au>
@williamspatrick
Copy link
Member

I believe this is resolved. Please re-open if observed in latest tag.

Kenthliu pushed a commit to Kenthliu/linux that referenced this issue Jun 3, 2016
This patch fixes issue: openbmc/skeleton#58.

OCC sensor number varies for different platforms.  The patch creates
hwmon sysfs attributes dynamically, using sensor information get from
OCC. Previously the sysfs attributes are created using statically
defined data structures.

Signed-off-by: Yi Li <adamliyi@msn.com>
Signed-off-by: Joel Stanley <joel@jms.id.au>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants