Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

phosphor-hwmon: remove sensors that fail with EAGAIN from bus #2327

Closed
spinler opened this issue Sep 18, 2017 · 4 comments
Closed

phosphor-hwmon: remove sensors that fail with EAGAIN from bus #2327

spinler opened this issue Sep 18, 2017 · 4 comments
Assignees
Labels

Comments

@spinler
Copy link
Contributor

spinler commented Sep 18, 2017

In the design call today, it was decided to treat sensors that fail with EAGAIN differently than other failures.

If a sensor read fails with EAGAIN past the retry threshold, it should be removed from D-Bus. Any clients that are monitoring that sensor can register for the InterfacesRemoved signal on it to know when it goes away and act accordingly. No hardware callout will be made.

@zahrens zahrens added this to the openBMC v2.0 Backlog milestone Sep 20, 2017
@rfrandse rfrandse added defer and removed Phase 6 labels Sep 27, 2017
@rfrandse rfrandse modified the milestones: openBMC v2.0 Backlog, openBMC v3.0 Backlog Sep 27, 2017
@msbarth
Copy link
Contributor

msbarth commented Mar 20, 2018

A couple ideas to consider in supporting this would be:

  1. Add an additional command line parameter to readd called "Remove on RC" that would take an optional return code and a list of sensors where when that return code is received on any of the sensors listed, the dbus object would be removed. (This parameter would have the rc and list of sensors optional so it could support replacing the current configure flag used to remove any/all sensors from dbus to a per hwmon instance basis).
  phosphor-hwmon-readd --rm_on_rc [RC [SENSORS ...]]
  1. For each sensor within the hwmon device conf file, a "Remove on RC" entry could be set to a particular return code where that sensor would be removed from dbus when that return code is returned.
  LABEL_temp198 = "gpu0_core_temp"
  REMOVERC_temp198 = "EAGAIN"

@spinler spinler assigned msbarth and unassigned spinler Mar 21, 2018
@msbarth
Copy link
Contributor

msbarth commented Mar 21, 2018

It seems from the service file setup that adding an additional command line parameter is not feasible with how each hwmon instance is started. Going to propose option 2 as the best solution to the mailing list.

geissonator pushed a commit to openbmc/phosphor-hwmon that referenced this issue Mar 23, 2018
This is a temporary fix until the following issues are completed:
    openbmc/openbmc#2327
    openbmc/openbmc#2329

When an EAGAIN or an EREMOTEIO return code is received by hwmon
from the OCC driver in the 4.13 kernel, they should be translated to
an unavailable sensor(0x00) and failed sensor(0xFF) scaled values
respectively. This will keep the OCC hwmon instance running and allow
applications to continue using these sensors as they were reported under
the mainline openbmc/linux 4.10 kernel.

Tested:
    Verified return codes are caught and sensor value modified

Change-Id: Ie61859863e7d88878caa942e5f5b062acabe67aa
Signed-off-by: Matthew Barth <msbarth@us.ibm.com>
@msbarth
Copy link
Contributor

msbarth commented Mar 23, 2018

  • Define sensor removal return codes within hwmon device conf file
  • Store defined sensor return codes for dbus removal
  • Remove sensors when given return codes are received
  • Re-add sensors when return code is no longer present (keep attempting to read sensor)

@rfrandse
Copy link

https://gerrit.openbmc-project.xyz/9825 Re-add removed sensors during monitoring
Resolves #2327 phosphor-hwmon: remove sensors that fail with EAGAIN from bus

foxconn-bmc-ks pushed a commit to foxconn-bmc-ks/phosphor-hwmon that referenced this issue Apr 13, 2018
This is a temporary fix until the following issues are completed:
    openbmc/openbmc#2327
    openbmc/openbmc#2329

When an EAGAIN or an EREMOTEIO return code is received by hwmon
from the OCC driver in the 4.13 kernel, they should be translated to
an unavailable sensor(0x00) and failed sensor(0xFF) scaled values
respectively. This will keep the OCC hwmon instance running and allow
applications to continue using these sensors as they were reported under
the mainline openbmc/linux 4.10 kernel.

Tested:
    Verified return codes are caught and sensor value modified

Change-Id: Ie61859863e7d88878caa942e5f5b062acabe67aa
Signed-off-by: Matthew Barth <msbarth@us.ibm.com>
Signed-off-by: Doyle Huang <doyle.sy.huang@mail.foxconn.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants