Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1030.10.ips: Input power was lost appeared probalilistically after AC #278

Closed
lxwinspur opened this issue Mar 23, 2023 · 4 comments
Closed

Comments

@lxwinspur
Copy link
Contributor

Pre-condition:

  1. The server power cable is connect to a network power controller to do AC on/off through network.
  2. Create a LPAR and install OS.
  3. Enable option “Automatically start when the managed system is powered on” in LPAR profile.

AC Cycle steps:

  1. A script is executed on a client to monitor and control server power status.
  2. Power on server and wait 6 minutes to power off server with command “obmcutil poweroff” in BMC console.
  3. When script detects the host is powered off, send command to the network power controller to do AC off.
  4. After 30 seconds, send command to do AC on.
  5. Wait 3 minutes for BMC to be ready, and then send command “obmcutil poweron” in BMC console to power on host.
  6. Then server boots to runtime and LPAR boots to OS then power off again.
  7. Repeat step2-6

event Log:

{
"Private Header": {
    "Section Version":          "1",
    "Sub-section type":         "0",
    "Created by":               "0x2700",
    "Created at":               "03/23/2023 02:23:11",
    "Committed at":             "03/23/2023 02:23:12",
    "Creator Subsystem":        "BMC",
    "CSSVER":                   "",
    "Platform Log Id":          "0x50001AB7",
    "Entry Id":                 "0x50001AB7",
    "BMC Event Log Id":         "2619"
},
"User Header": {
    "Section Version":          "1",
    "Sub-section type":         "0",
    "Log Committed by":         "0x2000",
    "Subsystem":                "Power/Cooling",
    "Event Scope":              "Entire Platform",
    "Event Severity":           "Critical Error, Scope of Failure unknown",
    "Event Type":               "Not Applicable",
    "Action Flags": [
                                "Service Action Required",
                                "Report Externally"
    ],
    "Host Transmission":        "Acked",
    "HMC Transmission":         "Acked"
},
"Primary SRC": {
    "Section Version":          "1",
    "Sub-section type":         "1",
    "Created by":               "0x2700",
    "SRC Version":              "0x02",
    "SRC Format":               "0x55",
    "Virtual Progress SRC":     "False",
    "I5/OS Service Event Bit":  "False",
    "Hypervisor Dump Initiated":"False",
    "Backplane CCIN":           "2E2F",
    "Terminate FW Error":       "False",
    "Deconfigured":             "False",
    "Guarded":                  "False",
    "Error Details": {
        "Message":              "Input power was lost while the system was powered on."
    },
    "Valid Word Count":         "0x09",
    "Reference Code":           "110000AC",
    "Hex Word 2":               "00080055",
    "Hex Word 3":               "2E2F0010",
    "Hex Word 4":               "00000000",
    "Hex Word 5":               "00000000",
    "Hex Word 6":               "00000000",
    "Hex Word 7":               "00000000",
    "Hex Word 8":               "00000000",
    "Hex Word 9":               "00000000",
    "Callout Section": {
        "Callout Count":        "1",
        "Callouts": [{
            "FRU Type":         "Symbolic FRU",
            "Priority":         "Mandatory, replace all with this type as a unit",
            "Part Number":      "ACMODUL"
        }]
    }
},
"Extended User Header": {
    "Section Version":          "1",
    "Sub-section type":         "0",
    "Created by":               "0x2000",
    "Reporting Machine Type":   "9105-42A",
    "Reporting Serial Number":  "783C4E1",
    "FW Released Ver":          "PL1030_045",
    "FW SubSys Version":        "fw1030.10-17.4",
    "Common Ref Time":          "00/00/0000 00:00:00",
    "Symptom Id Len":           "20",
    "Symptom Id":               "110000AC_2E2F0010"
},
"Failing MTMS": {
    "Section Version":          "1",
    "Sub-section type":         "0",
    "Created by":               "0x2000",
    "Machine Type Model":       "9105-42A",
    "Serial Number":            "783C4E1"
},
"User Data 0": {
    "Section Version": "1",
    "Sub-section type": "1",
    "Created by": "0x2000",
    "BMCLoad": "3.41 0.94 0.32",
    "BMCState": "NotReady",
    "BMCUptime": "0y 0d 0h 1m 10s",
    "BootState": "",
    "ChassisState": "",
    "FW Version ID": "fw1030.10-17.4-ips-1030.2307.20230307i-prod (PL1030_045)",
    "HostState": "",
    "Process Name": "/usr/bin/phosphor-chassis-state-manager",
    "System IM": "50001000"
},
"User Data 1": {
    "Section Version": "1",
    "Sub-section type": "1",
    "Created by": "0x2000",
    "_PID": "1805"
}
}
@lxwinspur
Copy link
Contributor Author

lxwinspur commented Mar 23, 2023

I suspect that it is caused by monitoring the Ac fault or PGood fault signal of the PSU after power on.
https://github.com/ibm-openbmc/phosphor-power/blob/1050/phosphor-power-supply/psu_manager.cpp#L812-L835

@spinler @mzipse
FYI

@spinler
Copy link
Contributor

spinler commented Mar 23, 2023

When script detects the host is powered off, send command to the network power controller to do AC off.

If I had to guess, your script is cutting AC before the chassis is actually off, or at least before chassis state manager gets a chance to persist the new chassis power state. That code is all in phosphor-state-manager/chassis_state_manager.cpp.

@JerryInspur
Copy link

When script detects the host is powered off, send command to the network power controller to do AC off.

If I had to guess, your script is cutting AC before the chassis is actually off, or at least before chassis state manager gets a chance to persist the new chassis power state. That code is all in phosphor-state-manager/chassis_state_manager.cpp.

Hi
We have tried add a 20 seconds time interval before AC off, this error is not seen any more.
Why this only happened on 4 PSUs S1024, but not on 2 PSUs S1022?

@spinler
Copy link
Contributor

spinler commented Mar 28, 2023

@JerryInspur If you wanted, you could watch that LastStateChangeTime property to see what was going on.

@spinler spinler closed this as completed Mar 28, 2023
lxwinspur pushed a commit to lxwinspur/openbmc that referenced this issue Jun 12, 2023
George Liu (3):
  Implement LocationIndicatorActive for CPU resource
  Implement SubProcessors for processor core
  Implement LocationIndicatorActive for Memory resource

Ramesh Iyyar (19):
  redfish-core: Processor: Workaround to handle DCM
  redfish-core: LogServices: Added HardwareIsolation service
  LogServices: HardwareIsolation: Get LogEntryCollection
  LogServices: HardwareIsolation: Get LogEntry
  LogServices: HardwareIsolation: Delete LogEntry
  LogServices: HardwareIsolation: Post ClearLog
  redfish-core: Processor: Fixed the processor object search (ibm-openbmc#168)
  Enabled deconfiguration reason support to the DIMM and Core and Few fixes (ibm-openbmc#171)
  HW-Isolation: Fix, Use GetAncestors to get the parents id (ibm-openbmc#235)
  HW-Isolation: Fix, Don't throw internal error if failed to get error log (ibm-openbmc#245)
  HW-Isolation: Fix, Update State if the Core and DIMM are recovered (ibm-openbmc#288)
  registry: Add PropertyValueResourceConflict registry
  HW-Isolation: Return an appropriate error if the request is failed.
  HW-Isolation: Fill OriginOfCondition for the TPM and Motherboard (ibm-openbmc#278)
  HW-Isolation: Return ResourceCannotBeDeleted error (ibm-openbmc#297)
  HW-Isolation: Fix, Don't handle the Unavailable D-Bus error
  registry: Add PropertyValueExternalConflict registry
  LogService: HW-Isolation: Return an appropriate error
  LogEntry: HW-Isolation: Removed the Resolved property (#341)

PriyangaRamasamy (1):
  1050:Pull lamptest related commits to 1050 (#572)

Asmitha Karunanithi (1):
  Support ipv6 on hypervisor ethernet interface (#569)

Reed Frandsen (1):
  Merge pull request #542 from deepakala-k/sync_1030_commits

deepakala-k (6):
  redfish-core: Core: Enabled the isolation (aka guard) feature
  redfish-core: Memory: Enabled the isolation (aka guard) feature
  Core: Fix, Patch a core into the respective parent processor (ibm-openbmc#261)
  Add missing odata.id field under OriginOfCondition for assemblies (#494)
  clang-format ran
  Fix Errors found during CI

Willy Tu (1):
  util: Add pretty name for resources

Change-Id: I4ff543ef70c2ef32e86fa892d979a6dbf2a57efd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants