Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal - store last reset reason / error code / debugging info on reset, send to cloud on reconnect? #403

Closed
dmiddlecamp opened this issue Mar 9, 2015 · 13 comments

Comments

@dmiddlecamp
Copy link
Member

commented Mar 9, 2015

It'd be really interesting to know why a device reset, or about where in the firmware things crashed or went wrong. How hard would it be to catch this, and make it available to users?

@andyw-lala

This comment has been minimized.

Copy link
Contributor

commented Mar 9, 2015

This should be 100% doable. This feature is long overdue, please do not let the perfect get in the way of the possible. There are (at least) two broad categories of syndrome you need to track:

1 - hardware (was is a reset, a brownout, a watchdog, a power-on, etc) that info should be readable upon startup by the bootloader. Note - may need to hook into .init section to get invoked early before registers are clobbered, but it should be doable.

2 - software - was it a oom, a panic/assert, etc and if so, what info is available to pinpoint it ? How is that info passed across the reset/bootloader/payload barriers ?

My advice, build a flexible and reliable framework, populate it with the low hanging fruit and then work to expand the coverage.

+1 for contemplating this.

@dmiddlecamp

This comment has been minimized.

Copy link
Member Author

commented Mar 9, 2015

👍 thanks!

@m-mcgowan

This comment has been minimized.

Copy link
Contributor

commented Mar 9, 2015

The framework for storing and reporting a syndrome for reset is fairly straightforward.

  • store a value in persistent store which is the reset syndrome code
  • post that value to the cloud as an event after handshake

The trickier part is the detection :

  • for software-based issues like OOM, panic, we have code that is run before the reset so it's simply a matter of writing an indicator value to persistent store (assuming we can do that without triggering another OOM error...)
  • hardware issues are trickier and will require some form of detection - how to distinguish regular reset from a brownout and the other types? @andyw-lala / @satishgn do you have any thoughts on how reset causes can be distinguished? It's no problem to inject code that executes immediately on startup so we can examine startup registers, should they contain info we can use to determine the cause of reset.

The biggest block so this is that it requires changes to the bootloader. While upgrading the bootloader is possible, it does carry a non-zero risk of bricking the device. For the photon, we are producing regular release candidates for manufacturing now. The MVP solution would involve saving all system state at startup to memory (in the bootloader) so that it can be later analyzed in system code (which we can more easily and reliably update with new forms of reset analysis.) The code to save state in bootloader should be fairly trivial and so quick to implement in time for the next photon release candidate.

@andyw-lala

This comment has been minimized.

Copy link
Contributor

commented Mar 9, 2015

So, let's make sure the bootloader for photon and electron support this, and the firmware supports this in a backwards compatible manner so that new core firmware with old core bootloader functions cleanly (like reporting non-capable bootloader status.) If we provide a working bootloader for the core, but do not make the deployment mandatory, I feel that covers all the bases, and provides value moving forward.

I will dig up the hardware syndrome details, and provide references here.

@satishgn

This comment has been minimized.

Copy link
Contributor

commented Mar 9, 2015

It's easier to catch hardware reset detection via Reset and Clock control(RCC) registers of STM32.
We are already doing that in bootloader to detect Watchdog resets if IWDG is enabled in system firmware. Apart from watchdog, we can also detect other type of resets such as BOR, RST Pin, POR/PDR, Software Reset(i.e. NVIC_SystemReset), Low Power Reset as follows via the following std periph api: RCC_GetFlagStatus().

/**
  * @brief  Checks whether the specified RCC flag is set or not.
  * @param  RCC_FLAG: specifies the flag to check.
  *          This parameter can be one of the following values:
  *            @arg RCC_FLAG_HSIRDY: HSI oscillator clock ready
  *            @arg RCC_FLAG_HSERDY: HSE oscillator clock ready
  *            @arg RCC_FLAG_PLLRDY: main PLL clock ready
  *            @arg RCC_FLAG_PLLI2SRDY: PLLI2S clock ready
  *            @arg RCC_FLAG_LSERDY: LSE oscillator clock ready
  *            @arg RCC_FLAG_LSIRDY: LSI oscillator clock ready
  *            @arg RCC_FLAG_BORRST: POR/PDR or BOR reset
  *            @arg RCC_FLAG_PINRST: Pin reset
  *            @arg RCC_FLAG_PORRST: POR/PDR reset
  *            @arg RCC_FLAG_SFTRST: Software reset
  *            @arg RCC_FLAG_IWDGRST: Independent Watchdog reset
  *            @arg RCC_FLAG_WWDGRST: Window Watchdog reset
  *            @arg RCC_FLAG_LPWRRST: Low Power reset
  * @retval The new state of RCC_FLAG (SET or RESET).
  */
FlagStatus RCC_GetFlagStatus(uint8_t RCC_FLAG)

I would say not to code for the hardware reset detection scenario in the bootloader but during the setup/config code of main firmware(system module) to keep in flexible so it can be changed as per requirement via OTA update.

@andyw-lala

This comment has been minimized.

Copy link
Contributor

commented Mar 9, 2015

Experience has taught me that the best thing to do with the hardware syndrome is to capture it as early as possible. We should strive to do it in the bootloader if possible, if that is not done, then the bootloader specs/comments need to be updated to declare that these hardware pieces are expected to be untouched. We do not want some subsequent clock init code to have a side effect of clearing these bits.

Out of interest, what is the disposition of the watchdog reset bit, if the bootloader is inspecting it, does it also clear it ? If so, how would the application gather this info ?

I'll not bother researching the hw reference manual section, since Satish has already identified the relevant library call(s).

@satishgn

This comment has been minimized.

Copy link
Contributor

commented Mar 9, 2015

currently only when the watchdog reset occurs, the bootloader clears the RCC flag based on a SystemHealth backup ram variable which is set to some defined system status in the main firmware before reseting and on reset takes necessary action such as backup firmware revert, factory reset or in the worst possible scenario enter DFU mode(blinking Yellow LED).

So as per this issue the best thing to do now in the photon's bootloader is to remove the clearing of flags(RCC_ClearFlag) when the IWDG reset occur and clear it in the main firmware after sending some error message to the cloud. We can also achieve this without touching the existing bootloader code that is already deployed in field by deferring the flags clearing thing to the main system firmware by not setting the backup RAM variable.

One thing to note for now in the photon is that Watchdog is not enabled by default to provide support for true sleep/standby mode operations. It would now be the responsibility of the user app to turn on IWDG to recover from crashed firmware.

@andyw-lala

This comment has been minimized.

Copy link
Contributor

commented Mar 9, 2015

Hmm, well I'd still recommend trying to get the photon/electron bootloaders to read and cache the hardware syndrome as early as possible, while leaving the core bootloader as-is and designing a system around that behaviour.

@randomite

This comment has been minimized.

Copy link

commented Apr 7, 2015

Will this proposal be included in the the recently announced Spark Dashboard?

@m-mcgowan

This comment has been minimized.

Copy link
Contributor

commented Apr 8, 2015

Yes certainly, error logging is a planned feature of the dashboard.

@pomplesiegel

This comment has been minimized.

Copy link

commented Jun 30, 2015

This is huge! Currently we have no idea why a core restarted w/o watching the thing from above

@m-mcgowan m-mcgowan added this to the 0.6.x milestone Mar 14, 2016

@sergeuz

This comment has been minimized.

Copy link
Member

commented Apr 1, 2016

I see that bootloader already preserves RCC_CSR register, so that its value can be checked by other modules safely. What needs to be added is reason code for software resets (e.g. by introducing HAL_Core_System_Reset_Ex()) that is saved to some persistent storage (DCT?).

What is the best way to publish reset reason data to the cloud? Ideally, we need to publish some timestamp (UNIX time or milliseconds since startup) along with reason code.

@m-mcgowan

This comment has been minimized.

Copy link
Contributor

commented Apr 28, 2016

Does anyone know of any standards for describing device reset syndromes? We have the codes defined, but would be nice to align these with any existing standards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants
You can’t perform that action at this time.