Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System crash when button interrupt occurs during i2c transmission #709

Closed
pomplesiegel opened this issue Oct 22, 2015 · 15 comments

Comments

@pomplesiegel
Copy link

commented Oct 22, 2015

Using system FW 0.4.7 (which now includes changes preventing task switching during i2c transmissions) if a Photon's "reset" or "setup" button is pressed during an i2c transmission, the system hangs indefinitely (totally unresponsive), as shown by the the static RGB LEDs.

We quickly ran across this issue using our full-featured application, as our product uses an i2c port expander to track real-time state of a front panel. This should be easy to reproduce on any program actively using the i2c bus.

image

img_0208

As i mentioned in #698

My only question is what would prevent the occasional system instability I saw on Photon when adding os_thread_scheduling(false, NULL); to the Wire methods. Is there a less restrictive call which won't cause crashes/delays to the system FW when pressing the reset button on Photon? I'd imagine that issue would show up elsewhere, for other interrupts?

Because locally instantiating a CriticalSection object (done before each i2c call) simply sets and unsets os_thread_scheduling(true, NULL) upon its creation and destruction, it makes sense this issue has not gone away.

I'd imagine this will happen for other Photon pins registered to interrupts as well.

Easiest way to reproduce

  1. Attach a Photon to a MCP23017 port expander (at address 0x20) over i2c w/ pull-ups
  2. Load the code below on a Photon running system FW 0.4.7
  3. While the system is running, quickly tap the "setup" button. The goal here is to trigger a button interrupt while task switching is currently disabled by an active CriticalSection object on the stack. This should only take a few seconds.

Code to reproduce

#include "application.h"
#include "Adafruit_MCP23017.h"

SYSTEM_MODE(MANUAL);
SYSTEM_THREAD(ENABLED);

Adafruit_MCP23017 myPE; 

void setup()
{
  Serial.begin(9600);

  WiFi.on();
  WiFi.connect();
  Particle.connect();

  Serial.println("Example program w/ real i2c traffic");

  myPE.begin();
  myPE.pinMode(0,OUTPUT);
  myPE.pinMode(1,INPUT);
}

void loop()
{
  Particle.process(); 
  delayMicroseconds(10);

  //i2c transactions w/ our PE
  myPE.digitalWrite(0, myPE.digitalRead(1) );
}

For context

  • The task switching protections inserted by @m-mcgowan in 0.4.7 have proved incredibly effective at eliminating previously commonplace i2c transmission errors. Our error rate has dropped to 0%, which is great.
  • However, the last task is to make sure that even with a program which actively (and realistically) uses the i2c bus, no potential instability like this occurs when task switching is disabled.

Our proposal to Particle SQA

@jonlogan, For early highlighting of critical but easily reproducible issues like this, which occur only in a realistic customer application (aka. not Tinker), would you please consider building programs like the example above into your QA regression test regimen before releasing? Historically, a few real-time calls to Wire, SPI, Serial and Particle.process within Loop() are often all that's necessary to reproduce a serious bug within a new System FW release. I would be happy to provide boiler plate programs to exercise the Photon in the spirit of a realistic customer application following a new System FW release.

@towynlin

This comment has been minimized.

Copy link
Member

commented Oct 22, 2015

👍 Thanks so much @pomplesiegel for the great report! We will absolutely include this code as a test as we continue building out our automated end-to-end testing infrastructure.

@pomplesiegel

This comment has been minimized.

Copy link
Author

commented Oct 22, 2015

@towynlin, Great, thanks!

@karlhenricksen

This comment has been minimized.

Copy link

commented Oct 26, 2015

@m-mcgowan , I've also seen this same behavior after a firmware update, at startup in general (after OTA reset, it hung on steady blue LED, assume that's the same), as well as during static state (leaving Photon running) - so this looks to be possible if any sort of interrupt occurs in that critical i2c transmission, where os_thread_scheduling(true, NULL) has been set.

@pomplesiegel

This comment has been minimized.

Copy link
Author

commented Nov 1, 2015

FYI, this is proven to be an extremely frequent crash for us within our program, which uses i2c extensively. Especially at startup, the number of interrupts occurring while i2c is gearing up results in a high probability of crashing with a static blue LED. This can only be fixed by a reset or power cycle, so this is quite dangerous for an OTA update.
img_0221

@pomplesiegel

This comment has been minimized.

Copy link
Author

commented Nov 14, 2015

This system crash is now significantly affecting us, as the SPST push button of our rotary encoder is registered to an interrupt - very clean signal w/ a slow RC filter followed by a schmitt trigger.

Total system crash (unresponsive photon w/ blank LED) occurring randomly following a push button. To help debug, I reduced the body of the ISR to solely

void RotaryEncoderWithPush::pushButtonInterruptHandler()
{
  numTotalInterruptsOccurred++; 
}

setup code is:

  attachInterrupt(pushButtonInputPin, &RotaryEncoderWithPush::pushButtonInterruptHandler, this, CHANGE);

whether using multi-threading or not, the crash still occurs after a few clicks. Issue only occurs if an interrupt is registered to the pin (D4), irrespective of whether it's RISING/FALLING/CHANGE.

Any thoughts on how this issue can be resolved?

@m-mcgowan

This comment has been minimized.

Copy link
Contributor

commented Dec 2, 2015

I2C fixes have been made in develop. Please try that when you have a moment and let me know if it improves the situation.

@m-mcgowan m-mcgowan added this to the 0.4.8 milestone Dec 2, 2015

@pomplesiegel

This comment has been minimized.

Copy link
Author

commented Dec 3, 2015

Just compiled and ran the example code below against develop, flashing both the system and user fw.
Issue still reproducible after only a few seconds of "setup" button clicking when running the code below with a port expander attached.

This video shows the current behavior.

The issue seems to have gotten much worse following the i2c changes in develop. The crash is occurring more easily/quickly, with a few button presses, and the i2c data is being corrupted, causing flashing, even though the intended value remains static.

As shown in the video, the onboard LED has stopped breathing, but strangely i2c transactions are still taking place. This does not occur with 0.4.7. Note that before the crash occurs (and if i never hit the onboard "setup" button) that the onboard LED does not light up.
NOTE: a delay within loop of 10uS causes similar behavior.

@m-mcgowan, Could you please try this on actual hardware (just a photon + MCP23017 PE w/ pull-ups) and attempt to reproduce the issue? Otherwise, this issue will likely be quite difficult to debug.

#include "application.h"
#include "Adafruit_MCP23017.h"

SYSTEM_MODE(MANUAL);
SYSTEM_THREAD(ENABLED);

Adafruit_MCP23017 myPE; 

void setup()
{
  Serial.begin(9600);

  WiFi.on();
  WiFi.connect();
  Particle.connect();

  Serial.println("Example program w/ real i2c traffic");

  myPE.begin();
  myPE.pinMode(0,OUTPUT);
  myPE.pinMode(1,INPUT);
}

void loop()
{
  Particle.process(); 
  delayMicroseconds(1);

  //i2c transactions w/ our PE
  myPE.digitalWrite(0, myPE.digitalRead(1) );
}

@m-mcgowan m-mcgowan modified the milestones: 0.4.9, 0.4.8 Dec 14, 2015

@avtolstoy

This comment has been minimized.

Copy link
Member

commented Dec 24, 2015

@pomplesiegel, do you still have the same issue with the "develop" branch? I just checked today almost the same code with PCF8574 and PCA9554 (using multiple byte reads/writes) and couldn't reproduce it.

@pomplesiegel

This comment has been minimized.

Copy link
Author

commented Jan 4, 2016

@avtolstoy, thanks for the heads up! It appears you may be right.

@m-mcgowan, yesterday I compiled the example code above (plus the latest version of our app) again against develop. Thankfully, now I cannot reproduce this issue using either program! Did changes occur to i2c / multi-threading since you last instructed me to check out develop (Dec 2)?

@m-mcgowan

This comment has been minimized.

Copy link
Contributor

commented Jan 4, 2016

Nothing has changed afaik. But very happy to hear it works for you now, since that's what I was expecting! ;-)

@m-mcgowan

This comment has been minimized.

Copy link
Contributor

commented Jan 6, 2016

Hey @pomplesiegel I have changed the type of critical section so that task switching is disabled, but not interrupts, since interrupts are critical for proper I2C function. Would you mind retesting against develop to verify it works for your application. Thanks!

@mohitbhoite

This comment has been minimized.

Copy link
Contributor

commented Jan 8, 2016

I recreated the issue with a hardware setup as described. The system got hung on myPE.begin(0x02); (note the mcp chip address is different on my setup) when built with v0.4.7. I could not even get it to run the user loop. The Photon went completely dead with the RGB led OFF. The only way to recover was to DFU and reflash it.

With the latest develop branch, the system looks very stable.

@pomplesiegel

This comment has been minimized.

Copy link
Author

commented Jan 8, 2016

Hey @m-mcgowan, with the latest develop version, I am unable to reproduce this issue using both the MCP23017 test app above nor our app. Our QA team is giving this a more thorough test early next week as well. Thanks!

@pomplesiegel

This comment has been minimized.

Copy link
Author

commented Jan 8, 2016

and @mohitbhoite, those are the exact symptoms I saw with 0.4.7, so that's what we're watching for here. Nice job. In the past, it was also reproducible at startup by pressing the "reset" button.

@m-mcgowan

This comment has been minimized.

Copy link
Contributor

commented Jan 18, 2016

Fix verified by end-to-end automated test suite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.