Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More i2c transaction errors following 0.5.1 release #1042

Closed
pomplesiegel opened this issue Jun 22, 2016 · 22 comments

Comments

@pomplesiegel
Copy link

commented Jun 22, 2016

We noticed an uptick in the amount of i2c transactional errors following the 0.5.1 release. After speaking with @technobly, it became clear that SINGLE_THREADED_BLOCK or CriticalSection are no longer protecting the actual i2c transaction calls, Wire.endTransmission() and Wire.requestFrom(). We believe these protections (and possibly even ATOMIC_BLOCK() to disable interrupts) are necessary for a full-featured multi-threaded program to behave normally over time.

Here is a test program which can be refined and added as a regression test, as i2c changes really should be thoroughly tested against real hardware before packaged and released.

Test program

This program keeps track of the number of i2c transactional errors which occur when constantly communicating with a MCP23017 port expander, while publishing irrelevant data. To check the number of errors, get the variable i2cErrors.

#include "application.h"
#include "Adafruit_MCP23017.h"

SYSTEM_MODE(SEMI_AUTOMATIC);
SYSTEM_THREAD(ENABLED);

Adafruit_MCP23017 myPE; 

static int peState = false; 

int numI2cErrorsSoFar = 0;

void PointlessPublish()
{
  static unsigned long millisOfLastPublish = 0;

  if( Particle.connected() )
  {
    if( ( millis() - millisOfLastPublish ) > random(260, 500) )
    {
      millisOfLastPublish = millis();
      Serial.println("Publishing bull....");
      Particle.publish("bull", " fjksdas " + String(millisOfLastPublish) + String(peState) );
    }
  }
}

Timer pointlessThread(2, PointlessPublish);

int Uptime( String unused )
{
  return millis() / 1000;
}

void setup()
{
  Serial.begin(9600);

  WiFi.on();
  WiFi.connect(); 
  Particle.connect();

  //i2c and PE stuff
  Wire.begin();

  myPE.begin(0);
  myPE.pinMode(0,OUTPUT);
  myPE.pinMode(1,INPUT);

  Particle.variable("i2cErrors", numI2cErrorsSoFar);

  Particle.function("uptime", Uptime);

  Serial.println("Example program with publishes + i2c");

  pointlessThread.start();
}


void loop()
{
  Particle.process();

  PointlessPublish();

  peState = myPE.digitalRead(1);
  myPE.digitalWrite(0,peState); 
}
Partially modified MCP23017 driver implementation for this test
/*************************************************** 
 This is a library for the MCP23017 i2c port expander

 These displays use I2C to communicate, 2 pins are required to
 interface
 Adafruit invests time and resources providing this open source code,
 please support Adafruit and open-source hardware by purchasing
 products from Adafruit!

 Written by Limor Fried/Ladyada for Adafruit Industries.
 BSD license, all text above must be included in any redistribution
 ****************************************************/


#include "Adafruit_MCP23017.h"

...

extern int numI2cErrorsSoFar; 

void ReportI2cError()
{
    numI2cErrorsSoFar++;
    Serial.println("numI2cErrorsSoFar = " + String(numI2cErrorsSoFar) );
}

...

/**
 * Reads a given register
 */
uint8_t Adafruit_MCP23017::readRegister(uint8_t addr){

    static int result = 0;
    // read the current GPINTEN
    Wire.beginTransmission(MCP23017_ADDRESS | i2caddr);
    wiresend(addr);

    SINGLE_THREADED_BLOCK()
    { 
        result = Wire.endTransmission();
    }

    if(result != 0)
    {
        Serial.println("readRegister endTransmission(): " + String(result) );
        ReportI2cError();
        Wire.reset(); 
    }

    SINGLE_THREADED_BLOCK()
    {
        result = Wire.requestFrom(MCP23017_ADDRESS | i2caddr, 1);
    }

    if(result == 0)
    {
        Serial.println("readRegister requestFrom() data: " + String(result) );
        ReportI2cError();
        Wire.reset(); 
    }

    return wirerecv();
}


/**
 * Writes a given register
 */
void Adafruit_MCP23017::writeRegister(uint8_t regAddr, uint8_t regValue){

    static int result = 0;

    // Write the register
    Wire.beginTransmission(MCP23017_ADDRESS | i2caddr);
    wiresend(regAddr);
    wiresend(regValue);

    SINGLE_THREADED_BLOCK()
    {
        result = Wire.endTransmission();
    }

    if(result != 0)
    {
        Serial.println("writeRegister endTransmission(): " + String(result) );
        ReportI2cError();
        Wire.reset(); 
    }
}

To try different levels of protection, comment out SINGLE_THREADED_BLOCK or replace it with ATOMIC_BLOCK.

Results for different levels of protection

  1. No protection (commenting out SINGLE_THREADED_BLOCK or ATOMIC_BLOCK): 7 errors after 30 minutes
  2. SINGLE_THREADED_BLOCK: 0 errors after 932 minutes
  3. ATOMIC_BLOCK: 0 errors after 960 minutes

Within a real program

We observed that while SINGLE_THREADED_BLOCK() may be sufficient for most applications in preventing i2c transactional errors, within our company's program (which uses i2c, SPI, Serial, WiFi, swTimers...) all together, it still resulted in occasional errors. However, when we changed to ATOMIC_BLOCK() these errors disappeared in all overnight tests on multiple units. The devices appear to be running happily, even with interrupts disabled during i2c transactions.


Completeness:

  • Minimum test case added
  • Device, system and user firmware versions stated
  • Particle confirmed
@technobly

This comment has been minimized.

Copy link
Member

commented Jun 22, 2016

Wow! Thanks for the great detail @pomplesiegel :)

We have confirmed that if a thread blocks context switching such that the 100ms timers in i2c_hal.c are exceeded before the events become true, the transaction will fail.

We had a bit of trouble with your example, I think possibly because of the excessive publishing rate (260-500ms). As a test, would you try replacing your PointlessPublish() routine with this one?

void i2cThreadBlocker() {
    uint32_t block = random(0, 300);
    {
        SINGLE_THREADED_SECTION();
        uint32_t startMillis = millis();
        while ((millis() - startMillis) < block);
    }
}

a simple fix would be altering the i2c_hal.c timer checks as follows:

if (EVENT_TIMEOUT < (HAL_Timer_Get_Micro_Seconds() - _micros) &&
  I2C_GetFlagStatus(i2cMap[i2c]->I2C_Peripheral, I2C_FLAG_BTF) == RESET) { // i2c sw reset }

so that as another thread is blocking the ability to check the flag, when it returns to the i2c_hal thread the timeout qualifies the flag again as well. The flag should be SET and the i2c peripheral will not be reset due to the timer being expired already.

@pomplesiegel

This comment has been minimized.

Copy link
Author

commented Jun 22, 2016

Hi @technobly, what was the trouble with the publishing example? It may exceed what is actually able to be published by the photon, but it exercises the user -> system FW. From this example, how would a thread be blocking for more than 100ms? Have you run this locally to see what's causing the error?

@technobly

This comment has been minimized.

Copy link
Member

commented Jun 22, 2016

Another team member was seeing cloud disconnections but it might have been unrelated. I have the MCP23017 setup with a photon now, but it's constantly erroring numI2cErrorsSoFar = 2053
readRegister requestFrom() data: 0
so it must be configured wrong. Would you share a schematic for your MCP23017 so I can make sure it's configured the same way?

Other places in system firmware that may disable interrupts could have a similar effect. These are the steps I see us taking: 1. reliably duplicate the error in a very measureable way. 2. fix the vulnerability in the i2c library. 3. ensure the test passes. 4. hunt down all of the other places in system firmware that may have similar issues.

@pomplesiegel

This comment has been minimized.

Copy link
Author

commented Jun 22, 2016

Hi @technobly - yes that sounds like it's not configured correctly.
The main thing is to include pull-up resistors (between 2.2k to 10k) up to 5V or 3.3 for SDA and SCLK. Secondarily, you should be sure that the port expander is addressed at address 0x20, which you can do by making sure all the address pins are grounded.

Isn't there a MCP23017 setup already in use for regression testing? Actual hardware was used to test and resolve #709

@technobly

This comment has been minimized.

Copy link
Member

commented Jun 22, 2016

Yep, I'm using 1k and 3.3V. I had the addr skewed though from a previous project. There is an MCP23017 setup and tests, but I'm using my own MCP23017 PDIP in a breadboard currently as it's easier to iterate with. My main issue was the RESET pin was floating... the datasheet shows this as an OUTPUT pin, yet the text describes it as an input. Despite my better judgement I had left it floating xD. It's working now.

@pomplesiegel

This comment has been minimized.

Copy link
Author

commented Jun 22, 2016

Great!

Sent from my iPhone

On Jun 22, 2016, at 6:13 PM, Brett Walach notifications@github.com wrote:

Yep, I'm using 1k and 3.3V. I had the addr skewed though from a previous project. There is an MCP23017 setup and tests, but I'm using my own MCP23017 PDIP in a breadboard currently as it's easier to iterate with. My main issue was the RESET pin was floating... the datasheet shows this as an OUTPUT pin, yet the text describes it as an input. Despite my better judgement I had left it floating xD. It's working now.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@pomplesiegel

This comment has been minimized.

Copy link
Author

commented Jun 23, 2016

For another piece of data, our actual application ( which uses i2c, SPI, Serial, WiFi, swTimers, hardware interrupts... all together into a full application) has been running on three units for 46 hours w/ ATOMIC_BLOCK() protecting all Wire.endTransmission() and Wire.requestFrom() calls. With this protection we have experienced 0 i2c transaction errors. This binary intentionally makes more i2c calls than we normally would, to trigger any possible issue due to the interrupt disabling.

@m-mcgowan, is it definitely not OK to use ATOMIC_BLOCK() on i2c transaction calls, given it disables interrupts? So far we have seen zero adverse behaviors AND a significant improvement in i2c robustness.

@technobly

This comment has been minimized.

Copy link
Member

commented Jun 23, 2016

Just an update about the disconnects, they are real and happen every 60 publishes if your rate is 1 second or under. Keep your publishes just above 1 per second and you won't see disconnects. I'll be writing up an issue about this. Also, there has been progress on this issue behind the scenes over the last couple days... nothing to share just yet, but I'll update you when we have more info. Master mode does use interrupts for ACK failure detection. It's hard to argue with testing though, for your particular application. Perhaps removing single_threaded_section() from the i2c library is good in that it allow each user to augment the i2c calls however they need to. That said, we'd still like any timing dependent calls to be automatically handled correctly in system firmware.

@avtolstoy avtolstoy referenced this issue Jun 28, 2016
7 of 7 tasks complete
@pomplesiegel

This comment has been minimized.

Copy link
Author

commented Jun 29, 2016

Totally - i see the potential value of leaving this choice up to the user. However, most people will not know whether this is necessary for their program, and very few people will know to run long-term error-rate tests. My suspicion is that any more complicated, multi-feature program will benefit from the ATOMIC_BLOCK() protection.

FYI, we are at 5.55 days of 0 i2c transaction errors running on 7 boards, following the ATOMIC_BLOCK() protection. That is a huge improvement for a couple lines of code. The code is still fully aware of i2c disruptions, such as if a cable is pulled. I did a bunch of testing yesterday to confirm that we're not making the system ignorant of actual live i2c issues.

@avtolstoy

This comment has been minimized.

Copy link
Member

commented Jun 29, 2016

@pomplesiegel
Could you please test if you experience the same issues with PR #1047? As I explained in the PR, I2C in 0.5.1 is affected by non-monotonic micros()/millis() (see #943). Despite some minor errors in flags handling, develop and 0.6.x should not be affected much by this issue as well.

I've been running a heavy test for >24 hours already with zero errors with the following i2c devices:

  • ISL29020
  • HDC1000
  • DS1338
  • M24C01
  • 3x PCA9554
  • PCA9632
  • 2x MCP23017

I strongly discourage against using ATOMIC_BLOCK() for I2C transactions for two reasons:

  1. Worst case scenario you could be blocking interrupts for 200ms in case of an error
  2. Address NACK detection would not be working, causing 100ms delays for instance when scanning the bus. This will also cause interrupts to be blocked for at least 100ms.

@avtolstoy avtolstoy self-assigned this Jun 29, 2016

@pomplesiegel

This comment has been minimized.

Copy link
Author

commented Jun 30, 2016

@avtolstoy, great, thank you for the feedback. From an overnight test on two boards it appears the i2c error rate has improved drastically w/ PR #1047. I removed all ATOMIC_BLOCK() and SINGLE_THREADED_BLOCK() protections and am deploying on close to 10 units to observe behavior over the next few days. I'll report back once i have more data. Thanks!

@technobly

This comment has been minimized.

Copy link
Member

commented Jun 30, 2016

Also using the i2cThreadBlocker modification from above, I haven't see a single error in 60 minutes of running the app in the first post :)

@pomplesiegel

This comment has been minimized.

Copy link
Author

commented Jun 30, 2016

Indeed! Forgot to mention I ran the same i2c stress test on a board for 22 hours w/ this PR and saw 0 errors.

@pomplesiegel

This comment has been minimized.

Copy link
Author

commented Jul 5, 2016

@avtolstoy, with this PR running within our actual application on multiple units for 4.5 days (plus an elevated amount of i2c transactions as a test) we have seen 0 transactional errors, even with all protections like ATOMIC_BLOCK() and SINGLE_THREADED_BLOCK() removed. This seems to be good to go. Nice job!

@avtolstoy

This comment has been minimized.

Copy link
Member

commented Jul 5, 2016

@pomplesiegel Well, that was the plan, to remove all those "protections" and fix the underlying issue :) Glad to hear that it works well for you! And thanks for testing this! My test rig is also still running with 0 errors.

@pomplesiegel

This comment has been minimized.

Copy link
Author

commented Jul 5, 2016

Based on our testing, this would greatly improve the stability of 0.5.2. Are you planning to merge it for that release?

@pomplesiegel

This comment has been minimized.

Copy link
Author

commented Jul 5, 2016

@avtolstoy, other option: if this won't be released w/ 0.5.2, would it be possible to cook up a branch rebased against 0.5.2, so we can have the i2c fixes, but not the experimental changes currently breaking some behaviors in develop? I tried making these changes to 0.5.2, but was unsure if I missed something crucial after fixing compiler errors

@avtolstoy

This comment has been minimized.

Copy link
Member

commented Jul 6, 2016

@pomplesiegel As far as I understand, it should go into 0.5.2 (it's not in 0.5.2rc1 yet). @technobly, could you comment on this please?

If not, I can easily create a separate branch with this PR rebased on 0.5.2 (+ #943 merged in).

@pomplesiegel

This comment has been minimized.

Copy link
Author

commented Jul 6, 2016

@avtolstoy, last I heard this PR will not make it to 0.5.2. Yes, it would very helpful to have a branch which is 0.5.2-rc1 + this PR + #943. Would you mind creating that please? Thank you

@avtolstoy

This comment has been minimized.

Copy link
Member

commented Jul 6, 2016

@pomplesiegel https://github.com/spark/firmware/tree/feature/i2c-fixes-1-0.5.x is based on release/v0.5.2-rc.1 and should include #943 and #1047.

@pomplesiegel

This comment has been minimized.

Copy link
Author

commented Jul 6, 2016

@avtolstoy thanks so much! we'll be testing with this until 0.5.3 is released.

@pomplesiegel

This comment has been minimized.

Copy link
Author

commented Jul 7, 2016

@avtolstoy, Overnight seeing great stability and an i2c error rate of 0% w/ this 0.5.2 + PR combo running on 5 units + the dedicated i2c test program (using a MCP23017).

@technobly technobly added this to the 0.6.x milestone Aug 9, 2016

@technobly technobly closed this Aug 9, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.