Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Photon frequently SOS's immediately following cloud re-connect #663

Closed
pomplesiegel opened this issue Oct 4, 2015 · 18 comments

Comments

@pomplesiegel
Copy link

commented Oct 4, 2015

Using system FW 0.4.6 and the code below, the Photon often crashes (red flashing SOS) immediately upon a cloud re-connect, following a WiFi disruption.

I noticed this because our application which ran stable on 0.4.5 is often crashing under the same circumstance (re-connecting to the cloud, possibly while publishing?). Below is basically a representation of a normal application, in order to successfully induce the same failure.

Steps to reproduce :

  1. Load code below on a Photon (running system fw 0.4.6) w/ WiFi creds already installed. The photon will connect to a network (breathing cyan) and begin publishing
  2. Remove the WiFi network (pull the plug on router or disable this network)
  3. Wait a few seconds
  4. Re-enable the network
  5. Repeat steps 2-4 until the Photon SOS's. From my repeated experience this only takes 1-3 times.

Things to note

  1. This is SYSTEM_MODE(MANUAL) using the new SYSTEM_THREAD(ENABLED)
  2. There is a delay in the program (mimicking a normal program's behavior)
  3. We're constantly checking Particle.connected(), then publishing once that returns true.
    Given the changes in multi-threading, might this be a concurrency issue regarding this system fw -> user app state?
#include "application.h"

SYSTEM_MODE(MANUAL);
SYSTEM_THREAD(ENABLED);

void setup()
{
  Serial.begin(9600);
  delay(2000);
  Serial.println("Beginning program!");

  WiFi.on();
  WiFi.connect();
  Particle.connect();
}

int timeOfPublish = 0;

void loop()
{
  //If we're connected, once per second
  if( Particle.connected() && ( timeOfPublish != Time.now() ) ) 
  {
    //Publish
    Serial.println("publishing...");
    Particle.publish("test","testData");
    timeOfPublish = Time.now();
  }

  //Manage WiFi/Cloud overhead
  Particle.process();

  //Delay of a normal program 
  delay(5);
}
@HardWater

This comment has been minimized.

Copy link

commented Oct 5, 2015

I can confirm that I experienced the same SOS problem when trying to use 4.6 it occurred on the second http request in a code loop (awake cycle) communicating with my server . The first request succeeds. This seems to be a bug that was fixed in 4.4 but has come back. I switched to 4.5 and was not able to replicate the SOS problem.

See https://community.particle.io/t/httpclient-and-the-photon/12661/22

@nlambuca

This comment has been minimized.

Copy link

commented Oct 5, 2015

exactly. Will try the same...

@m-mcgowan

This comment has been minimized.

Copy link
Contributor

commented Oct 9, 2015

This is fixed in develop. Much appreciated if someone is able to test and confirm.

@m-mcgowan m-mcgowan added this to the 0.4.7 milestone Oct 9, 2015

@indraastra

This comment has been minimized.

Copy link

commented Oct 9, 2015

Ok, so I'm not running exactly the same code as @pomplesiegel, but rather, a version of the code user mhazley posted here:

https://community.particle.io/t/failures-on-handling-wifi-reconnects-v0-4-5-v0-4-6/16384

I can confirm that killing the AP no longer causes an immediate SOS from the Particle.connect() that follows once the Photon realizes it's no longer connected.

That said, I followed @pomplesiegel's steps 3 times without power cycling and the Photon reconnected to the cloud successfully the first two times and the SOS'd the third time.

Edit: To keep this thread clean, I'm deleting my earlier posts about build issues. For posterity, I had TARGET_NAME set in my environment from some other build process. Unsetting it allowed me to build the libraries properly.

@indraastra

This comment has been minimized.

Copy link

commented Oct 9, 2015

I repeated the experiment with the exact code posted above and can confirm that it's still happening on the third reconnection. Not sure what it is about the third time, but I noticed the LED flash red momentarily on the second reconnect.

On that note, why does the Photon even attempt to reconnect to the cloud after a disconnect in this code? Is it the call to Particle.process()?

@m-mcgowan

This comment has been minimized.

Copy link
Contributor

commented Oct 10, 2015

Flashing RED (or orange as it is in the latest code) momentarily is fine - that can be due to a failed connection attempt. It's not a permanent error and is expected in some situations. (I changed it from red to orange to avoid confusion with SOS.)

@m-mcgowan

This comment has been minimized.

Copy link
Contributor

commented Oct 10, 2015

Particle.connect() sets a flag that says "connect to the cloud" - the system will continue attempting to connect to the cloud whenever loop exits or Particle.process() is called (in single-threaded mode.) In multithreading, the system will continually attempt to connect on the background thread. This continues until the connection succeeds or Particle.disconnect() is called.

@mhazley

This comment has been minimized.

Copy link

commented Oct 10, 2015

Thanks - should hopefully get a test run on this today.

@mhazley

This comment has been minimized.

Copy link

commented Oct 10, 2015

@m-mcgowan I tested that earlier and got no SOS - looks like you got that.

Have you got a PR or commit for that fix anywhere? I'm not sure I want to run Develop in the field this week so I might run a local 0.4.6 with that fix until 0.4.7 comes out. Or will that fix be going onto the 0.4.6 release branch for an 0.4.6.2?

@indraastra

This comment has been minimized.

Copy link

commented Oct 12, 2015

I can also happily report that the SOSes I was seeing on Friday are no longer happening in the latest develop branch. Thanks!

@indraastra

This comment has been minimized.

Copy link

commented Oct 12, 2015

@m-mcgowan Ah, so my takeaway from your explanation of Particle.connect() is that once called, you never need to call it again for the cloud connection to be re-established after a disconnect or any kind of connection upset?

@m-mcgowan

This comment has been minimized.

Copy link
Contributor

commented Oct 12, 2015

Yes, that's correct, although in manual mode you also are responsible for pumping messages with Particle.process()

@indraastra

This comment has been minimized.

Copy link

commented Oct 12, 2015

Except if threading is enabled, in which case Particle.process() is called on the system thread?

@m-mcgowan

This comment has been minimized.

Copy link
Contributor

commented Oct 12, 2015

Correct. With threading enabled, the only distinction between the modes is if the system starts with the cloud connected or not. The system implicitly does a Particle.connect() on startup when the mode is automatic.

@indraastra

This comment has been minimized.

Copy link

commented Oct 12, 2015

Great, thanks for the clarifications! I'm trying to update my wifi bringup and monitoring code to be more in line with all these details, and eagerly (but patiently) await the fix you have in store for the hasCredentials() reentrancy issue.

@pomplesiegel

This comment has been minimized.

Copy link
Author

commented Oct 13, 2015

@m-mcgowan, interesting! Given that insight, when threading is enabled, is there any difference between SEMI_AUTOMATIC and MANUAL? In #677 we're seeing different behavior between the two, as functions are not callable in MANUAL without adding delay(1) to loop()

@m-mcgowan

This comment has been minimized.

Copy link
Contributor

commented Oct 13, 2015

In principle, no difference. The issue you're seeing may be the result of thread priorities preventing execution of the system thread without the call to delay() which then allows the system to perform a thread yield. I will be investigating that in the coming days.

@pomplesiegel

This comment has been minimized.

Copy link
Author

commented Oct 13, 2015

Great! I figured it could be something like that - the OS thinking that this thread is not significant enough to run at our anticipated frequency.

Thanks for looking into this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.