Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Photon: PANIC when TCPClient fails to connect #516

Closed
maghis opened this issue Aug 2, 2015 · 22 comments · Fixed by #539

Comments

@maghis
Copy link

commented Aug 2, 2015

When trying to connect to a closed port or just failing to connect, the Photon restarts with an SOS instead of simply returning false.

Update:

  • crashes and restarts w a PANIC if you try to connect right after the unit has connected to the internet (if you wait 2 seconds it works, less and it fails)
  • crashes and restart w a PANIC if: address doesn't exist, nobody is listening on the port or if the tcp connection fails after a successful connect
@monkbroc

This comment has been minimized.

Copy link
Member

commented Aug 5, 2015

I have seen similar PANIC when the cloud connection fails. The wiced_tcp_delete_socket call in Spark_Disconnect causes the crash.

@monkbroc

This comment has been minimized.

Copy link
Member

commented Aug 5, 2015

This is probably the same underlying issue as #490

@m-mcgowan

This comment has been minimized.

Copy link
Contributor

commented Aug 10, 2015

Are you building locally? If so please check you have flashed your system firmware with this commit, which addresses a SOS issue with sockets - ae42b71

@andyw-lala

This comment has been minimized.

Copy link
Contributor

commented Aug 10, 2015

Is that already in develop ? I am still seeing problems with latest develop
On Aug 10, 2015 11:42, "Matthew McGowan" notifications@github.com wrote:

Are you building locally? If so please check you have flashed your system
firmware with this commit, which addresses a SOS issue with sockets -
ae42b71
ae42b71


Reply to this email directly or view it on GitHub
#516 (comment).

@maghis

This comment has been minimized.

Copy link
Author

commented Aug 10, 2015

@m-mcgowan I imported that change in my local 0.4.3 (0.4.4 doesn't compile #503).

I'll see if it works.

@maghis

This comment has been minimized.

Copy link
Author

commented Aug 10, 2015

@m-mcgowan the fix works but I believe the sockets get in a strange state after multiple connections, probably because I'm missing all the changes to the socket_hal you did before.

@monkbroc

This comment has been minimized.

Copy link
Member

commented Aug 12, 2015

Comment by @andyw-lala originally posted on #519:

As of 4b09814, client.connect() causes SOS under 3 of the 3 common scenarios.
This includes Mat's rework of your PR #519.

From the comment in the test case:

Modify the following to create the 3 common scenarios: *

  1. Server and listener exist (e.g. run "netcat -l -p 8888" on the server)
  2. Server exists, no listener on port (don't do the netcat)
  3. Server doesn't exist (point the IPAddress somewhere in hyperspace *
    As of 4b09814 on develop branch:
    #1 works fine (duh)
    #2 fails instantly
    #3 fails after ~5 seconds (the timeout in socket_connect())
    Full test case can be found here:
    https://www.dropbox.com/s/ql7nq1a58xjydht/490-unit-test.ino?dl=0

Going to spend some (but sadly, not a lot of) quality time digging into this right now.

EDIT: I can confirm that commenting out wiced_tcp_delete_socket(this) in the tcp_socket_t destructor stops the panics. I'm dubious that "this" is used here, is that safe to pass to a wiced_xxx() call that expects a wiced_tcp_socket_t pointer ? [I just can't remember the rules for derived classes]

@m-mcgowan

This comment has been minimized.

Copy link
Contributor

commented Aug 12, 2015

Thanks Andy. Let me know when you're done looking and I'll carry the baton from where you leave off.

@monkbroc

This comment has been minimized.

Copy link
Member

commented Aug 12, 2015

FYI I dug deep into the WICED stack this afternoon using gdb and the programmer shield, but I couldn't come up with anything other than confirm that a Hard Fault happens when the connection is being deleted. The actual crash happens inside memcpy.

Here's the stack trace just before the hard fault.

#0  tcpip_apimsg (apimsg=apimsg@entry=0x200025c8) at WICED/network/LwIP/ver1.4.0.rc1/src/api/tcpip.c:389
#1  0x08066360 in netconn_delete (conn=0x2001bd60 <memp_memory_NETCONN_base+816>) at WICED/network/LwIP/ver1.4.0.rc1/src/api/api_lib.c:129
#2  0x0806e9ca in wiced_tcp_delete_socket (socket=0x20005cac) at WICED/network/LwIP/WICED/tcpip.c:458
#3  0x0807928c in ~tcp_socket_t (this=0x20005cac, __in_chrg=<optimized out>) at src/photon/socket_hal.cpp:115
#4  socket_t::dispose (this=0x20005ca8) at src/photon/socket_hal.cpp:371
#5  0x08079384 in socket_dispose (handle=<optimized out>) at src/photon/socket_hal.cpp:556
#6  0x080796f2 in socket_close (sock=536894632) at src/photon/socket_hal.cpp:843
#7  0x0807413e in Spark_Disconnect () at src/system_cloud.cpp:670
#8  0x08074268 in Spark_Connect () at src/system_cloud.cpp:653
#9  0x08075126 in establish_cloud_connection () at src/system_task.cpp:228
#10 0x0807531a in manage_cloud_connection (force_events=<optimized out>) at src/system_task.cpp:308
#11 0x08074f88 in Spark_Idle () at ./inc/system_task.h:46
#12 app_setup_and_loop () at src/main.cpp:208
#13 0x080780ea in application_start () at src/stm32f2xx/core_hal_stm32f2xx.c:436

When I set a breakpoint to HardFault_Handler and dump the local variables set in prvGetRegistersFromStack I get:

r0 = 0            = 0x0
r1 = 135266304    = 0x8100000
r2 = 268455914    = 0x10004fea
r3 = 1039814      = 0xfddc6
r12 = 0           = 0x0
lr = 134632229    = 0x8065325
pc = 134730818    = 0x807d442
psr = 2164260864  = 0x81000000

The PC where the fault occurs is in memcpy

(gdb) info symbol 0x807d442
memcpy + 10 in section .text

The LR is in xQueueGenericReceive

(gdb) info symbol 0x8065325
xQueueGenericReceive + 37 in section .text
@m-mcgowan

This comment has been minimized.

Copy link
Contributor

commented Aug 13, 2015

I pushed fixes to develop - for anyone that can and wants to, please pull and test!

@m-mcgowan m-mcgowan added this to the 0.4.4 milestone Aug 13, 2015

@andyw-lala

This comment has been minimized.

Copy link
Contributor

commented Aug 13, 2015

Build fails for me:

...
Building file: src/hal_peripherals.c
Invoking: ARM GCC C Compiler
mkdir -p ../build/target/hal-dynalib/platform-6-m/./src/
arm-none-eabi-gcc -DSTM32_DEVICE -DSTM32F2XX -DPLATFORM_THREADING=1 -DPLATFORM_ID=6 -DPLATFORM_NAME=photon -DUSBD_VID_SPARK=0x2B04 -DUSBD_PID_DFU=0xD006 -DUSBD_PID_CDC=0xC006 -DDYNALIB_IMPORT -DUSER_FIRMWARE_IMAGE_SIZE=0x20000 -DUSER_FIRMWARE_IMAGE_LOCATION=0x80A0000 -DMODULAR_FIRMWARE=1 -DMODULE_VERSION=4 -DMODULE_FUNCTION=4 -DMODULE_INDEX=1 -DMODULE_DEPENDENCY=0,0,0 -DRELEASE_BUILD -Werror -I../hal/inc -I../hal/shared -I../hal/src/photon/api -I../dynalib/inc -I. -MD -MP -MF ../build/target/hal-dynalib/platform-6-m/./src/hal_peripherals.o.d -ffunction-sections -fdata-sections -Wall -Wno-switch -Wno-error=deprecated-declarations -fmessage-length=0 -fno-strict-aliasing -DSPARK=1 -DSTART_DFU_FLASHER_SERIAL_SPEED=14400 -DSTART_YMODEM_FLASHER_SERIAL_SPEED=28800 -g3 -gdwarf-2 -Os -mcpu=cortex-m3 -mthumb -Wno-pointer-sign -std=gnu99 -c -o ../build/target/hal-dynalib/platform-6-m/./src/hal_peripherals.o src/hal_peripherals.c
In file included from src/hal_peripherals.c:1:0:
../hal/inc/hal_dynalib_peripherals.h:32:56: fatal error: platform_config.h: No such file or directory
#include "platform_config.h" // for HAS_SERIAL_FLASH
^
compilation terminated.
make[2]: *** [../build/target/hal-dynalib/platform-6-m/./src/hal_peripherals.o] Error 1
make[2]: Leaving directory /home/andyw/spark/firmware/hal-dynalib' make[1]: *** [hal-dynalib] Error 2 make[1]: Leaving directory/home/andyw/spark/firmware/modules/photon/system-part1'
make: *** [/home/andyw/spark/firmware/modules/photon/system-part1/makefile] Error 2

@andyw-lala

This comment has been minimized.

Copy link
Contributor

commented Aug 13, 2015

As of 6ae147b, the develop branch passes all cases of my unit test for client.connect().

Connect returns 1 iff the connection can be established.

Note that connect() will block for approximately 5 seconds if the host is non-existent, before returning 0. This is the desired/expected behaviour, as far as I'm concerned.

I will create and run a long-term test to check for resource leakage & report back.

NOTE: I have not tested or addressed the client.connected() failure yet. One thing at a time...

@monkbroc

This comment has been minimized.

Copy link
Member

commented Aug 13, 2015

Commit 6ae147b fixes the crash when the Photon is connected to a captive portal, so I'm good too with the fix.

@m-mcgowan

This comment has been minimized.

Copy link
Contributor

commented Aug 13, 2015

We are waiting for WICED to give us a permissive license for SDK 3.3.1 which may resolve the WICED issues, since Broadcom are updating the LwIP stack, so this problem may simply disappear soon. Much appreciated if everyone can keep their test setups so we can retest later. I've added some cases to our test suite.

@monkbroc

This comment has been minimized.

Copy link
Member

commented Aug 13, 2015

We are waiting for WICED to give us a permissive license for SDK 3.3.1

👍

Is the idea that Particle will host an open repository of WICED?

Otherwise, it would be helpful to include instructions on building WICED from source (register with Broadcom, download WICED SDK, download platform files for BCM9WCDUSI09 and BCM9WCDUSI14, run a specified build command).

@m-mcgowan

This comment has been minimized.

Copy link
Contributor

commented Aug 13, 2015

We have had to make changes to their SDK so building from source isn't possible at present without access to our private repo. So, yes, we very much hope for a permissive license so we can host the repo publicly.

@monkbroc

This comment has been minimized.

Copy link
Member

commented Aug 13, 2015

That would be awesome. Thanks for the great work!

@andyw-lala

This comment has been minimized.

Copy link
Contributor

commented Aug 13, 2015

Single invocation of the long term test suggests there is a resource leak or other problem that effectively shuts a photon off from the outside world (both the Particle cloud, and any local TCP connections initiated by the photon) - I'll see if I can devise a more complex test that provides more data about the failure mode.

TL;DR - we're not out of the woods yet, sadly.

@monkbroc

This comment has been minimized.

Copy link
Member

commented Aug 14, 2015

My investigation from yesterday shows that the crash was a result of netconn_delete which must be called in order to prevent leaking the connection handler allocated by WICED.

@andyw-lala

This comment has been minimized.

Copy link
Contributor

commented Aug 14, 2015

There is definitely a resource leak of sorts, after ~350 connection attempts, I lose the cloud connection and the ability to open new connections. Hence, I believe the fix in commit 6ae147b is very beneficial, but incomplete.

@monkbroc

This comment has been minimized.

Copy link
Member

commented Aug 14, 2015

I'm hopeful I've got a real fix to the crash for non-existent server. See #539.

@m-mcgowan

This comment has been minimized.

Copy link
Contributor

commented Aug 17, 2015

Lots of activity lately around sockets, and reliability has improved significantly over the past week - thanks to @monkbroc and @andyw-lala for contributions and testing.

I made some further fixes today, and I have a test case that is executing 10's of thousands of socket connect and failed connect to a nonexistent resource.

I'll close this issue. If we discover further issues let's create separate issues for them, since I believe the underlying cause of this one has been resolved.

@m-mcgowan m-mcgowan closed this Aug 17, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.