New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UIPEthernet stops working after some time #30
Comments
My best guess as you mention Serial output not being connected: writing to an unconnected Serial might causes stability-issue when used in sketches that use a good amount of memory as UIPEthernet does (stock Ethernet need nearly no memory since everything is done on the WIZ5100 chip). It may help to just connect to Serial-monitor to prevent an overflow of the Serial-device buffers. There is an open issue with Arduino about this: https://github.com/arduino/Arduino/issues/1724. I allways use some '#ifdef debug' statements to remove all debug-logging from the sketch for production use. |
Well, I can try to remove those printouts and see what happens. But I have been running sketches on the same hardware with NanodeUIP instead and serial printouts without problems before. But that library requires you to write code that cannot be reused easily with Wiznet based hardware. So that's why i'm hoping this library would work :) |
I had this same issue, however since running the very latest master, with the connect returning 0 instead of -1 fix in, I have not encountered this issue. I had my code running for around 50 hours before I turned it off. This was working as a server. Code I'm using: https://gist.github.com/gibsop1/ba4cd682913144fecd18 |
Issue #29 was reverted in the latest master since the stock Ethernet library also return negative values. They should update their API documentation because it says true/false but in case of connect that requires DNS lookup they will return negative values. And in my code i'm checking the return value and only accept > 0. I have now removed all my serial output and started up the sketch again so we will see if this run lasts any longer. |
Ah about thanks. I'll update my code with >0. Thanks On Mon, Jan 20, 2014 at 8:27 PM, nilrog notifications@github.com wrote:
|
Serial.print was not the problem. I removed all those printouts yesterday and tonight the sketch has stopped posting again. There is something in here that is not 100% ok but I don't yet know what it is. |
ok, can you provide the complete code that is running on the arduino? Either here or as link to your repo? |
I have done some more tests and tracings and there seems to be a problem in the connect() call when you use a hostname. After a while the connect() call will still return 1 (or > 0 which I am testing for) but nothing is sent to the remote host. Snooping with Wireshark reveals that the DNS-request for the remote host is sent, and a reply to that is returned, when the sketch calls connect(). But the actual HTTP-request is never sent from the sketch. So the sketch is then waiting for a response that will never come. I have put part of the network trace in this gist, https://gist.github.com/nilrog/27745bbf923388c94ec4 The problem begins with the request in frame 481. The response for this request seems to be duplicated (in frame 483 and 486) and from that point on UIPEthernet stops working properly. |
I assume 192.168.2.11 is the Arduino? Frame 486 is just a retransmission of Frame 483. This is caused by the duplicate-ack-packet in Frame 485 which tells 176.227.193.50 that it didn't receive or process frame 481. Frame 486 (dup of 483) is acknowledged in 487, after that 176.227.193.50 agains send the FIN,ACK to close the connection... That looks all good - retransmissions are not an error, since packets can (and may) allways be dropped. And in this case the retransmitted http-response in frame 486 seems to have been processed by 192.168.2.11 as it acknowledge the close of connection afterwards. The strange thing is that from this point on every connection-attempt from 192.168.2.11 is going to fail because the ACKs of 192.168.2.11 (e.g. Frame 497) to the SYN,ACK-packets from 176.227.193.50 (Frame 496) seem not to reach its target as 176.227.193.50 retransmits the SYN,ACK-packet (Frame 498), ignores the retransmitted ACK of 192.168.2.11 (Frame 499) and then closes the connection (Frame 500) which is acknoledged by 192.168.2.11 in Frame 501. This frame obviously reached 176.227.193.50, as it's acked in Frame 502 and the connection is closed. From what I can see 192.168.2.11 works fine, as it's 176.227.193.50 not accepting the new connections. This could also be a firewall/router issue (possible triggered by the retransmitted ACK in the previous connection) when the ACKs from 192.168.2.11 would be dropped on the way not reaching 176.227.193.50. connect()-method returns with return-value of 1 as soon the connection is established which is when the ACK (see Frame 497) to 176.227.193.50's SYN,ACK-packet is sent. Bad thing is that the connection is closed by 176.227.193.50 just after that so your sketch never gets to send out the request.
|
Yes, 192.168.2.11 is my Arduino. The thing is that this is only an issue with UIPEthernet. I have an identical Arduino that has been running with EtherCard posting to the same server for a month now without problems. No hiccups, it just keeps on working. I have also used NanodeUIP, which also uses UIP, on the same hardware and it "just works". But both of these libraries have their drawbacks and they are also not so user-friendly as the stock Ethernet API. If I reset the Arduino when it is in this state everything starts working again. No other measures taken. So I do not believe this is an issue with a router. It's just that I have not been able to put my finger on what is wrong yet :( But I will continue digging :) |
@nilrog: The wireshark log gives evidence that the connections are closed by 176.227.193.50 aprox. 20 seconds after the ACK to 176.227.193.50 SYN,ACK-packet was sent. This can also be just because of configured timeout on 176.227.193.50, or could be because of not having received the ACK - we cannot know without having access to the webservers log. What we know is your sketch is not sending further packets within these 20 sec after the connection is established. Can you please post the complete code your sketch runs (Either here or as a gist)? |
By the time that 20 secs has gone the Arduino has already closed the connection. If no response, or no complete response has been received within 10 secs i'm calling client.stop(). The 20 sec seem to be the default timeout on Linux. |
Can you please post the complete code your sketch runs (Either here or as a gist)? |
Sure, but it just connects and prints the request to the client. |
I created a pull request (#33) that moves all debug printouts from RAM to FLASH. That was needed otherwise the sketch would soon crash due to lack of free RAM. |
Here is serial output where I have turened on UIPETHERNET_DEBUG_CLIENT and UIPETHERNET_DEBUG. Unfortunately I couldn't get the full serial printout since the Arduino IDE seems to have a limit to how much it can store. But I have extracted what should be one attempt to post.
(Updated serial printout to show the correct flow of events from the start of one post.) |
UIPClient allocation failed <<-- that means that all available slots to manage connection-data are in use. Maybe this should better be checked in connect()-method even before the connection is initiated. Currently it's checked when the first attempt to use the established connection is made so connect returns 1, which is misleading. You could detect this situation by checking the return-value of Client::write() (which would be 0 in this case). To narrow it down:
|
Ok, I will try #2 tonight when I get home. |
I have tested removing the Timer and resorted to counting millis instead. But the result is the same. The sketch stops working and UIPEthernet is out of resources. As I expected the use of Timer shouldn't have interfered. Tonight i'm going to test adding a call to client.stop() before I dispatch a new request. But I can't see in my code how this isn't already done, unless UIPEthernet is returning some really weird statuses for connected() and available(). In the state where i'm waiting for response I have two 10 second timers that call client.stop(). One in the loop where i'm reading the response and one that will always trigger, as can be seen in the serial logs. But I will try it anyway just to try and narrow down on this problem. Btw, what version of uIP did you use as base for UIPEthernet? |
Thank you for testing this. I for myself wouln't have expected Timer1 to interfere with UIPEthernet, it's just that I have this report #26 without further info. There is another way UIPEthernet can run out of available connections - if the remote side fails to acknowledge the outgoing packets (because it just might be disconnected from network) this outstanding data will be cleaned up not before the tcp-timeouts are expired and the connection is closed by uip due to timeout. Calling Client.stop() only removes unread incoming data - outstanding data-to-send is disassociated from the client-object but still kept for (re-)transmission. So the uip-connection slot stays allocated longer than the client-object itself is valid. uip is from contiki-OS: https://github.com/contiki-os/contiki/tree/master/core/net |
No problem...I want to crack this damned bug :P I can take the easy route and go back to EtherCard, which is enough for this project. But I like the idea of a unified Ethernet API regardless of what hardware you are using. This API is also so much easier to use that what EtherCard etc. are using. So unless I hit a dead end I will continue to try and hunt this down :) I checked the serial log a bit further and to me it looks like UIPClient suddenly stops "closing"... This is how it looks from the beginning (grep "UIPClient "):
But after a while it starts to look like this:
And the response from the server never reaches the sketch. The sketch will continue to post thinking all is ok, since UIPEthernet has not flagged anything, and later on UIPEthernet runs out of resources. What is weird is that I cannot see that the sketch prints "Disconnected" before it prints "Posting update". But I cannot see how the sketch can start posting again before it has called client.stop(). After a while I start seeing "Timeout #2" from the sketch which indicates that reading the response has taken more than 10 seconds, and then I also see "Disconnecting", which is expected since i'm stopping the client. The serial log is unfortunately too big to put in a gist...at least it gets cut-off when I try to create it. But if you want to have it I can share it somewhere else. This is basically how the state machine in the sketch looks like. And this is what I have been using many times before with EtherCard, NanodeUIP etc. The remote server is also up and running all the time because I have another, EtherCard based, sketch that is posting 24/7 without hiccups. state_need_to_post_update -> state_posting_update (waiting for response for up to 10 seconds) -> state_noconnection (after client.stop() has been called). The only way the sketch can get back to state "state_need_to_post_update" is if the timer/millis has expired AND the sketch is in state "state_noconnection". And it cannot get to "state_noconnection" without calling client.stop(). I'm going to test tonight to either add some checks for client.connected() and call client.stop() before posting, or i'm going to add some more debug printouts to try and see in what state the client is. |
do you have a Wireshark-log that catches the packets at 'But after a while it starts to look like this:...'? That would show wether there is a response that is not processed by UIPEthernet or not (And it would proove whether there is unacknowledged outgoing data or not). |
Not from the latest run when I "know" that this happened. The problem is also that it's hard to find out when in the Wireshark log it would be. I need to add some timestamps to the serial logs to be able to correlate the two logs. I will see what I can do tonight. |
Here is one occasion where it seems like the sketch starts to post again before it has disconnected. But it hasn't. Because I added a printout before posting if the client is still connected. The corresponding wireshark log is in this gist https://gist.github.com/nilrog/19327348563aefb18cae
At this moment the sketch is still working. But i'm a bit confused as to what i'm seeing. The library handles multiple connections right? And afaik I cannot see which connection it is that is used. Could it be that with time UIPEthernet starts to loose track of which connection is handling which request? |
Yes, the library handles multiple connections, but a single instance of Client can only be associated with a single connection. Calling 'connect()' will associate the client with a new connection. If you happen to not have called stop() before connecting again the previous connection will be orphaned, but not discarded, if there is any unread inbound data left. That seems to have happened here, your log shows that it didn't read the 'HTTP 1.1 200 OK'-response that is being found in the wireshark-trace. btw, I have an idea why your stat-machine fails (and it propably does fail with ethercard or wiz5100 as well - it's just these libraries seem to be more robust when being used improperly): the enum used for the states is 16-bit internally if the gcc compiler switch -fshort-enums is not being used. This results in assignments that eventually fail when interrupt is happening in between the two bytes being assigned. As UIPEthernet internally uses some 16-bit integers as well (because the signatures of stock Ethernet-api require it) there propably are some places that require masking of interupts too. Try replacing the enum with values defined as 'const byte' |
It's the Arduino serial terminal that is borked, not the state machine, which I was suspecting but had ignored. I tried tonight again and captured the serial data with a better terminal and now it all looks much better...from a sketch point-of-view...up until UIPEthernet stops working just as before. I did change the state variable to a plain uint8_t with defined values instead of using the enum to avoid what you were saying could happen. But that didn't make any difference. I also added printouts that shows everytime the sketch changes state. In the serial output I can see that every time the sketch wants to send a request the sketch is in state STATE_NOCONNECTION, which it gets to after calling client.stop(). So what we were seeing up until now was not the whole truth...thanks to the Arduino serial terminal. After a long run the sketch suddenly stops receiving data from UIPEthernet and goes to "Timeout 2". This means that the client is not connected and it has been >10 secs since the request was sent down to UIPEthernet. This happens 2-3 times and then UIPEthernet starts spitting out "UIPClient allocation failed", right after "UIPClient uip_connected". So UIPEthernet has told my sketch that it was successfully connected but it cannot allocate any resources to send out what the sketch is writing to it. So I guess UIPEthernet disconnects, otherwise "Timeout 1" would have kicked in, but it doesn't. So when the sketch is in state STATE_POSTING_UPDATE waiting for the response the client is not connected. Here is an example of a timeout:
|
short update: I'm setting up a testcase using your code and try to reproduce on my side. Sorry I'm not faster doing these things, but this is a freetime-only-project and time to work on issues is a volatile resource. P.S. I've added a call to stop() to UIPClient::connect() 36a0479, included in branch https://github.com/ntruchsess/arduino_uip/tree/dev and https://github.com/ntruchsess/arduino_uip/tree/Arduino_1.5.5_dev P.P.S: Test is running now. I'm not posting to emoncms.org but to local apache installed on Raspberry pi, serial is attached to Pi running picomon. Repeat-interval is down to 5 seconds.... P.P.P.S: |
No worries...we will get this bug eventually :) I don't expect you to work 24/7 on this. You have done a great job so far with this library. Finally we have a hardware agnostic Ethernet API that allows us to write sketches that can use either Wiznet chips or the ENC28J60. Both EtherCard and NanodeUIP are great, tailored, libraries for the ENC28J60. But they have their own set of limitations and quirks making them hard to work with. The Arduino Ethernet API in contrast is dead simple. And now we can use that with this chip too :) Glad to hear you got it reproduced locally. I did a quick test with a small bottle.py web server a few days ago but wasn't able to reproduce it then. Perhaps I didn't wait long enough... I don't think there is an issue that the client isn't stopped before a new request is made. I think it is more likely that something in the interwork between uIP and UIPEthernet goes wrong sometimes. Or that there is some subtle bug in UIPEthernet that sometimes pops up. I noticed there are a few compiler warnings when compiling UIPEthernet that I have neglected before. I have written issues for them (#34, #35, #36, #37 and #38). But I have not yet had time to look into and try and fix all of them. And GitHub is messing with me tonight so I cannot push the changes I have done :( There are some other warnings too but they are less severe so I haven't written any issues for those. |
I could prove that calling stop before connect doesn't solve this issue (did include stop() as very first call in connect-method). So there is definitly something going on that is not obvious as the place that disassociates the clients data-member from the uip-connection (https://github.com/ntruchsess/arduino_uip/blob/master/UIPClient.cpp#L398) on remote-close doesn't disassociate the data-member from the client-object but only sets the data's state. The data-object is disassociated during read (https://github.com/ntruchsess/arduino_uip/blob/master/UIPClient.cpp#L258) if remote disconnect occured before or in stop (https://github.com/ntruchsess/arduino_uip/blob/master/UIPClient.cpp#L110). But in both these cases data remains associated with the uip_connection to be cleaned up after successful close of connection (in https://github.com/ntruchsess/arduino_uip/blob/master/UIPClient.cpp#L398). So I think it's either not interrupt-safe or other race-condition or memory-corruption. All three very hard to find though. I'm currently running a test with no Serial output at all and no Timer (to minimize interrupts), watching the webservers-log to see whether it's still active. (aprox. 1000 requests so far...) Completely disabling interrupts within UIPEthernet for testing-purpose by just wrapping all calls to UIPEthernet in pairs of cli() and sei() didn't work, Client.connect() would never return (which is something I'm still a bit surprised, but it could be that millis() fails within interrupt-disabled sections so connect() wouldn't send out further packets as internal timeouts never expire...) |
good news. Guess I found the issue! |
fix #30, caused by leaked client.data-members when remote closed connection with no unread inbound data left. Add explicit check for state!=0 to detect this in client.close();
Great job hunting this one down! :) I tried running my sketch over night and it was still working this morning. A few timeouts had happened during night but they didn't cause UIPEthernet to run out of resources. So i'm a happy camper now :) |
@nilrog: great! |
The best fixes are the ones that requires few changes :) I will try to run the sketch for a few days and see how it works. |
48+ hours and counting now. I think it is safe to say that this library is quite stable now :) Haven't seen anymore getHostByName == -1 so that might have been a temporary network glitch. |
Thank ntruchsess and nilrog for finding and fixing the bug. After update the code and run sketch for over 18 hours, my arduino still hang out. I use the same hardware as nilrog used: minipro + enc28j60 |
I have a sketch, written for the stock Ethernet library, that posts information to http://emoncms.org with 30 secs interval. When I run this sketch on an Uno + Ethernet shield it justs keeps posting every 30 secs.
But when I recompile this sketch for UIPEthernet and run in to a Nano + ENC28J60 shield it has so far been running for at most 20 hours, but the track record is down as low as < 1 hour. I have had to add additional error handling and timeout handling to the sketch to improve stability with UIPEthernet. With the stock Ethernet library it "just worked".
What happens is that the sketch just stops posting. I not yet had the possibility to see this lockup when the Nano was connected to my Mac. Because where I have it wired up for longterm testing I have no possibility to log serial out.
I am currently running the lastest "master".
I know this is a bit vague, but I will try to gather more info. I just wanted to toss this out here since you have way more knowledge of the internals in UIPEthernet and might have a clue about what is happening :)
/Roger
The text was updated successfully, but these errors were encountered: