frequent ECONNRESET error #137

Closed
iamcgn opened this Issue Nov 15, 2013 · 24 comments

Projects

None yet

9 participants

@iamcgn

I'm logging this error pretty frequently. It seems to happen if I haven't sent any notifications for an extended period of time. When I get this error the push notification that it was attempting to send never gets sent regardless of the notification retry limit value.

@argon

I am looking into this.

@argon

I see 'ECONNRESET' after 4 minutes of inactivity in a production environment. I send a notification, which is received by the device, then 4 minutes after sending no notifications I see 'ECONNRESET'. I then send another notification and it re-establishes the connection and sends it correctly, the device then receives it.

I'm a little confused by your explanation that if happens if you haven't sent any notifications for an extended period of time, then you say that the notification it was sending at the time never gets sent. How could it be sending a notification if you haven't sent one for some time?

@iamcgn
@andreyvit

+1, seeing the same problem here. When trying to send a notification after an idle period:

Processing job 170: {"to":"48c1f6f2 0f9e8d80 08971431 99bf19a4 2163ee55 7cb2e543 cc9974e7 1c807268","payload":{"aps":{"alert":"Test 1","sound":"chime","badge":11}}}
Fri, 21 Feb 2014 19:26:30 GMT apn Transmitting notification from buffer
Fri, 21 Feb 2014 19:26:30 GMT apn Socket drained 0
Fri, 21 Feb 2014 19:26:30 GMT apn 0 left to send
Fri, 21 Feb 2014 19:26:30 GMT apn 0 left to send
Fri, 21 Feb 2014 19:26:31 GMT apn Socket error occurred 0 { [Error: read ECONNRESET] code: 'ECONNRESET', errno: 'ECONNRESET', syscall: 'read' }
Fri, 21 Feb 2014 19:26:31 GMT apn Raising error: { [Error: read ECONNRESET] code: 'ECONNRESET', errno: 'ECONNRESET', syscall: 'read' } null undefined
Fri, 21 Feb 2014 19:26:31 GMT apn Error occurred with trace: Error: read ECONNRESET
    at errnoException (net.js:901:11)
    at TCP.onread (net.js:556:19)
Fri, 21 Feb 2014 19:26:31 GMT apn Destroying connection 0
Fri, 21 Feb 2014 19:26:31 GMT apn Socket closed 0
⤫ Disconnected from the Apple Push Notification Service (open = 0)
Fri, 21 Feb 2014 19:26:31 GMT apn 0 left to send

The next attempt re-established the connection and sends the notification:

Fri, 21 Feb 2014 19:27:09 GMT apn Initialising connection
Fri, 21 Feb 2014 19:27:09 GMT apn 1 left to send
Fri, 21 Feb 2014 19:27:20 GMT apn Connection established
✔ Connected to Apple Push Notification Service (gateway.sandbox.push.apple.com, open = 1)
Fri, 21 Feb 2014 19:27:20 GMT apn Transmitting notification from buffer
Fri, 21 Feb 2014 19:27:20 GMT apn Socket drained 1
Fri, 21 Feb 2014 19:27:20 GMT apn 0 left to send
Fri, 21 Feb 2014 19:27:20 GMT apn 0 left to send

Unfortunately, this means that the first notification gets lost. What I'd like is automatic retrying logic to kick in here.

@andreyvit

Here's what I ended up doing for now. It's not pretty, but I need to have an interim working version today:

@connection.on 'transmitted', (notification, device) =>
  console.log "✓ Transmitted #{JSON.stringify notification} to #{device}"
  @lastNotification = { notification, device, time: Date.now() }
@connection.on 'socketError', (err) =>
  if err?.code is 'ECONNRESET'
    if @lastNotification?.time >= Date.now() - 5000
      console.log "[!] Re-sending notification after ECONNRESET."
      setTimeout =>
        @connection.pushNotification(@lastNotification.notification, @lastNotification.device)
      , 100
    else
      console.log "[!] Not re-sending the last notification after ECONNRESET because it's too old."
@zhudan

I have encountered this problem in production. Today happened twice. I can provide a more detailed environmental. Expect you to solve this problem. (Guangzhou,China. V:1.4.4, Forgive my bad English).

{ [Error: write ECONNRESET] code: 'ECONNRESET', errno: 'ECONNRESET', syscall: 'write'

@argon

I have done some investigation regarding this problem. I can reproduce a similar problem by sending a notification then disconnecting my network connection and sending another notification. The notification will not cause any errors immediately but after a delay (up to 90seconds, probably dependent on environmental factors) it will throw the error. I believe the problem here is that the socket is not sending any keep-alive's so the socket is perhaps being dropped somewhere if it is idle for a very long period of time without the client receiving any information. I have modified the code to setKeepAlive(true) on the socket so hopefully it will stay alive over long durations. At present I do not set an interval which leaves it at the OS default.

I would like to add that I have not actually been able to replicate the problem when the socket is left dormant for a long period. It has always just worked for me, but this could be OS dependent. Hopefully someone who is seeing this error on a regular(?) basis could try the latest master branch and see if it helps. I am not prepared to release this as another version yet as I have some more pending changes to make.

@andreyvit unfortunately the problem with your solution is that if multiple notifications are sent concurrently then the ECONNRESET error will still occur but your code will only be sending the most recent even though several have failed.

@mdelete

I have installed the lastest master branch on my sandbox server now. The ECONNRESET was always triggered there after about an hour of inactivity, so I should be able to provide preliminary results later.

@mdelete

No luck with setKeepAlive(true). Still connection reset after about an hour inactivity.

Notification transmitted to:d3...
{ [Error: read ECONNRESET] code: 'ECONNRESET', errno: 'ECONNRESET', syscall: 'read' }
Disconnected from APNS
@mdelete

Found this interesting bit in the node.js changelogs:

Version 0.9.10
net: don't suppress ECONNRESET (Ben Noordhuis)
@argon
@andreyvit

@argon APNS will be closing idle connections occasionally, whatever you do. So TCP keepalives aren't a solution either.

The core issue here is that a transmitted notification is marked successful by the time the connection is found to be disconnected. (Of course, the real underlying issue is that the protocol does not allow idempotent retransmissions, but we can't fix that.)

@argon

Closing idle connections would be fine if it were ending them correctly so we can handle the socket closure. The key point is that they're being reset which is something that isn't detected until sending occurs.

An idea I have been thinking about is delaying the 'transmitted' event until either an error occurs - in which case we know what has successfully been transmitted based on an ID from Apple or we can take a best guess if it's a socket error - or a timeout occurs (perhaps 3-5 seconds) after which we can in most cases be confident it has been transmitted. I don't like the underlying assumptions made here though and it increases the chance of duplicate notifications being sent I think.

The new protocol framing would allow for universal identifiers for idempotent transmission, so maybe we can hope for future improvements.

@mdelete

A delay wouldn't help. If the reason is a half-closed TCP connection on the apple side, you would always need a second write() to detect the error.

@nuaimat

this issue is happening to me too.
is there any way of sending ACK signal to APNS and see if we've got a reply? if we can do that we can detect connection reset.

@argon
@andreyvit

@argon Are you sure that keep alive can even have that effect? To the best of my knowledge, the minimal TCP keep-alive period may be close to 2 hours on many implementations of TCP stack, and its goal is to be a last-resort defense against leaking memory (and ports) for broken TCP connections, not to provide an application-level service.

Besides, APN may be closing connections in some circumstances, no matter what, like any other networking service. For examples, their servers may die or get rebooted. (For all we know, given the recurrence of the problem, they may be rebooting their servers daily.) I'm not really a network expert, so I can't say how half-closed connections fit into that, but I'd say the app must be prepared to handle anything that network throws at it.

@andreyvit

OK, so I've looked at the code and the docs, and I see two issues.

First, you're simply doing setKeepAlive(true), which uses the default period of 2 hours.

Second, you can't really have a period that's less than 11-12 minutes. Even if you do setKeepAlive(true, 60) (60 seconds), it will make a few retries, so that [1]

An idle connection is terminated after approximately an additional 11 minutes (9 probes an interval of 75 seconds apart) when keep-alive is enabled.

Node.js does not seem to have an API to configure the retry timer [2], and 10 minutes may (or may not) be enough for the connection to die due to inactivity.

[1] nodejs/node-v0.x-archive#6194 (comment)
[2] nodejs/node-v0.x-archive#4109

@mdelete

I've also tried setKeepAlive(true, variousTimeouts), but with no success. APN seems to disregard keepAlive. Even so Apples own push server implementation uses a keepAlive interval of 4:45. The keepAlive defaults for linux are, as pointed out by @andreyvit:

net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9

Other operating systems might use different values. Unfortunately, my service runs in a virtual linux machine and I have no control over these parameters. So an OS-agnostic solution would be preferable, but a list of broken OS'es would suffice, too.

@occasl

I found this error too recently, but then I upgrade to 1.6.2 and it went away. Has this been fixed?

@argon
@chenpurong

I had this problem too.

@argon

node-apn will now reset after 1 hour of inactivity which should fix this problem.

@argon argon closed this Feb 11, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment