Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETIMEDOUT on IDLE #11

Closed
rmccue opened this issue Mar 25, 2020 · 16 comments
Closed

ETIMEDOUT on IDLE #11

rmccue opened this issue Mar 25, 2020 · 16 comments

Comments

@rmccue
Copy link

rmccue commented Mar 25, 2020

When the client is IDLEing, I get ETIMEDOUT errors:

{
  err: Error: read ETIMEDOUT
      at TLSWrap.onStreamRead (internal/stream_base_commons.js:201:27) {
    errno: 'ETIMEDOUT',
    code: 'ETIMEDOUT',
    syscall: 'read'
  }
}

The log shows a bit more detail (plus timings):

  imap:server debug 4C IDLE +15s
  imap:server debug + idling +87ms
  imap:client debug initiated IDLE +0ms
  imap:client debug breaking IDLE +5m
  imap:other error {
  err: Error: read ETIMEDOUT
      at TLSWrap.onStreamRead (internal/stream_base_commons.js:201:27) {
    errno: 'ETIMEDOUT',
    code: 'ETIMEDOUT',
    syscall: 'read'
  }
} +58s
  imap:other error {
  err: Error: read ETIMEDOUT
      at TLSWrap.onStreamRead (internal/stream_base_commons.js:201:27) {
    errno: 'ETIMEDOUT',
    code: 'ETIMEDOUT',
    syscall: 'read'
  }
} +3ms

(Using a custom logger with debug for the timings; imap:server = S:, imap:client = C:)

This appears to be linked to the socket timeout in some non-obvious way; decreasing to 2 minutes instead of 5 makes the IDLE -> NOOP -> IDLE loop work properly. It seems like when it writes the DONE, it's timing out waiting for the tagged response. This could be a Gmail-specific thing perhaps.

Still digging into what's happening here.

@andris9
Copy link
Collaborator

andris9 commented Mar 25, 2020

I guess bringing it down to 2 min (or make it configurable) would be an option. I’ll look into it.

@rmccue
Copy link
Author

rmccue commented Mar 25, 2020

FWIW, I don't actually think it's the timeout being too long that is the problem here; I think actually there's something wrong with the way it's writing DONE and/or waiting for the tagged response. It's possible the ETIMEDOUT is actually fine, but just isn't being caught properly.

Continuing to dig into it.

@rmccue
Copy link
Author

rmccue commented Mar 25, 2020

Confirmed experimentally that Gmail will allow IDLE for at least 25 mins when directly connected, so that's not the issue.

@rmccue
Copy link
Author

rmccue commented Mar 25, 2020

In trySend it's hitting this block:
https://github.com/andris9/imapflow/blob/b8702ed3b0cca358e307b6705562d4e43a91250c/lib/imap-flow.js#L300-L302

this.currentRequest is still set to the IDLE command, so this causes trySend to return early when calling connection.exec('NOOP'). reader isn't receiving the tagged response data, which makes me think there's some buffer involved here that hasn't filled properly. Gmail's raw response is 4 OK IDLE terminated (Success) which seems long enough, so not obvious whether that's actually what's happening.

Still investigating.

@andris9
Copy link
Collaborator

andris9 commented Mar 25, 2020

Maybe it's compression related? You can disable compression by returning false here

@rmccue
Copy link
Author

rmccue commented Mar 25, 2020

Yeah, the windowBits: 15 in compress() seems suspicious to me, and the default for the flush option is also Z_NO_FLUSH.

It's unclear to me why changing the socket timeout fixes this, but it does cause reader to receive the response line and hence clear the currentRequest.

@andris9
Copy link
Collaborator

andris9 commented Mar 25, 2020

windowBits size is defined by RFC4978:

The windowBits value must be in the range
-8 to -15

@rmccue
Copy link
Author

rmccue commented Mar 25, 2020

Just to be clear, when I say "suspicious", I mean "this sounds like it could be causing an unfilled buffer" :)

@andris9
Copy link
Collaborator

andris9 commented Mar 25, 2020

Yeah, the compression thing was more or less hacked together into a state that “works”, so everything about it is kind of suspicious.

@rmccue
Copy link
Author

rmccue commented Mar 25, 2020

Well, I can confirm it's not the compression.

Even stranger, it looks like this might be a protocol error of some sort. Here's what Wireshark shows:
Screen Shot 2020-03-25 at 19 18 13

It looks to me like Gmail never actually responds. Very strange.

@andris9
Copy link
Collaborator

andris9 commented Mar 25, 2020

Could this be your local connection issue? Eg. home router killing long TCP connections etc? Or does this happen also on a server?

@rmccue
Copy link
Author

rmccue commented Mar 25, 2020

Even weirder: it looks like I fairly consistently get TCP retransmissions when sending that DONE:

Screen Shot 2020-03-25 at 19 23 15

This seems to indicate network congestion of some sort, except that it consistently happens every time, and only for the DONE request from imapflow. It doesn't occur with other clients, nor when writing data directly to the socket elsewhere. I'm also not seeing any other significant TCP retransmissions.

I... am genuinely stumped as to what's happening here.

@rmccue
Copy link
Author

rmccue commented Mar 25, 2020

Through further testing, it does indeed appear to be ISP-related rather than a generic Gmail problem, although it appears it's caused by the use of setKeepAlive(true):

https://github.com/andris9/imapflow/blob/b8702ed3b0cca358e307b6705562d4e43a91250c/lib/imap-flow.js#L994

It seems this may be the same issue as nodejs/node#24689. Increasing the timeout to 5s (the default Thunderbird value for TCP keep-alive) appears to fix this. Phew.

So, changing the above to this.socket.setKeepAlive(true, 5 * 1000) fixes this issue. From testing, anything less than a second causes this behaviour pretty consistently. That'd have a minimal impact (increasing the latency to 1s for actual disconnects, basically).

@andris9
Copy link
Collaborator

andris9 commented Mar 25, 2020

Oh wow, great work, thanks a lot! I didn’t even know about the time argument for setKeepAlive or that there would be a difference.

@andris9
Copy link
Collaborator

andris9 commented Mar 25, 2020

I added the time argument to setKeepAlive and did a release (v1.0.35), which should fix this then

@andris9 andris9 closed this as completed Mar 25, 2020
@rmccue
Copy link
Author

rmccue commented Mar 25, 2020

Thanks! More than anything, just glad I finally found the cause of the bug :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants