re-resolve DNS record if the connection broke #195

Closed
wilhelmy opened this Issue Jan 1, 2015 · 13 comments

Comments

Projects
None yet
4 participants

wilhelmy commented Jan 1, 2015

[23:48] [Freenode] !morgan.freenode.net Server Terminating. Received SIGTERM
[23:48] *** Irssi: warning SSL read error: server closed connection unexpectedly
[23:48] [Freenode] *** Irssi: Connection lost to 2610:150:4b0f::
[23:48] *** Irssi: Removed reconnection to server 2610:150:4b0f:: port 7070
[23:48] [Freenode] *** Irssi: Looking up 2610:150:4b0f::
[23:48] [Freenode] *** Irssi: Reconnecting to 2610:150:4b0f:: [2610:150:4b0f::] port 7070 
          - use /RMRECONNS to abort
[23:48] *** Irssi: warning SSL handshake failed: Broken pipe
[23:48] [Freenode] *** Irssi: Connection lost to 2610:150:4b0f::
[23:53] *** Irssi: Removed reconnection to server 2610:150:4b0f:: port 7070
[23:53] [Freenode] *** Irssi: Looking up 2610:150:4b0f::
[23:53] [Freenode] *** Irssi: Reconnecting to 2610:150:4b0f:: [2610:150:4b0f::] port 7070 
          - use /RMRECONNS to abort
[23:53] *** Irssi: warning SSL handshake failed: Broken pipe
[23:53] [Freenode] *** Irssi: Connection lost to 2610:150:4b0f::
[23:58] *** Irssi: Removed reconnection to server 2610:150:4b0f:: port 7070
[23:58] [Freenode] *** Irssi: Looking up 2610:150:4b0f::
[23:58] [Freenode] *** Irssi: Reconnecting to 2610:150:4b0f:: [2610:150:4b0f::] port 7070 
          - use /RMRECONNS to abort
[23:58] *** Irssi: warning SSL handshake failed: Broken pipe
[23:58] [Freenode] *** Irssi: Connection lost to 2610:150:4b0f::

It keeps going on like this until I manually issue /rmreconns and /connect Freenode
Ideally, it would re-resolve the rrDNS on chat.freenode.net and connect to a different IP in the round-robin in case the connection fails.

khepler commented Jul 23, 2015

Also a problem when a server moves to a different IP. Had a client constantly trying to connect to an old IP after DNS pointed to the new one. Only picked up the new IP after /rmreconns and /connect cycle.

Contributor

ailin-nemui commented Nov 1, 2015

👍

Member

LemonBoy commented Jan 27, 2016

Quick and dirty analysis of what (I suppose) happens here.
Let's start by looking at this line from the log, *** Irssi: Looking up 2610:150:4b0f::, this suggests that sig_server_looking got an ip(v6) address instead of a hostname which is what brings up this problem.

Now let's walk up the signal chain, we discover that the signal server_looking is emitted from core/servers.c:437 and here we can see irssi is actually trying to resolve an ip address (as a protip and future fixme what about adding some checks to avoid a roundtrip to-from the resolver when all we got is an ip address ?) which is specified in server->connrec->address which in turn is only set in server_connect_callback_readpipe with the result of the reverse lookup or at the moment where the SERVER_CONNECT_REC is created (eg. server_create_conn or some other places in the code handling the reconnection which are listed in a later note).

So it's either a case of pebkac where the user is connected by specifying a fixed ip, meaning that we can't rely on getting a fresh and working ip when reconnecting, or it's irssi at fault here.
If the latter is the case I'd suggest to investigate around the place where the reverse lookup happens since (at first glance) it seems the only place that might end up overwrite the address along with sserver_connect and sig_reconnect, both in core/servers-reconnect.c.

Member

LemonBoy commented Jan 27, 2016

Quick follow up to the previous investigation work.
After a short dive into the manpage for getnameinfo(3) I've found out that there's an interesting flag named NI_NAMEREQD, that we don't require, that alters the normal behaviour of the syscall not to return an ip when the lookup fails [1].

[1] rust-lang/rust#22608

khepler commented Jan 27, 2016

In my case the server was certainly specified by DNS, and the server does not support IPv6.

Member

LemonBoy commented Jan 28, 2016

Do you mean that irssi wasn't trying to resolve an ip address in your case ?

khepler commented Jan 28, 2016

I specified the server by DNS hostname when I started the program. I did not specify an IP address.

Unfortunately I do not know if irssi was making any name lookups after the server IP changed. The messages in the transcript referenced the server by hostname, but it repeatedly tried reconnecting to the old IP address.

The client system was able to resolve the new IP with other programs (ping, host) while irssi was reconnecting, and other services reconnected to the server while irssi continued to attempt reconnect. irssi reconnected the instant I /QUIT and restarted it with the command line selected from shell history and not modified.

Member

LemonBoy commented Jan 28, 2016

If the messages were like this

*** Irssi: Reconnecting to <hostname> [<ip>] port <port> - use /RMRECONNS to abort

Where hostname is the one to resolve and ip is the same one every time then I think it's a different problem.
Maybe the getaddrinfo() call returned a stale entry, or maybe there were so few responses to the dns query that random() ended up picking the same one over and over.

I'm connecting to irc.freenode.net, so I doubt this was a case of PEBCAK with a fixed IP I'm connecting to; at least resolving irc.freenode.net with host(1) returns the entire round robin and there is no entry in /etc/hosts.

For the record, I'm using FreeBSD, in case their libc resolver functions work differently. I haven't tried reproducing the issue on linux.

@khepler have you had the issue on IPv4 by chance? Maybe it's an IPv6 issue related to the fact that IPv6 addresses look differently and could potentially break some internal parsing code?

Edit: I don't remember whether or not I had the issue on IPv4. Sorry about that.

In my case the server was certainly specified by DNS, and the server does not support IPv6.

hm, guess that's not it, then. Sorry, nevermind.

@LemonBoy my guess is that the resolving code either caches DNS lookups or the lookups are returned in the same order every time. My suggested fix is shuffling them around in case of a connection error, I guess, and making sure you're connecting to a different server the next time, if one is present and no address family is specified.

Member

LemonBoy commented Jan 28, 2016

@wilhelmy

The issue you've reported should be fixed by the pending PR, it sounds like the reverse lookup failed and the hostname got overwritten by the ipv6 address.

The DNS resolver is already shuffling the record that gets selected (and this is a separate problem). So if we managed to pick the same one over and over that means fhe random number generator is heavily skewed (and broken)...perhaps a single result is returned ?
Can't say for sure since I am not able to reproduce the problem nor you seem able to do so we're all pulling ideas out of thin air.

Sounds like it. Thanks for tracking down and fixing the issue regarding the missing PTR record!

Contributor

ailin-nemui commented Jan 29, 2016

as this change is integrated in git, I will close this for now. please reopen if you encounter this issue again with latest git

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment