IRRd connection failure handling #85

bluikko · 2021-09-19T05:30:55Z

arouteserver could failover to a secondary IRRd when the default IRRd rr.ntt.net is not accessible or not responsive.

In the past rr.ntt.net has been half a day in a state where connections were accepted and queries could be sent, but a response was never received (problem with the IRRd at NTT).
The problem seems to be exacerbated by bgpq4 not having proper failure handling/timeouts in such a state - it took an excessive amount of minutes (dozens?) for bgpq4 to time out (I am not 100% sure the queries did time out, maybe @job has some insight to this).

While rr.ntt.net was unresponsive it was also revealed that there is a secondary public IRRd rr1.ntt.net (also includes IPv6 support!), so it could be possible to failover from the primary IRRd to the secondary IRRd in case of problem in the former.

The text was updated successfully, but these errors were encountered:

pierky · 2021-09-24T17:59:42Z

Hello @bluikko,

in 1a7fdfa I've introduced a mechanism that monitors the execution time of bgpq3/bgpq4 and kills the sub-process when it seems stuck. The timeout can be set in the program's configuration file (arouteserver.yml, bgpq3_timeout setting). Also, the setting where the IRRD host is configured (bgpq3_host) now accepts a list of hosts; when a query fails (either because of timeout or other issues), the next host in that list is used. If all the hosts in the list time out, the process is aborted.

The sum of these 2 mechanisms should provide a solution to the issues that you've mentioned. I'd like to hear your feedback on it.

Also, I'm not too sure about the default timeout I'm proposing. At the moment I've set 2 minutes, which I think should be fine to complete queries against big data-sets. What's your thoughts on this value?

I've used time bgpq4 -h rr.ntt.net -S RADB -3 -j -4 -A -l prefix_list AS-HURRICANE (so, a query against HE's rset) to make an idea of how long a big query could take, which gave me results in the range 7-100 seconds.

bluikko · 2021-09-27T01:30:50Z

Sounds good to me! If there will be a release candidate I can try to test it - but I don't have good ideas how to properly replicate the IRRd failure that happened earlier.

I had tested the query and get quite consistent under 20 seconds. That is the largest AS-set I am aware of so 2 minutes sounds reasonable, better to have a timeout too large rather than too small.

pierky · 2021-09-29T08:46:17Z

Thanks for the feedback @bluikko, and for volunteering to test the candidate release.

I've pushed https://test.pypi.org/project/arouteserver/1.11.0a1/ and 1.11.0-alpha1 tag on DockerHub. Instructions on how to install alpha pre-releases can be found at https://arouteserver.readthedocs.io/en/latest/INSTALLATION.html#development-and-pre-release-versions.

bluikko · 2021-09-30T09:34:44Z

Tested the failover mechanism and it looks good to me.

pierky · 2021-10-07T17:43:25Z

Thanks @bluikko, I've just pushed the latest changes to master, the CI/CI pipeline should complete in 1 hour and if everything goes well 1.11.0 should be out, with this new feature.

pierky added this to the v1.11.0 milestone Sep 21, 2021

pierky closed this as completed in 1a7fdfa Oct 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IRRd connection failure handling #85

IRRd connection failure handling #85

bluikko commented Sep 19, 2021

pierky commented Sep 24, 2021

bluikko commented Sep 27, 2021

pierky commented Sep 29, 2021

bluikko commented Sep 30, 2021

pierky commented Oct 7, 2021

IRRd connection failure handling #85

IRRd connection failure handling #85

Comments

bluikko commented Sep 19, 2021

pierky commented Sep 24, 2021

bluikko commented Sep 27, 2021

pierky commented Sep 29, 2021

bluikko commented Sep 30, 2021

pierky commented Oct 7, 2021