Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IRRd connection failure handling #85

Closed
bluikko opened this issue Sep 19, 2021 · 5 comments
Closed

IRRd connection failure handling #85

bluikko opened this issue Sep 19, 2021 · 5 comments
Milestone

Comments

@bluikko
Copy link
Contributor

bluikko commented Sep 19, 2021

arouteserver could failover to a secondary IRRd when the default IRRd rr.ntt.net is not accessible or not responsive.

In the past rr.ntt.net has been half a day in a state where connections were accepted and queries could be sent, but a response was never received (problem with the IRRd at NTT).
The problem seems to be exacerbated by bgpq4 not having proper failure handling/timeouts in such a state - it took an excessive amount of minutes (dozens?) for bgpq4 to time out (I am not 100% sure the queries did time out, maybe @job has some insight to this).

While rr.ntt.net was unresponsive it was also revealed that there is a secondary public IRRd rr1.ntt.net (also includes IPv6 support!), so it could be possible to failover from the primary IRRd to the secondary IRRd in case of problem in the former.

@pierky pierky added this to the v1.11.0 milestone Sep 21, 2021
@pierky
Copy link
Owner

pierky commented Sep 24, 2021

Hello @bluikko,

in 1a7fdfa I've introduced a mechanism that monitors the execution time of bgpq3/bgpq4 and kills the sub-process when it seems stuck. The timeout can be set in the program's configuration file (arouteserver.yml, bgpq3_timeout setting). Also, the setting where the IRRD host is configured (bgpq3_host) now accepts a list of hosts; when a query fails (either because of timeout or other issues), the next host in that list is used. If all the hosts in the list time out, the process is aborted.

The sum of these 2 mechanisms should provide a solution to the issues that you've mentioned. I'd like to hear your feedback on it.

Also, I'm not too sure about the default timeout I'm proposing. At the moment I've set 2 minutes, which I think should be fine to complete queries against big data-sets. What's your thoughts on this value?

I've used time bgpq4 -h rr.ntt.net -S RADB -3 -j -4 -A -l prefix_list AS-HURRICANE (so, a query against HE's rset) to make an idea of how long a big query could take, which gave me results in the range 7-100 seconds.

@bluikko
Copy link
Contributor Author

bluikko commented Sep 27, 2021

Sounds good to me! If there will be a release candidate I can try to test it - but I don't have good ideas how to properly replicate the IRRd failure that happened earlier.

I had tested the query and get quite consistent under 20 seconds. That is the largest AS-set I am aware of so 2 minutes sounds reasonable, better to have a timeout too large rather than too small.

@pierky
Copy link
Owner

pierky commented Sep 29, 2021

Thanks for the feedback @bluikko, and for volunteering to test the candidate release.

I've pushed https://test.pypi.org/project/arouteserver/1.11.0a1/ and 1.11.0-alpha1 tag on DockerHub. Instructions on how to install alpha pre-releases can be found at https://arouteserver.readthedocs.io/en/latest/INSTALLATION.html#development-and-pre-release-versions.

@bluikko
Copy link
Contributor Author

bluikko commented Sep 30, 2021

Tested the failover mechanism and it looks good to me.

@pierky pierky closed this as completed in 1a7fdfa Oct 7, 2021
@pierky
Copy link
Owner

pierky commented Oct 7, 2021

Thanks @bluikko, I've just pushed the latest changes to master, the CI/CI pipeline should complete in 1 hour and if everything goes well 1.11.0 should be out, with this new feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants