-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IRRd connection failure handling #85
Comments
Hello @bluikko, in 1a7fdfa I've introduced a mechanism that monitors the execution time of The sum of these 2 mechanisms should provide a solution to the issues that you've mentioned. I'd like to hear your feedback on it. Also, I'm not too sure about the default timeout I'm proposing. At the moment I've set 2 minutes, which I think should be fine to complete queries against big data-sets. What's your thoughts on this value? I've used |
Sounds good to me! If there will be a release candidate I can try to test it - but I don't have good ideas how to properly replicate the IRRd failure that happened earlier. I had tested the query and get quite consistent under 20 seconds. That is the largest AS-set I am aware of so 2 minutes sounds reasonable, better to have a timeout too large rather than too small. |
Thanks for the feedback @bluikko, and for volunteering to test the candidate release. I've pushed https://test.pypi.org/project/arouteserver/1.11.0a1/ and 1.11.0-alpha1 tag on DockerHub. Instructions on how to install alpha pre-releases can be found at https://arouteserver.readthedocs.io/en/latest/INSTALLATION.html#development-and-pre-release-versions. |
Tested the failover mechanism and it looks good to me. |
Thanks @bluikko, I've just pushed the latest changes to master, the CI/CI pipeline should complete in 1 hour and if everything goes well 1.11.0 should be out, with this new feature. |
arouteserver could failover to a secondary IRRd when the default IRRd
rr.ntt.net
is not accessible or not responsive.In the past
rr.ntt.net
has been half a day in a state where connections were accepted and queries could be sent, but a response was never received (problem with the IRRd at NTT).The problem seems to be exacerbated by
bgpq4
not having proper failure handling/timeouts in such a state - it took an excessive amount of minutes (dozens?) forbgpq4
to time out (I am not 100% sure the queries did time out, maybe @job has some insight to this).While
rr.ntt.net
was unresponsive it was also revealed that there is a secondary public IRRdrr1.ntt.net
(also includes IPv6 support!), so it could be possible to failover from the primary IRRd to the secondary IRRd in case of problem in the former.The text was updated successfully, but these errors were encountered: