Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bouncer giving out down test-helpers for ~16 hours #377

Closed
1 task
hellais opened this issue Oct 24, 2019 · 5 comments
Closed
1 task

Bouncer giving out down test-helpers for ~16 hours #377

hellais opened this issue Oct 24, 2019 · 5 comments
Labels

Comments

@hellais
Copy link
Member

hellais commented Oct 24, 2019

As part of #356 the echo and http test helpers were redeployed to new hosts (mia-echoth.ooni.nu & mia-httpth.ooni.nu) however the bouncer was not updated to reflect these changes.

The old hosts which had a different IP (37.218.247.110 for echo & 37.218.247.95 for http) were shutdown at around 19:00 UTC+2.

Detection: alert from user

Timeline:
~17:00 UTC 24-10-2019 the old test helper hosts are shutdown
03:00 UTC 25-10-2019 a user informs us via twitter of this issue
09:00 UTC 25-10-2019 the bouncer is updated with the IPs of the new services

What went well:

  • Users detected this early

What went wrong:

  • We did not have alerting takes into account the values returned by the bouncer

What we should do to prevent it happening the future:

  • Configure monitoring to use the values returned by the bouncer to check uptime of services
@hellais hellais changed the title Bouncer giving out down test-helpers for ~10 hours Bouncer giving out down test-helpers for ~16 hours Oct 24, 2019
@darkk
Copy link
Contributor

darkk commented Oct 24, 2019

Another option is to take RUM route: count volume or percentage of erratic measurements coming to the collector.

@hellais
Copy link
Member Author

hellais commented Oct 28, 2019

So it turned out that actually the httpth was also wrongly configured on the bouncer and we were giving out an address of a down service for that until ~21:00 2019-10-26 UTC.

@hellais
Copy link
Member Author

hellais commented Oct 28, 2019

For people accessing this incident issue from the internet, the incident has been fully resolved since Saturday evening.

The incident affects measurements from the following time period: 2019-10-24 16:00 UTC - 2019-10-26 23:00 UTC.

We keep the issue ticket open until we have taken all the steps we consider necessary to prevent it re-occurring in the future.

@hellais
Copy link
Member Author

hellais commented Oct 28, 2019

The way to check if the measurement you are looking at is a false positive is:

  1. Look at the test_keys->failure key and seeing if it says connection_refused (it should not say this)
  2. Checking if the test_helpers->backend says something other than http://37.218.247.94:80 or 37.218.247.93

@hellais
Copy link
Member Author

hellais commented Feb 18, 2020

ooni/backend#343

@hellais hellais closed this as completed Feb 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants