Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

b.collector down for 8.5 hours #157

Closed
2 of 9 tasks
darkk opened this issue Sep 10, 2017 · 6 comments
Closed
2 of 9 tasks

b.collector down for 8.5 hours #157

darkk opened this issue Sep 10, 2017 · 6 comments
Labels
Projects

Comments

@darkk
Copy link
Contributor

darkk commented Sep 10, 2017

Impact: TBD, it's the primary collector for the mobile app

Detection: email & IRC alert

Timeline UTC, Sep 10:
00:00:01: b systemd[1]: Starting Certbot...
00:00:03: certbot Should renew, less than 30 days before certificate expiry 2017-10-09 23:01:00 UTC. Running pre-hook command: docker stop ooni-backend-b.collector.ooni.io
00:00:15: certbot: Running post-hook command: ... && docker start ooni-backend-b.collector.ooni.io
00:00:32: checkForStaleReports() ValueError: time data '2017-50-20 18:7:3' does not match format '%Y-%m-%d %H:%M:%S'
00:05:55: AlertManager FIRING InstanceDown
07:32:41 @channel can somebody look at this?
07:41:41 good morning
08:30:55: AlertManager RESOLVED InstanceDown
10:14:95: incident published

What went well:

  • alerting works
  • it was possible to move broken files away from /data/b.collector.ooni.io/raw_reports to /home/darkk/20170910/ to hotfix the issue

What went wrong:

What is still unclear:

  • there were 7000 files in main.report_dir with 7000 corresponding metadata.json dated ranging from 2017-07-20 to 2017-09-09. List of these files can be found at /home/darkk/20170910/archive.ls-ltr These files were moved to main.archive_dir after successful daemon restart. Is it some case to be monitored? Seems, the spice was flowing from b.collector.ooni.io according to ooni-pipeline-cron.log at chameleon.infra.ooni.io, rsync was taking tens minutes.
  • there is some significant amount of log lines in /data/b.collector.ooni.io/var/log/ooni like 404 POST /report/20170714T164513Z_AS47589_Lw5RfUUfj5kHbr1MGn7WmnmxKQX3WmqZM3gmrykqRuSTpZUt10, do these lines mean, we're dropping data on the floor? Seems, the client thinks so and retries.
  • what does primary collector for the mobile app mean? Does mobile app retry failed uploads? Does it retry them with different collector? Is it possible that one part of the report is uploaded to one collector and another part goes to another one?

What could be done to prevent relapse and decrease impact:

  • there was the cause for those 7000 reports to get stuck, it should be identified and solved
  • there was a cause for 12 malformed raw reports to kill oonib on restart, it should be identified and solved
  • the source for these 12 malformed raw reports is TransportCanary/0.0.10-beta, what is it?
  • get some insistent notifications for urgent actionable alerts, as discussed in IRC, also b.web-connectivity.th down for 7.6 hours #128
  • move letsencrypt updates to team "business hours", people sleep at UTC midnight, also b.web-connectivity.th down for 7.6 hours #128
  • avoid stale pid files somehow: cleanup twisted PID file on container start || grab flock on pid || randomize daemon pid so it's not pid=1, also b.web-connectivity.th down for 7.6 hours #128
@darkk
Copy link
Contributor Author

darkk commented Nov 9, 2017

Once again.

Timeline UTC:
09 Nov 00:05: [FIRING:1] InstanceDown (https://b.collector.ooni.io/invalidpath
09 Nov 08:42: @darkk | looking at it
09 Nov 09:15: [RESOLVED] InstanceDown

The cause is same: malformed raw reports from TransportCanary/0.0.10-beta. The trigger is same: letsencrypt cert update. Another batch is stored in /home/darkk/20171109/

@darkk
Copy link
Contributor Author

darkk commented Jan 8, 2018

FTR, once again. Timeline UTC:
08 Jan 00:05 [FIRING:1] BlackboxDown
08 Jan 00:20 [RESOLVED] BlackboxDown

@darkk
Copy link
Contributor Author

darkk commented Sep 20, 2018

FTR, once again. Timeline UTC:
20 Sep 00:05 [FIRING] https://c.collector.ooni.io/invalidpath endpoint down
20 Sep 09:21 @hellais I am looking into it just now
20 Sep 09:40 [RESOLVED] https://c.collector.ooni.io/invalidpath endpoint down

The stacktrace was

  File "/usr/local/lib/python2.7/dist-packages/oonib/report/handlers.py", line 188, in checkForStaleReports
   closeReport(report_id)
 File "/usr/local/lib/python2.7/dist-packages/oonib/report/handlers.py", line 166, in closeReport
   report_id)
 File "/usr/local/lib/python2.7/dist-packages/oonib/report/handlers.py", line 25, in report_file_path
   timestamp = datetime.strptime(report_details['test_start_time'], "%Y-%m-%d %H:%M:%S")
 File "/usr/lib/python2.7/_strptime.py", line 328, in _strptime
   data_string[found.end():])
ValueError: unconverted data remains: .774990

That was probably caused by libnettest2 (testing? version 0.0.0 does not sound like release version):

{"software_name": "libnettest2", "software_version": "0.0.0", "format": "json", "test_start_time": "0000-00-00 426827:38:04"
{"software_name": "libnettest2", "software_version": "0.0.0", "format": "json", "test_start_time": "2018-09-10 11:25:45.774990"

xref: ooni/backend#115

@darkk
Copy link
Contributor Author

darkk commented Nov 4, 2018

FTR, once again. Timeline UTC:
04 Nov 00:00 DOWN https://b.collector.ooni.io/invalidpath
04 Nov 11:56 UP

The stracktrace was ValueError: unconverted data remains: .000000

It was libnettest2 once again:

{"software_name": "libnettest2", "software_version": "0.0.0", "format": "json", "test_start_time": "2018-09-10 11:39:31.000000", "test_name": "web_connectivity", "data_format_version": "0.2.0", "test_version": "0.0.1", "input_hashes": [], "probe_asn": "AS0", "probe_cc": "ZZ"}

@darkk
Copy link
Contributor Author

darkk commented Apr 29, 2019

get some insistent notifications for urgent actionable alerts

That's decided as WONTFIX for the moment: #158

@hellais hellais added this to Icebox in OONI-Verse Oct 8, 2019
@hellais
Copy link
Member

hellais commented Feb 18, 2020

ooni/backend#343

@hellais hellais closed this as completed Feb 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
OONI-Verse
  
Icebox
Development

No branches or pull requests

2 participants