pipeline: chameleon disk full #116

darkk · 2017-07-03T18:41:52Z

Impact: possible reports loss(?)

Detection: luck and curiosity

Timeline UTC:
13 Jun 11:34: @anadahz brings news to #ooni-internal that chameleon.infra.ooni.io has 86 Gb free at /
15 Jun 14:30: @hellais tells @darkk (during the call) that chameleon was cleaned up, actually only ~200Gb were cleaned up according to munin chart
03 Jul 14:30: @darkk logs into chameleon and notices that there are only 155 Mb free at /
02 Jul 02:27: 2017-07-02 04:27:26...No space left on device in ooni-pipeline/ooni-pipeline.log
03 Jul 16:00: @darkk cleaned up reports-raw at chameleon for 2017-{04,05}-* 2017-06-{01..20} checking data against set of canned files at datacollector, it cleans up ~1010Gb; cleanup of sanitised files 2017-05-* 2017-06-{01..20} (checked agasinst s3) cleans ~650Gb more

What could be done to prevent relapse and decrease impact:

alerting on disk space across all nodes
alerting on pipeline failures (happening on 2017-07-02)
automatic cleanup of chameleon

What else could be done?

It's unclear if any data was actually lost, it seems to me that's not the case as rsync's --remove-source-files should remove source only on success, but it's done during the transfer, so it does not produce duplicate report files across different buckets as well.
@hellais do you have any ideas how to double-check that no files were lost?
ooni-pipeline-cron.log is bad clue as $TEMP_DIR/fail file can be created (there are free inodes), but any attempt to write to the file should fail (no diskspace):

$ find tmp.MJMZZMKod6 tmp.FCkAAllTHY tmp.8MLYjpci8m -ls
19144742    4 drwx------   2 soli     soli         4096 Jul  1 04:00 tmp.MJMZZMKod6
19144743    4 drwx------   2 soli     soli         4096 Jul  2 06:30 tmp.FCkAAllTHY
19139487    0 -rw-rw-r--   1 soli     soli            0 Jul  2 06:32 tmp.FCkAAllTHY/fail
18882602    4 drwx------   2 soli     soli         4096 Jul  3 04:08 tmp.8MLYjpci8m
18878873    0 -rw-rw-r--   1 soli     soli            0 Jul  3 06:22 tmp.8MLYjpci8m/fail

The text was updated successfully, but these errors were encountered:

darkk · 2017-07-06T15:04:29Z

We've not found any definite proof of data loss. The incident happened to bucket 2017-07-03 that is smaller than usual, but the next bucket is larger then usual. I assume that data was buffered at collectors, but I've not checked that during the incident.

12943M	2017-06-21
10761M	2017-06-22
15036M	2017-06-23
11400M	2017-06-24
11673M	2017-06-25
11917M	2017-06-26
11832M	2017-06-27
13104M	2017-06-28
11432M	2017-06-29
11184M	2017-06-30
12538M	2017-07-01
11581M	2017-07-02
 9046M	2017-07-03
14517M	2017-07-04
12980M	2017-07-05
13002M	2017-07-06

darkk added the incident label Jul 3, 2017

darkk mentioned this issue Jul 5, 2017

measurements: SSL certificate expired for 17 hours #120

Closed

1 task

darkk closed this as completed Feb 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pipeline: chameleon disk full #116

pipeline: chameleon disk full #116

darkk commented Jul 3, 2017 •

edited

Loading

darkk commented Jul 6, 2017

pipeline: chameleon disk full #116

pipeline: chameleon disk full #116

Comments

darkk commented Jul 3, 2017 • edited Loading

darkk commented Jul 6, 2017

darkk commented Jul 3, 2017 •

edited

Loading