You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Timeline UTC:
13 Jun 11:34: @anadahz brings news to #ooni-internal that chameleon.infra.ooni.io has 86 Gb free at /
15 Jun 14:30: @hellais tells @darkk (during the call) that chameleon was cleaned up, actually only ~200Gb were cleaned up according to munin chart
03 Jul 14:30: @darkk logs into chameleon and notices that there are only 155 Mb free at /
02 Jul 02:27: 2017-07-02 04:27:26...No space left on device in ooni-pipeline/ooni-pipeline.log
03 Jul 16:00: @darkk cleaned up reports-raw at chameleon for 2017-{04,05}-* 2017-06-{01..20} checking data against set of canned files at datacollector, it cleans up ~1010Gb; cleanup of sanitised files 2017-05-* 2017-06-{01..20} (checked agasinst s3) cleans ~650Gb more
What could be done to prevent relapse and decrease impact:
alerting on disk space across all nodes
alerting on pipeline failures (happening on 2017-07-02)
automatic cleanup of chameleon
What else could be done?
It's unclear if any data was actually lost, it seems to me that's not the case as rsync's --remove-source-filesshould remove source only on success, but it's done during the transfer, so it does not produce duplicate report files across different buckets as well. @hellais do you have any ideas how to double-check that no files were lost? ooni-pipeline-cron.log is bad clue as $TEMP_DIR/fail file can be created (there are free inodes), but any attempt to write to the file should fail (no diskspace):
We've not found any definite proof of data loss. The incident happened to bucket 2017-07-03 that is smaller than usual, but the next bucket is larger then usual. I assume that data was buffered at collectors, but I've not checked that during the incident.
Impact: possible reports loss(?)
Detection: luck and curiosity
Timeline UTC:
13 Jun 11:34: @anadahz brings news to #ooni-internal that chameleon.infra.ooni.io has 86 Gb free at
/
15 Jun 14:30: @hellais tells @darkk (during the call) that chameleon was cleaned up, actually only ~200Gb were cleaned up according to munin chart
03 Jul 14:30: @darkk logs into chameleon and notices that there are only 155 Mb free at
/
02 Jul 02:27:
2017-07-02 04:27:26...No space left on device
inooni-pipeline/ooni-pipeline.log
03 Jul 16:00: @darkk cleaned up reports-raw at chameleon for 2017-{04,05}-* 2017-06-{01..20} checking data against set of canned files at datacollector, it cleans up ~1010Gb; cleanup of sanitised files 2017-05-* 2017-06-{01..20} (checked agasinst s3) cleans ~650Gb more
What could be done to prevent relapse and decrease impact:
What else could be done?
It's unclear if any data was actually lost, it seems to me that's not the case as rsync's
--remove-source-files
should remove source only on success, but it's done during the transfer, so it does not produce duplicate report files across different buckets as well.@hellais do you have any ideas how to double-check that no files were lost?
ooni-pipeline-cron.log
is bad clue as$TEMP_DIR/fail
file can be created (there are free inodes), but any attempt to write to the file should fail (no diskspace):The text was updated successfully, but these errors were encountered: