Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pipeline: chameleon disk full #116

Closed
2 of 3 tasks
darkk opened this issue Jul 3, 2017 · 1 comment
Closed
2 of 3 tasks

pipeline: chameleon disk full #116

darkk opened this issue Jul 3, 2017 · 1 comment
Labels

Comments

@darkk
Copy link
Contributor

darkk commented Jul 3, 2017

Impact: possible reports loss(?)

Detection: luck and curiosity

Timeline UTC:
13 Jun 11:34: @anadahz brings news to #ooni-internal that chameleon.infra.ooni.io has 86 Gb free at /
15 Jun 14:30: @hellais tells @darkk (during the call) that chameleon was cleaned up, actually only ~200Gb were cleaned up according to munin chart
03 Jul 14:30: @darkk logs into chameleon and notices that there are only 155 Mb free at /
02 Jul 02:27: 2017-07-02 04:27:26...No space left on device in ooni-pipeline/ooni-pipeline.log
03 Jul 16:00: @darkk cleaned up reports-raw at chameleon for 2017-{04,05}-* 2017-06-{01..20} checking data against set of canned files at datacollector, it cleans up ~1010Gb; cleanup of sanitised files 2017-05-* 2017-06-{01..20} (checked agasinst s3) cleans ~650Gb more

What could be done to prevent relapse and decrease impact:

  • alerting on disk space across all nodes
  • alerting on pipeline failures (happening on 2017-07-02)
  • automatic cleanup of chameleon

What else could be done?

It's unclear if any data was actually lost, it seems to me that's not the case as rsync's --remove-source-files should remove source only on success, but it's done during the transfer, so it does not produce duplicate report files across different buckets as well.
@hellais do you have any ideas how to double-check that no files were lost?
ooni-pipeline-cron.log is bad clue as $TEMP_DIR/fail file can be created (there are free inodes), but any attempt to write to the file should fail (no diskspace):

$ find tmp.MJMZZMKod6 tmp.FCkAAllTHY tmp.8MLYjpci8m -ls
19144742    4 drwx------   2 soli     soli         4096 Jul  1 04:00 tmp.MJMZZMKod6
19144743    4 drwx------   2 soli     soli         4096 Jul  2 06:30 tmp.FCkAAllTHY
19139487    0 -rw-rw-r--   1 soli     soli            0 Jul  2 06:32 tmp.FCkAAllTHY/fail
18882602    4 drwx------   2 soli     soli         4096 Jul  3 04:08 tmp.8MLYjpci8m
18878873    0 -rw-rw-r--   1 soli     soli            0 Jul  3 06:22 tmp.8MLYjpci8m/fail
@darkk
Copy link
Contributor Author

darkk commented Jul 6, 2017

We've not found any definite proof of data loss. The incident happened to bucket 2017-07-03 that is smaller than usual, but the next bucket is larger then usual. I assume that data was buffered at collectors, but I've not checked that during the incident.

12943M	2017-06-21
10761M	2017-06-22
15036M	2017-06-23
11400M	2017-06-24
11673M	2017-06-25
11917M	2017-06-26
11832M	2017-06-27
13104M	2017-06-28
11432M	2017-06-29
11184M	2017-06-30
12538M	2017-07-01
11581M	2017-07-02
 9046M	2017-07-03
14517M	2017-07-04
12980M	2017-07-05
13002M	2017-07-06

@darkk darkk closed this as completed Feb 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant