wish: implement a retry policy #12

aivanise · 2022-12-06T08:31:50Z

Hi,

Thanks for a wonderful tool, it saved my life a couple of times already :)

I have a large(ish) cluster, 6 nodes, 150+ containers and there is always something going on, either a backup or devs playing around overloading individual nodes, upgrades, maintenance,etc, so more often than not lxc times out and then the complete service "fails", like this:

Dec 06 09:02:27 lxd10.2e-systems.com lxd-snapper[30497]: -> deleting snapshot: auto-20221206-040026
Dec 06 09:02:27 lxd10.2e-systems.com lxd-snapper[30497]: error: lxc returned a non-zero status code and said:
Dec 06 09:02:27 lxd10.2e-systems.com lxd-snapper[30497]: -> [ FAILED ]
...
Dec 06 09:03:37 lxd10.2e-systems.com lxd-snapper[30497]: Error: Some instances couldn't be pruned

in some cases this is a problem as a snapshot that is not deleted on time uses disk space which is sometimes scarce, so would it be possible to implement some kind of retry policy, preferably configurable, like:

retry: 5
retry-interval: 30s

The text was updated successfully, but these errors were encountered:

Patryk27 · 2022-12-06T08:47:01Z

Oh, that's nice - yeah, should be doable 🙂

Would you see such retry & retry-interval as a global configuration for all containers or you'd have some use-case for specifying different retry-options for different policies / containers / remotes?

aivanise · 2022-12-06T13:41:38Z

Whatever is easier to implement, I don't actually have a use case to have it separate per policy, as I don't see how exec(lxc) can fail differently depending on the policy, maybe if snapshot removal on zfs level is slower depending on the snapshots around it, but it is a stretch.

although... openzfs/zfs#11933, but still a stretch ;)

aivanise · 2023-03-15T15:42:45Z

one more thing here, somewhat related: lxc can also get stuck and never exit, so it would be nice to have a timeout on exec calls to lxc. Happened to me just now and since it was in a systemd unit that was missing TimeoutStartSec, it was happily hanging in there "as a service" for two weeks until I've realized there are no more snapshots ;)

Patryk27 · 2023-03-16T21:01:34Z

it would be nice to have a timeout on exec calls to lxc.

Oh, this I can implement pretty quickly! 😄

Check out current master, I've just added there lxc-timeout (with the default of 10 minutes), which allows to specify the maximum waiting time for each invocation of lxc.

aivanise · 2023-03-20T11:06:22Z

i've tried it out and it actually makes every call to lxc take lxc-timeout time instead of timing it out ;)

# stdbuf -i0 -o0 -e0 time /tmp/lxd-snapper -c /tmp/lxd-snapper.conf backup-and-prune | awk '{ print strftime("[%Y-%m-%d %H:%M:%S]"), $0 }'
[2023-03-20 12:00:52] Backing-up
[2023-03-20 12:00:52] ----------
[2023-03-20 12:00:52]
[2023-03-20 12:00:53] AEE/aee-qc
[2023-03-20 12:00:53]   - creating snapshot: auto-20230320-110053 [ OK ]
[2023-03-20 12:00:53]
[2023-03-20 12:03:53]
[2023-03-20 12:03:53] Pruning
[2023-03-20 12:03:53] -------
[2023-03-20 12:03:53]
[2023-03-20 12:03:54] AEE/aee-qc
[2023-03-20 12:03:54]
^CCommand terminated by signal 2
0.29user 0.88system 4:26.23elapsed 0%CPU (0avgtext+0avgdata 28168maxresident)k
0inputs+0outputs (0major+20946minor)pagefaults 0swaps

# head -3 /tmp/lxd-snapper.conf
# this is yaml
lxc-timeout: 3 min
policies:

Patryk27 · 2023-03-20T13:26:12Z

Huh, that's pretty random - I've just re-checked on my machine and everything seems to be working as intended, i.e. the commands complete without any extra delay:

pwy@ubu:~/Projects/lxd-snapper$ stdbuf -i0 -o0 -e0 time ./target/release/lxd-snapper backup-and-prune | awk '{ print strftime("[%Y-%m-%d %H:%M:%S]"), $0 }'
[2023-03-20 14:25:38] Backing-up
[2023-03-20 14:25:38] ----------
[2023-03-20 14:25:38] 
[2023-03-20 14:25:38] test
[2023-03-20 14:25:38]   - creating snapshot: auto-20230320-132538 [ OK ]
[2023-03-20 14:25:38] 
[2023-03-20 14:25:38] Backing-up summary
[2023-03-20 14:25:38] ------------------
[2023-03-20 14:25:38]   processed instances: 1
[2023-03-20 14:25:38]   created snapshots: 1
[2023-03-20 14:25:38] 
[2023-03-20 14:25:38] Pruning
[2023-03-20 14:25:38] -------
[2023-03-20 14:25:38] 
[2023-03-20 14:25:38] test
[2023-03-20 14:25:38]   - keeping snapshot: auto-20230320-132538
[2023-03-20 14:25:38]   - deleting snapshot: auto-20230320-132510 [ OK ]
[2023-03-20 14:25:38] 
[2023-03-20 14:25:38] Pruning summary
[2023-03-20 14:25:38] ---------------
[2023-03-20 14:25:38]   processed instances: 1
[2023-03-20 14:25:38]   deleted snapshots: 1
[2023-03-20 14:25:38]   kept snapshots: 1
0.14user 0.20system 0:00.50elapsed 68%CPU (0avgtext+0avgdata 27472maxresident)k
0inputs+0outputs (0major+50077minor)pagefaults 0swaps
pwy@ubu:~/Projects/lxd-snapper$ cat config.yaml 
lxc-timeout: 3 min

policies:
  every-instance:
    keep-last: 1
pwy@ubu:~/Projects/lxd-snapper$

Which OS and kernel are you using? 👀

aivanise · 2023-03-21T08:55:53Z

I'm on Centos 8-Streams, 4.18.0-408.el8.x86_64

maybe you should add one more machine at least to be able see it, as in my case the delays are between the machines, i.e. [OK] appears immediately, but it then waits lxc-timeout time to skip to the next one.

Patryk27 · 2023-03-21T16:38:51Z

Yeah, I did check on multiple machines - even with a few different kernel versions (4.14, 4.9 & 5.4) 🤔

Would you mind checking this binary?

(it's lxd-snapper built via Nix, through nix build .#packages.x86_64-linux.default && cp ./result/bin/lxd-snapper . - that's to make sure the compiler or dynamic binaries aren't playing any tricks here 😄)

aivanise · 2023-03-22T07:32:37Z

same result. I have noticed however that, according to 'ps -e f', it spawns lxc list and hangs in there for the duration od a timeout. Identical lxc list command issued on the command line returns within seconds. So, it might be something else, not the timeout per se. The version that works that I use is the last release (v1.3.0), so it might be something added to the master after that.

1512271 pts/0    S+     0:00  |           \_ time /tmp/lxd-snapper -c /tmp/lxd-snapper.conf backup-and-prune
1512282 pts/0    S+     0:00  |           |   \_ /tmp/lxd-snapper -c /tmp/lxd-snapper.conf backup-and-prune
1512582 pts/0    Sl+    0:00  |           |       \_ lxc list local: --project=default --format=json
1512272 pts/0    S+     0:00  |           \_ awk { print strftime("[%Y-%m-%d %H:%M:%S]"), $0 }

Patryk27 · 2023-03-27T16:14:07Z

Okie, I've just prepared a different implementation - feel free to checkout current master branch if you find a minute 🙂

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wish: implement a retry policy #12

wish: implement a retry policy #12

aivanise commented Dec 6, 2022

Patryk27 commented Dec 6, 2022

aivanise commented Dec 6, 2022

aivanise commented Mar 15, 2023

Patryk27 commented Mar 16, 2023 •

edited

aivanise commented Mar 20, 2023 •

edited

Patryk27 commented Mar 20, 2023

aivanise commented Mar 21, 2023

Patryk27 commented Mar 21, 2023

aivanise commented Mar 22, 2023

Patryk27 commented Mar 27, 2023

wish: implement a retry policy #12

wish: implement a retry policy #12

Comments

aivanise commented Dec 6, 2022

Patryk27 commented Dec 6, 2022

aivanise commented Dec 6, 2022

aivanise commented Mar 15, 2023

Patryk27 commented Mar 16, 2023 • edited

aivanise commented Mar 20, 2023 • edited

Patryk27 commented Mar 20, 2023

aivanise commented Mar 21, 2023

Patryk27 commented Mar 21, 2023

aivanise commented Mar 22, 2023

Patryk27 commented Mar 27, 2023

Patryk27 commented Mar 16, 2023 •

edited

aivanise commented Mar 20, 2023 •

edited