Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wish: implement a retry policy #12

Open
aivanise opened this issue Dec 6, 2022 · 10 comments
Open

wish: implement a retry policy #12

aivanise opened this issue Dec 6, 2022 · 10 comments

Comments

@aivanise
Copy link
Contributor

aivanise commented Dec 6, 2022

Hi,

Thanks for a wonderful tool, it saved my life a couple of times already :)

I have a large(ish) cluster, 6 nodes, 150+ containers and there is always something going on, either a backup or devs playing around overloading individual nodes, upgrades, maintenance,etc, so more often than not lxc times out and then the complete service "fails", like this:

Dec 06 09:02:27 lxd10.2e-systems.com lxd-snapper[30497]: -> deleting snapshot: auto-20221206-040026
Dec 06 09:02:27 lxd10.2e-systems.com lxd-snapper[30497]: error: lxc returned a non-zero status code and said:
Dec 06 09:02:27 lxd10.2e-systems.com lxd-snapper[30497]: -> [ FAILED ]
...
Dec 06 09:03:37 lxd10.2e-systems.com lxd-snapper[30497]: Error: Some instances couldn't be pruned

in some cases this is a problem as a snapshot that is not deleted on time uses disk space which is sometimes scarce, so would it be possible to implement some kind of retry policy, preferably configurable, like:

retry: 5
retry-interval: 30s

@Patryk27
Copy link
Owner

Patryk27 commented Dec 6, 2022

Oh, that's nice - yeah, should be doable 🙂

Would you see such retry & retry-interval as a global configuration for all containers or you'd have some use-case for specifying different retry-options for different policies / containers / remotes?

@aivanise
Copy link
Contributor Author

aivanise commented Dec 6, 2022

Whatever is easier to implement, I don't actually have a use case to have it separate per policy, as I don't see how exec(lxc) can fail differently depending on the policy, maybe if snapshot removal on zfs level is slower depending on the snapshots around it, but it is a stretch.

although... openzfs/zfs#11933, but still a stretch ;)

@aivanise
Copy link
Contributor Author

one more thing here, somewhat related: lxc can also get stuck and never exit, so it would be nice to have a timeout on exec calls to lxc. Happened to me just now and since it was in a systemd unit that was missing TimeoutStartSec, it was happily hanging in there "as a service" for two weeks until I've realized there are no more snapshots ;)

@Patryk27
Copy link
Owner

Patryk27 commented Mar 16, 2023

it would be nice to have a timeout on exec calls to lxc.

Oh, this I can implement pretty quickly! 😄

Check out current master, I've just added there lxc-timeout (with the default of 10 minutes), which allows to specify the maximum waiting time for each invocation of lxc.

@aivanise
Copy link
Contributor Author

aivanise commented Mar 20, 2023

i've tried it out and it actually makes every call to lxc take lxc-timeout time instead of timing it out ;)

# stdbuf -i0 -o0 -e0 time /tmp/lxd-snapper -c /tmp/lxd-snapper.conf backup-and-prune | awk '{ print strftime("[%Y-%m-%d %H:%M:%S]"), $0 }'
[2023-03-20 12:00:52] Backing-up
[2023-03-20 12:00:52] ----------
[2023-03-20 12:00:52]
[2023-03-20 12:00:53] AEE/aee-qc
[2023-03-20 12:00:53]   - creating snapshot: auto-20230320-110053 [ OK ]
[2023-03-20 12:00:53]
[2023-03-20 12:03:53]
[2023-03-20 12:03:53] Pruning
[2023-03-20 12:03:53] -------
[2023-03-20 12:03:53]
[2023-03-20 12:03:54] AEE/aee-qc
[2023-03-20 12:03:54]
^CCommand terminated by signal 2
0.29user 0.88system 4:26.23elapsed 0%CPU (0avgtext+0avgdata 28168maxresident)k
0inputs+0outputs (0major+20946minor)pagefaults 0swaps

# head -3 /tmp/lxd-snapper.conf
# this is yaml
lxc-timeout: 3 min
policies:

@Patryk27
Copy link
Owner

Huh, that's pretty random - I've just re-checked on my machine and everything seems to be working as intended, i.e. the commands complete without any extra delay:

pwy@ubu:~/Projects/lxd-snapper$ stdbuf -i0 -o0 -e0 time ./target/release/lxd-snapper backup-and-prune | awk '{ print strftime("[%Y-%m-%d %H:%M:%S]"), $0 }'
[2023-03-20 14:25:38] Backing-up
[2023-03-20 14:25:38] ----------
[2023-03-20 14:25:38] 
[2023-03-20 14:25:38] test
[2023-03-20 14:25:38]   - creating snapshot: auto-20230320-132538 [ OK ]
[2023-03-20 14:25:38] 
[2023-03-20 14:25:38] Backing-up summary
[2023-03-20 14:25:38] ------------------
[2023-03-20 14:25:38]   processed instances: 1
[2023-03-20 14:25:38]   created snapshots: 1
[2023-03-20 14:25:38] 
[2023-03-20 14:25:38] Pruning
[2023-03-20 14:25:38] -------
[2023-03-20 14:25:38] 
[2023-03-20 14:25:38] test
[2023-03-20 14:25:38]   - keeping snapshot: auto-20230320-132538
[2023-03-20 14:25:38]   - deleting snapshot: auto-20230320-132510 [ OK ]
[2023-03-20 14:25:38] 
[2023-03-20 14:25:38] Pruning summary
[2023-03-20 14:25:38] ---------------
[2023-03-20 14:25:38]   processed instances: 1
[2023-03-20 14:25:38]   deleted snapshots: 1
[2023-03-20 14:25:38]   kept snapshots: 1
0.14user 0.20system 0:00.50elapsed 68%CPU (0avgtext+0avgdata 27472maxresident)k
0inputs+0outputs (0major+50077minor)pagefaults 0swaps
pwy@ubu:~/Projects/lxd-snapper$ cat config.yaml 
lxc-timeout: 3 min

policies:
  every-instance:
    keep-last: 1
pwy@ubu:~/Projects/lxd-snapper$ 

Which OS and kernel are you using? 👀

@aivanise
Copy link
Contributor Author

I'm on Centos 8-Streams, 4.18.0-408.el8.x86_64

maybe you should add one more machine at least to be able see it, as in my case the delays are between the machines, i.e. [OK] appears immediately, but it then waits lxc-timeout time to skip to the next one.

@Patryk27
Copy link
Owner

Yeah, I did check on multiple machines - even with a few different kernel versions (4.14, 4.9 & 5.4) 🤔

Would you mind checking this binary?

(it's lxd-snapper built via Nix, through nix build .#packages.x86_64-linux.default && cp ./result/bin/lxd-snapper . - that's to make sure the compiler or dynamic binaries aren't playing any tricks here 😄)

@aivanise
Copy link
Contributor Author

same result. I have noticed however that, according to 'ps -e f', it spawns lxc list and hangs in there for the duration od a timeout. Identical lxc list command issued on the command line returns within seconds. So, it might be something else, not the timeout per se. The version that works that I use is the last release (v1.3.0), so it might be something added to the master after that.

1512271 pts/0    S+     0:00  |           \_ time /tmp/lxd-snapper -c /tmp/lxd-snapper.conf backup-and-prune
1512282 pts/0    S+     0:00  |           |   \_ /tmp/lxd-snapper -c /tmp/lxd-snapper.conf backup-and-prune
1512582 pts/0    Sl+    0:00  |           |       \_ lxc list local: --project=default --format=json
1512272 pts/0    S+     0:00  |           \_ awk { print strftime("[%Y-%m-%d %H:%M:%S]"), $0 }

@Patryk27
Copy link
Owner

Okie, I've just prepared a different implementation - feel free to checkout current master branch if you find a minute 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants