etcd-snapshot save times out in 10 seconds the first try #9985

aganesh-suse · 2024-04-19T18:25:20Z

Issue found on master branch with version v1.29.4-rc1+k3s1

Environment Details

Infrastructure

Cloud
Hosted

Node(s) CPU architecture, OS, and Version:

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.2 LTS"

$ uname -m
x86_64

Cluster Configuration:

HA: 3 server/ 1 agent

Config.yaml:

token: xxxx
cluster-init: true
write-kubeconfig-mode: "0644"
node-external-ip: 1.1.1.1
node-label:
- k3s-upgrade=server

Testing Steps

Copy config.yaml

$ sudo mkdir -p /etc/rancher/k3s && sudo cp config.yaml /etc/rancher/k3s

Install k3s

curl -sfL https://get.k3s.io | sudo INSTALL_K3S_VERSION='v1.29.4-rc1+k3s1' sh -s - server

Verify Cluster Status:

kubectl get nodes -o wide
kubectl get pods -A

Perform etcd-snapshot save with s3 details provided:

$ sudo /usr/local/bin/k3s etcd-snapshot save --s3 --s3-bucket=<bucket> --s3-folder=<folder> --s3-region=<region> --s3-access-key=xxxx --s3-secret-key="xxxx" --debug

Expected behavior:

The saved snapshot should succeed. The first attempt times out in 10 seconds. And a retry attempted after maybe 30 to 60 seconds starts saving successfully again.

Reproducing Results/Observations:

k3s version used for replication:

$ k3s -v
k3s version v1.29.4-rc1+k3s1 (d973fadb)
go version go1.21.9

$ sudo /usr/local/bin/k3s etcd-snapshot save --s3 --s3-bucket=<bucket> --s3-folder=<folder> --s3-region=<region> --s3-access-key=xxxx --s3-secret-key="xxxx" --debug 
time="2024-04-19T18:04:31Z" level=warning msg="Unknown flag --cluster-init found in config.yaml, skipping\n"
time="2024-04-19T18:04:31Z" level=warning msg="Unknown flag --write-kubeconfig-mode found in config.yaml, skipping\n"
time="2024-04-19T18:04:31Z" level=warning msg="Unknown flag --node-external-ip found in config.yaml, skipping\n"
time="2024-04-19T18:04:31Z" level=warning msg="Unknown flag --node-label found in config.yaml, skipping\n"
time="2024-04-19T18:04:31Z" level=warning msg="Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation."
time="2024-04-19T18:04:41Z" level=fatal msg="see server log for details: Post \"https://127.0.0.1:6443/db/snapshot\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

Also, on the journal logs, the server is still performing some operations during the above timeframe, and we do not see any error reported on the journal logs, when the client timed out.

The text was updated successfully, but these errors were encountered:

aganesh-suse · 2024-04-22T20:26:40Z

Validated on master branch with commit `d3b6054`

Environment Details

Infrastructure

Cloud
Hosted

Node(s) CPU architecture, OS, and Version:

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.2 LTS"

$ uname -m
x86_64

Cluster Configuration:

HA: 3 server/ 1 agent

Config.yaml:

token: xxxx
cluster-init: true
write-kubeconfig-mode: "0644"
node-external-ip: 1.1.1.1
node-label:
- k3s-upgrade=server

Testing Steps

Copy config.yaml

$ sudo mkdir -p /etc/rancher/k3s && sudo cp config.yaml /etc/rancher/k3s

Install k3s

curl -sfL https://get.k3s.io | sudo INSTALL_K3S_COMMIT='d3b60543e7df924881854108984593aafb557d3c' sh -s - server

Verify Cluster Status:

kubectl get nodes -o wide
kubectl get pods -A

Perform etcd-snapshot save with s3 details provided:

$ sudo /usr/local/bin/k3s etcd-snapshot save --s3 --s3-bucket=<bucket> --s3-folder=<folder> --s3-region=<region> --s3-access-key=xxxx --s3-secret-key="xxxx" --debug

Expected Behavior:
etcd snapshot save action should be successful and not timeout in 10 seconds.

Validation Results:

k3s version used for validation:

$ k3s -v
k3s version v1.29.4+k3s-d3b60543 (d3b60543)
go version go1.21.9

$ sudo /usr/local/bin/k3s etcd-snapshot save --s3 --s3-bucket=<s3-bucket> --s3-region=<s3-region> --s3-access-key=xxxx --s3-secret-key="xxxx" --debug
time="2024-04-22T20:19:49Z" level=warning msg="Unknown flag --cluster-init found in config.yaml, skipping\n"
time="2024-04-22T20:19:49Z" level=warning msg="Unknown flag --write-kubeconfig-mode found in config.yaml, skipping\n"
time="2024-04-22T20:19:49Z" level=warning msg="Unknown flag --node-external-ip found in config.yaml, skipping\n"
time="2024-04-22T20:19:49Z" level=warning msg="Unknown flag --node-label found in config.yaml, skipping\n"
time="2024-04-22T20:19:49Z" level=warning msg="Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation."
time="2024-04-22T20:20:19Z" level=info msg="Snapshot on-demand-ip-172-31-16-180-1713817190 saved."
time="2024-04-22T20:20:19Z" level=info msg="Snapshot on-demand-ip-172-31-16-180-1713817190 saved."

As we can see from log timings above, the save did not timeout in 10 seconds. It waits for the save completion and the save is successful. Closing the bug.

brandond self-assigned this Apr 19, 2024

brandond added kind/bug Something isn't working status/blocker labels Apr 19, 2024

brandond added this to the v1.29.4+k3s1 milestone Apr 19, 2024

This was referenced Apr 19, 2024

Fix on-demand snapshots timing out; not honoring folder #9984

Merged

[Release-1.28] - etcd-snapshot save times out in 10 seconds the first try #9998

Closed

[Release-1.27] - etcd-snapshot save times out in 10 seconds the first try #9999

Closed

brandond assigned aganesh-suse Apr 22, 2024

brandond mentioned this issue Apr 22, 2024

etcd-snapshot save timeout / s3 bucket issues rancher/rke2#5813

Closed

aganesh-suse closed this as completed Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd-snapshot save times out in 10 seconds the first try #9985

etcd-snapshot save times out in 10 seconds the first try #9985

aganesh-suse commented Apr 19, 2024

aganesh-suse commented Apr 22, 2024

etcd-snapshot save times out in 10 seconds the first try #9985

etcd-snapshot save times out in 10 seconds the first try #9985

Comments

aganesh-suse commented Apr 19, 2024

Issue found on master branch with version v1.29.4-rc1+k3s1

Environment Details

Testing Steps

aganesh-suse commented Apr 22, 2024

Validated on master branch with commit d3b6054

Environment Details

Testing Steps

Validated on master branch with commit `d3b6054`