Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error restoring snapshot #3326

Closed
pvmarius opened this issue Jul 26, 2017 · 1 comment
Closed

error restoring snapshot #3326

pvmarius opened this issue Jul 26, 2017 · 1 comment
Labels
theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner type/enhancement Proposed improvement or new feature
Milestone

Comments

@pvmarius
Copy link

Hi
We are running into an error trying to restore a snapshot:

time consul snapshot restore test.snap
Error restoring snapshot: Unexpected response code: 500 (Raft error when restoring snapshot: timed out enqueuing operation)

real	3m14.793s
user	0m0.704s
sys	0m4.228s

and then a different error

time consul snapshot restore test.snap
Error restoring snapshot: Put http://127.0.0.1:8500/v1/snapshot: EOF

real	5m36.831s
user	0m0.180s
sys	0m4.668s

Config:

  • consul 0.9.0
  • single node cluster
  • instance AWS m4.large and m4.xlarge (initially)
  • snapshot size ~ 3.8GB on disk / compressed
  • OS Amazon Linux AMI 2017.03

Workaround:

  • used AWS i3.large instance
  • mounted /tmp and /var/lib/consul on the nvme disk
    /dev/nvme0n1p1 on /tmp type ext4 (rw,relatime,data=ordered)
    /dev/nvme0n1p2 on /var/lib/consul type ext4 (rw,relatime,data=ordered)
    
  • run the restore again
    time consul snapshot restore test.snap
    Restored snapshot
    
    real	1m37.715s
    user	0m0.440s
    sys	0m3.980s
    

Issue seems to be the fixed timeout here:

Is it possible to make this timeout configurable ?

Thanks

@slackpad slackpad added type/enhancement Proposed improvement or new feature theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner labels Sep 2, 2017
@slackpad
Copy link
Contributor

slackpad commented Sep 2, 2017

Seems like we should be able to plumb that down.

@slackpad slackpad added this to the 1.0.2 milestone Dec 4, 2017
slackpad added a commit that referenced this issue Dec 13, 2017
Fixes #3326

Originally I started out making this configurable, but realized it wasn't
worth the complexity. If you are restoring a snapshot, you really need to
wait until it's done, or something bad happens like losing leadership, which
will already trigger an error. Rather than have operators have to tune this
to cover whatever their snapshot size is, we just make it block here. There's
also already a bocking barrier right after the restore that hasn't been a
problem.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner type/enhancement Proposed improvement or new feature
Projects
None yet
Development

No branches or pull requests

2 participants