New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unexpected end of JSON input after server crash #1221

Closed
amir-stratoscale opened this Issue Sep 7, 2015 · 20 comments

Comments

Projects
None yet
10 participants
@amir-stratoscale

amir-stratoscale commented Sep 7, 2015

After rebooting a server running Consul agent (in server mode), attempting to restart Consul's agent with its previously used parameters throws these errors:
==> Starting Consul agent...
==> Error starting agent: unexpected end of JSON input

Looking in the directories: <consul_data_dir>/checks & <consul_data_dir>/services, I see several files with size 0.

The server's reboot occurred at the same time that we registered health checks to Consul, so I believe that the files are corrupt (empty actually) because they are not being written atomically.

  1. Can the files written to the directories mentioned be written atomically?
  2. As a workaround, is it safe to delete these directories and restart Consul? It seems to work.

Thanks,
Amir

@slackpad

This comment has been minimized.

Show comment
Hide comment
@slackpad

slackpad Sep 8, 2015

Contributor

Yeah this could probably done in a little safer manner to reduce the chances of corrupting them. It should be safe to delete them and restart, as long as you re-register the checks since the agent is the source of truth about what's registered.

Contributor

slackpad commented Sep 8, 2015

Yeah this could probably done in a little safer manner to reduce the chances of corrupting them. It should be safe to delete them and restart, as long as you re-register the checks since the agent is the source of truth about what's registered.

@JeanMertz

This comment has been minimized.

Show comment
Hide comment
@JeanMertz

JeanMertz Sep 26, 2015

👍 for this. We ran into this as well:

==> WARNING: Expect Mode enabled, expecting 3 servers
==> Starting raft data migration...
==> Starting Consul agent...
==> Error starting agent: failed decoding service from /data/services/3d2e6bded046ca2e3f906f04235b279b: unexpected end of JSON input

JeanMertz commented Sep 26, 2015

👍 for this. We ran into this as well:

==> WARNING: Expect Mode enabled, expecting 3 servers
==> Starting raft data migration...
==> Starting Consul agent...
==> Error starting agent: failed decoding service from /data/services/3d2e6bded046ca2e3f906f04235b279b: unexpected end of JSON input
@krestjaninoff

This comment has been minimized.

Show comment
Hide comment
@krestjaninoff

krestjaninoff Oct 21, 2015

The same problem :(

krestjaninoff commented Oct 21, 2015

The same problem :(

@slackpad slackpad added the bug label Jan 8, 2016

@MrMMorris

This comment has been minimized.

Show comment
Hide comment
@MrMMorris

MrMMorris Feb 12, 2016

saw this as well with 0.6.0. I just made a config change then restarted consul and a couple of the clients had empty files in services/ that when I deleted and restarted, solved the problem.

MrMMorris commented Feb 12, 2016

saw this as well with 0.6.0. I just made a config change then restarted consul and a couple of the clients had empty files in services/ that when I deleted and restarted, solved the problem.

@MrMMorris

This comment has been minimized.

Show comment
Hide comment
@MrMMorris

MrMMorris Feb 15, 2016

This happened about half the times I made changes to node_ttl on a running consul client and then restarted consul. Let me know if I can help debug.

I did see serf: Failed to update snapshot: write /opt/consul/data/serf/local.snapshot: no space left on device before the restart. Not sure if related.

When there was 2 empty files, I actually got a useful error for the second one:

Error starting agent: failed decoding service file "/opt/consul/data/services/4e2b1752dd646044507e9ac846479a90": unexpected end of JSON input

MrMMorris commented Feb 15, 2016

This happened about half the times I made changes to node_ttl on a running consul client and then restarted consul. Let me know if I can help debug.

I did see serf: Failed to update snapshot: write /opt/consul/data/serf/local.snapshot: no space left on device before the restart. Not sure if related.

When there was 2 empty files, I actually got a useful error for the second one:

Error starting agent: failed decoding service file "/opt/consul/data/services/4e2b1752dd646044507e9ac846479a90": unexpected end of JSON input

@alistanis

This comment has been minimized.

Show comment
Hide comment
@alistanis

alistanis Feb 24, 2016

I have a pull request up for this in #1750

alistanis commented Feb 24, 2016

I have a pull request up for this in #1750

@camerondavison

This comment has been minimized.

Show comment
Hide comment
@camerondavison

camerondavison Aug 3, 2016

Contributor

This looks like the same problem that nomad had in hashicorp/nomad#1357 which IMHO is a two part solution. Firstly fix the write to be atomic like in #1750 but secondly still allow startup of corrupt state.

Contributor

camerondavison commented Aug 3, 2016

This looks like the same problem that nomad had in hashicorp/nomad#1357 which IMHO is a two part solution. Firstly fix the write to be atomic like in #1750 but secondly still allow startup of corrupt state.

camerondavison added a commit to novilabs/consul that referenced this issue Aug 3, 2016

camerondavison added a commit to novilabs/consul that referenced this issue Aug 3, 2016

camerondavison added a commit to novilabs/consul that referenced this issue Aug 3, 2016

@alistanis

This comment has been minimized.

Show comment
Hide comment
@alistanis

alistanis Aug 4, 2016

Out of curiosity, did you find something in my implementation that you didn't agree with, and is startup with corrupt state a desirable thing for a system responsible for dns and service discovery? I personally feel like consul's config files need to be as guaranteed as possible.

alistanis commented Aug 4, 2016

Out of curiosity, did you find something in my implementation that you didn't agree with, and is startup with corrupt state a desirable thing for a system responsible for dns and service discovery? I personally feel like consul's config files need to be as guaranteed as possible.

@flegler

This comment has been minimized.

Show comment
Hide comment
@flegler

flegler Aug 15, 2016

Not sure if hashicorp/nomad#1367 is the same thing so I am also posting here.

+1 for this fix.

Occasionally I see this issue with Nomad 0.4.0

# /path/nomad agent -config /path/config.json
    Loaded configuration from /path/config.json
==> Starting Nomad agent...
==> Error starting agent: client setup failed: failed to restore state: 2 error(s) occurred:

* failed to decode state: unexpected end of JSON input
* failed to decode state: unexpected end of JSON input

It turns out that some state.json file are empty

# /...path.../nomad version
Nomad v0.4.0
# find /...nomad_directory.../ -name state.json | xargs ls -la
-rw------- 1 root root    0 Aug 11 13:13 /nomad_directory/client/client/alloc/5612e3fc-078d-0a08-049a-b8169d50920c/state.json
-rw------- 1 root root 2760 Aug  8 12:01 /nomad_directory/client/client/alloc/5612e3fc-078d-0a08-049a-b8169d50920c/task-a519f3bf481a344e286bbdab7d3cfe5f/state.json
-rw------- 1 root root    0 Aug 11 13:13 /nomad_directory/client/client/alloc/643ace46-34e0-7861-c26e-43cf5c82a6c1/state.json
-rw------- 1 root root    0 Aug  8 19:52 /nomad_directory/client/client/alloc/643ace46-34e0-7861-c26e-43cf5c82a6c1/task-e71f891226c0be65a6a1d38f954cc4b8/state.json

Deleting /nomad_directory/ helped to overcome the issue.

Not sure what caused this. It might have been the shutdown of the nodes. There was no shortage of disk space.

flegler commented Aug 15, 2016

Not sure if hashicorp/nomad#1367 is the same thing so I am also posting here.

+1 for this fix.

Occasionally I see this issue with Nomad 0.4.0

# /path/nomad agent -config /path/config.json
    Loaded configuration from /path/config.json
==> Starting Nomad agent...
==> Error starting agent: client setup failed: failed to restore state: 2 error(s) occurred:

* failed to decode state: unexpected end of JSON input
* failed to decode state: unexpected end of JSON input

It turns out that some state.json file are empty

# /...path.../nomad version
Nomad v0.4.0
# find /...nomad_directory.../ -name state.json | xargs ls -la
-rw------- 1 root root    0 Aug 11 13:13 /nomad_directory/client/client/alloc/5612e3fc-078d-0a08-049a-b8169d50920c/state.json
-rw------- 1 root root 2760 Aug  8 12:01 /nomad_directory/client/client/alloc/5612e3fc-078d-0a08-049a-b8169d50920c/task-a519f3bf481a344e286bbdab7d3cfe5f/state.json
-rw------- 1 root root    0 Aug 11 13:13 /nomad_directory/client/client/alloc/643ace46-34e0-7861-c26e-43cf5c82a6c1/state.json
-rw------- 1 root root    0 Aug  8 19:52 /nomad_directory/client/client/alloc/643ace46-34e0-7861-c26e-43cf5c82a6c1/task-e71f891226c0be65a6a1d38f954cc4b8/state.json

Deleting /nomad_directory/ helped to overcome the issue.

Not sure what caused this. It might have been the shutdown of the nodes. There was no shortage of disk space.

@flegler

This comment has been minimized.

Show comment
Hide comment
@flegler

flegler Aug 29, 2016

I just realized that my previous comment was for a similar issue with Nomad. Now I have experienced it with Consul too on 0.6.3.dev.

# /path/consul agent -config-file /path/config.json
==> Starting Consul agent...
==> Error starting agent: failed decoding service file "/path/services/18501c56aa3648940425d400c66dad9a": unexpected end of JSON input

# ls -la /path/services/18501c56aa3648940425d400c66dad9a
-rw------- 1 root root 0 Aug 29 19:50 /path/services/18501c56aa3648940425d400c66dad9a

# /path/consul version
Consul 0.6.3.dev
Consul Protocol: 3 (Understands back to: 1)

flegler commented Aug 29, 2016

I just realized that my previous comment was for a similar issue with Nomad. Now I have experienced it with Consul too on 0.6.3.dev.

# /path/consul agent -config-file /path/config.json
==> Starting Consul agent...
==> Error starting agent: failed decoding service file "/path/services/18501c56aa3648940425d400c66dad9a": unexpected end of JSON input

# ls -la /path/services/18501c56aa3648940425d400c66dad9a
-rw------- 1 root root 0 Aug 29 19:50 /path/services/18501c56aa3648940425d400c66dad9a

# /path/consul version
Consul 0.6.3.dev
Consul Protocol: 3 (Understands back to: 1)
@zg

This comment has been minimized.

Show comment
Hide comment
@zg

zg Jan 8, 2018

This issue is still valid in 0.8.5 with an empty JSON file in /var/consul/services:

# consul version
Consul v0.8.5
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
# cat /var/consul/services/foo
# /usr/bin/consul agent -config-dir=/etc/consul.d -pid-file /tmp/consul.pid
==> WARNING: LAN keyring exists but -encrypt given, using keyring
==> Starting Consul agent...
==> Error starting agent: failed decoding service file "/var/consul/services/foo": unexpected end of JSON input

zg commented Jan 8, 2018

This issue is still valid in 0.8.5 with an empty JSON file in /var/consul/services:

# consul version
Consul v0.8.5
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
# cat /var/consul/services/foo
# /usr/bin/consul agent -config-dir=/etc/consul.d -pid-file /tmp/consul.pid
==> WARNING: LAN keyring exists but -encrypt given, using keyring
==> Starting Consul agent...
==> Error starting agent: failed decoding service file "/var/consul/services/foo": unexpected end of JSON input
@slackpad

This comment has been minimized.

Show comment
Hide comment
@slackpad

slackpad Jan 10, 2018

Contributor

Hi @zg did you experience a crash/running out of disk space to get the empty file, or did it occur some other way?

Contributor

slackpad commented Jan 10, 2018

Hi @zg did you experience a crash/running out of disk space to get the empty file, or did it occur some other way?

@zg

This comment has been minimized.

Show comment
Hide comment
@zg

zg Jan 10, 2018

I created the tmp file myself as a proof of concept, but previously there was an incident where consul couldn’t start due to zero byte tmp files that were in the data directory that we previously were unaware of.

zg commented Jan 10, 2018

I created the tmp file myself as a proof of concept, but previously there was an incident where consul couldn’t start due to zero byte tmp files that were in the data directory that we previously were unaware of.

@slackpad slackpad modified the milestones: 0.7.1, 1.0.3 Jan 10, 2018

@slackpad

This comment has been minimized.

Show comment
Hide comment
@slackpad

slackpad Jan 10, 2018

Contributor

I see. I'll kick this back open so we can take a look.

Contributor

slackpad commented Jan 10, 2018

I see. I'll kick this back open so we can take a look.

@slackpad slackpad reopened this Jan 10, 2018

@pjelar

This comment has been minimized.

Show comment
Hide comment
@pjelar

pjelar Jan 15, 2018

@zg how did you resolve this? I have an issue where I just pushed a rolling cluster update using kops and one of my consul instances failed to start as it's looking for a service file that doesn't exist.

==> Error starting agent: failed decoding service file "/var/lib/consul/services/a369186d812d2a6ea01b87b6956ef828-340fa39d-14d1-e3f2-1305-d2228422425d.tmp": unexpected end of JSON input

pjelar commented Jan 15, 2018

@zg how did you resolve this? I have an issue where I just pushed a rolling cluster update using kops and one of my consul instances failed to start as it's looking for a service file that doesn't exist.

==> Error starting agent: failed decoding service file "/var/lib/consul/services/a369186d812d2a6ea01b87b6956ef828-340fa39d-14d1-e3f2-1305-d2228422425d.tmp": unexpected end of JSON input
@MrMMorris

This comment has been minimized.

Show comment
Hide comment
@MrMMorris

MrMMorris Jan 15, 2018

@pjelar delete the empty services file and restart consul

MrMMorris commented Jan 15, 2018

@pjelar delete the empty services file and restart consul

@pjelar

This comment has been minimized.

Show comment
Hide comment
@pjelar

pjelar Jan 15, 2018

@MrMMorris there wasn't a file there. I upgraded to 1.0.0 and the problem went away.

pjelar commented Jan 15, 2018

@MrMMorris there wasn't a file there. I upgraded to 1.0.0 and the problem went away.

@zg

This comment has been minimized.

Show comment
Hide comment
@zg

zg Jan 15, 2018

It could be that an upgrade deletes the tmp files.

zg commented Jan 15, 2018

It could be that an upgrade deletes the tmp files.

@pjelar

This comment has been minimized.

Show comment
Hide comment
@pjelar

pjelar Jan 15, 2018

The files weren't there @zg I deleted them. I tried placing valid .json files there too. More likely that 1.0.0 handles this issue better than 0.8.5, it wasn't a normal situation as I also had to do a peers.json manual recovery.

pjelar commented Jan 15, 2018

The files weren't there @zg I deleted them. I tried placing valid .json files there too. More likely that 1.0.0 handles this issue better than 0.8.5, it wasn't a normal situation as I also had to do a peers.json manual recovery.

@zg

This comment has been minimized.

Show comment
Hide comment
@zg

zg Jan 15, 2018

That’s how I resolved my issue as well, by deleting the empty JSON files.

zg commented Jan 15, 2018

That’s how I resolved my issue as well, by deleting the empty JSON files.

kyhavlov added a commit that referenced this issue Jan 19, 2018

Add graceful handling of malformed persisted service/check files.
Previously a change was made to make the file writing atomic,
but that wasn't enough to cover something like an OS crash so we
needed something here to handle the situation more gracefully.

Fixes #1221.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment