Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unexpected end of JSON input after server crash #1221

Closed
amir-stratoscale opened this issue Sep 7, 2015 · 23 comments · Fixed by #2240
Closed

unexpected end of JSON input after server crash #1221

amir-stratoscale opened this issue Sep 7, 2015 · 23 comments · Fixed by #2240
Labels
type/bug Feature does not function as expected
Milestone

Comments

@amir-stratoscale
Copy link

After rebooting a server running Consul agent (in server mode), attempting to restart Consul's agent with its previously used parameters throws these errors:
==> Starting Consul agent...
==> Error starting agent: unexpected end of JSON input

Looking in the directories: <consul_data_dir>/checks & <consul_data_dir>/services, I see several files with size 0.

The server's reboot occurred at the same time that we registered health checks to Consul, so I believe that the files are corrupt (empty actually) because they are not being written atomically.

  1. Can the files written to the directories mentioned be written atomically?
  2. As a workaround, is it safe to delete these directories and restart Consul? It seems to work.

Thanks,
Amir

@slackpad
Copy link
Contributor

slackpad commented Sep 8, 2015

Yeah this could probably done in a little safer manner to reduce the chances of corrupting them. It should be safe to delete them and restart, as long as you re-register the checks since the agent is the source of truth about what's registered.

@JeanMertz
Copy link

👍 for this. We ran into this as well:

==> WARNING: Expect Mode enabled, expecting 3 servers
==> Starting raft data migration...
==> Starting Consul agent...
==> Error starting agent: failed decoding service from /data/services/3d2e6bded046ca2e3f906f04235b279b: unexpected end of JSON input

@krestjaninoff
Copy link

The same problem :(

@slackpad slackpad added the type/bug Feature does not function as expected label Jan 8, 2016
@MrMMorris
Copy link

saw this as well with 0.6.0. I just made a config change then restarted consul and a couple of the clients had empty files in services/ that when I deleted and restarted, solved the problem.

@MrMMorris
Copy link

This happened about half the times I made changes to node_ttl on a running consul client and then restarted consul. Let me know if I can help debug.

I did see serf: Failed to update snapshot: write /opt/consul/data/serf/local.snapshot: no space left on device before the restart. Not sure if related.

When there was 2 empty files, I actually got a useful error for the second one:

Error starting agent: failed decoding service file "/opt/consul/data/services/4e2b1752dd646044507e9ac846479a90": unexpected end of JSON input

@alistanis
Copy link

I have a pull request up for this in #1750

@camerondavison
Copy link
Contributor

This looks like the same problem that nomad had in hashicorp/nomad#1357 which IMHO is a two part solution. Firstly fix the write to be atomic like in #1750 but secondly still allow startup of corrupt state.

@alistanis
Copy link

Out of curiosity, did you find something in my implementation that you didn't agree with, and is startup with corrupt state a desirable thing for a system responsible for dns and service discovery? I personally feel like consul's config files need to be as guaranteed as possible.

@flegler-zz
Copy link

Not sure if hashicorp/nomad#1367 is the same thing so I am also posting here.

+1 for this fix.

Occasionally I see this issue with Nomad 0.4.0

# /path/nomad agent -config /path/config.json
    Loaded configuration from /path/config.json
==> Starting Nomad agent...
==> Error starting agent: client setup failed: failed to restore state: 2 error(s) occurred:

* failed to decode state: unexpected end of JSON input
* failed to decode state: unexpected end of JSON input

It turns out that some state.json file are empty

# /...path.../nomad version
Nomad v0.4.0
# find /...nomad_directory.../ -name state.json | xargs ls -la
-rw------- 1 root root    0 Aug 11 13:13 /nomad_directory/client/client/alloc/5612e3fc-078d-0a08-049a-b8169d50920c/state.json
-rw------- 1 root root 2760 Aug  8 12:01 /nomad_directory/client/client/alloc/5612e3fc-078d-0a08-049a-b8169d50920c/task-a519f3bf481a344e286bbdab7d3cfe5f/state.json
-rw------- 1 root root    0 Aug 11 13:13 /nomad_directory/client/client/alloc/643ace46-34e0-7861-c26e-43cf5c82a6c1/state.json
-rw------- 1 root root    0 Aug  8 19:52 /nomad_directory/client/client/alloc/643ace46-34e0-7861-c26e-43cf5c82a6c1/task-e71f891226c0be65a6a1d38f954cc4b8/state.json

Deleting /nomad_directory/ helped to overcome the issue.

Not sure what caused this. It might have been the shutdown of the nodes. There was no shortage of disk space.

@flegler-zz
Copy link

I just realized that my previous comment was for a similar issue with Nomad. Now I have experienced it with Consul too on 0.6.3.dev.

# /path/consul agent -config-file /path/config.json
==> Starting Consul agent...
==> Error starting agent: failed decoding service file "/path/services/18501c56aa3648940425d400c66dad9a": unexpected end of JSON input

# ls -la /path/services/18501c56aa3648940425d400c66dad9a
-rw------- 1 root root 0 Aug 29 19:50 /path/services/18501c56aa3648940425d400c66dad9a

# /path/consul version
Consul 0.6.3.dev
Consul Protocol: 3 (Understands back to: 1)

@zg
Copy link

zg commented Jan 8, 2018

This issue is still valid in 0.8.5 with an empty JSON file in /var/consul/services:

# consul version
Consul v0.8.5
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
# cat /var/consul/services/foo
# /usr/bin/consul agent -config-dir=/etc/consul.d -pid-file /tmp/consul.pid
==> WARNING: LAN keyring exists but -encrypt given, using keyring
==> Starting Consul agent...
==> Error starting agent: failed decoding service file "/var/consul/services/foo": unexpected end of JSON input

@slackpad
Copy link
Contributor

Hi @zg did you experience a crash/running out of disk space to get the empty file, or did it occur some other way?

@zg
Copy link

zg commented Jan 10, 2018

I created the tmp file myself as a proof of concept, but previously there was an incident where consul couldn’t start due to zero byte tmp files that were in the data directory that we previously were unaware of.

@slackpad slackpad modified the milestones: 0.7.1, 1.0.3 Jan 10, 2018
@slackpad
Copy link
Contributor

I see. I'll kick this back open so we can take a look.

@slackpad slackpad reopened this Jan 10, 2018
@pjelar
Copy link

pjelar commented Jan 15, 2018

@zg how did you resolve this? I have an issue where I just pushed a rolling cluster update using kops and one of my consul instances failed to start as it's looking for a service file that doesn't exist.

==> Error starting agent: failed decoding service file "/var/lib/consul/services/a369186d812d2a6ea01b87b6956ef828-340fa39d-14d1-e3f2-1305-d2228422425d.tmp": unexpected end of JSON input

@MrMMorris
Copy link

@pjelar delete the empty services file and restart consul

@pjelar
Copy link

pjelar commented Jan 15, 2018

@MrMMorris there wasn't a file there. I upgraded to 1.0.0 and the problem went away.

@zg
Copy link

zg commented Jan 15, 2018

It could be that an upgrade deletes the tmp files.

@pjelar
Copy link

pjelar commented Jan 15, 2018

The files weren't there @zg I deleted them. I tried placing valid .json files there too. More likely that 1.0.0 handles this issue better than 0.8.5, it wasn't a normal situation as I also had to do a peers.json manual recovery.

@zg
Copy link

zg commented Jan 15, 2018

That’s how I resolved my issue as well, by deleting the empty JSON files.

kyhavlov added a commit that referenced this issue Jan 19, 2018
Previously a change was made to make the file writing atomic,
but that wasn't enough to cover something like an OS crash so we
needed something here to handle the situation more gracefully.

Fixes #1221.
@nh2
Copy link

nh2 commented Sep 21, 2023

Consul v1.15.3 still does not write the server_metadata.json file atomically, here is the strace showing the straight openat(..., O_TRUNC), and write(), no rename():

openat(AT_FDCWD</>, "/var/lib/consul/server_metadata.json", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0600 <unfinished ...>
<... openat resumed>)     = 7</var/lib/consul/server_metadata.json>
fcntl(7</var/lib/consul/server_metadata.json>, F_GETFL) = 0x8001 (flags O_WRONLY|O_LARGEFILE)
fcntl(7</var/lib/consul/server_metadata.json>, F_SETFL, O_WRONLY|O_NONBLOCK|O_LARGEFILE) = 0
epoll_ctl(3<anon_inode:[eventpoll]>, EPOLL_CTL_ADD, 7</var/lib/consul/server_metadata.json>, {events=EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, data={u32=1970443432, u64=140103803638952}}) = -1 EPERM (Operation not permitted)
fcntl(7</var/lib/consul/server_metadata.json>, F_GETFL) = 0x8801 (flags O_WRONLY|O_NONBLOCK|O_LARGEFILE)
fcntl(7</var/lib/consul/server_metadata.json>, F_SETFL, O_WRONLY|O_LARGEFILE) = 0
write(7</var/lib/consul/server_metadata.json>, "{\"last_seen_unix\":1695257357}", 29) = 29
close(7</var/lib/consul/server_metadata.json>) = 0

@soupdiver
Copy link
Contributor

I also still have the same problem

@NathanPB
Copy link

In my case, I rm'd server-metadata.json in my Consul data directory, it was a plain empty file. Later, had to rm all the consul data directory, but not sure if this is related

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.