Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to start up the cluster when upgrading to 1.13.0 #14107

Closed
tuxillo opened this issue Aug 10, 2022 · 13 comments · Fixed by #14149
Closed

Failed to start up the cluster when upgrading to 1.13.0 #14107

tuxillo opened this issue Aug 10, 2022 · 13 comments · Fixed by #14149
Labels
type/bug Feature does not function as expected

Comments

@tuxillo
Copy link

tuxillo commented Aug 10, 2022

Overview of the Issue

Upgrading from 1.12.3 to 1.13.0 leads to a failure to start the cluster.

Reproduction Steps

Steps to reproduce this issue, eg:

  1. We have a 3-node cluster
  2. Upgrade nodes to 1.13.0
  3. The nodes won't start up due to an error (see log fragments below)

Consul info for both Client and Server

Client info
$ consul info
Error querying agent: Get "https://localhost:8501/v1/agent/self": dial tcp [::1]:8501: connect: connection refused
Server info
$ consul info
Error querying agent: Get "https://localhost:8501/v1/agent/self": dial tcp [::1]:8501: connect: connection refused

Operating system and Environment details

Centos 7 LXC running on Proxmox.

Linux consul-01.mysite 5.11.22-5-pve #1 SMP PVE 5.11.22-10 (Tue, 28 Sep 2021 08:15:41 +0200) x86_64 x86_64 x86_64 GNU/Linux

Log Fragments

Aug 10 07:40:59 consul-01.mysite consul[1683882]: ==> Starting Consul agent...
Aug 10 07:40:59 consul-01.mysite consul[1683882]:            Version: '1.13.0'
Aug 10 07:40:59 consul-01.mysite consul[1683882]:         Build Date: '2022-08-09 18:28:11 +0000 UTC'
Aug 10 07:40:59 consul-01.mysite consul[1683882]:            Node ID: '1aaaf50c-d91e-8d64-ec70-ed9d68341ff3'
Aug 10 07:40:59 consul-01.mysite consul[1683882]:          Node name: 'consul-01'
Aug 10 07:40:59 consul-01.mysite consul[1683882]:         Datacenter: 'mysite' (Segment: '<all>')
Aug 10 07:40:59 consul-01.mysite consul[1683882]:             Server: true (Bootstrap: false)
Aug 10 07:40:59 consul-01.mysite consul[1683882]:        Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: 8501, gRPC: 8502, DNS: 8600)
Aug 10 07:40:59 consul-01.mysite consul[1683882]:       Cluster Addr: 192.168.10.11 (LAN: 8301, WAN: 8302)
Aug 10 07:40:59 consul-01.mysite consul[1683882]:            Encrypt: Gossip: true, TLS-Outgoing: true, TLS-Incoming: true, Auto-Encrypt-TLS: true
Aug 10 07:40:59 consul-01.mysite consul[1683882]: ==> Log data will now stream in as it occurs:
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.402Z [WARN]  agent: The 'ca_file' field is deprecated. Use the 'tls.defaults.ca_file' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.402Z [WARN]  agent: The 'cert_file' field is deprecated. Use the 'tls.defaults.cert_file' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.402Z [WARN]  agent: The 'key_file' field is deprecated. Use the 'tls.defaults.key_file' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.402Z [WARN]  agent: The 'verify_incoming' field is deprecated. Use the 'tls.defaults.verify_incoming' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.402Z [WARN]  agent: The 'verify_incoming_rpc' field is deprecated. Use the 'tls.internal_rpc.verify_incoming' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.402Z [WARN]  agent: The 'verify_outgoing' field is deprecated. Use the 'tls.defaults.verify_outgoing' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.402Z [WARN]  agent: The 'verify_server_hostname' field is deprecated. Use the 'tls.internal_rpc.verify_server_hostname' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.402Z [WARN]  agent: bootstrap_expect > 0: expecting 3 servers
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.410Z [WARN]  agent.auto_config: The 'ca_file' field is deprecated. Use the 'tls.defaults.ca_file' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.410Z [WARN]  agent.auto_config: The 'cert_file' field is deprecated. Use the 'tls.defaults.cert_file' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.410Z [WARN]  agent.auto_config: The 'key_file' field is deprecated. Use the 'tls.defaults.key_file' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.410Z [WARN]  agent.auto_config: The 'verify_incoming' field is deprecated. Use the 'tls.defaults.verify_incoming' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.410Z [WARN]  agent.auto_config: The 'verify_incoming_rpc' field is deprecated. Use the 'tls.internal_rpc.verify_incoming' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.410Z [WARN]  agent.auto_config: The 'verify_outgoing' field is deprecated. Use the 'tls.defaults.verify_outgoing' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.410Z [WARN]  agent.auto_config: The 'verify_server_hostname' field is deprecated. Use the 'tls.internal_rpc.verify_server_hostname' field instead.
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.410Z [WARN]  agent.auto_config: bootstrap_expect > 0: expecting 3 servers
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.489Z [INFO]  agent.server.raft: starting restore from snapshot: id=1651-7004807-1660117093745 last-index=7004807 last-term=1651 size-in-bytes=149375
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.489Z [INFO]  agent.server.raft: snapshot restore progress: id=1651-7004807-1660117093745 last-index=7004807 last-term=1651 size-in-bytes=149375 read-bytes=70 percent-complete=0.05%
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.490Z [ERROR] agent.server.raft: failed to restore snapshot: id=1651-6988421-1660047942935 last-index=6988421 last-term=1651 size-in-bytes=149004 error="object missing primary index"
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.490Z [INFO]  agent.server: shutting down server
Aug 10 07:40:59 consul-01.mysite consul[1683882]: 2022-08-10T07:40:59.491Z [ERROR] agent: Error starting agent: error="Failed to start Consul server: Failed to start Raft: failed to load any existing snapshots"

@Amier3 Amier3 added the type/bug Feature does not function as expected label Aug 10, 2022
@huikang
Copy link
Collaborator

huikang commented Aug 11, 2022

@tuxillo , I couldn't reproduce the error probably I have a very small snapshot from v1.12.3

2022-08-10T22:31:56.355-0400 [INFO]  agent.server.raft: starting restore from snapshot: id=2-22-1660185101472 last-index=22 last-term=2 size-in-bytes=8564
2022-08-10T22:31:56.360-0400 [INFO]  agent.server.raft: snapshot restore progress: id=2-22-1660185101472 last-index=22 last-term=2 size-in-bytes=8564 read-bytes=8564 percent-complete=100.00%
2022-08-10T22:31:56.360-0400 [INFO]  agent.server.raft: restored from snapshot: id=2-22-1660185101472 last-index=22 last-term=2 size-in-bytes=8564
  • Will you be able to restart the server using the old version of consul, v1.12.3?

  • What is the output of consul snapshot inspect <snap-file>. The snapshot file should be located in the consul data dir, e.g., /path-to-data-dir/raft/snapshots/2-22-1660185101472/state.bin

@optiz0r
Copy link

optiz0r commented Aug 11, 2022

Seeing the same (repeatably). Downgrade to 1.12.3 works fine.

# consul snapshot inspect /opt/consul/raft/snapshots/1598-16767572-1660192717741/state.bin
 ID           1598-16767572-1660192717741
 Size         5882700
 Index        16767572
 Term         1598
 Version      1

 Type                        Count      Size
 ----                        ----       ----
 KVS                         3331       5.4MB
 Register                    173        220.3KB
 ACLToken                    25         10.7KB
 ACLPolicy                   10         3.2KB
 Index                       81         2.8KB
 CoordinateBatchUpdate       14         2.3KB
 Tombstone                   11         1.7KB
 ConnectCA                   1          1.2KB
 ConnectCAProviderState      1          1.2KB
 Session                     3          586B
 ConfigEntry                 1          353B
 FederationState             2          306B
 SystemMetadata              3          215B
 ConnectCAConfig             1          205B
 Autopilot                   1          199B
 ServiceVirtualIP            2          114B
 FreeVirtualIP               1          33B
 ChunkingState               1          12B
 ----                        ----       ----
 Total                                  5.6MB

With the perhaps slightly more useful log message: [ERROR] agent.server.raft: failed to restore snapshot: id=1596-16751183-1660135204652 last-index=16751183 last-term=1596 size-in-bytes=5887294 error="object missing primary index"

Full startup logs below.

Aug 11 07:52:03 consulserver.example.com systemd[1]: Started Consul Agent.
Aug 11 07:52:03 consulserver.example.com consul[268773]: ==> Starting Consul agent...
Aug 11 07:52:03 consulserver.example.com consul[268773]:            Version: '1.13.0'
Aug 11 07:52:03 consulserver.example.com consul[268773]:         Build Date: '2022-08-09 18:28:11 +0000 UTC'
Aug 11 07:52:03 consulserver.example.com consul[268773]:            Node ID: '8bcdd91a-4810-60ab-b2a3-f127b32dfc69'
Aug 11 07:52:03 consulserver.example.com consul[268773]:          Node name: 'consulserver'
Aug 11 07:52:03 consulserver.example.com consul[268773]:         Datacenter: 'dc1' (Segment: '<all>')
Aug 11 07:52:03 consulserver.example.com consul[268773]:             Server: true (Bootstrap: false)
Aug 11 07:52:03 consulserver.example.com consul[268773]:        Client Addr: [127.0.0.1] (HTTP: -1, HTTPS: 8500, gRPC: 8502, DNS: 8600)
Aug 11 07:52:03 consulserver.example.com consul[268773]:       Cluster Addr: 81.187.154.149 (LAN: 8301, WAN: 8302)
Aug 11 07:52:03 consulserver.example.com consul[268773]:            Encrypt: Gossip: true, TLS-Outgoing: true, TLS-Incoming: true, Auto-Encrypt-TLS: true
Aug 11 07:52:03 consulserver.example.com consul[268773]: ==> Log data will now stream in as it occurs:
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.648Z [WARN]  agent: bootstrap_expect > 0: expecting 3 servers
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.662Z [WARN]  agent.auto_config: bootstrap_expect > 0: expecting 3 servers
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.670Z [INFO]  agent.server.raft: starting restore from snapshot: id=1598-16767572-1660192717741 last-index=16767572 last-term=1598 size-in-bytes=5882700
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.679Z [INFO]  agent.server.raft: snapshot restore progress: id=1598-16767572-1660192717741 last-index=16767572 last-term=1598 size-in-bytes=5882700 read-bytes=62 percent-complete=0.00%
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.680Z [ERROR] agent.server.raft: failed to restore snapshot: id=1598-16767572-1660192717741 last-index=16767572 last-term=1598 size-in-bytes=5882700 error="object missing primary index"
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.680Z [INFO]  agent.server.raft: starting restore from snapshot: id=1596-16751183-1660135204652 last-index=16751183 last-term=1596 size-in-bytes=5887294
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.689Z [INFO]  agent.server.raft: snapshot restore progress: id=1596-16751183-1660135204652 last-index=16751183 last-term=1596 size-in-bytes=5887294 read-bytes=62 percent-complete=0.00%
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.689Z [ERROR] agent.server.raft: failed to restore snapshot: id=1596-16751183-1660135204652 last-index=16751183 last-term=1596 size-in-bytes=5887294 error="object missing primary index"
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.689Z [INFO]  agent.server: shutting down server
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.689Z [ERROR] agent: Error starting agent: error="Failed to start Consul server: Failed to start Raft: failed to load any existing snapshots"
Aug 11 07:52:03 consulserver.example.com consul[268773]: 2022-08-11T07:52:03.689Z [INFO]  agent: Exit code: code=1
Aug 11 07:52:03 consulserver.example.com systemd[1]: consul.service: Main process exited, code=exited, status=1/FAILURE
Aug 11 07:52:03 consulserver.example.com systemd[1]: consul.service: Failed with result 'exit-code'.
Aug 11 07:52:03 consulserver.example.com systemd[1]: consul.service: Service RestartSec=100ms expired, scheduling restart.
Aug 11 07:52:03 consulserver.example.com systemd[1]: consul.service: Scheduled restart job, restart counter is at 6.
Aug 11 07:52:03 consulserver.example.com systemd[1]: Stopped Consul Agent.
Aug 11 07:52:03 consulserver.example.com systemd[1]: consul.service: Start request repeated too quickly.
Aug 11 07:52:03 consulserver.example.com systemd[1]: consul.service: Failed with result 'exit-code'.
Aug 11 07:52:03 consulserver.example.com systemd[1]: Failed to start Consul Agent.

Just re-checked and no additional logs are available with level set to debug.

@tuxillo
Copy link
Author

tuxillo commented Aug 11, 2022

I can confirm that we had to downgrade to 1.12.3 to bring back the cluster. In another cluster we have we left 2 nodes in 1.12.3 and one in 1.13.0, which doesn't boot.

Here's the inspect of the snapshot that fails to restore:

$ consul snapshot inspect /var/consul/raft/snapshots/1651-6973752-1659985741788/state.bin
 ID           1651-6973752-1659985741788
 Size         148533
 Index        6973752
 Term         1651
 Version      1

 Type                        Count      Size
 ----                        ----       ----
 Register                    72         58.6KB
 KVS                         5          50.1KB
 ACLToken                    33         14.5KB
 ACLPolicy                   23         8.7KB
 ConfigEntry                 13         4KB
 ConnectCA                   1          2.7KB
 ConnectCAProviderState      1          1.9KB
 CoordinateBatchUpdate       11         1.8KB
 Index                       30         1KB
 Session                     3          612B
 FederationState             3          441B
 SystemMetadata              3          211B
 Autopilot                   1          199B
 ConnectCAConfig             1          180B
 Tombstone                   1          89B
 ServiceVirtualIP            1          70B
 FreeVirtualIP               1          33B
 ChunkingState               1          12B
 ----                        ----       ----
 Total                                  145.1KB

@dani
Copy link

dani commented Aug 11, 2022

Same problem on a 3 nodes test cluster, running on AlmaLinux 8. Downgraded to 1.12.3 to have the agent able to start

@MikeN123
Copy link

FYI - Same here on a single-node dev cluster. Snapshot info:

 ID           32-2392224-1659978506821
 Size         73385
 Index        2392224
 Term         32
 Version      1

 Type                        Count      Size
 ----                        ----       ----
 Register                    56         60KB
 ConfigEntry                 13         4.2KB
 Index                       48         2.1KB
 KVS                         12         1.6KB
 ConnectCA                   1          1.2KB
 ConnectCAProviderState      1          1.2KB
 ServiceVirtualIP            8          471B
 SystemMetadata              3          215B
 ConnectCAConfig             1          205B
 Autopilot                   1          199B
 CoordinateBatchUpdate       1          174B
 FederationState             1          139B
 FreeVirtualIP               1          33B
 ChunkingState               1          12B
 ----                        ----       ----
 Total                                  71.7KB

@huikang
Copy link
Collaborator

huikang commented Aug 11, 2022

@MikeN123 , thanks so much for sharing; it's really useful.
To help troubleshooting this issue, do you mind sharing the agent configuration and the step to reproduce?

@dani
Copy link

dani commented Aug 11, 2022

Here's mine

data_dir = "/opt/consul/data"
bind_addr = "0.0.0.0"
client_addr = "0.0.0.0"

advertise_addr = "10.99.5.18"

ports {
  dns = 8600
  http = 8500
  https = 8501
  grpc = 8502
  serf_lan = 8301
  serf_wan = 8302
  server = 8300
  sidecar_min_port = 21000
  sidecar_max_port = 21255
  expose_min_port = 21500
  expose_max_port = 21755
}

retry_join = [
  "nomad1.example.org",
  "nomad2.example.org",
  "nomad3.example.org",
]

server = true
bootstrap_expect = 3
performance {
  raft_multiplier = 1
}

encrypt = "<hidden>"

ui_config {
  enabled = true
}

recursors = [
  "10.99.5.1",
]

telemetry {
  prometheus_retention_time = "1h"
}

connect {
  enabled = true
}

log_level = "INFO"
  
node_meta {
}

Started by this systemd unit

[Unit]
Description="HashiCorp Consul - A service mesh solution"
Documentation=https://www.consul.io/
Requires=network-online.target
After=network-online.target
ConditionFileNotEmpty=/opt/consul/etc/consul.hcl

[Service]
Type=notify
EnvironmentFile=-/opt/consul/etc/consul.env
User=consul
Group=consul
ExecStart=/opt/consul/bin/consul agent -config-dir=/opt/consul/etc/
ExecReload=/bin/kill --signal HUP $MAINPID
SuccessExitStatus=1
Restart=on-failure
RestartSec=2
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

Nothing special to reproduce, just running a consul cluster in 1.12.3 (which has been installed and bootstraped with 1.12.3), change the binary to the 1.13.0 version and restart the service

@MikeN123
Copy link

MikeN123 commented Aug 11, 2022

Config here (1.12.3 works fine, 1.13.0 does not start) - almost completely default except for enabling Connect.

data_dir = "/opt/consul"
client_addr = "0.0.0.0"
ui = true
server = true

bootstrap_expect = 1

connect {
  enabled = true
}

ports {
  grpc = 8502
}
bind_addr = "{{ GetInterfaceIP \"ens192\" }}"

Steps to reproduce - no idea, just upgraded the node which has been running for several versions and never had issues with upgrades.

Maybe it's related to Connect, as all installs above seem to have that in common?

@huikang
Copy link
Collaborator

huikang commented Aug 11, 2022

We have found the root cause and are on fixing the issue. Thanks to all for helping troubleshoot the issue!

@kisunji
Copy link
Contributor

kisunji commented Aug 12, 2022

Thank you everyone for the reports. We narrowed down the issue to a breaking change that has been addressed in 1.13.1.

This bug in 1.13.0 affects those upgrading from Consul 1.11+ AND using Connect service mesh (i.e. Connect proxies registered). Everyone is advised to upgrade to 1.13.1 instead.

@ahjohannessen
Copy link

ahjohannessen commented Sep 29, 2022

@kisunji I am trying to upgrade from 1.12.4 to 1.13.2 and get this:

2022-09-29T14:03:33.851Z [WARN]  agent: The 'ca_file' field is deprecated. Use the 'tls.defaults.ca_file' field instead.
2022-09-29T14:03:33.852Z [WARN]  agent: The 'cert_file' field is deprecated. Use the 'tls.defaults.cert_file' field instead.
2022-09-29T14:03:33.852Z [WARN]  agent: The 'key_file' field is deprecated. Use the 'tls.defaults.key_file' field instead.
2022-09-29T14:03:33.852Z [WARN]  agent: The 'tls_min_version' field is deprecated. Use the 'tls.defaults.tls_min_version' field instead.
2022-09-29T14:03:33.852Z [WARN]  agent: The 'verify_incoming' field is deprecated. Use the 'tls.defaults.verify_incoming' field instead.
2022-09-29T14:03:33.852Z [WARN]  agent: The 'verify_incoming_https' field is deprecated. Use the 'tls.https.verify_incoming' field instead.
2022-09-29T14:03:33.852Z [WARN]  agent: The 'verify_incoming_rpc' field is deprecated. Use the 'tls.internal_rpc.verify_incoming' field instead.
2022-09-29T14:03:33.852Z [WARN]  agent: The 'verify_outgoing' field is deprecated. Use the 'tls.defaults.verify_outgoing' field instead.
2022-09-29T14:03:33.852Z [WARN]  agent: The 'verify_server_hostname' field is deprecated. Use the 'tls.internal_rpc.verify_server_hostname' field instead.
2022-09-29T14:03:33.852Z [WARN]  agent: bootstrap_expect > 0: expecting 3 servers
2022-09-29T14:03:33.852Z [WARN]  agent: if auto_encrypt.allow_tls is turned on, tls.internal_rpc.verify_incoming should be enabled (either explicitly or via tls.defaults.verify_incoming). It is necessary to turn it off during a migration to TLS, but it should definitely be turned on afterwards.
2022-09-29T14:03:33.859Z [WARN]  agent.auto_config: The 'ca_file' field is deprecated. Use the 'tls.defaults.ca_file' field instead.
2022-09-29T14:03:33.859Z [WARN]  agent.auto_config: The 'cert_file' field is deprecated. Use the 'tls.defaults.cert_file' field instead.
2022-09-29T14:03:33.859Z [WARN]  agent.auto_config: The 'key_file' field is deprecated. Use the 'tls.defaults.key_file' field instead.
2022-09-29T14:03:33.859Z [WARN]  agent.auto_config: The 'tls_min_version' field is deprecated. Use the 'tls.defaults.tls_min_version' field instead.
2022-09-29T14:03:33.859Z [WARN]  agent.auto_config: The 'verify_incoming' field is deprecated. Use the 'tls.defaults.verify_incoming' field instead.
2022-09-29T14:03:33.859Z [WARN]  agent.auto_config: The 'verify_incoming_https' field is deprecated. Use the 'tls.https.verify_incoming' field instead.
2022-09-29T14:03:33.859Z [WARN]  agent.auto_config: The 'verify_incoming_rpc' field is deprecated. Use the 'tls.internal_rpc.verify_incoming' field instead.
2022-09-29T14:03:33.859Z [WARN]  agent.auto_config: The 'verify_outgoing' field is deprecated. Use the 'tls.defaults.verify_outgoing' field instead.
2022-09-29T14:03:33.859Z [WARN]  agent.auto_config: The 'verify_server_hostname' field is deprecated. Use the 'tls.internal_rpc.verify_server_hostname' field instead.
2022-09-29T14:03:33.859Z [WARN]  agent.auto_config: bootstrap_expect > 0: expecting 3 servers
2022-09-29T14:03:33.859Z [WARN]  agent.auto_config: if auto_encrypt.allow_tls is turned on, tls.internal_rpc.verify_incoming should be enabled (either explicitly or via tls.defaults.verify_incoming). It is necessary to turn it off during a migration to TLS, but it should definitely be turned on afterwards.
2022-09-29T14:03:33.862Z [INFO]  agent.server.raft: starting restore from snapshot: id=79518-2867789-1664381847462 last-index=2867789 last-term=79518 size-in-bytes=101474
2022-09-29T14:03:33.863Z [INFO]  agent.server.raft: snapshot restore progress: id=79518-2867789-1664381847462 last-index=2867789 last-term=79518 size-in-bytes=101474 read-bytes=53 percent-complete=0.05%
2022-09-29T14:03:33.863Z [ERROR] agent.server.raft: failed to restore snapshot: id=79518-2867789-1664381847462 last-index=2867789 last-term=79518 size-in-bytes=101474 error="object missing primary index"
2022-09-29T14:03:33.863Z [INFO]  agent.server.raft: starting restore from snapshot: id=79518-2851404-1664285318345 last-index=2851404 last-term=79518 size-in-bytes=101474
2022-09-29T14:03:33.863Z [INFO]  agent.server.raft: snapshot restore progress: id=79518-2851404-1664285318345 last-index=2851404 last-term=79518 size-in-bytes=101474 read-bytes=53 percent-complete=0.05%
2022-09-29T14:03:33.863Z [ERROR] agent.server.raft: failed to restore snapshot: id=79518-2851404-1664285318345 last-index=2851404 last-term=79518 size-in-bytes=101474 error="object missing primary index"
2022-09-29T14:03:33.863Z [INFO]  agent.server: shutting down server
2022-09-29T14:03:33.863Z [ERROR] agent: Error starting agent: error="Failed to start Consul server: Failed to start Raft: failed to load any existing snapshots"
2022-09-29T14:03:33.863Z [INFO]  agent: Exit code: code=1

I use Consul Connect with Nomad 1.3.5.

@ahjohannessen
Copy link

@kisunji @huikang

I tried to upgrade from 1.12.4 to 1.14.0 and still get the same error as I mentioned above, error="object missing primary index" - I suspect it has something to do with I use Consul Connect and Nomad.

A snapshot of raft from 1.12.4 before attempting the upgrade looks like this:

[core@f03 ~]$ consul snapshot inspect backup.snap
 ID           79661-3562292-1668597879567
 Size         77849
 Index        3562292
 Term         79661
 Version      1

 Type                        Count      Size
 ----                        ----       ----
 Register                    52         45.8KB
 KVS                         17         8.5KB
 ACLToken                    15         5.7KB
 ConfigEntry                 14         4.9KB
 Index                       69         2.7KB
 ACLPolicy                   9          2.2KB
 ACLRole                     4          1.6KB
 ConnectCA                   1          1.2KB
 ConnectCAProviderState      1          1.1KB
 CoordinateBatchUpdate       5          835B
 Session                     3          528B
 Autopilot                   1          199B
 ConnectCAConfig             1          195B
 SystemMetadata              3          191B
 Tombstone                   2          183B
 ServiceVirtualIP            3          151B
 FederationState             1          139B
 FreeVirtualIP               1          33B
 ChunkingState               1          12B
 ----                        ----       ----
 Total                                  76KB

@kisunji
Copy link
Contributor

kisunji commented Nov 17, 2022

@ahjohannessen Could you please raise a new issue with your Consul versions (OSS or Enterprise?) and log fragments (ideally containing the agent version)?

From a quick glance I didn't find any diffs between v1.12.3..v1.12.4 or v1.13.1..1.14.0 that would break snapshot restore.

If there is an issue with specific versions it would be ideal to track it separately since this one is considered closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants