Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to debug already sealed Vault pod (1.3.0) #8155

Closed
Aracki opened this issue Jan 14, 2020 · 7 comments
Closed

How to debug already sealed Vault pod (1.3.0) #8155

Aracki opened this issue Jan 14, 2020 · 7 comments
Labels
bug Used to indicate a potential bug storage/consul

Comments

@Aracki
Copy link

Aracki commented Jan 14, 2020

In my case, 1/3 pods keep getting sealed over time, but the only error I'm getting is:

failed to lookup token: error="failed to look up namespace from the token: no namespace"

Do you know what can be the case of sealing and most importantly how do I debug already sealed pod?

@michelvocks
Copy link
Contributor

Hi @Aracki!

I think the most important source to identify failures are the Vault server logs.
Would it be possible to get the complete logs from a sealed Vault instance?

Cheers,
Michel

@Aracki
Copy link
Author

Aracki commented Jan 15, 2020

Apart from the error I've already copied, there were a lot of WARNs regarding Consul backend:

devops-vault-0 vault 2020-01-15T21:37:23.533Z [WARN]  storage.consul: check unable to talk with Consul backend: error="Unexpected response code: 500 (CheckID "vault:<IP>:8200:vault-sealed-check" does not have associated TTL)"
devops-vault-0 vault 2020-01-15T21:37:24.535Z [WARN]  storage.consul: check unable to talk with Consul backend: error="Unexpected response code: 500 (CheckID "vault:<IP>:8200:vault-sealed-check" does not have associated TTL)"
devops-vault-0 vault 2020-01-15T21:37:25.536Z [WARN]  storage.consul: check unable to talk with Consul backend: error="Unexpected response code: 500 (CheckID "vault:<IP>:8200:vault-sealed-check" does not have associated TTL)"
devops-vault-0 vault 2020-01-15T21:37:26.538Z [WARN]  storage.consul: check unable to talk with Consul backend: error="Unexpected response code: 500 (CheckID "vault:<IP>:8200:vault-sealed-check" does not have associated TTL)"
devops-vault-0 vault 2020-01-15T21:37:27.540Z [WARN]  storage.consul: check unable to talk with Consul backend: error="Unexpected response code: 500 (CheckID "vault:<IP>:8200:vault-sealed-check" does not have associated TTL)"
devops-vault-0 vault 2020-01-15T21:37:28.542Z [WARN]  storage.consul: check unable to talk with Consul backend: error="Unexpected response code: 500 (CheckID "vault:<IP>:8200:vault-sealed-check" does not have associated TTL)"
devops-vault-0 vault 2020-01-15T21:37:29.545Z [WARN]  storage.consul: check unable to talk with Consul backend: error="Unexpected response code: 500 (CheckID "vault:<IP>:8200:vault-sealed-check" does not have associated TTL)"

...but I've already opened up the #6876 in the Consul repo.

@michelvocks
Copy link
Contributor

It looks like this is a duplicate of hashicorp/consul#5439.

Until we figured out what the problem is and come up with a fix I would try to work around this issue with what hashicorp/consul#5439 (comment) has described.

@michelvocks michelvocks added bug Used to indicate a potential bug storage/consul waiting-for-response and removed question labels Jan 17, 2020
@Aracki
Copy link
Author

Aracki commented Jan 21, 2020

Sorry @michelvocks, it happened again and I have found logs with related error:

2020-01-20T23:28:07.327Z [ERROR] expiration: error restoring leases: error="failed to read lease entry auth/github/login/<ID>: Unexpected response code: 500"
2020-01-20T23:28:07.327Z [ERROR] core: shutting down
2020-01-20T23:28:07.327Z [INFO]  core: marked as sealed
2020-01-20T23:28:07.566Z [ERROR] expiration: failed to revoke lease: lease_id=auth/kubernetes/login/<ID> error="failed to revoke token: failed to revoke entry: failed to delete entry: Unexpected response code: 500 (No cluster leader)"
2020-01-20T23:28:07.567Z [INFO]  core: pre-seal teardown starting
2020-01-20T23:28:08.069Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
2020-01-20T23:28:08.070Z [INFO]  rollback: stopping rollback manager
2020-01-20T23:28:08.071Z [INFO]  core: pre-seal teardown complete
2020-01-20T23:28:08.089Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
2020-01-20T23:28:08.113Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
2020-01-20T23:28:08.234Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
2020-01-20T23:28:09.132Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
2020-01-20T23:28:09.185Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
2020-01-20T23:28:09.304Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
2020-01-20T23:28:10.327Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
2020-01-20T23:28:10.602Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
2020-01-20T23:28:10.729Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
2020-01-20T23:28:11.122Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
2020-01-20T23:28:12.832Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
2020-01-20T23:28:13.241Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
2020-01-20T23:28:13.408Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
2020-01-20T23:28:13.443Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
2020-01-20T23:28:13.754Z [INFO]  core: stopping cluster listeners
2020-01-20T23:28:13.754Z [INFO]  core.cluster-listener: forwarding rpc listeners stopped
2020-01-20T23:28:13.943Z [INFO]  core.cluster-listener: rpc listeners successfully shut down
2020-01-20T23:28:13.943Z [INFO]  core: cluster listeners successfully shut down
2020-01-20T23:28:13.943Z [INFO]  core: vault is sealed
2020-01-20T23:28:17.567Z [ERROR] expiration: core context canceled, not attempting further revocation of lease: lease_id=auth/kubernetes/login/<ID>

@michelvocks
Copy link
Contributor

Hi @Aracki!

failed to delete entry: Unexpected response code: 500 (No cluster leader)

Have you checked your Consul logs? It looks like Consul had an issue during that time.

Cheers,
Michel

@Aracki
Copy link
Author

Aracki commented Jan 30, 2020

@michelvocks Just a lot of the following ones:

[ERR] http: Request PUT /v1/agent/check/pass/vault:<IP>:vault-sealed-check?note=Vault+Unsealed, error: CheckID "vault:<IP>:vault-sealed-check" does not have associated TTL from=<IP>

@ncabatoff
Copy link
Collaborator

This isn't a Vault issue, it's expected behaviour for Vault to seal itself if it gets a storage error while restoring leases at startup:

2020-01-20T23:28:07.327Z [ERROR] expiration: error restoring leases: error="failed to read lease entry auth/github/login/<ID>: Unexpected response code: 500"
2020-01-20T23:28:07.327Z [ERROR] core: shutting down
2020-01-20T23:28:07.327Z [INFO]  core: marked as sealed

I suggest asking for help on the discuss forums if you're still experiencing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Used to indicate a potential bug storage/consul
Projects
None yet
Development

No branches or pull requests

4 participants