Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

elasticsearch: readiness probe extension #609

Merged
merged 1 commit into from Sep 7, 2017

Conversation

wozniakjan
Copy link
Contributor

@wozniakjan wozniakjan commented Aug 25, 2017

Interim readiness probe until multi-role Elasticsearch topology from openshift/openshift-ansible#5001 is introduced.

Proposed changes are based on this document: https://docs.google.com/document/d/1m9-LXr-UThs9WhRfkFRbwf1Lppxa5Opw_PYRY1wuDbY/

List of implemented readiness probe checks:

  1. localhost:9200/ returns OK 200
  2. localhost:9200/.searchguard.${DC_NAME} returns OK 200
  3. localhost:9200/_template/${template} returns OK 200 for each $template in $ES_HOME/index_templates/*.json
  4. localhost:9200/.searchguard.$DC_NAME/_count returns count of 5 documents in the response body

Depends on this PR

@wozniakjan
Copy link
Contributor Author

elasticsearch/probe/readiness.sh Outdated Show resolved Hide resolved
elasticsearch/probe/readiness.sh Show resolved Hide resolved
@wozniakjan wozniakjan force-pushed the es_readiness_probe branch 2 times, most recently from 169cb95 to 0540d24 Compare August 25, 2017 14:31
Copy link
Contributor

@lukas-vlcek lukas-vlcek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left minor comments.

As for the missing feature defined as:

We need to ensure pod is still accessible by admin or by using admin certs

I am not sure what this meant, we are already using admin certs with curl so it is implicitly fulfilled?

elasticsearch/probe/readiness.sh Show resolved Hide resolved
@jcantrill
Copy link
Contributor

@wozniakjan Regarding 'We need to ensure pod is still accessible by admin or by using admin certs' this is not intended to implemented in the probe but more a note we need to ensure that when a pod is in a status such that it is not ready, that we can still access the member of the node to obtain information (e.g. _cat/indices). I believe the ACL backend registry will allow you to do this assuming you have access to the admin certs

Copy link
Contributor

@jcantrill jcantrill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We will need to backport this to 3.6. Let's make sure not to forget that PR

@wozniakjan wozniakjan changed the title WIP elasticsearch: readiness probe extension elasticsearch: readiness probe extension Aug 28, 2017
elasticsearch/probe/readiness.sh Outdated Show resolved Hide resolved
elasticsearch/probe/readiness.sh Show resolved Hide resolved
| python -c "import sys, json; print json.load(sys.stdin)['count']")

if [ "$sg_doc_count" != $EXPECTED_SG_DOCUMENT_COUNT ]; then
echo "Incorrect SG document count, expected $EXPECTED_SG_DOCUMENT_COUNT [received doc count: ${sg_doc_count}]"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above comment about errors

@jcantrill
Copy link
Contributor

[test]

@wozniakjan
Copy link
Contributor Author

wozniakjan commented Sep 6, 2017

@jcantrill I updated versions of elasticsearch cloud kubernetes plugin and tested again. All seems to be ready for merge and QE tests.

additional checks in readiness probe:
1) localhost:9200/ returns OK 200
2) localhost:9200/.searchguard.${DC_NAME} returns OK 200
3) localhost:9200/_template/${template} returns OK 200 for each template
   in $ES_HOME/index_templates/*.json
4) localhost:9200/.searchguard.$DC_NAME/_search?size=0 returns count of
   5 documents in the response body
@wozniakjan
Copy link
Contributor Author

[test]

@openshift-bot
Copy link

Evaluated for origin aggregated logging test up to c691f2a

@openshift-bot
Copy link

Origin Aggregated Logging Test Results: SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_aggregated_logging/378/) (Base Commit: 0b6cc33) (PR Branch Commit: c691f2a)

@wozniakjan
Copy link
Contributor Author

@jcantrill tests are passing, my local tests also succeeded, do you think we can merge it?

@jcantrill
Copy link
Contributor

[merge]

@openshift-bot
Copy link

Evaluated for origin aggregated logging merge up to c691f2a

@openshift-bot
Copy link

openshift-bot commented Sep 7, 2017

Origin Aggregated Logging Merge Results: SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin_aggregated_logging/158/) (Base Commit: b3418b5) (PR Branch Commit: c691f2a)

@openshift-bot openshift-bot merged commit ca917ff into openshift:master Sep 7, 2017
@portante
Copy link
Contributor

portante commented Sep 7, 2017

I have some reservations about this probe. This relies on the cluster being operational for the probes to work. If the searchguard index is sharded across all nodes of the cluster, and one of the nodes is not available, then will this probe work?

Fundamentally, readiness probes should not relied on the cluster-wide state of Elasticsearch to determine if one pod is ready to join the service.

@wozniakjan
Copy link
Contributor Author

I have some reservations about this probe. This relies on the cluster being operational for the probes to work. If the searchguard index is sharded across all nodes of the cluster, and one of the nodes is not available, then will this probe work?

Each probe verifies only its own searchguard index. If fewer than half of ES pods aren't available, rest of the service remains operational. Those malfunctioning ES pods will be removed from ES service and the service will loadbalance only across working pods forming a cluster.

Fundamentally, readiness probes should not relied on the cluster-wide state of Elasticsearch to determine if one pod is ready to join the service.

I am afraid there is no way around it. It is ES internal mechanism and a single pod is not ready until it negotiates with a subset of his peers to form an operational cluster according to its configuration.

@richm
Copy link
Contributor

richm commented Sep 8, 2017

How can we test this on a real, live, multi-node cluster, so we can try to reproduce some of these scenarios @portante is describing?

@portante
Copy link
Contributor

portante commented Sep 8, 2017

Each probe verifies only its own searchguard index.

Are not indices in Elasticsearch maintained cluster wide? Don't they have state that is managed across all nodes in the cluster? If that is not the case, share where that is documented to be otherwise.

Assuming it is cluster-wide state, making it part of a pod's readiness probe does not make sense.

There is a way to drop per-ES pod searchguard indices which would make this particular issue go away.

But it still holds a that a readiness probe is per pod, and as such, should not rely on service or ES cluster state to determine its readiness to join the service.

@wozniakjan
Copy link
Contributor Author

wozniakjan commented Sep 8, 2017

@richm locally I tested the searchguard index miss by creating an image that had this commented out and deployed cluster with 3 nodes.


Then another image with this line back and redeploy 2 out of 3 images. Two images formed an operational cluster, one was still waiting for the probe to succeed and was, therefore, dropped out of the service

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants