-
Notifications
You must be signed in to change notification settings - Fork 31
Added leader election to JH image #100
Added leader election to JH image #100
Conversation
LGTM, you'll need to add the actual leader election image to the pod though, https://github.com/Xaenalt/jupyterhub-odh-ha/blob/abad9fe9519c18866767deebb2821289d12d3dd1/jupyterhub-dc.yaml#L101-L103 Also, is that s2i run script executed before any of the others in the pod, I added it originally to /opt/app-root/builder/run since that's what the JH pod invokes |
https://docs.openshift.com/container-platform/4.7/openshift_images/using_images/customizing-s2i-images.html#images-using-customizing-s2i-images-scripts-embedded_customizing-s2i-images
Does this mean it wont invoke the actual jupyterhub run script? |
@Xaenalt This is the associated PR to odh-manifests with the sidecar: opendatahub-io/odh-manifests#460 |
yeah, I totally forgot to implement the exec command, it's just running the exec to notify when to start running the jupyter notebook, so instead of the |
I think is done now, it will only be called by the leader, as the |
As we are still discussing the sleep time I'm committing the fixes in separate commits, once it's sorted out ill squash them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it didn't put these comments out there until I hit unresolve, my bad
Ok thanks, so I would then resolve @vpavlin comments, sorry, I haven't been able to test it cause my cluster is still down. |
Yeah, as nice as it'd be if the -http method worked, if you can find what's wrong and fix it, go for it To reproduce, do the http method with a replicaset and delete the current leader, it'll never be updated in the new leader on the http method, but it'll update properly using the get endpoint method |
ok perfect 👍 |
Ok, I've implemented your solution @maulikjs, as you stated the service was load balancing between the 3 pods even though two of them were not ready, hence the API route error. |
I can confirm this is due to the pods being reported as missing and the proxy routing the request to pods which are waiting for leader and arent running JH yet. |
Will that result in failing/restarting pods when the readinessProbe check potentially never gets satisfied? |
Nope, we add either do not add a livenessProbe or empty livenessprobe which always evaluates to true. Once we add traefik into the mix this would mean user pods will be accessible no matter what and jupyterhub will at max have a downtime of ~12 seconds, 6 seconds for it to go through the leadership check and approx 6 more (as per my testing on osd) for the standby pod to start running and become the leader. I haven't tested this with traefik yet but the only thing we need to look into is:
why Is it adding the route again and again and will this behaviour continue once we have traefik in the mix. |
That's from the publish service which we are removing, so don't worry about it:) |
perfect! |
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things look good on Maulik's cluster. 🚢
Awesome job |
Related Issues and Dependencies
RHODS-767
This introduces a breaking change
Added the s2i run step to check if the image is the leader. By checking the port opened by the sidecar leader container only the leader will start the execution.
This Pull Request implements
Description
Based on the work of of Sean add check method for the kubernetes leader election pattern.
This PR is linked to the Added sidecar leader election on JH PR
Testing
This image could be built with the following command:
For further info you can check this test log