Added leader election to JH image #100

lucferbux · 2021-07-16T18:36:15Z

Related Issues and Dependencies

This introduces a breaking change

Yes
No

Added the s2i run step to check if the image is the leader. By checking the port opened by the sidecar leader container only the leader will start the execution.

This Pull Request implements

Add script to check the leader election strategy

Description

Based on the work of of Sean add check method for the kubernetes leader election pattern.

This PR is linked to the Added sidecar leader election on JH PR

Testing

This image could be built with the following command:

s2i build git@github.com:lucferbux/jupyterhub-odh.git --ref feature/add-leader-election --context-dir / quay.io/odh-jupyterhub/jupyterhub:v3.5.3 [registry]

For further info you can check this test log

Xaenalt · 2021-07-16T20:23:10Z

LGTM, you'll need to add the actual leader election image to the pod though, https://github.com/Xaenalt/jupyterhub-odh-ha/blob/abad9fe9519c18866767deebb2821289d12d3dd1/jupyterhub-dc.yaml#L101-L103

Also, is that s2i run script executed before any of the others in the pod, I added it originally to /opt/app-root/builder/run since that's what the JH pod invokes

maulikjs · 2021-07-19T05:09:28Z

https://docs.openshift.com/container-platform/4.7/openshift_images/using_images/customizing-s2i-images.html#images-using-customizing-s2i-images-scripts-embedded_customizing-s2i-images
One thing we would want to note is if you scroll to the bottom of the above page and read the warning

When wrapping the run script, you must use exec for invoking it to ensure signals are handled properly. The use of exec also precludes the ability to run additional commands after invoking the default image run script.

Does this mean it wont invoke the actual jupyterhub run script?

maulikjs · 2021-07-19T05:11:00Z

@Xaenalt This is the associated PR to odh-manifests with the sidecar: opendatahub-io/odh-manifests#460

lucferbux · 2021-07-19T07:18:50Z

https://docs.openshift.com/container-platform/4.7/openshift_images/using_images/customizing-s2i-images.html#images-using-customizing-s2i-images-scripts-embedded_customizing-s2i-images
One thing we would want to note is if you scroll to the bottom of the above page and read the warning
When wrapping the run script, you must use exec for invoking it to ensure signals are handled properly. The use of exec also precludes the ability to run additional commands after invoking the default image run script.
Does this mean it wont invoke the actual jupyterhub run script?

yeah, I totally forgot to implement the exec command, it's just running the exec to notify when to start running the jupyter notebook, so instead of the echo "Became leader! Start the program!" we should call the script.

lucferbux · 2021-07-19T09:09:08Z

https://docs.openshift.com/container-platform/4.7/openshift_images/using_images/customizing-s2i-images.html#images-using-customizing-s2i-images-scripts-embedded_customizing-s2i-images
One thing we would want to note is if you scroll to the bottom of the above page and read the warning
When wrapping the run script, you must use exec for invoking it to ensure signals are handled properly. The use of exec also precludes the ability to run additional commands after invoking the default image run script.
Does this mean it wont invoke the actual jupyterhub run script?

I think is done now, it will only be called by the leader, as the echo ... was only executed by the leader, I'm gonna try this approach to check if it works

.s2i/bin/run

lucferbux · 2021-07-19T11:17:35Z

As we are still discussing the sleep time I'm committing the fixes in separate commits, once it's sorted out ill squash them.

Xaenalt

Oh, it didn't put these comments out there until I hit unresolve, my bad

.s2i/bin/run

lucferbux · 2021-07-19T14:58:04Z

Oh, it didn't put these comments out there until I hit unresolve, my bad

Ok thanks, so I would then resolve @vpavlin comments, sorry, I haven't been able to test it cause my cluster is still down.

Xaenalt · 2021-07-19T15:05:09Z

Yeah, as nice as it'd be if the -http method worked, if you can find what's wrong and fix it, go for it

To reproduce, do the http method with a replicaset and delete the current leader, it'll never be updated in the new leader on the http method, but it'll update properly using the get endpoint method

lucferbux · 2021-07-19T15:09:06Z

Yeah, as nice as it'd be if the -http method worked, if you can find what's wrong and fix it, go for it

To reproduce, do the http method with a replicaset and delete the current leader, it'll never be updated in the new leader on the http method, but it'll update properly using the get endpoint method

ok perfect 👍

.s2i/bin/run

maulikjs · 2021-07-22T08:27:07Z

Also sometimes I get this error:

Refreshing the page multiple times takes care of it though which is odd.
Another thing that is odd is:

/user/mshah@redhat.com/public
08:25:51.480 [ConfigProxy] �[32minfo�[39m: Adding route /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:25:51.481 [ConfigProxy] �[32minfo�[39m: Route added /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:25:51.481 [ConfigProxy] �[32minfo�[39m: 201 POST /api/routes/user/mshah@redhat.com/public
08:25:56.486 [ConfigProxy] �[32minfo�[39m: 200 GET /api/routes
http://10.128.2.33:9090!=http://10.128.2.33:8080
ParseResult(scheme='http', netloc='10.128.2.33:8080', path='', params='', query='', fragment='')
/user/mshah@redhat.com/public
08:25:56.490 [ConfigProxy] �[32minfo�[39m: Adding route /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:25:56.490 [ConfigProxy] �[32minfo�[39m: Route added /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:25:56.490 [ConfigProxy] �[32minfo�[39m: 201 POST /api/routes/user/mshah@redhat.com/public
08:26:01.495 [ConfigProxy] �[32minfo�[39m: 200 GET /api/routes
http://10.128.2.33:9090!=http://10.128.2.33:8080
ParseResult(scheme='http', netloc='10.128.2.33:8080', path='', params='', query='', fragment='')
/user/mshah@redhat.com/public
08:26:01.499 [ConfigProxy] �[32minfo�[39m: Adding route /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:26:01.499 [ConfigProxy] �[32minfo�[39m: Route added /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:26:01.499 [ConfigProxy] �[32minfo�[39m: 201 POST /api/routes/user/mshah@redhat.com/public

It keeps trying to add route for a pod which exists.

lucferbux · 2021-07-22T11:03:55Z

Also sometimes I get this error:

Refreshing the page multiple times takes care of it though which is odd.
Another thing that is odd is:

/user/mshah@redhat.com/public
08:25:51.480 [ConfigProxy] �[32minfo�[39m: Adding route /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:25:51.481 [ConfigProxy] �[32minfo�[39m: Route added /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:25:51.481 [ConfigProxy] �[32minfo�[39m: 201 POST /api/routes/user/mshah@redhat.com/public
08:25:56.486 [ConfigProxy] �[32minfo�[39m: 200 GET /api/routes
http://10.128.2.33:9090!=http://10.128.2.33:8080
ParseResult(scheme='http', netloc='10.128.2.33:8080', path='', params='', query='', fragment='')
/user/mshah@redhat.com/public
08:25:56.490 [ConfigProxy] �[32minfo�[39m: Adding route /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:25:56.490 [ConfigProxy] �[32minfo�[39m: Route added /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:25:56.490 [ConfigProxy] �[32minfo�[39m: 201 POST /api/routes/user/mshah@redhat.com/public
08:26:01.495 [ConfigProxy] �[32minfo�[39m: 200 GET /api/routes
http://10.128.2.33:9090!=http://10.128.2.33:8080
ParseResult(scheme='http', netloc='10.128.2.33:8080', path='', params='', query='', fragment='')
/user/mshah@redhat.com/public
08:26:01.499 [ConfigProxy] �[32minfo�[39m: Adding route /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:26:01.499 [ConfigProxy] �[32minfo�[39m: Route added /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:26:01.499 [ConfigProxy] �[32minfo�[39m: 201 POST /api/routes/user/mshah@redhat.com/public

It keeps trying to add route for a pod which exists.

Ok, I've implemented your solution @maulikjs, as you stated the service was load balancing between the 3 pods even though two of them were not ready, hence the API route error.
Adding readinessProbe to check if the service is up, so it can start accept traffic. Changes are live here: opendatahub-io/odh-manifests#460

maulikjs · 2021-07-22T11:08:44Z

I can confirm this is due to the pods being reported as missing and the proxy routing the request to pods which are waiting for leader and arent running JH yet.
We need to add a readinessProbe which wont report the pod as running until it is actually the leader.

vpavlin · 2021-07-22T13:01:23Z

We need to add a readinessProbe which wont report the pod as running until it is actually the leader.

Will that result in failing/restarting pods when the readinessProbe check potentially never gets satisfied?

maulikjs · 2021-07-22T13:09:47Z

We need to add a readinessProbe which wont report the pod as running until it is actually the leader.

Will that result in failing/restarting pods when the readinessProbe check potentially never gets satisfied?

Nope, we add either do not add a livenessProbe or empty livenessprobe which always evaluates to true.
Based on how @lucferbux configured the deployment, the readinessProbe will check for a service running on port 8081 which is started when jupyterhub actually starts and thanks to this jupyterhub will only start when its the leader. So we will always only have 1 active pod with both containers running and 2 standby pods waiting in a not-ready state to get a lock and become the leader.

Once we add traefik into the mix this would mean user pods will be accessible no matter what and jupyterhub will at max have a downtime of ~12 seconds, 6 seconds for it to go through the leadership check and approx 6 more (as per my testing on osd) for the standby pod to start running and become the leader.

I haven't tested this with traefik yet but the only thing we need to look into is:

/user/mshah@redhat.com/public
08:25:51.480 [ConfigProxy] �[32minfo�[39m: Adding route /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:25:51.481 [ConfigProxy] �[32minfo�[39m: Route added /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:25:51.481 [ConfigProxy] �[32minfo�[39m: 201 POST /api/routes/user/mshah@redhat.com/public
08:25:56.486 [ConfigProxy] �[32minfo�[39m: 200 GET /api/routes
http://10.128.2.33:9090!=http://10.128.2.33:8080
ParseResult(scheme='http', netloc='10.128.2.33:8080', path='', params='', query='', fragment='')
/user/mshah@redhat.com/public
08:25:56.490 [ConfigProxy] �[32minfo�[39m: Adding route /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:25:56.490 [ConfigProxy] �[32minfo�[39m: Route added /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:25:56.490 [ConfigProxy] �[32minfo�[39m: 201 POST /api/routes/user/mshah@redhat.com/public
08:26:01.495 [ConfigProxy] �[32minfo�[39m: 200 GET /api/routes
http://10.128.2.33:9090!=http://10.128.2.33:8080
ParseResult(scheme='http', netloc='10.128.2.33:8080', path='', params='', query='', fragment='')
/user/mshah@redhat.com/public
08:26:01.499 [ConfigProxy] �[32minfo�[39m: Adding route /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:26:01.499 [ConfigProxy] �[32minfo�[39m: Route added /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:26:01.499 [ConfigProxy] �[32minfo�[39m: 201 POST /api/routes/user/mshah@redhat.com/public

why Is it adding the route again and again and will this behaviour continue once we have traefik in the mix.

vpavlin · 2021-07-22T13:16:25Z

why Is it adding the route again and again and will this behaviour continue once we have traefik in the mix.

That's from the publish service which we are removing, so don't worry about it:)

maulikjs · 2021-07-22T13:32:08Z

perfect!

.s2i/bin/run

maulikjs · 2021-07-22T17:36:11Z

/lgtm

anishasthana

Things look good on Maulik's cluster. 🚢

Xaenalt · 2021-07-22T18:46:12Z

Awesome job

lucferbux requested review from crobby, anishasthana, Xaenalt and maulikjs July 16, 2021 18:36

vpavlin reviewed Jul 19, 2021

View reviewed changes

.s2i/bin/run Outdated Show resolved Hide resolved

vpavlin reviewed Jul 19, 2021

View reviewed changes

.s2i/bin/run Show resolved Hide resolved

vpavlin reviewed Jul 19, 2021

View reviewed changes

.s2i/bin/run Outdated Show resolved Hide resolved

lucferbux requested a review from vpavlin July 19, 2021 12:04

Xaenalt reviewed Jul 19, 2021

View reviewed changes

.s2i/bin/run Outdated Show resolved Hide resolved

.s2i/bin/run Outdated Show resolved Hide resolved

vpavlin reviewed Jul 20, 2021

View reviewed changes

.s2i/bin/run Outdated Show resolved Hide resolved

lucferbux requested review from vpavlin and Xaenalt July 20, 2021 12:38

lucferbux changed the title ~~WIP - Added leader election to JH image~~ Added leader election to JH image Jul 20, 2021

lucferbux marked this pull request as ready for review July 20, 2021 12:41

Xaenalt reviewed Jul 20, 2021

View reviewed changes

.s2i/bin/run Outdated Show resolved Hide resolved

.s2i/bin/run Show resolved Hide resolved

lucferbux requested a review from Xaenalt July 21, 2021 11:13

Xaenalt reviewed Jul 21, 2021

View reviewed changes

.s2i/bin/run Outdated Show resolved Hide resolved

maulikjs reviewed Jul 22, 2021

View reviewed changes

.s2i/bin/run Outdated Show resolved Hide resolved

Added leader election to JH image

9f3dac0

maulikjs approved these changes Jul 22, 2021

View reviewed changes

anishasthana approved these changes Jul 22, 2021

View reviewed changes

anishasthana merged commit e1dd86f into opendatahub-io-contrib:master Jul 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added leader election to JH image #100

Added leader election to JH image #100

lucferbux commented Jul 16, 2021 •

edited

Xaenalt commented Jul 16, 2021 •

edited

maulikjs commented Jul 19, 2021

maulikjs commented Jul 19, 2021

lucferbux commented Jul 19, 2021

lucferbux commented Jul 19, 2021

lucferbux commented Jul 19, 2021

Xaenalt left a comment

lucferbux commented Jul 19, 2021

Xaenalt commented Jul 19, 2021

lucferbux commented Jul 19, 2021

maulikjs commented Jul 22, 2021

lucferbux commented Jul 22, 2021 •

edited

maulikjs commented Jul 22, 2021

vpavlin commented Jul 22, 2021

maulikjs commented Jul 22, 2021

vpavlin commented Jul 22, 2021

maulikjs commented Jul 22, 2021

maulikjs commented Jul 22, 2021

anishasthana left a comment

Xaenalt commented Jul 22, 2021

Added leader election to JH image #100

Added leader election to JH image #100

Conversation

lucferbux commented Jul 16, 2021 • edited

Related Issues and Dependencies

This introduces a breaking change

This Pull Request implements

Description

Testing

Xaenalt commented Jul 16, 2021 • edited

maulikjs commented Jul 19, 2021

maulikjs commented Jul 19, 2021

lucferbux commented Jul 19, 2021

lucferbux commented Jul 19, 2021

lucferbux commented Jul 19, 2021

Xaenalt left a comment

Choose a reason for hiding this comment

lucferbux commented Jul 19, 2021

Xaenalt commented Jul 19, 2021

lucferbux commented Jul 19, 2021

maulikjs commented Jul 22, 2021

lucferbux commented Jul 22, 2021 • edited

maulikjs commented Jul 22, 2021

vpavlin commented Jul 22, 2021

maulikjs commented Jul 22, 2021

vpavlin commented Jul 22, 2021

maulikjs commented Jul 22, 2021

maulikjs commented Jul 22, 2021

anishasthana left a comment

Choose a reason for hiding this comment

Xaenalt commented Jul 22, 2021

lucferbux commented Jul 16, 2021 •

edited

Xaenalt commented Jul 16, 2021 •

edited

lucferbux commented Jul 22, 2021 •

edited