Skip to content
This repository has been archived by the owner on Mar 17, 2023. It is now read-only.

Added leader election to JH image #100

Merged
merged 1 commit into from
Jul 22, 2021
Merged

Added leader election to JH image #100

merged 1 commit into from
Jul 22, 2021

Conversation

lucferbux
Copy link

@lucferbux lucferbux commented Jul 16, 2021

Related Issues and Dependencies

RHODS-767

This introduces a breaking change

  • Yes
  • No

Added the s2i run step to check if the image is the leader. By checking the port opened by the sidecar leader container only the leader will start the execution.

This Pull Request implements

Description

Based on the work of of Sean add check method for the kubernetes leader election pattern.

This PR is linked to the Added sidecar leader election on JH PR

Testing

This image could be built with the following command:

s2i build git@github.com:lucferbux/jupyterhub-odh.git --ref feature/add-leader-election --context-dir / quay.io/odh-jupyterhub/jupyterhub:v3.5.3 [registry]

For further info you can check this test log

@Xaenalt
Copy link

Xaenalt commented Jul 16, 2021

LGTM, you'll need to add the actual leader election image to the pod though, https://github.com/Xaenalt/jupyterhub-odh-ha/blob/abad9fe9519c18866767deebb2821289d12d3dd1/jupyterhub-dc.yaml#L101-L103

Also, is that s2i run script executed before any of the others in the pod, I added it originally to /opt/app-root/builder/run since that's what the JH pod invokes

@maulikjs
Copy link

https://docs.openshift.com/container-platform/4.7/openshift_images/using_images/customizing-s2i-images.html#images-using-customizing-s2i-images-scripts-embedded_customizing-s2i-images
One thing we would want to note is if you scroll to the bottom of the above page and read the warning

When wrapping the run script, you must use exec for invoking it to ensure signals are handled properly. The use of exec also precludes the ability to run additional commands after invoking the default image run script.

Does this mean it wont invoke the actual jupyterhub run script?

@maulikjs
Copy link

@Xaenalt This is the associated PR to odh-manifests with the sidecar: opendatahub-io/odh-manifests#460

@lucferbux
Copy link
Author

https://docs.openshift.com/container-platform/4.7/openshift_images/using_images/customizing-s2i-images.html#images-using-customizing-s2i-images-scripts-embedded_customizing-s2i-images
One thing we would want to note is if you scroll to the bottom of the above page and read the warning

When wrapping the run script, you must use exec for invoking it to ensure signals are handled properly. The use of exec also precludes the ability to run additional commands after invoking the default image run script.

Does this mean it wont invoke the actual jupyterhub run script?

yeah, I totally forgot to implement the exec command, it's just running the exec to notify when to start running the jupyter notebook, so instead of the echo "Became leader! Start the program!" we should call the script.

@lucferbux
Copy link
Author

https://docs.openshift.com/container-platform/4.7/openshift_images/using_images/customizing-s2i-images.html#images-using-customizing-s2i-images-scripts-embedded_customizing-s2i-images
One thing we would want to note is if you scroll to the bottom of the above page and read the warning

When wrapping the run script, you must use exec for invoking it to ensure signals are handled properly. The use of exec also precludes the ability to run additional commands after invoking the default image run script.

Does this mean it wont invoke the actual jupyterhub run script?

I think is done now, it will only be called by the leader, as the echo ... was only executed by the leader, I'm gonna try this approach to check if it works

.s2i/bin/run Outdated Show resolved Hide resolved
.s2i/bin/run Show resolved Hide resolved
.s2i/bin/run Outdated Show resolved Hide resolved
@lucferbux
Copy link
Author

As we are still discussing the sleep time I'm committing the fixes in separate commits, once it's sorted out ill squash them.

@lucferbux lucferbux requested a review from vpavlin July 19, 2021 12:04
Copy link

@Xaenalt Xaenalt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it didn't put these comments out there until I hit unresolve, my bad

.s2i/bin/run Outdated Show resolved Hide resolved
.s2i/bin/run Outdated Show resolved Hide resolved
@lucferbux
Copy link
Author

Oh, it didn't put these comments out there until I hit unresolve, my bad

Ok thanks, so I would then resolve @vpavlin comments, sorry, I haven't been able to test it cause my cluster is still down.

@Xaenalt
Copy link

Xaenalt commented Jul 19, 2021

Yeah, as nice as it'd be if the -http method worked, if you can find what's wrong and fix it, go for it

To reproduce, do the http method with a replicaset and delete the current leader, it'll never be updated in the new leader on the http method, but it'll update properly using the get endpoint method

@lucferbux
Copy link
Author

Yeah, as nice as it'd be if the -http method worked, if you can find what's wrong and fix it, go for it

To reproduce, do the http method with a replicaset and delete the current leader, it'll never be updated in the new leader on the http method, but it'll update properly using the get endpoint method

ok perfect 👍

.s2i/bin/run Outdated Show resolved Hide resolved
@lucferbux lucferbux changed the title WIP - Added leader election to JH image Added leader election to JH image Jul 20, 2021
@lucferbux lucferbux marked this pull request as ready for review July 20, 2021 12:41
.s2i/bin/run Outdated Show resolved Hide resolved
.s2i/bin/run Show resolved Hide resolved
@lucferbux lucferbux requested a review from Xaenalt July 21, 2021 11:13
.s2i/bin/run Outdated Show resolved Hide resolved
@maulikjs
Copy link

Also sometimes I get this error:
image

Refreshing the page multiple times takes care of it though which is odd.
Another thing that is odd is:

/user/mshah@redhat.com/public
08:25:51.480 [ConfigProxy] �[32minfo�[39m: Adding route /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:25:51.481 [ConfigProxy] �[32minfo�[39m: Route added /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:25:51.481 [ConfigProxy] �[32minfo�[39m: 201 POST /api/routes/user/mshah@redhat.com/public
08:25:56.486 [ConfigProxy] �[32minfo�[39m: 200 GET /api/routes
http://10.128.2.33:9090!=http://10.128.2.33:8080
ParseResult(scheme='http', netloc='10.128.2.33:8080', path='', params='', query='', fragment='')
/user/mshah@redhat.com/public
08:25:56.490 [ConfigProxy] �[32minfo�[39m: Adding route /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:25:56.490 [ConfigProxy] �[32minfo�[39m: Route added /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:25:56.490 [ConfigProxy] �[32minfo�[39m: 201 POST /api/routes/user/mshah@redhat.com/public
08:26:01.495 [ConfigProxy] �[32minfo�[39m: 200 GET /api/routes
http://10.128.2.33:9090!=http://10.128.2.33:8080
ParseResult(scheme='http', netloc='10.128.2.33:8080', path='', params='', query='', fragment='')
/user/mshah@redhat.com/public
08:26:01.499 [ConfigProxy] �[32minfo�[39m: Adding route /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:26:01.499 [ConfigProxy] �[32minfo�[39m: Route added /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:26:01.499 [ConfigProxy] �[32minfo�[39m: 201 POST /api/routes/user/mshah@redhat.com/public

It keeps trying to add route for a pod which exists.

@lucferbux
Copy link
Author

lucferbux commented Jul 22, 2021

Also sometimes I get this error:
image

Refreshing the page multiple times takes care of it though which is odd.
Another thing that is odd is:

/user/mshah@redhat.com/public
08:25:51.480 [ConfigProxy] �[32minfo�[39m: Adding route /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:25:51.481 [ConfigProxy] �[32minfo�[39m: Route added /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:25:51.481 [ConfigProxy] �[32minfo�[39m: 201 POST /api/routes/user/mshah@redhat.com/public
08:25:56.486 [ConfigProxy] �[32minfo�[39m: 200 GET /api/routes
http://10.128.2.33:9090!=http://10.128.2.33:8080
ParseResult(scheme='http', netloc='10.128.2.33:8080', path='', params='', query='', fragment='')
/user/mshah@redhat.com/public
08:25:56.490 [ConfigProxy] �[32minfo�[39m: Adding route /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:25:56.490 [ConfigProxy] �[32minfo�[39m: Route added /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:25:56.490 [ConfigProxy] �[32minfo�[39m: 201 POST /api/routes/user/mshah@redhat.com/public
08:26:01.495 [ConfigProxy] �[32minfo�[39m: 200 GET /api/routes
http://10.128.2.33:9090!=http://10.128.2.33:8080
ParseResult(scheme='http', netloc='10.128.2.33:8080', path='', params='', query='', fragment='')
/user/mshah@redhat.com/public
08:26:01.499 [ConfigProxy] �[32minfo�[39m: Adding route /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:26:01.499 [ConfigProxy] �[32minfo�[39m: Route added /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:26:01.499 [ConfigProxy] �[32minfo�[39m: 201 POST /api/routes/user/mshah@redhat.com/public

It keeps trying to add route for a pod which exists.

Ok, I've implemented your solution @maulikjs, as you stated the service was load balancing between the 3 pods even though two of them were not ready, hence the API route error.
Adding readinessProbe to check if the service is up, so it can start accept traffic. Changes are live here: opendatahub-io/odh-manifests#460

@maulikjs
Copy link

I can confirm this is due to the pods being reported as missing and the proxy routing the request to pods which are waiting for leader and arent running JH yet.
We need to add a readinessProbe which wont report the pod as running until it is actually the leader.

@vpavlin
Copy link

vpavlin commented Jul 22, 2021

We need to add a readinessProbe which wont report the pod as running until it is actually the leader.

Will that result in failing/restarting pods when the readinessProbe check potentially never gets satisfied?

@maulikjs
Copy link

We need to add a readinessProbe which wont report the pod as running until it is actually the leader.

Will that result in failing/restarting pods when the readinessProbe check potentially never gets satisfied?

Nope, we add either do not add a livenessProbe or empty livenessprobe which always evaluates to true.
Based on how @lucferbux configured the deployment, the readinessProbe will check for a service running on port 8081 which is started when jupyterhub actually starts and thanks to this jupyterhub will only start when its the leader. So we will always only have 1 active pod with both containers running and 2 standby pods waiting in a not-ready state to get a lock and become the leader.

Once we add traefik into the mix this would mean user pods will be accessible no matter what and jupyterhub will at max have a downtime of ~12 seconds, 6 seconds for it to go through the leadership check and approx 6 more (as per my testing on osd) for the standby pod to start running and become the leader.

I haven't tested this with traefik yet but the only thing we need to look into is:

/user/mshah@redhat.com/public
08:25:51.480 [ConfigProxy] �[32minfo�[39m: Adding route /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:25:51.481 [ConfigProxy] �[32minfo�[39m: Route added /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:25:51.481 [ConfigProxy] �[32minfo�[39m: 201 POST /api/routes/user/mshah@redhat.com/public
08:25:56.486 [ConfigProxy] �[32minfo�[39m: 200 GET /api/routes
http://10.128.2.33:9090!=http://10.128.2.33:8080
ParseResult(scheme='http', netloc='10.128.2.33:8080', path='', params='', query='', fragment='')
/user/mshah@redhat.com/public
08:25:56.490 [ConfigProxy] �[32minfo�[39m: Adding route /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:25:56.490 [ConfigProxy] �[32minfo�[39m: Route added /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:25:56.490 [ConfigProxy] �[32minfo�[39m: 201 POST /api/routes/user/mshah@redhat.com/public
08:26:01.495 [ConfigProxy] �[32minfo�[39m: 200 GET /api/routes
http://10.128.2.33:9090!=http://10.128.2.33:8080
ParseResult(scheme='http', netloc='10.128.2.33:8080', path='', params='', query='', fragment='')
/user/mshah@redhat.com/public
08:26:01.499 [ConfigProxy] �[32minfo�[39m: Adding route /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:26:01.499 [ConfigProxy] �[32minfo�[39m: Route added /user/mshah@redhat.com/public -> http://10.128.2.33:9090
08:26:01.499 [ConfigProxy] �[32minfo�[39m: 201 POST /api/routes/user/mshah@redhat.com/public

why Is it adding the route again and again and will this behaviour continue once we have traefik in the mix.

@vpavlin
Copy link

vpavlin commented Jul 22, 2021

why Is it adding the route again and again and will this behaviour continue once we have traefik in the mix.

That's from the publish service which we are removing, so don't worry about it:)

@maulikjs
Copy link

perfect!

.s2i/bin/run Outdated Show resolved Hide resolved
@maulikjs
Copy link

/lgtm

Copy link

@anishasthana anishasthana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things look good on Maulik's cluster. 🚢

@anishasthana anishasthana merged commit e1dd86f into opendatahub-io-contrib:master Jul 22, 2021
@Xaenalt
Copy link

Xaenalt commented Jul 22, 2021

Awesome job

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants