Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Availability modes #1095

Merged
merged 8 commits into from
Jun 27, 2022
Merged

Conversation

kevin-bates
Copy link
Member

#1086 references an issue whereby the loading of persisted kernel sessions at EG's startup was commented out when the changes for #737 were merged. PR #737 essentially enabled the ability to, so to speak, have multiple instances of EG running simultaneously emulating an active-active availability. The previous code, on the other hand, emulated more of an active-passive behavior where only a single EG instance is running but introducing a higher degree of resiliency, as pointed out in #1086. Some users have found that functionality helpful and we should try to accommodate that use case as well.

This pull request introduces a configurable option named availability_mode that can hold one of three values: None (default), active-active, and active-passive. Both non-none values require that kernel session persistence also be enabled. Since 'active-active' was essentially the default behavior (when kernel session persistence was enabled), we will automatically set the availability_mode to active-active whenever kernel session persistence is enabled and availability mode is not - thereby providing a form of backward compatibility.

Users desiring a single-instanced EG that is capable of restarting following an unexpected failure can now use the availability mode of 'active-passive'.

These modes (including kernel session persistence) can be enabled via a configuration file, command line, or environment variables as noted in the documentation or when running jupyter enterprisegateway --help-all.

As noted in the companion documentation, this functionality should be considered experimental!

Resolves: #1086

enterprise_gateway/mixins.py Outdated Show resolved Hide resolved

# If we're using active-passive availability, attempt to start persisted sessions
if self.availability_mode == "active-passive":
self.kernel_session_manager.start_sessions()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevin-bates : I went over the previous comments on this decision, but I am still do not understand completely on "why we wouldn't want to load all the sessions from persistence at server start" irrespective of the availability_mode?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've had similar discussions with myself. 😄 (It's good to have someone else to talk with about this!)

I think the primary issue here is affinity and if we load the sessions in active-active then all EG nodes will have a KernelManager thinking they are managing the kernel. In active-active, a "second" node will only manage a "previously managed" kernel when the "previously-managing node" has gone down, so there's still only one node managing the kernel (because we always require "node affinity".

Perhaps the terms active-active and active-passive to describe these modes are not quite correct. As @dnwe pointed out in the issue, they use active-passive as more of a form of resilency than HA (I suppose more for DR). Perhaps we could spin active-active to HA and active-passive to DR?

Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my thoughts on the naming the availability modes:
active-passive -> single_instance
"active-active" -> multi_instance

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like these names as they essentially describe the expected configuration of each and don't attempt to overload or conflate the meanings of the classic HA/DR terms.

I would like to continue using hyphens as the separators in the string values. (I view underscores more for variable names and constants.) So let's go with "single-instance" ("active-passive" is used) and "multi-instance" (where "active-active" is used). Does that sound okay?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the values to use the instance references. Note that I also added code to auto-enable kernel session persistence if not set when availability mode is set. It felt a little overbearing to require the persistence setting when it's required to use "availability". So, rather than throw an exception, we'll log an informational message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevin-bates : I was going over some other service documentation where I found 2 new terms used to describe the similar availability scenarios:

  1. active-passive -> standalone
  2. active-active -> replication

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm - I think I like these names over "single-instance" and "multi-instance", especially if there's precedent.

@lresende - you just approved this PR. Are you okay with going with the names "Standalone" and "Replication"?

Known issues include:
1. Culling configurations do not account for different nodes and therefore could result in the premature culling of kernels.
2. Each "node switch" requires a manual reconnect to the kernel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the above issues only with "active-active" mode and not with "active-passive" mode of EG?

Copy link
Member Author

@kevin-bates kevin-bates Jun 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think reconnecting is necessary for both forms.

Even with "active-active", because we still expect/advise affinity with the managed kernel, you shouldn't run into an issue where the kernel is culled prematurely because it should always stay on the originating node. Only if the affinity is not configured (or not working) could the kernel be culled prematurely from the previous node.

I'll look into some better wording for this, but we should probably better understand where things are with this before merging. Thanks for this comment.

)

# If we're using single-instance availability, attempt to start persisted sessions
if self.availability_mode == "single-instance":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we define constants / static variables for these availability_mode so that it can be used across modules / files.

@@ -162,7 +162,9 @@ def check_kernel_id(self, kernel_id):
self.parent.kernel_session_manager.delete_session(kernel_id)
raise web.HTTPError(404, "Kernel does not exist: %s" % kernel_id)

def _refresh_kernel(self, kernel_id):
def _refresh_kernel(self, kernel_id) -> bool:
if not self.parent.availability_mode or self.parent.availability_mode == "single-instance":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the thought here is, incase of s_i mode, the kernels are already hydrated when the EG server starts..so there is not need to check the persistence for kernel ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. The multi-kernel manager should be aware of all active kernels in this case.

Copy link
Contributor

@rahul26goyal rahul26goyal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Approving it.

@kevin-bates
Copy link
Member Author

I still need to apply the final name changes with "Standalone" and "Replication" so let's not merge yet.

@kevin-bates
Copy link
Member Author

Need to rework the docs now that #1101 has been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Persisted sessions are not restored at startup, only if requested by kernel ID, leaking pods?
3 participants