Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating FileSessionKernelManager to load&save kernel sessions to mul… #737

Merged
merged 7 commits into from
Jan 2, 2020

Conversation

Gsbreddy
Copy link
Contributor

@Gsbreddy Gsbreddy commented Sep 25, 2019

…tiple files(#735)

This fixes #735 which updates FileSessionKernelManager to load and save kernel sessions to multiple files as discussed in #562.

…tiple files(#735)

This fixes #735 which updates FileSessionKernelManager to load and save kernel sessions to multiple files as discussed in #562.
Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Gsbreddy - thank you for working on this.

I think we have a disconnect on breaking the sessions into individual files. I believe this breakdown needs to more granular. Since EG now controls the handlers, we are now free to intercept 404 errors for kernels a server doesn't know about and attempt to load the persisted kernel session for that specific kernel_id. I'm not saying that needs to be part of this PR, but we need to have finer granularity on the files we're saving and loading.

I think the base class will need to define additional abstract methods like load_session(kernel_id) and save_session(kernel_id) since all subclasses (FileKernelSessionManager and others) should behave in this same manner.

I am happy to help you with this.

@kevin-bates
Copy link
Member

@Gsbreddy - is this ready for review? I see you've made some recent changes. If it's not ready, please prefix the PR's title with [WIP] and, once that prefix is removed, I'll be happy to review things.

At quick glance, I think we will want the ability to start a single session as well (given a kernel_id). Also, the removal code in save_sessions() doesn't make sense to me. In fact, I'm not sure we even need save_sessions since all sessions will be saved as a given kernel starts. Unlike load_sessions() where the startup of a new server may want to load all persisted sessions. This would apply to Active/Passive configurations.

This is coming along nicely - thank you!

@Gsbreddy Gsbreddy changed the title Updating FileSessionKernelManager to load&save kernel sessions to mul… [WIP] Updating FileSessionKernelManager to load&save kernel sessions to mul… Oct 15, 2019
@Gsbreddy
Copy link
Contributor Author

@kevin-bates To answer your question, load_session(kernel_id) will have the ability to start a single session. In next PR, I will handle the same in remotemanager.py. I get that the removal code in save_sessions() doesn't make sense, it's just the name actually. Let me rename it to something more meaningful like commit_sessions

@kevin-bates
Copy link
Member

Not really following your last statement. Methods that ONLY delete persisted sessions shouldn't be named "save_" or "commit_". That said, I'll wait for [WIP] to be removed before reviewing. Thanks again.

@Gsbreddy
Copy link
Contributor Author

Sure. Thanks.

@Gsbreddy Gsbreddy changed the title [WIP] Updating FileSessionKernelManager to load&save kernel sessions to mul… Updating FileSessionKernelManager to load&save kernel sessions to mul… Dec 13, 2019
@Gsbreddy
Copy link
Contributor Author

Please retrigger travis.

@kevin-bates
Copy link
Member

I restarted the build and it still failed rather quickly. The issue is due to a linting problem:

enterprise_gateway/services/kernels/remotemanager.py:373:37: E131 continuation line unaligned for hanging indent
Makefile:52: recipe for target 'lint' failed
make: *** [lint] Error 1

I recommend you run make lint prior to pushing changes.

These changes the methods to handle the kernel creation and deletion by overriding the api call to the Kernel handler(/api/kernel/<kernel-id>)
@Gsbreddy
Copy link
Contributor Author

Please review the changes @kevin-bates

@kevin-bates
Copy link
Member

Hi @Gsbreddy - I just wanted you to know that I'm not ignoring this. I'm on vacation until 2020, but I hope to sit down and take a good look at this in the next few days. This will be a great addition and I'm excited to take a look!

Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great stuff! I had a few comments - mostly surrounding isolation and some refactoring, but the basic premise here looks really good - thank you!

Since we're now actively detecting and loading kernel sessions, do we still need this code to occur when EG starts? This is more of the active/passive approach. I suppose it doesn't hurt to have this as this would save 'on-demand' loads, but I'm not sure its necessary.

enterprise_gateway/services/kernels/handlers.py Outdated Show resolved Hide resolved
enterprise_gateway/services/kernels/remotemanager.py Outdated Show resolved Hide resolved
enterprise_gateway/services/kernels/remotemanager.py Outdated Show resolved Hide resolved
@Gsbreddy
Copy link
Contributor Author

I will make these changes and update.

This commit refactors some private methods and adds a new method(start_session(kernel_id)) in KernelSessionManager.
Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look great - thank you. I just had the one comment about self.sessions_to_remove.

I was thinking about this. So this essentially provides active/active capabilities - enabling one to configure a farm of EG servers. I think kernel shutdown might be an issue, since there are no mechanisms for communicating with other EG instances that kernel X has been shutdown. As a result, kernel managers on the other EG servers will detect the kernel has died and attempt to automatically restart it.

So, at a minimum, I think auto-restarts may need to be disabled in such configurations. (I don't find the auto-restart logic to be valuable anyway since the auto-restart sequence rarely succeeds.) Then, assuming culling is configured on each EG server, the other EG servers will eventually cull the kernel because it will appear idle.

Where this is all heading is some form of periodic callback method that consults the persisted session state and "culls" kernels that have been terminated via a different EG server.

Hmm. If we could 'hook' the auto-restart logic, we could check if the kernel is still present in the persisted store and, if not (due to its termination on another server), abort the auto-restart and trigger the culling of the kernel within that EG instance.

In addition, rather than use a periodic callback, instead have an event that fires whenever the persisted store is updated. Events could be for create and termination where this issue only applies to termination. EG instances is creation could, perhaps, prepare for that kernel's usage, etc.

I don't think this should stop us in any way from moving forward, but I think we might need an answer to this issue shortly. If its as simple as disabling auto-restarts, then that would be fine for now.

@Gsbreddy
Copy link
Contributor Author

Gsbreddy commented Dec 19, 2019

@kevin-bates I agree, we have to think this through. I will take this up and see if I can come up with a solution. Meanwhile, we should be good on this PR.

…735)

Now that delete_sessions takes a list of kernel_ids, we don't need a member variable for removing sessions in KernelSessionManager.
@kevin-bates
Copy link
Member

I was thinking a few hours after my last response that using sticky sessions will go a long way here. That way, the lifecycle of the kernel would stay pinned to a single instance. Should that instance go down, then your changes would kick in on the other instance - and subsequent requests on that kernel would stay pinned to that server. Since there's still only one instance ever operating against a given kernel, we should still be okay. Of course, if the original instance didn't entirely go down (say due to a web-level network issue), then we could be facing some chaotic behaviors once the kernel is terminated.

Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great addition - thank you Gottam!

@kevin-bates
Copy link
Member

Shoot, there's one thing I wanted to followup on and I mentioned it above in this comment. In thinking more about this, I don't think we want this code at all. Even users deploying an active/passive operation would still benefit from your change and leaving that code in place will certainly mess things up unless all EG instances started when no kernels are hosted by other instances. So I think we should probably remove that call as part of this PR.

@Gsbreddy
Copy link
Contributor Author

Gsbreddy commented Dec 19, 2019

I will remove the call to start_sessions() from enterprisegatewayapp.py and update the PR.

@Gsbreddy
Copy link
Contributor Author

Gsbreddy commented Dec 20, 2019

Shoot, there's one thing I wanted to followup on and I mentioned it above in this comment. In thinking more about this, I don't think we want this code at all. Even users deploying an active/passive operation would still benefit from your change and leaving that code in place will certainly mess things up unless all EG instances started when no kernels are hosted by other instances. So I think we should probably remove that call as part of this PR.

@kevin-bates Made the changes.

@kevin-bates kevin-bates merged commit d4401fa into jupyter-server:master Jan 2, 2020
@kevin-bates
Copy link
Member

Sorry for the delayed response as I've been out on holiday (Happy New Year!). Thank you for this awesome contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Changing saving of kernels from single to multiple files.
2 participants