Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cull idle kernels #2215

Merged
merged 4 commits into from Apr 5, 2017
Merged

Cull idle kernels #2215

merged 4 commits into from Apr 5, 2017

Conversation

@kevin-bates
Copy link
Member

@kevin-bates kevin-bates commented Feb 22, 2017

(Updated to include review recommendations below...)

This change introduces the ability to cull kernels that have been idle for a specified amount of time. Off by default, it is enabled by setting --MappingKernelManager.cull_idle_timeout to a positive value representing the number of seconds a kernel must remain inactive to be culled. This feature leverages activity tracking's last_activity value to determine when the last activity for a given kernel occurred. Typical values for cull_idle_timeout will be on the order of 43200 (12 hours) or 86400 (24 hours) seconds as described in this JKG issue. These changes originally targeted Jupyter Kernel Gateway per issue #226 but really belongs in the Notebook server.

An optional property has been introduced to adjust the interval in which idle kernels are culled provided they have exceeded the aforementioned timeout property. This property is adjusted by setting --MappingKernelManager.cull_interval to a positive value. If not set or non-positive, the default value of 300 (5 minutes) seconds will be used.

JKG Issue 227 has been opened to remove the activity tracking that was also in place in the JKG. Prior to that, there could be false reports of active kernels that have been culled by the underlying MappingKernelManager. If not removed, we should minimally adjust the JKG's activity tracking handler to consult the active kernel manager class on each request.

@@ -52,6 +58,26 @@ def _update_root_dir(self, proposal):
raise TraitError("kernel root dir %r is not a directory" % value)
return value

cull_kernels_after_minutes_env = 'CULL_KERNELS_AFTER_MINUTES'

This comment has been minimized.

@parente

parente Feb 23, 2017
Member

Providing a way to set traitlets with environment variables is done in the kernel gateway code base, but it's not common in the notebook codebase.

Copy link
Member

@minrk minrk left a comment

Thanks for the PR! I left a few comments inline. Let me know if you'd like any help finishing it up, or clarifying things.

You can push new commits to this same branch and it will update the PR.

@@ -52,6 +58,26 @@ def _update_root_dir(self, proposal):
raise TraitError("kernel root dir %r is not a directory" % value)
return value

cull_kernels_after_minutes_env = 'CULL_KERNELS_AFTER_MINUTES'
cull_kernels_after_minutes_default = 0

This comment has been minimized.

@minrk

minrk Feb 23, 2017
Member

Let's simplify the trait declarations for now. @parente is right that the environment variable declarations aren't generally used here, so we can have a single assignment:

cull_idle_timeout = Integer(0, config=True,
    help="...
)

without the @default generator or other variables defined.

Plus, I think it might be clearer to use seconds everywhere, rather than a _minutes timeout here and a _seconds timeout below. How about we call them:

cull_idle_timeout: timeout (in seconds) after which a kernel is considered idle and ready to be culled.
cull_interval: interval (in seconds) on which to check for idle kernels to cleanup...

This comment has been minimized.

@kevin-bates

kevin-bates Feb 23, 2017
Author Member

Thank you - this has been addressed in the subsequent commit.

for kId, kernel in self._kernels.items():
self.cull_kernel(kId, kernel)

def cull_kernel(self, kId, kernel):

This comment has been minimized.

@minrk

minrk Feb 23, 2017
Member

cull_kernel can just take a kernel_id argument, and the above can be for kernel_id in list(self._kernels). The list will make a copy of the keys, which may be needed, as shutting down a kernel will remove it from the _kernels dictionary and you can't iterate over a dictionary while changing it, for example:

for kernel_id in list(self._kernels):
    self.cull_kernel_if_idle(kernel_id)

def cull_kernel_if_idle(self, kernel_id):
    kernel = self._kernels[kernel_id]
    ...

This comment has been minimized.

@kevin-bates

kevin-bates Feb 23, 2017
Author Member

Good point - thanks.

if self._culler_callback is None:
loop = IOLoop.current()
if self.kernel_culling_interval_seconds <= 0: #handle case where user set invalid value
self.log.warn("Invalid value for 'kernel_culling_interval_seconds' detected (%s) - using default value (%s).",

This comment has been minimized.

@minrk

minrk Feb 23, 2017
Member

log.warn is deprecated in Python in favor of log.warning

This comment has been minimized.

@kevin-bates

kevin-bates Feb 23, 2017
Author Member

+1

if activity is not None:
dtNow = utcnow()
#dtActivity = datetime.strptime(activity,'%Y-%m-%dT%H:%M:%S.%f')
dtIdle = dtNow - activity

This comment has been minimized.

@minrk

minrk Feb 23, 2017
Member

In Python, we try to use _ to separate words, such as kernel_id, idle_time instead of camelCase

This comment has been minimized.

@kevin-bates

kevin-bates Feb 23, 2017
Author Member

I had noticed the convention previously but am finding this to be a very difficult habit to break - thanks for pointing it out.

Copy link
Member

@Carreau Carreau left a comment

+1 I think many people will be happy to see this ! It deserved a good note in what's new.

A couple of suggestion, but it's look ok as is for 5.1

if self.cull_interval <= 0: #handle case where user set invalid value
self.log.warning("Invalid value for 'cull_interval' detected (%s) - using default value (%s).",
self.cull_interval, self.cull_interval_default)
self.cull_interval = self.cull_interval_default

This comment has been minimized.

@Carreau

Carreau Feb 23, 2017
Member

Should we also warn for cull_interval < 60s ? it's technically not wrong, but it's likely that user might have thoughs it was minutes. Even anything below 300s seem it can be a mistake.

This comment has been minimized.

@kevin-bates

kevin-bates Feb 23, 2017
Author Member

Yeah, I agree. I thought implementing a minimum timeout would be good (now that I've tested it :-)). 300 strikes me as good minimum. If folks are making changes in these areas and need shorter timeouts for testing, they can temporarily adjust the constant.

Positive values less than the minimum would result in a warning and auto-adjustment to the minimum (similar to what happens if the interval is <= 0 and culling is enabled).

So, to your comment above, the description would be...

"Timeout (in seconds) after which a kernel is considered idle and ready to be culled. Values of 0 or lower disable culling. The minimum timeout is 300 seconds. Positive values less than the minimum will be adjusted to the minimum."

(or is that too verbose?)

This comment has been minimized.

@minrk

minrk Feb 24, 2017
Member

@Carreau a warning is a good idea.

@kevin-bates that sounds perfect. You might add (5 minutes) next to 300 seconds.

This comment has been minimized.

@kevin-bates

kevin-bates Feb 24, 2017
Author Member

Will do.

@@ -52,6 +58,15 @@ def _update_root_dir(self, proposal):
raise TraitError("kernel root dir %r is not a directory" % value)
return value

cull_idle_timeout = Integer(0, config=True,
help="""Timeout (in seconds) after which a kernel is considered idle and ready to be culled."""

This comment has been minimized.

@Carreau

Carreau Feb 23, 2017
Member

Nitpicking, I think a "Values of 0 or lower will disable culling" could be nice. But we can improve later in a subsequent PR.

This comment has been minimized.

@kevin-bates

kevin-bates Feb 23, 2017
Author Member

+1

def cull_kernels(self):
self.log.debug("Polling every %s seconds for kernels idle > %s seconds...",
self.cull_interval, self.cull_idle_timeout)
for kernel_id in list(self._kernels):

This comment has been minimized.

@Carreau

Carreau Feb 23, 2017
Member

I'm guessing the list() here is to reify the things you want to try to cull and make sure you don't modify the list while culling. Is that right ? If so it seem like something easily broken.

What do @minrk and @parente think about building the list of kernel to shutdown and shut them al down at once ?
That should make the flow a little bit easier to follow (and also to change the predicate if necessary).

This comment has been minimized.

@minrk

minrk Feb 24, 2017
Member

Either way is fine. I don't think getting a list of dictionary keys is easily broken.

Would this be clearer?

# identify the idle kernels to be culled
to_cull = [ kid for kid in self._kernels if self.should_cull(kid) ]
# cull them
for kernel_id in to_cull:
    self.shutdown_kernel(kernel_id)

This comment has been minimized.

@parente

parente Feb 24, 2017
Member

What do @minrk and @parente think about building the list of kernel to shutdown and shut them al down at once ?

Sounds good to me as long as there's appropriate checks in place so that an problem getting the status of any one kernel (if that's possible) doesn't prevent cleanup of others that should be culled.

This comment has been minimized.

@kevin-bates

kevin-bates Feb 24, 2017
Author Member

I guess I'm not really following the benefit of obtaining the list of cullable kernels, then walking that list to perform the cull when we could just as easily encounter an issue determining the cullable kernels. So, to address the point raised by @parente, what about something like the following? (Is that the right way to handle an unknown set of exceptions in python?)

        for kernel_id in list(self._kernels):
            try:
                self.cull_kernel_if_idle(kernel_id)
            except:
                self.log.error("The following error was encountered while checking the idle duration of kernel %s: %s",
                    kernel_id, sys.exc_info()[0])

This comment has been minimized.

@minrk

minrk Feb 24, 2017
Member

That works, too. There are plenty of valid ways to do it. A few notes:

  • a comment about copying the kernel id list so we don't modify the dict during iteration might help clarity
  • you can use log.exception to automatically log a traceback, so you don't need to use exc_info.
  • always use at least except Exception:, because except: catches things like KeyboardInterrupts that shouldn't be caught.

This comment has been minimized.

@kevin-bates

kevin-bates Feb 24, 2017
Author Member

Thank you @minrk - very helpful. Here's what I've got...

        """Create a separate list of kernels to avoid conflicting updates while iterating"""
        for kernel_id in list(self._kernels):
            try:
                self.cull_kernel_if_idle(kernel_id)
            except Exception as e:
                self.log.exception("The following exception was encountered while checking the idle duration of kernel %s: %s",
                    kernel_id, e)
@minrk
Copy link
Member

@minrk minrk commented Feb 26, 2017

@kevin-bates excellent, thanks! We're in the process of finishing up the 5.0 release from the master branch, but this can be merged once that release has been pushed out.

@minrk
minrk approved these changes Feb 26, 2017
@minrk
Copy link
Member

@minrk minrk commented Apr 5, 2017

5.0 is out, long live 5.1!

@minrk minrk merged commit 8d6460a into jupyter:master Apr 5, 2017
4 checks passed
4 checks passed
codecov/patch 38.09% of diff hit (target 0%)
Details
codecov/project 76.72% (-0.22%) compared to 9dd3818
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@kevin-bates kevin-bates deleted the kevin-bates:cull-idle-kernels branch Apr 5, 2017
@parente
Copy link
Member

@parente parente commented May 17, 2017

After finally getting a test bed setup for this change, I realize that it does not take into account whether the kernel is busy or not, or whether there are connections to the kernel or not. I'm going to submit a follow-on PR that adds settings for whether these attributes should affect the culling decision or not.

@bigquant
Copy link

@bigquant bigquant commented Aug 3, 2017

when will 5.1 go release?

@kevin-bates
Copy link
Member Author

@kevin-bates kevin-bates commented Aug 3, 2017

@bigquant - Sounds like its very soon per the latest Jupyter Weekly Summary (2017, week 29): https://groups.google.com/forum/#!topic/jupyter/NgR4XpUf66E

Notebook (Grant)

    Grant is closing out outstanding issues marked with 5.1 milestone
    Planning on releasing a 5.1 release candidate by EOW
@takluyver
Copy link
Member

@takluyver takluyver commented Aug 3, 2017

Yup, we're hoping to put the release candidate out any day now.

@birdstar
Copy link

@birdstar birdstar commented Sep 24, 2017

This is exactly the feature I wanted!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

7 participants
You can’t perform that action at this time.