Skip to content

Commit

Permalink
Merge pull request #366 from GeorgianaElena/addIdleCuller
Browse files Browse the repository at this point in the history
Add idle culler
  • Loading branch information
yuvipanda committed Jun 14, 2019
2 parents 4daa965 + 3467015 commit ba86dcb
Show file tree
Hide file tree
Showing 6 changed files with 642 additions and 1 deletion.
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,7 @@ Topic guides provide in-depth explanations of specific topics.
topic/tljh-config
topic/authenticator-configuration
topic/escape-hatch
topic/idle-culler


Troubleshooting
Expand Down
114 changes: 114 additions & 0 deletions docs/topic/idle-culler.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
.. _topic/idle-culler:

=============================
Culling idle notebook servers
=============================

The idle culler automatically shuts down user notebook servers when they have
not been used for a certain time period, in order to reduce the total resource
usage on your JupyterHub.


JupyterHub pings the user's notebook server at certain time intervals. If no response
is received from the server during this checks and the timeout expires, the server is
considered to be *inactive (idle)* and will be culled.


Default settings
================

By default, JupyterHub will ping the user notebook servers every 60s to check their
status. Every server found to be idle for more than 10 minutes will be culled.

.. code-block:: python
services.cull.every = 60
services.cull.timeout = 600
Because the servers don't have a maximum age set, an active server will not be shut down
regardless of how long it has been up and running.

.. code-block:: python
services.cull.max_age = 0
If after the culling process, there are users with no active notebook servers, by default,
the users will not be culled alongside their notebooks and will continue to exist.

.. code-block:: python
services.cull.users = False
Configuring the idle culler
===========================

The available configuration options are:

Idle timeout
------------
The idle timeout is the maximum time (in seconds) a server can be inactive before it
will be culled. The timeout can be configured using:

.. code-block:: bash
sudo tljh-config set services.cull.timeout <max-idle-sec-before-server-is-culled>
sudo tljh-config reload
Idle check interval
-------------------
The idle check interval represents how frequent (in seconds) the Hub will
check if there are any idle servers to cull. It can be configured using:

.. code-block:: bash
sudo tljh-config set services.cull.every <number-of-sec-this-check-is-done>
sudo tljh-config reload
Maximum age
-----------
The maximum age sets the time (in seconds) a server should be running.
The servers that exceed the maximum age, will be culled even if they are active.
A maximum age of 0, will deactivate this option.
The maximum age can be configured using:

.. code-block:: bash
sudo tljh-config set services.cull.max_age <server-max-age>
sudo tljh-config reload
User culling
------------
In addition to servers, it is also possible to cull the users. This is usually
suited for temporary-user cases such as *tmpnb*.
User culling can be activated using the following command:

.. code-block:: bash
sudo tljh-config set services.cull.users True
sudo tljh-config reload
Concurrency
-----------
Deleting a lot of users at the same time can slow down the Hub.
The number of concurrent requests made to the Hub can be configured using:

.. code-block:: bash
sudo tljh-config set services.cull.concurrency <number-of-concurrent-hub-requests>
sudo tljh-config reload
Because TLJH it's used for a small number of users, the cases that may require to
modify the concurrency limit should be rare.


Disabling the idle culler
=========================

The idle culling service is enabled by default. To disable it, use the following
command:

.. code-block:: bash
sudo tljh-config set services.cull.enabled False
sudo tljh-config reload
98 changes: 97 additions & 1 deletion integration-tests/test_hub.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import requests
from hubtraf.user import User
from hubtraf.auth.dummy import login_dummy
from jupyterhub.utils import exponential_backoff
import secrets
import pytest
from functools import partial
Expand Down Expand Up @@ -137,4 +138,99 @@ async def test_long_username():
'-u', 'jupyterhub',
'--no-pager'
])
raise
raise


@pytest.mark.asyncio
async def test_idle_server_culled():
"""
User logs in, starts a server & stays idle for 1 min.
(the user's server should be culled during this period)
"""
# This *must* be localhost, not an IP
# aiohttp throws away cookies if we are connecting to an IP!
hub_url = 'http://localhost'
username = secrets.token_hex(8)

assert 0 == await (await asyncio.create_subprocess_exec(*TLJH_CONFIG_PATH, 'set', 'auth.type', 'dummyauthenticator.DummyAuthenticator')).wait()
# Check every 10s for idle servers to cull
assert 0 == await (await asyncio.create_subprocess_exec(*TLJH_CONFIG_PATH, 'set', 'services.cull.every', "10")).wait()
# Apart from servers, also cull users
assert 0 == await (await asyncio.create_subprocess_exec(*TLJH_CONFIG_PATH, 'set', 'services.cull.users', "True")).wait()
# Cull servers and users after 60s of activity
assert 0 == await (await asyncio.create_subprocess_exec(*TLJH_CONFIG_PATH, 'set', 'services.cull.max_age', "60")).wait()
assert 0 == await (await asyncio.create_subprocess_exec(*TLJH_CONFIG_PATH, 'reload')).wait()

async with User(username, hub_url, partial(login_dummy, password='')) as u:
await u.login()
# Start user's server
await u.ensure_server()
# Assert that the user exists
assert pwd.getpwnam(f'jupyter-{username}') is not None

# Check that we can get to the user's server
r = await u.session.get(u.hub_url / 'hub/api/users' / username,
headers={'Referer': str(u.hub_url / 'hub/')})
assert r.status == 200

async def _check_culling_done():
# Check that after 60s, the user and server have been culled and are not reacheable anymore
r = await u.session.get(u.hub_url / 'hub/api/users' / username,
headers={'Referer': str(u.hub_url / 'hub/')})
print(r.status)
return r.status == 403

await exponential_backoff(
_check_culling_done,
"Server culling failed!",
timeout=100,
)

@pytest.mark.asyncio
async def test_active_server_not_culled():
"""
User logs in, starts a server & stays idle for 30s
(the user's server should not be culled during this period).
"""
# This *must* be localhost, not an IP
# aiohttp throws away cookies if we are connecting to an IP!
hub_url = 'http://localhost'
username = secrets.token_hex(8)

assert 0 == await (await asyncio.create_subprocess_exec(*TLJH_CONFIG_PATH, 'set', 'auth.type', 'dummyauthenticator.DummyAuthenticator')).wait()
# Check every 10s for idle servers to cull
assert 0 == await (await asyncio.create_subprocess_exec(*TLJH_CONFIG_PATH, 'set', 'services.cull.every', "10")).wait()
# Apart from servers, also cull users
assert 0 == await (await asyncio.create_subprocess_exec(*TLJH_CONFIG_PATH, 'set', 'services.cull.users', "True")).wait()
# Cull servers and users after 60s of activity
assert 0 == await (await asyncio.create_subprocess_exec(*TLJH_CONFIG_PATH, 'set', 'services.cull.max_age', "60")).wait()
assert 0 == await (await asyncio.create_subprocess_exec(*TLJH_CONFIG_PATH, 'reload')).wait()

async with User(username, hub_url, partial(login_dummy, password='')) as u:
await u.login()
# Start user's server
await u.ensure_server()
# Assert that the user exists
assert pwd.getpwnam(f'jupyter-{username}') is not None

# Check that we can get to the user's server
r = await u.session.get(u.hub_url / 'hub/api/users' / username,
headers={'Referer': str(u.hub_url / 'hub/')})
assert r.status == 200

async def _check_culling_done():
# Check that after 30s, we can still reach the user's server
r = await u.session.get(u.hub_url / 'hub/api/users' / username,
headers={'Referer': str(u.hub_url / 'hub/')})
print(r.status)
return r.status != 200

try:
await exponential_backoff(
_check_culling_done,
"User's server is still reacheable!",
timeout=30,
)
except TimeoutError:
# During the 30s timeout the user's server wasn't culled, which is what we intended.
pass
44 changes: 44 additions & 0 deletions tests/test_configurer.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
"""

import os
import sys

from tljh import configurer

Expand Down Expand Up @@ -187,6 +188,49 @@ def test_set_traefik_api():
assert c.TraefikTomlProxy.traefik_api_password == '1234'


def test_cull_service_default():
"""
Test default cull service settings with no overrides
"""
c = apply_mock_config({})

cull_cmd = [
sys.executable, '/srv/src/tljh/cull_idle_servers.py',
'--timeout=600', '--cull-every=60', '--concurrency=5',
'--max-age=0'
]
assert c.JupyterHub.services == [{
'name': 'cull-idle',
'admin': True,
'command': cull_cmd,
}]


def test_set_cull_service():
"""
Test setting cull service options
"""
c = apply_mock_config({
'services': {
'cull': {
'every': 10,
'users': True,
'max_age': 60
}
}
})
cull_cmd = [
sys.executable, '/srv/src/tljh/cull_idle_servers.py',
'--timeout=600', '--cull-every=10', '--concurrency=5',
'--max-age=60', '--cull-users'
]
assert c.JupyterHub.services == [{
'name': 'cull-idle',
'admin': True,
'command': cull_cmd,
}]


def test_load_secrets(tljh_dir):
"""
Test loading secret files
Expand Down
44 changes: 44 additions & 0 deletions tljh/configurer.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
"""

import os
import sys

from .config import CONFIG_FILE, STATE_DIR
from .yaml import yaml
Expand Down Expand Up @@ -55,6 +56,16 @@
'user_environment': {
'default_app': 'classic',
},
'services': {
'cull': {
'enabled': True,
'timeout': 600,
'every': 60,
'concurrency': 5,
'users': False,
'max_age': 0
}
}
}

def load_config(config_file=CONFIG_FILE):
Expand Down Expand Up @@ -86,6 +97,7 @@ def apply_config(config_overrides, c):
update_user_environment(c, tljh_config)
update_user_account_config(c, tljh_config)
update_traefik_api(c, tljh_config)
update_services(c, tljh_config)


def set_if_not_none(parent, key, value):
Expand Down Expand Up @@ -191,6 +203,38 @@ def update_traefik_api(c, config):
c.TraefikTomlProxy.traefik_api_password = config['traefik_api']['password']


def set_cull_idle_service(config):
"""
Set Idle Culler service
"""
cull_cmd = [
sys.executable, '/srv/src/tljh/cull_idle_servers.py'
]
cull_config = config['services']['cull']
print()

cull_cmd += ['--timeout=%d' % cull_config['timeout']]
cull_cmd += ['--cull-every=%d' % cull_config['every']]
cull_cmd += ['--concurrency=%d' % cull_config['concurrency']]
cull_cmd += ['--max-age=%d' % cull_config['max_age']]
if cull_config['users']:
cull_cmd += ['--cull-users']

cull_service = {
'name': 'cull-idle',
'admin': True,
'command': cull_cmd,
}

return cull_service


def update_services(c, config):
c.JupyterHub.services = []
if config['services']['cull']['enabled']:
c.JupyterHub.services.append(set_cull_idle_service(config))


def _merge_dictionaries(a, b, path=None, update=True):
"""
Merge two dictionaries recursively.
Expand Down

0 comments on commit ba86dcb

Please sign in to comment.