This repository has been archived by the owner. It is now read-only.

Spawning pool #69

Merged
merged 47 commits into from Oct 18, 2014

Conversation

Projects
None yet
3 participants
@smashwilson
Contributor

smashwilson commented Oct 14, 2014

Time to shave off that final 2-second delay. This PR implements #32 by pre-launching a configured (by which I mean hardcoded to "3") set of containers and handing them out to incoming requests. When a container is culled for inactivity, it's scrapped and a new one is launched and added to the pool in its place.

If you hit orchestrate.py with a path already set (using a stale link, for example, or an appear.in-style link like #67 implements), you have a personal container created and linked as it happens now.

/cc @rgbkrk so he can see just how terrible this is right now, and suggest embetterment techniques 馃槈

  • Remove any remaining now-dead code. Container launching should be centralized and managed by the SpawnPool.
  • Get culling working again.
  • Fix the race condition in the culling routine that causes the pool to grow without bound if the culling interval is shorter than the time that it takes to cull a container.
  • Display a friendlier message (meaning, not a stack trace) when the node is out of capacity. A non-200 response code might be useful for upstream load balancing, too. If it's possible, it would be neat to do things like fail over to the next node when 500s are received.
  • Since we're using the pool size as an absolute capacity, defaulting to zero is not optimal. make dev should probably default to something like 3, maybe.
  • Clean out existing containers and proxy entries on restart of orchestrate.py.
  • Catch and log Docker API errors.
  • Change the "I'm full" page to indicate that it'll also auto-refresh for you to try again.
  • Refactor _clean_orphaned_containers. Bring more of that functionality into cull as well, so we also clean up zombies and dead proxy entries periodically.
@rgbkrk

This comment has been minimized.

Show comment
Hide comment
@rgbkrk

rgbkrk Oct 14, 2014

Member

Awesome. That 2-second delay turns into a 15 second delay with high activity. That's the real problem.

Member

rgbkrk commented Oct 14, 2014

Awesome. That 2-second delay turns into a 15 second delay with high activity. That's the real problem.

blocking_docker_client = docker.Client(base_url=docker_host,
version=version,
timeout=timeout)
executor = ThreadPoolExecutor(max_workers=max_workers)
-
+

This comment has been minimized.

@rgbkrk

rgbkrk Oct 14, 2014

Member

Those Atom users.

@rgbkrk

rgbkrk Oct 14, 2014

Member

Those Atom users.

This comment has been minimized.

@smashwilson

smashwilson Oct 14, 2014

Contributor

I need a more passive-aggressive emoji here. 鉃★笍

@smashwilson

smashwilson Oct 14, 2014

Contributor

I need a more passive-aggressive emoji here. 鉃★笍

This comment has been minimized.

@rgbkrk

rgbkrk Oct 16, 2014

Member

馃檱

@rgbkrk

rgbkrk Oct 16, 2014

Member

馃檱

orchestrate.py
- app_log.debug("redirecting %s -> %s", self.request.path, url)
- self.redirect(url, permanent=False)
+ prefix = path.lstrip('/').split('/', 1)[0]
+ app_log.info("Initializing a new ad-hoc container for [%s].", prefix)

This comment has been minimized.

@rgbkrk

rgbkrk Oct 14, 2014

Member

I'm ok with saying this is "provisioning a new ad-hoc container".

@rgbkrk

rgbkrk Oct 14, 2014

Member

I'm ok with saying this is "provisioning a new ad-hoc container".

orchestrate.py
+ yield self.wait_for_server(ip, port, prefix)
+
+ if path is None:
+ url = "/%s/notebooks/Welcome.ipynb" % prefix

This comment has been minimized.

@smashwilson

smashwilson Oct 14, 2014

Contributor

Oh right, I should use self.redirect_uri here, too.

@smashwilson

smashwilson Oct 14, 2014

Contributor

Oh right, I should use self.redirect_uri here, too.

orchestrate.py
-
+
+ # Pre-launch a set number of containers, ready to serve.
+ pool.prepare(3)

This comment has been minimized.

@rgbkrk

rgbkrk Oct 14, 2014

Member

I think if we have a --pool-size option it would work well. It could default to either None or 0 I guess.

@rgbkrk

rgbkrk Oct 14, 2014

Member

I think if we have a --pool-size option it would work well. It could default to either None or 0 I guess.

This comment has been minimized.

@smashwilson

smashwilson Oct 14, 2014

Contributor

Works for me. I was assuming we'd do something like that, but I thought you had a PR in flight to tighten up the configuration, so I just hardcoded for now.

@smashwilson

smashwilson Oct 14, 2014

Contributor

Works for me. I was assuming we'd do something like that, but I thought you had a PR in flight to tighten up the configuration, so I just hardcoded for now.

@smashwilson

This comment has been minimized.

Show comment
Hide comment
@smashwilson

smashwilson Oct 14, 2014

Contributor

One question: what do we do when the pool is empty? Right now, it raises an exception.

There are two directions I can take this in:

  1. I can make the pool own all containers: count the ad-hoc containers that are launched, as well. Use the spawning pool as a kind of resource management strategy. When all of the container slots are filled on a host, show a template.
  2. I can make the pool only responsible for the containers within it, and trust the system to fail gracefully if we get a spike in concurrent users for some reason.

2 makes for cleaner conceptual boundaries, but I like 1 because it feels more controlled. Maybe I can even do some things to balance ad-hoc vs. prelaunched containers. What do you think?

Contributor

smashwilson commented Oct 14, 2014

One question: what do we do when the pool is empty? Right now, it raises an exception.

There are two directions I can take this in:

  1. I can make the pool own all containers: count the ad-hoc containers that are launched, as well. Use the spawning pool as a kind of resource management strategy. When all of the container slots are filled on a host, show a template.
  2. I can make the pool only responsible for the containers within it, and trust the system to fail gracefully if we get a spike in concurrent users for some reason.

2 makes for cleaner conceptual boundaries, but I like 1 because it feels more controlled. Maybe I can even do some things to balance ad-hoc vs. prelaunched containers. What do you think?

@smashwilson

This comment has been minimized.

Show comment
Hide comment
@smashwilson

smashwilson Oct 14, 2014

Contributor

Also, I think it'd clean up quite a bit of redundancy if I refactored out a class to interact with the proxy - something you can instantiate with an endpoint and a token and pass around orchestrate.py and the spawn pool.

Contributor

smashwilson commented Oct 14, 2014

Also, I think it'd clean up quite a bit of redundancy if I refactored out a class to interact with the proxy - something you can instantiate with an endpoint and a token and pass around orchestrate.py and the spawn pool.

@smashwilson

This comment has been minimized.

Show comment
Hide comment
@smashwilson

smashwilson Oct 14, 2014

Contributor

Something else I could explore is a decaying notebook expiration time. Under heavy load, we could start expiring containers that have been alive for 45 minutes, 30 minutes, and so on, down to some minimum, before we just give up.

Contributor

smashwilson commented Oct 14, 2014

Something else I could explore is a decaying notebook expiration time. Under heavy load, we could start expiring containers that have been alive for 45 minutes, 30 minutes, and so on, down to some minimum, before we just give up.

@rgbkrk

This comment has been minimized.

Show comment
Hide comment
@rgbkrk

rgbkrk Oct 14, 2014

Member

I'd like to let ad-hoc containers get created outside the bounds of the pool, but then that could easily overload the system. The real reason we want the pooling is for speed though.

What if we just remove one of the allocated containers to make room for the ad-hoc container? That user does have to wait for theirs to spin up. Only other way around that would be to somehow turn an allocated container's base_path into the adhoc's requested base_path.

Member

rgbkrk commented Oct 14, 2014

I'd like to let ad-hoc containers get created outside the bounds of the pool, but then that could easily overload the system. The real reason we want the pooling is for speed though.

What if we just remove one of the allocated containers to make room for the ad-hoc container? That user does have to wait for theirs to spin up. Only other way around that would be to somehow turn an allocated container's base_path into the adhoc's requested base_path.

@rgbkrk

This comment has been minimized.

Show comment
Hide comment
@rgbkrk

rgbkrk Oct 14, 2014

Member

I'd leave out the decaying notebook expiration time for now. That wouldn't work well for the use cases that resonate really well with people (tutorials, classes). It would work well for demos though. I guess post an issue as a feature request for now.

Member

rgbkrk commented Oct 14, 2014

I'd leave out the decaying notebook expiration time for now. That wouldn't work well for the use cases that resonate really well with people (tutorials, classes). It would work well for demos though. I guess post an issue as a feature request for now.

@rgbkrk

This comment has been minimized.

Show comment
Hide comment
@rgbkrk

rgbkrk Oct 14, 2014

Member

Refactoring a class out for the proxy sounds good to me. I was originally going to do that up until it became apparent how few calls we were going to make against the proxy.

What probably needs to happen though is for the pool to do the creation of the new adhoc containers and for when the pool is empty, all centralized.

Member

rgbkrk commented Oct 14, 2014

Refactoring a class out for the proxy sounds good to me. I was originally going to do that up until it became apparent how few calls we were going to make against the proxy.

What probably needs to happen though is for the pool to do the creation of the new adhoc containers and for when the pool is empty, all centralized.

orchestrate.py
+ # Wait for the notebook server to come up.
+ yield self.wait_for_server(ip, port, prefix)
+
+ if path is None:

This comment has been minimized.

@rgbkrk

rgbkrk Oct 14, 2014

Member

This piece is redundant now, since path is not None.

@rgbkrk

rgbkrk Oct 14, 2014

Member

This piece is redundant now, since path is not None.

orchestrate.py
+ if not url.startswith('/'):
+ url = '/' + url
+ app_log.debug("Redirecting [%s] -> [%s].", self.request.path, url)
+ self.redirect(url, permanent=False)

This comment has been minimized.

@rgbkrk

rgbkrk Oct 14, 2014

Member

This whole section seems like it should get refactored in a util method that gets handled by the pool so we're not repeating how we invoke create_notebook_server and the other options.

@rgbkrk

rgbkrk Oct 14, 2014

Member

This whole section seems like it should get refactored in a util method that gets handled by the pool so we're not repeating how we invoke create_notebook_server and the other options.

This comment has been minimized.

@smashwilson

smashwilson Oct 15, 2014

Contributor

This whole section seems like it should get refactored in a util method that gets handled by the pool so we're not repeating how we invoke create_notebook_server and the other options.

Yeah, excellent 馃憤

In general, I think I still have a lot of cleanup to do in terms of functionality that I moved into the SpawnPool, but neglected to remove from its original location. I left the original culling function around, too, for example.

@smashwilson

smashwilson Oct 15, 2014

Contributor

This whole section seems like it should get refactored in a util method that gets handled by the pool so we're not repeating how we invoke create_notebook_server and the other options.

Yeah, excellent 馃憤

In general, I think I still have a lot of cleanup to do in terms of functionality that I moved into the SpawnPool, but neglected to remove from its original location. I left the original culling function around, too, for example.

orchestrate.py
+
+ url = "/{}/{}".format(prefix, redirect_uri)
+ app_log.debug("Redirecting [%s] -> [%s].", self.request.path, url)
+ self.redirect(url, permanent=False)

This comment has been minimized.

@rgbkrk

rgbkrk Oct 14, 2014

Member

I would bring these last two lines out of the if/else block, since they're the same for both paths.

@rgbkrk

rgbkrk Oct 14, 2014

Member

I would bring these last two lines out of the if/else block, since they're the same for both paths.

spawnpool.py
+ self.docker = dockworker.DockerSpawner(docker_host, version, timeout, max_workers)
+
+ self.available = deque()
+ self.taken = set()

This comment has been minimized.

@rgbkrk

rgbkrk Oct 14, 2014

Member

This makes me wonder about restarts. If I set my pool size to 1000 and restart 3 times, do I now have 3000 containers?

@rgbkrk

rgbkrk Oct 14, 2014

Member

This makes me wonder about restarts. If I set my pool size to 1000 and restart 3 times, do I now have 3000 containers?

This comment has been minimized.

@rgbkrk

rgbkrk Oct 14, 2014

Member

Idea: Check if a running notebook container is not on the proxy.

Problem with that idea is when the proxy restarts we don't know if the leftover containers were used or not.

@rgbkrk

rgbkrk Oct 14, 2014

Member

Idea: Check if a running notebook container is not on the proxy.

Problem with that idea is when the proxy restarts we don't know if the leftover containers were used or not.

This comment has been minimized.

@smashwilson

smashwilson Oct 15, 2014

Contributor

This makes me wonder about restarts. If I set my pool size to 1000 and restart 3 times, do I now have 3000 containers?

Yah. One potential workaround for that would be to scan the proxy setup at launch and use proxy route metadata to reconstruct the state of the spawner. Of course, that assumes that the proxy and Docker are in sync...

@smashwilson

smashwilson Oct 15, 2014

Contributor

This makes me wonder about restarts. If I set my pool size to 1000 and restart 3 times, do I now have 3000 containers?

Yah. One potential workaround for that would be to scan the proxy setup at launch and use proxy route metadata to reconstruct the state of the spawner. Of course, that assumes that the proxy and Docker are in sync...

This comment has been minimized.

@smashwilson

smashwilson Oct 15, 2014

Contributor

Idea: Check if a running notebook container is not on the proxy.

Problem with that idea is when the proxy restarts we don't know if the leftover containers were used or not.

Really, we're now storing state in three places: Docker; the proxy; and the orchestration script. If any of the three fall out of sync, we're going to have problems.

I'm not sure what the best way forward would be, here. If I picked one as The Source of Truth I'd go with Docker. Maybe I should look at docker-py's documentation a little more closely to see how difficult it would be to reconstruct everything from your running containers on launch.

@smashwilson

smashwilson Oct 15, 2014

Contributor

Idea: Check if a running notebook container is not on the proxy.

Problem with that idea is when the proxy restarts we don't know if the leftover containers were used or not.

Really, we're now storing state in three places: Docker; the proxy; and the orchestration script. If any of the three fall out of sync, we're going to have problems.

I'm not sure what the best way forward would be, here. If I picked one as The Source of Truth I'd go with Docker. Maybe I should look at docker-py's documentation a little more closely to see how difficult it would be to reconstruct everything from your running containers on launch.

spawnpool.py
+ except HTTPError as e:
+ app_log.error("Failed to delete route [%s]: %s", container, e)
+
+ if replace:

This comment has been minimized.

@rgbkrk

rgbkrk Oct 14, 2014

Member

I'm liking this replacement setup. 馃憤

@rgbkrk

rgbkrk Oct 14, 2014

Member

I'm liking this replacement setup. 馃憤

@rgbkrk

This comment has been minimized.

Show comment
Hide comment
@rgbkrk

rgbkrk Oct 16, 2014

Member

@smashwilson THIS IS TOO FAST!!!

Member

rgbkrk commented Oct 16, 2014

@smashwilson THIS IS TOO FAST!!!

@rgbkrk

This comment has been minimized.

Show comment
Hide comment
@rgbkrk

rgbkrk Oct 16, 2014

Member

Oh my this is incredible Ash. Now to kick it up a notch.

Member

rgbkrk commented Oct 16, 2014

Oh my this is incredible Ash. Now to kick it up a notch.

@rgbkrk

This comment has been minimized.

Show comment
Hide comment
@rgbkrk

rgbkrk Oct 16, 2014

Member

First error, launched 128 containers:

[I 141016 20:47:58 spawnpool:225] Adding container [PooledContainer(id=u'2f0a516d033aa8e6cfc425e1443935ff9fd18c7650ed191b6146ea1d390e0076', path='user-a6EUf97zWFi3')] to the pool.
Traceback (most recent call last):
  File "orchestrate.py", line 181, in <module>
    main()
  File "orchestrate.py", line 160, in main
    ioloop.run_sync(setup_pool)
  File "/usr/local/lib/python2.7/dist-packages/tornado/ioloop.py", line 418, in run_sync
    return future_cell[0].result()
  File "/usr/local/lib/python2.7/dist-packages/tornado/concurrent.py", line 109, in result
    raise_exc_info(self._exc_info)
  File "/usr/local/lib/python2.7/dist-packages/tornado/gen.py", line 631, in run
    yielded = self.gen.throw(*sys.exc_info())
  File "orchestrate.py", line 158, in setup_pool
    yield pool.prelaunch()
  File "/usr/local/lib/python2.7/dist-packages/tornado/gen.py", line 628, in run
    value = future.result()
  File "/usr/local/lib/python2.7/dist-packages/tornado/concurrent.py", line 109, in result
    raise_exc_info(self._exc_info)
  File "/usr/local/lib/python2.7/dist-packages/tornado/gen.py", line 631, in run
    yielded = self.gen.throw(*sys.exc_info())
  File "/srv/tmpnb/spawnpool.py", line 69, in prelaunch
    yield [self._launch_container() for i in xrange(0, count)]
  File "/usr/local/lib/python2.7/dist-packages/tornado/gen.py", line 628, in run
    value = future.result()
  File "/usr/local/lib/python2.7/dist-packages/tornado/concurrent.py", line 109, in result
    raise_exc_info(self._exc_info)
  File "/usr/local/lib/python2.7/dist-packages/tornado/gen.py", line 464, in callback
    result_list = [i.result() for i in children]
  File "/usr/local/lib/python2.7/dist-packages/tornado/concurrent.py", line 109, in result
    raise_exc_info(self._exc_info)
  File "/usr/local/lib/python2.7/dist-packages/tornado/gen.py", line 631, in run
    yielded = self.gen.throw(*sys.exc_info())
  File "/srv/tmpnb/spawnpool.py", line 196, in _launch_container
    container_config=self.container_config)
  File "/usr/local/lib/python2.7/dist-packages/tornado/gen.py", line 628, in run
    value = future.result()
  File "/usr/local/lib/python2.7/dist-packages/tornado/concurrent.py", line 109, in result
    raise_exc_info(self._exc_info)
  File "/usr/local/lib/python2.7/dist-packages/tornado/gen.py", line 631, in run
    yielded = self.gen.throw(*sys.exc_info())
  File "/srv/tmpnb/dockworker.py", line 103, in create_notebook_server
    yield self.docker_client.start(container_id, port_bindings=port_bindings)
  File "/usr/local/lib/python2.7/dist-packages/tornado/gen.py", line 628, in run
    value = future.result()
  File "/usr/local/lib/python2.7/dist-packages/concurrent/futures/_base.py", line 400, in result
    return self.__get_result()
  File "/usr/local/lib/python2.7/dist-packages/concurrent/futures/_base.py", line 359, in __get_result
    reraise(self._exception, self._traceback)
  File "/usr/local/lib/python2.7/dist-packages/concurrent/futures/_compat.py", line 107, in reraise
    exec('raise exc_type, exc_value, traceback', {}, locals_)
  File "/usr/local/lib/python2.7/dist-packages/concurrent/futures/thread.py", line 61, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python2.7/dist-packages/docker/client.py", line 886, in start
    self._raise_for_status(res)
  File "/usr/local/lib/python2.7/dist-packages/docker/client.py", line 88, in _raise_for_status
    raise errors.APIError(e, response, explanation=explanation)
docker.errors.APIError: 500 Server Error: Internal Server Error ("Cannot start container 9c155c45e01fbef502de9bdb4f971188a0fc98d07ebb8c1dacd516755a8c4e01: Error starting userland proxy: listen tcp 127.0.0.1:49265: bind: address already in use")

Note that I am running the latest copy of Docker.

Member

rgbkrk commented Oct 16, 2014

First error, launched 128 containers:

[I 141016 20:47:58 spawnpool:225] Adding container [PooledContainer(id=u'2f0a516d033aa8e6cfc425e1443935ff9fd18c7650ed191b6146ea1d390e0076', path='user-a6EUf97zWFi3')] to the pool.
Traceback (most recent call last):
  File "orchestrate.py", line 181, in <module>
    main()
  File "orchestrate.py", line 160, in main
    ioloop.run_sync(setup_pool)
  File "/usr/local/lib/python2.7/dist-packages/tornado/ioloop.py", line 418, in run_sync
    return future_cell[0].result()
  File "/usr/local/lib/python2.7/dist-packages/tornado/concurrent.py", line 109, in result
    raise_exc_info(self._exc_info)
  File "/usr/local/lib/python2.7/dist-packages/tornado/gen.py", line 631, in run
    yielded = self.gen.throw(*sys.exc_info())
  File "orchestrate.py", line 158, in setup_pool
    yield pool.prelaunch()
  File "/usr/local/lib/python2.7/dist-packages/tornado/gen.py", line 628, in run
    value = future.result()
  File "/usr/local/lib/python2.7/dist-packages/tornado/concurrent.py", line 109, in result
    raise_exc_info(self._exc_info)
  File "/usr/local/lib/python2.7/dist-packages/tornado/gen.py", line 631, in run
    yielded = self.gen.throw(*sys.exc_info())
  File "/srv/tmpnb/spawnpool.py", line 69, in prelaunch
    yield [self._launch_container() for i in xrange(0, count)]
  File "/usr/local/lib/python2.7/dist-packages/tornado/gen.py", line 628, in run
    value = future.result()
  File "/usr/local/lib/python2.7/dist-packages/tornado/concurrent.py", line 109, in result
    raise_exc_info(self._exc_info)
  File "/usr/local/lib/python2.7/dist-packages/tornado/gen.py", line 464, in callback
    result_list = [i.result() for i in children]
  File "/usr/local/lib/python2.7/dist-packages/tornado/concurrent.py", line 109, in result
    raise_exc_info(self._exc_info)
  File "/usr/local/lib/python2.7/dist-packages/tornado/gen.py", line 631, in run
    yielded = self.gen.throw(*sys.exc_info())
  File "/srv/tmpnb/spawnpool.py", line 196, in _launch_container
    container_config=self.container_config)
  File "/usr/local/lib/python2.7/dist-packages/tornado/gen.py", line 628, in run
    value = future.result()
  File "/usr/local/lib/python2.7/dist-packages/tornado/concurrent.py", line 109, in result
    raise_exc_info(self._exc_info)
  File "/usr/local/lib/python2.7/dist-packages/tornado/gen.py", line 631, in run
    yielded = self.gen.throw(*sys.exc_info())
  File "/srv/tmpnb/dockworker.py", line 103, in create_notebook_server
    yield self.docker_client.start(container_id, port_bindings=port_bindings)
  File "/usr/local/lib/python2.7/dist-packages/tornado/gen.py", line 628, in run
    value = future.result()
  File "/usr/local/lib/python2.7/dist-packages/concurrent/futures/_base.py", line 400, in result
    return self.__get_result()
  File "/usr/local/lib/python2.7/dist-packages/concurrent/futures/_base.py", line 359, in __get_result
    reraise(self._exception, self._traceback)
  File "/usr/local/lib/python2.7/dist-packages/concurrent/futures/_compat.py", line 107, in reraise
    exec('raise exc_type, exc_value, traceback', {}, locals_)
  File "/usr/local/lib/python2.7/dist-packages/concurrent/futures/thread.py", line 61, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python2.7/dist-packages/docker/client.py", line 886, in start
    self._raise_for_status(res)
  File "/usr/local/lib/python2.7/dist-packages/docker/client.py", line 88, in _raise_for_status
    raise errors.APIError(e, response, explanation=explanation)
docker.errors.APIError: 500 Server Error: Internal Server Error ("Cannot start container 9c155c45e01fbef502de9bdb4f971188a0fc98d07ebb8c1dacd516755a8c4e01: Error starting userland proxy: listen tcp 127.0.0.1:49265: bind: address already in use")

Note that I am running the latest copy of Docker.

@rgbkrk

This comment has been minimized.

Show comment
Hide comment
@rgbkrk

rgbkrk Oct 16, 2014

Member

When that happened, tmpnb died overall.

Member

rgbkrk commented Oct 16, 2014

When that happened, tmpnb died overall.

@rgbkrk

This comment has been minimized.

Show comment
Hide comment
@rgbkrk

rgbkrk Oct 16, 2014

Member

Need to catch that API Error on the creation.

Member

rgbkrk commented Oct 16, 2014

Need to catch that API Error on the creation.

@rgbkrk

This comment has been minimized.

Show comment
Hide comment
@rgbkrk

rgbkrk Oct 17, 2014

Member

I live demo'ed with this branch at the New York City Python meetup, at Mongo HQ. IT WAS AMAZING. ~0 seconds to server, live in front of an audience. Ran it multiple times. 馃憤 mega awesome.

Member

rgbkrk commented Oct 17, 2014

I live demo'ed with this branch at the New York City Python meetup, at Mongo HQ. IT WAS AMAZING. ~0 seconds to server, live in front of an audience. Ran it multiple times. 馃憤 mega awesome.

@Carreau

This comment has been minimized.

Show comment
Hide comment
@Carreau

Carreau Oct 17, 2014

Member

You guys are on fire, all the notification I receive when I woke up are from tmpnb :-) Super awesome.

Member

Carreau commented Oct 17, 2014

You guys are on fire, all the notification I receive when I woke up are from tmpnb :-) Super awesome.

@smashwilson

This comment has been minimized.

Show comment
Hide comment
@smashwilson

smashwilson Oct 17, 2014

Contributor

馃 You already gave a demo on this? Awesome. Also: terrifying. 馃榿

I've updated the PR checklist with a few remaining bits of cleanup I'd like to do. Hopefully I can get this to a mergeable state today 馃弫

Contributor

smashwilson commented Oct 17, 2014

馃 You already gave a demo on this? Awesome. Also: terrifying. 馃榿

I've updated the PR checklist with a few remaining bits of cleanup I'd like to do. Hopefully I can get this to a mergeable state today 馃弫

@smashwilson

This comment has been minimized.

Show comment
Hide comment
@smashwilson

smashwilson Oct 17, 2014

Contributor

@rgbkrk I've pushed my work so far if you want to take a look. I think stale container culling is a bit broken, though.

Contributor

smashwilson commented Oct 17, 2014

@rgbkrk I've pushed my work so far if you want to take a look. I think stale container culling is a bit broken, though.

@smashwilson

This comment has been minimized.

Show comment
Hide comment
@smashwilson

smashwilson Oct 18, 2014

Contributor

@rgbkrk Actually... I gave this another spin this morning, and now everything seems to be working fine. For extra fun, kill the tmpnb container and re-launch it with a different --pool_size to watch it self-heal 鈿★笍

Contributor

smashwilson commented Oct 18, 2014

@rgbkrk Actually... I gave this another spin this morning, and now everything seems to be working fine. For extra fun, kill the tmpnb container and re-launch it with a different --pool_size to watch it self-heal 鈿★笍

+ '''Shut down a container and delete its proxy entry.
+
+ Destroy the container in an orderly fashion. If requested and capacity is remaining, create
+ a new one to take its place.'''

This comment has been minimized.

@rgbkrk

rgbkrk Oct 18, 2014

Member

I had a bunch of dead containers that needed to be removed (after a reboot). This didn't pick them up and I nuked them by hand.

@rgbkrk

rgbkrk Oct 18, 2014

Member

I had a bunch of dead containers that needed to be removed (after a reboot). This didn't pick them up and I nuked them by hand.

This comment has been minimized.

@smashwilson

smashwilson Oct 18, 2014

Contributor

Hmm, that could be a bug in heartbeat, or it could be "working as intended."

When you restart the tmpnb container, the new process has no way to distinguish the old process' pooled containers from its active containers, so to be safe, I assume that all of them are potentially active. They'll eventually be reaped and replaced with fresh ones once the normal culling time has elapsed. But, it can mean that the pool can be erroneously full until several heartbeats have elapsed on a restart.

To correct this, we'd need to store data somewhere externally to track which containers have been handed out and which are just waiting in the pool. I thought this would be okay for now, though. In the meantime, you might want to drop the cull time to something like ten or fifteen minutes if you're restarting often.

@smashwilson

smashwilson Oct 18, 2014

Contributor

Hmm, that could be a bug in heartbeat, or it could be "working as intended."

When you restart the tmpnb container, the new process has no way to distinguish the old process' pooled containers from its active containers, so to be safe, I assume that all of them are potentially active. They'll eventually be reaped and replaced with fresh ones once the normal culling time has elapsed. But, it can mean that the pool can be erroneously full until several heartbeats have elapsed on a restart.

To correct this, we'd need to store data somewhere externally to track which containers have been handed out and which are just waiting in the pool. I thought this would be okay for now, though. In the meantime, you might want to drop the cull time to something like ten or fifteen minutes if you're restarting often.

This comment has been minimized.

@rgbkrk

rgbkrk Oct 18, 2014

Member

Yeah, it's ok if we explore this piece later. I wasn't expecting a fully healing pool as part of this PR. 馃槣

@rgbkrk

rgbkrk Oct 18, 2014

Member

Yeah, it's ok if we explore this piece later. I wasn't expecting a fully healing pool as part of this PR. 馃槣

@rgbkrk

This comment has been minimized.

Show comment
Hide comment
@rgbkrk

rgbkrk Oct 18, 2014

Member

This is excellent work @smashwilson, thank you so much. I am 馃憤 on merge. When ready, take your [wip] tag off.

Member

rgbkrk commented Oct 18, 2014

This is excellent work @smashwilson, thank you so much. I am 馃憤 on merge. When ready, take your [wip] tag off.

dockworker.py
-AsyncHTTPClient.configure("tornado.curl_httpclient.CurlAsyncHTTPClient")
+
+ContainerConfig = namedtuple('ImageConfig', [

This comment has been minimized.

@Carreau

Carreau Oct 18, 2014

Member

Want to match the name of the named tuple with the variable ?

@Carreau

Carreau Oct 18, 2014

Member

Want to match the name of the named tuple with the variable ?

This comment has been minimized.

@smashwilson

smashwilson Oct 18, 2014

Contributor

Haha, yes. I just renamed one and not the other. I blame insufficient 鈽曪笍

@smashwilson

smashwilson Oct 18, 2014

Contributor

Haha, yes. I just renamed one and not the other. I blame insufficient 鈽曪笍

"--ip=0.0.0.0",
"--NotebookApp.base_url=/{}".format(base_path),
"--NotebookApp.tornado_settings=\"{}\"".format(tornado_settings)
]
- ipython_command = ipython_executable + " " + " ".join(ipython_args)
+ ipython_command = container_config.ipython_executable + " " + " ".join(ipython_args)

This comment has been minimized.

@Carreau

Carreau Oct 18, 2014

Member

you might want not to join, and just extend command below with these.

@Carreau

Carreau Oct 18, 2014

Member

you might want not to join, and just extend command below with these.

This comment has been minimized.

@smashwilson

smashwilson Oct 18, 2014

Contributor

Ultimately I'd want to see as much of this as possible end up externally configurable (because then tmpnb works for any Docker container that exposes a port, not just ipython images 馃槈). That's a task for another PR, though 馃悎 馃悎 馃悎

@smashwilson

smashwilson Oct 18, 2014

Contributor

Ultimately I'd want to see as much of this as possible end up externally configurable (because then tmpnb works for any Docker container that exposes a port, not just ipython images 馃槈). That's a task for another PR, though 馃悎 馃悎 馃悎

This comment has been minimized.

@rgbkrk

rgbkrk Oct 18, 2014

Member

Ultimately I'd want to see as much of this as possible end up externally configurable because then tmpnb works for any Docker container that exposes a port, not just ipython images

Well, they'd also need to set a base path otherwise all resources are on the wrong path, including CSS, JavaScript, images. That's the key piece that some number of commits ago was passed as a path to the underlying container as an environment variable $RAND_BASE.

@rgbkrk

rgbkrk Oct 18, 2014

Member

Ultimately I'd want to see as much of this as possible end up externally configurable because then tmpnb works for any Docker container that exposes a port, not just ipython images

Well, they'd also need to set a base path otherwise all resources are on the wrong path, including CSS, JavaScript, images. That's the key piece that some number of commits ago was passed as a path to the underlying container as an environment variable $RAND_BASE.

This comment has been minimized.

@smashwilson

smashwilson Oct 18, 2014

Contributor

Oh, I was thinking we could have a set of keyword parameters .format()'ed in to the command, so you'd provide a template like "--NotebookApp.base_url=/{base_path}.

@smashwilson

smashwilson Oct 18, 2014

Contributor

Oh, I was thinking we could have a set of keyword parameters .format()'ed in to the command, so you'd provide a template like "--NotebookApp.base_url=/{base_path}.

dockworker.py
+ raise gen.Return(matching)
+
+ @gen.coroutine
+ def _with_retries(self, max_tries, fn, *args, **kwargs):

This comment has been minimized.

@Carreau

Carreau Oct 18, 2014

Member

You could almost remove max_tries and default it to RETRIES :-)

@Carreau

Carreau Oct 18, 2014

Member

You could almost remove max_tries and default it to RETRIES :-)

This comment has been minimized.

@smashwilson

smashwilson Oct 18, 2014

Contributor

You could almost remove max_tries and default it to RETRIES :-)

So I could! Nice.

@smashwilson

smashwilson Oct 18, 2014

Contributor

You could almost remove max_tries and default it to RETRIES :-)

So I could! Nice.

- def cpu_shares(self):
- return self.settings['cpu_shares']
+ def pool(self):
+ return self.settings['pool']

This comment has been minimized.

@Carreau

Carreau Oct 18, 2014

Member

are you sure 'pool' is always define ? want to use .get('pool', default) ? Same below, but I'm not familiar enough with the code, it might be always defined.

@Carreau

Carreau Oct 18, 2014

Member

are you sure 'pool' is always define ? want to use .get('pool', default) ? Same below, but I'm not familiar enough with the code, it might be always defined.

This comment has been minimized.

@Carreau

Carreau Oct 18, 2014

Member

and BTW, is it pool_size ?

@Carreau

Carreau Oct 18, 2014

Member

and BTW, is it pool_size ?

This comment has been minimized.

@smashwilson

smashwilson Oct 18, 2014

Contributor

Yep - pool is part of the Tornado settings that are configured unconditionally here during process launch, so it should be there. Good catch though 馃榿

@smashwilson

smashwilson Oct 18, 2014

Contributor

Yep - pool is part of the Tornado settings that are configured unconditionally here during process launch, so it should be there. Good catch though 馃榿

spawner=spawner,
+ pool=pool,

This comment has been minimized.

@Carreau

Carreau Oct 18, 2014

Member

Ah, no, it's pool. Ok.

@Carreau

Carreau Oct 18, 2014

Member

Ah, no, it's pool. Ok.

@Carreau

This comment has been minimized.

Show comment
Hide comment
@Carreau

Carreau Oct 18, 2014

Member

that's a lot of code I'm not confortable with.
I'm disappointed though, I saw a lot of yield, but no yield from :-P.

I'll re-take a look later.

Member

Carreau commented Oct 18, 2014

that's a lot of code I'm not confortable with.
I'm disappointed though, I saw a lot of yield, but no yield from :-P.

I'll re-take a look later.

@smashwilson

This comment has been minimized.

Show comment
Hide comment
@smashwilson

smashwilson Oct 18, 2014

Contributor

that's a lot of code I'm not confortable with.
I'm disappointed though, I saw a lot of yield, but no yield from :-P.

I'll re-take a look later.

Understood, it's a big sprawling PR 馃槈 Thanks for looking it over!

Also, isn't yield from Python 3.3+? As far as I know we're still running orchestrate with 2.7. Also I'm not entirely sure how it would play with tornado.gen. If it would make things more elegant I'm all ears 馃榿

Contributor

smashwilson commented Oct 18, 2014

that's a lot of code I'm not confortable with.
I'm disappointed though, I saw a lot of yield, but no yield from :-P.

I'll re-take a look later.

Understood, it's a big sprawling PR 馃槈 Thanks for looking it over!

Also, isn't yield from Python 3.3+? As far as I know we're still running orchestrate with 2.7. Also I'm not entirely sure how it would play with tornado.gen. If it would make things more elegant I'm all ears 馃榿

@Carreau

This comment has been minimized.

Show comment
Hide comment
@Carreau

Carreau Oct 18, 2014

Member

Also, isn't yield from Python 3.3+? As far as I know we're still running orchestrate with 2.7. Also I'm not entirely sure how it would play with tornado.gen. If it would make things more elegant I'm all ears

I still need to use yield from myself, I have a few side project where I tried, but I think it will take me some time to be able to code = yield from brain.instance() :-)

Member

Carreau commented Oct 18, 2014

Also, isn't yield from Python 3.3+? As far as I know we're still running orchestrate with 2.7. Also I'm not entirely sure how it would play with tornado.gen. If it would make things more elegant I'm all ears

I still need to use yield from myself, I have a few side project where I tried, but I think it will take me some time to be able to code = yield from brain.instance() :-)

@smashwilson smashwilson changed the title from [wip] Spawning pool to Spawning pool Oct 18, 2014

@smashwilson

This comment has been minimized.

Show comment
Hide comment
@smashwilson

smashwilson Oct 18, 2014

Contributor

@rgbkrk You can merge it whenever you like, especially considering you've already used it in a demo! I'd be happy to fix things here or in subsequent PRs as we find them.

Contributor

smashwilson commented Oct 18, 2014

@rgbkrk You can merge it whenever you like, especially considering you've already used it in a demo! I'd be happy to fix things here or in subsequent PRs as we find them.

@rgbkrk

This comment has been minimized.

Show comment
Hide comment
@rgbkrk

rgbkrk Oct 18, 2014

Member

I'd be happy to fix things here or in subsequent PRs as we find them.

馃槃

Member

rgbkrk commented Oct 18, 2014

I'd be happy to fix things here or in subsequent PRs as we find them.

馃槃

@rgbkrk

This comment has been minimized.

Show comment
Hide comment
Member

rgbkrk commented Oct 18, 2014

:shipit:

rgbkrk added a commit that referenced this pull request Oct 18, 2014

@rgbkrk rgbkrk merged commit d21a34b into jupyter:master Oct 18, 2014

@smashwilson smashwilson deleted the smashwilson:spawnpool branch Oct 18, 2014

@rgbkrk

This comment has been minimized.

Show comment
Hide comment
@rgbkrk

rgbkrk Oct 19, 2014

Member

Thanks again for the review @Carreau and @smashwilson for this PR and addressing the review comments!

Member

rgbkrk commented Oct 19, 2014

Thanks again for the review @Carreau and @smashwilson for this PR and addressing the review comments!

@rgbkrk rgbkrk referenced this pull request Nov 6, 2014

Closed

Create a spawnpool #32

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.