Skip to content
This repository has been archived by the owner. It is now read-only.

Spawning pool #69

Merged
merged 47 commits into from Oct 18, 2014
Merged

Spawning pool #69

merged 47 commits into from Oct 18, 2014

Conversation

@smashwilson
Copy link
Contributor

smashwilson commented Oct 14, 2014

Time to shave off that final 2-second delay. This PR implements #32 by pre-launching a configured (by which I mean hardcoded to "3") set of containers and handing them out to incoming requests. When a container is culled for inactivity, it's scrapped and a new one is launched and added to the pool in its place.

If you hit orchestrate.py with a path already set (using a stale link, for example, or an appear.in-style link like #67 implements), you have a personal container created and linked as it happens now.

/cc @rgbkrk so he can see just how terrible this is right now, and suggest embetterment techniques 馃槈

  • Remove any remaining now-dead code. Container launching should be centralized and managed by the SpawnPool.
  • Get culling working again.
  • Fix the race condition in the culling routine that causes the pool to grow without bound if the culling interval is shorter than the time that it takes to cull a container.
  • Display a friendlier message (meaning, not a stack trace) when the node is out of capacity. A non-200 response code might be useful for upstream load balancing, too. If it's possible, it would be neat to do things like fail over to the next node when 500s are received.
  • Since we're using the pool size as an absolute capacity, defaulting to zero is not optimal. make dev should probably default to something like 3, maybe.
  • Clean out existing containers and proxy entries on restart of orchestrate.py.
  • Catch and log Docker API errors.
  • Change the "I'm full" page to indicate that it'll also auto-refresh for you to try again.
  • Refactor _clean_orphaned_containers. Bring more of that functionality into cull as well, so we also clean up zombies and dead proxy entries periodically.
@smashwilson smashwilson force-pushed the smashwilson:spawnpool branch from 832a61e to 6e60415 Oct 14, 2014
@rgbkrk
Copy link
Member

rgbkrk commented Oct 14, 2014

Awesome. That 2-second delay turns into a 15 second delay with high activity. That's the real problem.

blocking_docker_client = docker.Client(base_url=docker_host,
version=version,
timeout=timeout)

executor = ThreadPoolExecutor(max_workers=max_workers)

This comment has been minimized.

Copy link
@rgbkrk

rgbkrk Oct 14, 2014

Member

Those Atom users.

This comment has been minimized.

Copy link
@smashwilson

smashwilson Oct 14, 2014

Author Contributor

I need a more passive-aggressive emoji here. 鉃★笍

This comment has been minimized.

Copy link
@rgbkrk

rgbkrk Oct 16, 2014

Member

馃檱

@smashwilson smashwilson force-pushed the smashwilson:spawnpool branch from 6e60415 to f2dea5e Oct 14, 2014
app_log.debug("redirecting %s -> %s", self.request.path, url)
self.redirect(url, permanent=False)
prefix = path.lstrip('/').split('/', 1)[0]
app_log.info("Initializing a new ad-hoc container for [%s].", prefix)

This comment has been minimized.

Copy link
@rgbkrk

rgbkrk Oct 14, 2014

Member

I'm ok with saying this is "provisioning a new ad-hoc container".

yield self.wait_for_server(ip, port, prefix)

if path is None:
url = "/%s/notebooks/Welcome.ipynb" % prefix

This comment has been minimized.

Copy link
@smashwilson

smashwilson Oct 14, 2014

Author Contributor

Oh right, I should use self.redirect_uri here, too.



# Pre-launch a set number of containers, ready to serve.
pool.prepare(3)

This comment has been minimized.

Copy link
@rgbkrk

rgbkrk Oct 14, 2014

Member

I think if we have a --pool-size option it would work well. It could default to either None or 0 I guess.

This comment has been minimized.

Copy link
@smashwilson

smashwilson Oct 14, 2014

Author Contributor

Works for me. I was assuming we'd do something like that, but I thought you had a PR in flight to tighten up the configuration, so I just hardcoded for now.

@smashwilson
Copy link
Contributor Author

smashwilson commented Oct 14, 2014

One question: what do we do when the pool is empty? Right now, it raises an exception.

There are two directions I can take this in:

  1. I can make the pool own all containers: count the ad-hoc containers that are launched, as well. Use the spawning pool as a kind of resource management strategy. When all of the container slots are filled on a host, show a template.
  2. I can make the pool only responsible for the containers within it, and trust the system to fail gracefully if we get a spike in concurrent users for some reason.

2 makes for cleaner conceptual boundaries, but I like 1 because it feels more controlled. Maybe I can even do some things to balance ad-hoc vs. prelaunched containers. What do you think?

@smashwilson
Copy link
Contributor Author

smashwilson commented Oct 14, 2014

Also, I think it'd clean up quite a bit of redundancy if I refactored out a class to interact with the proxy - something you can instantiate with an endpoint and a token and pass around orchestrate.py and the spawn pool.

@smashwilson
Copy link
Contributor Author

smashwilson commented Oct 14, 2014

Something else I could explore is a decaying notebook expiration time. Under heavy load, we could start expiring containers that have been alive for 45 minutes, 30 minutes, and so on, down to some minimum, before we just give up.

@rgbkrk
Copy link
Member

rgbkrk commented Oct 14, 2014

I'd like to let ad-hoc containers get created outside the bounds of the pool, but then that could easily overload the system. The real reason we want the pooling is for speed though.

What if we just remove one of the allocated containers to make room for the ad-hoc container? That user does have to wait for theirs to spin up. Only other way around that would be to somehow turn an allocated container's base_path into the adhoc's requested base_path.

@rgbkrk
Copy link
Member

rgbkrk commented Oct 14, 2014

I'd leave out the decaying notebook expiration time for now. That wouldn't work well for the use cases that resonate really well with people (tutorials, classes). It would work well for demos though. I guess post an issue as a feature request for now.

@rgbkrk
Copy link
Member

rgbkrk commented Oct 14, 2014

Refactoring a class out for the proxy sounds good to me. I was originally going to do that up until it became apparent how few calls we were going to make against the proxy.

What probably needs to happen though is for the pool to do the creation of the new adhoc containers and for when the pool is empty, all centralized.

# Wait for the notebook server to come up.
yield self.wait_for_server(ip, port, prefix)

if path is None:

This comment has been minimized.

Copy link
@rgbkrk

rgbkrk Oct 14, 2014

Member

This piece is redundant now, since path is not None.

if not url.startswith('/'):
url = '/' + url
app_log.debug("Redirecting [%s] -> [%s].", self.request.path, url)
self.redirect(url, permanent=False)

This comment has been minimized.

Copy link
@rgbkrk

rgbkrk Oct 14, 2014

Member

This whole section seems like it should get refactored in a util method that gets handled by the pool so we're not repeating how we invoke create_notebook_server and the other options.

This comment has been minimized.

Copy link
@smashwilson

smashwilson Oct 15, 2014

Author Contributor

This whole section seems like it should get refactored in a util method that gets handled by the pool so we're not repeating how we invoke create_notebook_server and the other options.

Yeah, excellent 馃憤

In general, I think I still have a lot of cleanup to do in terms of functionality that I moved into the SpawnPool, but neglected to remove from its original location. I left the original culling function around, too, for example.


url = "/{}/{}".format(prefix, redirect_uri)
app_log.debug("Redirecting [%s] -> [%s].", self.request.path, url)
self.redirect(url, permanent=False)

This comment has been minimized.

Copy link
@rgbkrk

rgbkrk Oct 14, 2014

Member

I would bring these last two lines out of the if/else block, since they're the same for both paths.

self.docker = dockworker.DockerSpawner(docker_host, version, timeout, max_workers)

self.available = deque()
self.taken = set()

This comment has been minimized.

Copy link
@rgbkrk

rgbkrk Oct 14, 2014

Member

This makes me wonder about restarts. If I set my pool size to 1000 and restart 3 times, do I now have 3000 containers?

This comment has been minimized.

Copy link
@rgbkrk

rgbkrk Oct 14, 2014

Member

Idea: Check if a running notebook container is not on the proxy.

Problem with that idea is when the proxy restarts we don't know if the leftover containers were used or not.

This comment has been minimized.

Copy link
@smashwilson

smashwilson Oct 15, 2014

Author Contributor

This makes me wonder about restarts. If I set my pool size to 1000 and restart 3 times, do I now have 3000 containers?

Yah. One potential workaround for that would be to scan the proxy setup at launch and use proxy route metadata to reconstruct the state of the spawner. Of course, that assumes that the proxy and Docker are in sync...

This comment has been minimized.

Copy link
@smashwilson

smashwilson Oct 15, 2014

Author Contributor

Idea: Check if a running notebook container is not on the proxy.

Problem with that idea is when the proxy restarts we don't know if the leftover containers were used or not.

Really, we're now storing state in three places: Docker; the proxy; and the orchestration script. If any of the three fall out of sync, we're going to have problems.

I'm not sure what the best way forward would be, here. If I picked one as The Source of Truth I'd go with Docker. Maybe I should look at docker-py's documentation a little more closely to see how difficult it would be to reconstruct everything from your running containers on launch.

except HTTPError as e:
app_log.error("Failed to delete route [%s]: %s", container, e)

if replace:

This comment has been minimized.

Copy link
@rgbkrk

rgbkrk Oct 14, 2014

Member

I'm liking this replacement setup. 馃憤

@smashwilson
Copy link
Contributor Author

smashwilson commented Oct 17, 2014

@rgbkrk I've pushed my work so far if you want to take a look. I think stale container culling is a bit broken, though.

@smashwilson
Copy link
Contributor Author

smashwilson commented Oct 18, 2014

@rgbkrk Actually... I gave this another spin this morning, and now everything seems to be working fine. For extra fun, kill the tmpnb container and re-launch it with a different --pool_size to watch it self-heal

'''Shut down a container and delete its proxy entry.
Destroy the container in an orderly fashion. If requested and capacity is remaining, create
a new one to take its place.'''

This comment has been minimized.

Copy link
@rgbkrk

rgbkrk Oct 18, 2014

Member

I had a bunch of dead containers that needed to be removed (after a reboot). This didn't pick them up and I nuked them by hand.

This comment has been minimized.

Copy link
@smashwilson

smashwilson Oct 18, 2014

Author Contributor

Hmm, that could be a bug in heartbeat, or it could be "working as intended."

When you restart the tmpnb container, the new process has no way to distinguish the old process' pooled containers from its active containers, so to be safe, I assume that all of them are potentially active. They'll eventually be reaped and replaced with fresh ones once the normal culling time has elapsed. But, it can mean that the pool can be erroneously full until several heartbeats have elapsed on a restart.

To correct this, we'd need to store data somewhere externally to track which containers have been handed out and which are just waiting in the pool. I thought this would be okay for now, though. In the meantime, you might want to drop the cull time to something like ten or fifteen minutes if you're restarting often.

This comment has been minimized.

Copy link
@rgbkrk

rgbkrk Oct 18, 2014

Member

Yeah, it's ok if we explore this piece later. I wasn't expecting a fully healing pool as part of this PR. 馃槣

@rgbkrk
Copy link
Member

rgbkrk commented Oct 18, 2014

This is excellent work @smashwilson, thank you so much. I am 馃憤 on merge. When ready, take your [wip] tag off.


AsyncHTTPClient.configure("tornado.curl_httpclient.CurlAsyncHTTPClient")

ContainerConfig = namedtuple('ImageConfig', [

This comment has been minimized.

Copy link
@Carreau

Carreau Oct 18, 2014

Contributor

Want to match the name of the named tuple with the variable ?

This comment has been minimized.

Copy link
@smashwilson

smashwilson Oct 18, 2014

Author Contributor

Haha, yes. I just renamed one and not the other. I blame insufficient

"--ip=0.0.0.0",
"--NotebookApp.base_url=/{}".format(base_path),
"--NotebookApp.tornado_settings=\"{}\"".format(tornado_settings)
]

ipython_command = ipython_executable + " " + " ".join(ipython_args)
ipython_command = container_config.ipython_executable + " " + " ".join(ipython_args)

This comment has been minimized.

Copy link
@Carreau

Carreau Oct 18, 2014

Contributor

you might want not to join, and just extend command below with these.

This comment has been minimized.

Copy link
@smashwilson

smashwilson Oct 18, 2014

Author Contributor

Ultimately I'd want to see as much of this as possible end up externally configurable (because then tmpnb works for any Docker container that exposes a port, not just ipython images 馃槈). That's a task for another PR, though 馃悎 馃悎 馃悎

This comment has been minimized.

Copy link
@rgbkrk

rgbkrk Oct 18, 2014

Member

Ultimately I'd want to see as much of this as possible end up externally configurable because then tmpnb works for any Docker container that exposes a port, not just ipython images

Well, they'd also need to set a base path otherwise all resources are on the wrong path, including CSS, JavaScript, images. That's the key piece that some number of commits ago was passed as a path to the underlying container as an environment variable $RAND_BASE.

This comment has been minimized.

Copy link
@smashwilson

smashwilson Oct 18, 2014

Author Contributor

Oh, I was thinking we could have a set of keyword parameters .format()'ed in to the command, so you'd provide a template like "--NotebookApp.base_url=/{base_path}.

raise gen.Return(matching)

@gen.coroutine
def _with_retries(self, max_tries, fn, *args, **kwargs):

This comment has been minimized.

Copy link
@Carreau

Carreau Oct 18, 2014

Contributor

You could almost remove max_tries and default it to RETRIES :-)

This comment has been minimized.

Copy link
@smashwilson

smashwilson Oct 18, 2014

Author Contributor

You could almost remove max_tries and default it to RETRIES :-)

So I could! Nice.

def cpu_shares(self):
return self.settings['cpu_shares']
def pool(self):
return self.settings['pool']

This comment has been minimized.

Copy link
@Carreau

Carreau Oct 18, 2014

Contributor

are you sure 'pool' is always define ? want to use .get('pool', default) ? Same below, but I'm not familiar enough with the code, it might be always defined.

This comment has been minimized.

Copy link
@Carreau

Carreau Oct 18, 2014

Contributor

and BTW, is it pool_size ?

This comment has been minimized.

Copy link
@smashwilson

smashwilson Oct 18, 2014

Author Contributor

Yep - pool is part of the Tornado settings that are configured unconditionally here during process launch, so it should be there. Good catch though 馃榿

spawner=spawner,
pool=pool,

This comment has been minimized.

Copy link
@Carreau

Carreau Oct 18, 2014

Contributor

Ah, no, it's pool. Ok.

@Carreau
Copy link
Contributor

Carreau commented Oct 18, 2014

that's a lot of code I'm not confortable with.
I'm disappointed though, I saw a lot of yield, but no yield from :-P.

I'll re-take a look later.

@smashwilson
Copy link
Contributor Author

smashwilson commented Oct 18, 2014

that's a lot of code I'm not confortable with.
I'm disappointed though, I saw a lot of yield, but no yield from :-P.

I'll re-take a look later.

Understood, it's a big sprawling PR 馃槈 Thanks for looking it over!

Also, isn't yield from Python 3.3+? As far as I know we're still running orchestrate with 2.7. Also I'm not entirely sure how it would play with tornado.gen. If it would make things more elegant I'm all ears 馃榿

@Carreau
Copy link
Contributor

Carreau commented Oct 18, 2014

Also, isn't yield from Python 3.3+? As far as I know we're still running orchestrate with 2.7. Also I'm not entirely sure how it would play with tornado.gen. If it would make things more elegant I'm all ears

I still need to use yield from myself, I have a few side project where I tried, but I think it will take me some time to be able to code = yield from brain.instance() :-)

@smashwilson smashwilson changed the title [wip] Spawning pool Spawning pool Oct 18, 2014
@smashwilson
Copy link
Contributor Author

smashwilson commented Oct 18, 2014

@rgbkrk You can merge it whenever you like, especially considering you've already used it in a demo! I'd be happy to fix things here or in subsequent PRs as we find them.

@rgbkrk
Copy link
Member

rgbkrk commented Oct 18, 2014

I'd be happy to fix things here or in subsequent PRs as we find them.

馃槃

@rgbkrk
Copy link
Member

rgbkrk commented Oct 18, 2014

:shipit:

rgbkrk added a commit that referenced this pull request Oct 18, 2014
@rgbkrk rgbkrk merged commit d21a34b into jupyter:master Oct 18, 2014
@smashwilson smashwilson deleted the smashwilson:spawnpool branch Oct 18, 2014
@rgbkrk
Copy link
Member

rgbkrk commented Oct 19, 2014

Thanks again for the review @Carreau and @smashwilson for this PR and addressing the review comments!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can鈥檛 perform that action at this time.