Skip to content
This repository has been archived by the owner. It is now read-only.

Spawning pool #69

Merged
merged 47 commits into from Oct 18, 2014
Merged

Spawning pool #69

merged 47 commits into from Oct 18, 2014

Conversation

smashwilson
Copy link
Contributor

@smashwilson smashwilson commented Oct 14, 2014

Time to shave off that final 2-second delay. This PR implements #32 by pre-launching a configured (by which I mean hardcoded to "3") set of containers and handing them out to incoming requests. When a container is culled for inactivity, it's scrapped and a new one is launched and added to the pool in its place.

If you hit orchestrate.py with a path already set (using a stale link, for example, or an appear.in-style link like #67 implements), you have a personal container created and linked as it happens now.

/cc @rgbkrk so he can see just how terrible this is right now, and suggest embetterment techniques 馃槈

  • Remove any remaining now-dead code. Container launching should be centralized and managed by the SpawnPool.
  • Get culling working again.
  • Fix the race condition in the culling routine that causes the pool to grow without bound if the culling interval is shorter than the time that it takes to cull a container.
  • Display a friendlier message (meaning, not a stack trace) when the node is out of capacity. A non-200 response code might be useful for upstream load balancing, too. If it's possible, it would be neat to do things like fail over to the next node when 500s are received.
  • Since we're using the pool size as an absolute capacity, defaulting to zero is not optimal. make dev should probably default to something like 3, maybe.
  • Clean out existing containers and proxy entries on restart of orchestrate.py.
  • Catch and log Docker API errors.
  • Change the "I'm full" page to indicate that it'll also auto-refresh for you to try again.
  • Refactor _clean_orphaned_containers. Bring more of that functionality into cull as well, so we also clean up zombies and dead proxy entries periodically.

@rgbkrk
Copy link
Member

rgbkrk commented Oct 14, 2014

Awesome. That 2-second delay turns into a 15 second delay with high activity. That's the real problem.

blocking_docker_client = docker.Client(base_url=docker_host,
version=version,
timeout=timeout)

executor = ThreadPoolExecutor(max_workers=max_workers)

Copy link
Member

@rgbkrk rgbkrk Oct 14, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those Atom users.

Copy link
Contributor Author

@smashwilson smashwilson Oct 14, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need a more passive-aggressive emoji here. 鉃★笍

Copy link
Member

@rgbkrk rgbkrk Oct 16, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

馃檱

app_log.debug("redirecting %s -> %s", self.request.path, url)
self.redirect(url, permanent=False)
prefix = path.lstrip('/').split('/', 1)[0]
app_log.info("Initializing a new ad-hoc container for [%s].", prefix)
Copy link
Member

@rgbkrk rgbkrk Oct 14, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with saying this is "provisioning a new ad-hoc container".

@smashwilson
Copy link
Contributor Author

smashwilson commented Oct 14, 2014

One question: what do we do when the pool is empty? Right now, it raises an exception.

There are two directions I can take this in:

  1. I can make the pool own all containers: count the ad-hoc containers that are launched, as well. Use the spawning pool as a kind of resource management strategy. When all of the container slots are filled on a host, show a template.
  2. I can make the pool only responsible for the containers within it, and trust the system to fail gracefully if we get a spike in concurrent users for some reason.

2 makes for cleaner conceptual boundaries, but I like 1 because it feels more controlled. Maybe I can even do some things to balance ad-hoc vs. prelaunched containers. What do you think?

@smashwilson
Copy link
Contributor Author

smashwilson commented Oct 14, 2014

Also, I think it'd clean up quite a bit of redundancy if I refactored out a class to interact with the proxy - something you can instantiate with an endpoint and a token and pass around orchestrate.py and the spawn pool.

@smashwilson
Copy link
Contributor Author

smashwilson commented Oct 14, 2014

Something else I could explore is a decaying notebook expiration time. Under heavy load, we could start expiring containers that have been alive for 45 minutes, 30 minutes, and so on, down to some minimum, before we just give up.

@rgbkrk
Copy link
Member

rgbkrk commented Oct 14, 2014

I'd like to let ad-hoc containers get created outside the bounds of the pool, but then that could easily overload the system. The real reason we want the pooling is for speed though.

What if we just remove one of the allocated containers to make room for the ad-hoc container? That user does have to wait for theirs to spin up. Only other way around that would be to somehow turn an allocated container's base_path into the adhoc's requested base_path.

@rgbkrk
Copy link
Member

rgbkrk commented Oct 14, 2014

I'd leave out the decaying notebook expiration time for now. That wouldn't work well for the use cases that resonate really well with people (tutorials, classes). It would work well for demos though. I guess post an issue as a feature request for now.

@rgbkrk
Copy link
Member

rgbkrk commented Oct 14, 2014

Refactoring a class out for the proxy sounds good to me. I was originally going to do that up until it became apparent how few calls we were going to make against the proxy.

What probably needs to happen though is for the pool to do the creation of the new adhoc containers and for when the pool is empty, all centralized.

# Wait for the notebook server to come up.
yield self.wait_for_server(ip, port, prefix)

if path is None:
Copy link
Member

@rgbkrk rgbkrk Oct 14, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This piece is redundant now, since path is not None.

@smashwilson
Copy link
Contributor Author

smashwilson commented Oct 17, 2014

@rgbkrk I've pushed my work so far if you want to take a look. I think stale container culling is a bit broken, though.

@smashwilson
Copy link
Contributor Author

smashwilson commented Oct 18, 2014

@rgbkrk Actually... I gave this another spin this morning, and now everything seems to be working fine. For extra fun, kill the tmpnb container and re-launch it with a different --pool_size to watch it self-heal

'''Shut down a container and delete its proxy entry.

Destroy the container in an orderly fashion. If requested and capacity is remaining, create
a new one to take its place.'''
Copy link
Member

@rgbkrk rgbkrk Oct 18, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a bunch of dead containers that needed to be removed (after a reboot). This didn't pick them up and I nuked them by hand.

Copy link
Contributor Author

@smashwilson smashwilson Oct 18, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that could be a bug in heartbeat, or it could be "working as intended."

When you restart the tmpnb container, the new process has no way to distinguish the old process' pooled containers from its active containers, so to be safe, I assume that all of them are potentially active. They'll eventually be reaped and replaced with fresh ones once the normal culling time has elapsed. But, it can mean that the pool can be erroneously full until several heartbeats have elapsed on a restart.

To correct this, we'd need to store data somewhere externally to track which containers have been handed out and which are just waiting in the pool. I thought this would be okay for now, though. In the meantime, you might want to drop the cull time to something like ten or fifteen minutes if you're restarting often.

Copy link
Member

@rgbkrk rgbkrk Oct 18, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's ok if we explore this piece later. I wasn't expecting a fully healing pool as part of this PR. 馃槣

@rgbkrk
Copy link
Member

rgbkrk commented Oct 18, 2014

This is excellent work @smashwilson, thank you so much. I am 馃憤 on merge. When ready, take your [wip] tag off.


AsyncHTTPClient.configure("tornado.curl_httpclient.CurlAsyncHTTPClient")

ContainerConfig = namedtuple('ImageConfig', [
Copy link
Member

@Carreau Carreau Oct 18, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to match the name of the named tuple with the variable ?

Copy link
Contributor Author

@smashwilson smashwilson Oct 18, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, yes. I just renamed one and not the other. I blame insufficient

raise gen.Return(matching)

@gen.coroutine
def _with_retries(self, max_tries, fn, *args, **kwargs):
Copy link
Member

@Carreau Carreau Oct 18, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could almost remove max_tries and default it to RETRIES :-)

Copy link
Contributor Author

@smashwilson smashwilson Oct 18, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could almost remove max_tries and default it to RETRIES :-)

So I could! Nice.

@Carreau
Copy link
Member

Carreau commented Oct 18, 2014

that's a lot of code I'm not confortable with.
I'm disappointed though, I saw a lot of yield, but no yield from :-P.

I'll re-take a look later.

@smashwilson
Copy link
Contributor Author

smashwilson commented Oct 18, 2014

that's a lot of code I'm not confortable with.
I'm disappointed though, I saw a lot of yield, but no yield from :-P.

I'll re-take a look later.

Understood, it's a big sprawling PR 馃槈 Thanks for looking it over!

Also, isn't yield from Python 3.3+? As far as I know we're still running orchestrate with 2.7. Also I'm not entirely sure how it would play with tornado.gen. If it would make things more elegant I'm all ears 馃榿

@Carreau
Copy link
Member

Carreau commented Oct 18, 2014

Also, isn't yield from Python 3.3+? As far as I know we're still running orchestrate with 2.7. Also I'm not entirely sure how it would play with tornado.gen. If it would make things more elegant I'm all ears

I still need to use yield from myself, I have a few side project where I tried, but I think it will take me some time to be able to code = yield from brain.instance() :-)

@smashwilson smashwilson changed the title [wip] Spawning pool Spawning pool Oct 18, 2014
@smashwilson
Copy link
Contributor Author

smashwilson commented Oct 18, 2014

@rgbkrk You can merge it whenever you like, especially considering you've already used it in a demo! I'd be happy to fix things here or in subsequent PRs as we find them.

@rgbkrk
Copy link
Member

rgbkrk commented Oct 18, 2014

I'd be happy to fix things here or in subsequent PRs as we find them.

馃槃

@rgbkrk
Copy link
Member

rgbkrk commented Oct 18, 2014

:shipit:

rgbkrk added a commit that referenced this issue Oct 18, 2014
@rgbkrk rgbkrk merged commit d21a34b into jupyter:master Oct 18, 2014
@smashwilson smashwilson deleted the spawnpool branch Oct 18, 2014
@rgbkrk
Copy link
Member

rgbkrk commented Oct 19, 2014

Thanks again for the review @Carreau and @smashwilson for this PR and addressing the review comments!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants