Skip to content
This repository was archived by the owner on Jan 24, 2018. It is now read-only.

Spawning pool#69

Merged
rgbkrk merged 47 commits into
jupyter:masterfrom
smashwilson:spawnpool
Oct 18, 2014
Merged

Spawning pool#69
rgbkrk merged 47 commits into
jupyter:masterfrom
smashwilson:spawnpool

Conversation

@smashwilson

Copy link
Copy Markdown
Contributor

Time to shave off that final 2-second delay. This PR implements #32 by pre-launching a configured (by which I mean hardcoded to "3") set of containers and handing them out to incoming requests. When a container is culled for inactivity, it's scrapped and a new one is launched and added to the pool in its place.

If you hit orchestrate.py with a path already set (using a stale link, for example, or an appear.in-style link like #67 implements), you have a personal container created and linked as it happens now.

/cc @rgbkrk so he can see just how terrible this is right now, and suggest embetterment techniques 😉

  • Remove any remaining now-dead code. Container launching should be centralized and managed by the SpawnPool.
  • Get culling working again.
  • Fix the race condition in the culling routine that causes the pool to grow without bound if the culling interval is shorter than the time that it takes to cull a container.
  • Display a friendlier message (meaning, not a stack trace) when the node is out of capacity. A non-200 response code might be useful for upstream load balancing, too. If it's possible, it would be neat to do things like fail over to the next node when 500s are received.
  • Since we're using the pool size as an absolute capacity, defaulting to zero is not optimal. make dev should probably default to something like 3, maybe.
  • Clean out existing containers and proxy entries on restart of orchestrate.py.
  • Catch and log Docker API errors.
  • Change the "I'm full" page to indicate that it'll also auto-refresh for you to try again.
  • Refactor _clean_orphaned_containers. Bring more of that functionality into cull as well, so we also clean up zombies and dead proxy entries periodically.

@rgbkrk

rgbkrk commented Oct 14, 2014

Copy link
Copy Markdown
Member

Awesome. That 2-second delay turns into a 15 second delay with high activity. That's the real problem.

Comment thread dockworker.py

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those Atom users.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need a more passive-aggressive emoji here. ➡️

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙇

Comment thread orchestrate.py Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with saying this is "provisioning a new ad-hoc container".

@smashwilson

Copy link
Copy Markdown
Contributor Author

One question: what do we do when the pool is empty? Right now, it raises an exception.

There are two directions I can take this in:

  1. I can make the pool own all containers: count the ad-hoc containers that are launched, as well. Use the spawning pool as a kind of resource management strategy. When all of the container slots are filled on a host, show a template.
  2. I can make the pool only responsible for the containers within it, and trust the system to fail gracefully if we get a spike in concurrent users for some reason.

2 makes for cleaner conceptual boundaries, but I like 1 because it feels more controlled. Maybe I can even do some things to balance ad-hoc vs. prelaunched containers. What do you think?

@smashwilson

Copy link
Copy Markdown
Contributor Author

Also, I think it'd clean up quite a bit of redundancy if I refactored out a class to interact with the proxy - something you can instantiate with an endpoint and a token and pass around orchestrate.py and the spawn pool.

@smashwilson

Copy link
Copy Markdown
Contributor Author

Something else I could explore is a decaying notebook expiration time. Under heavy load, we could start expiring containers that have been alive for 45 minutes, 30 minutes, and so on, down to some minimum, before we just give up.

@rgbkrk

rgbkrk commented Oct 14, 2014

Copy link
Copy Markdown
Member

I'd like to let ad-hoc containers get created outside the bounds of the pool, but then that could easily overload the system. The real reason we want the pooling is for speed though.

What if we just remove one of the allocated containers to make room for the ad-hoc container? That user does have to wait for theirs to spin up. Only other way around that would be to somehow turn an allocated container's base_path into the adhoc's requested base_path.

@rgbkrk

rgbkrk commented Oct 14, 2014

Copy link
Copy Markdown
Member

I'd leave out the decaying notebook expiration time for now. That wouldn't work well for the use cases that resonate really well with people (tutorials, classes). It would work well for demos though. I guess post an issue as a feature request for now.

@rgbkrk

rgbkrk commented Oct 14, 2014

Copy link
Copy Markdown
Member

Refactoring a class out for the proxy sounds good to me. I was originally going to do that up until it became apparent how few calls we were going to make against the proxy.

What probably needs to happen though is for the pool to do the creation of the new adhoc containers and for when the pool is empty, all centralized.

Comment thread orchestrate.py Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This piece is redundant now, since path is not None.

@smashwilson

Copy link
Copy Markdown
Contributor Author

@rgbkrk I've pushed my work so far if you want to take a look. I think stale container culling is a bit broken, though.

@smashwilson

Copy link
Copy Markdown
Contributor Author

@rgbkrk Actually... I gave this another spin this morning, and now everything seems to be working fine. For extra fun, kill the tmpnb container and re-launch it with a different --pool_size to watch it self-heal ⚡

Comment thread spawnpool.py

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a bunch of dead containers that needed to be removed (after a reboot). This didn't pick them up and I nuked them by hand.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that could be a bug in heartbeat, or it could be "working as intended."

When you restart the tmpnb container, the new process has no way to distinguish the old process' pooled containers from its active containers, so to be safe, I assume that all of them are potentially active. They'll eventually be reaped and replaced with fresh ones once the normal culling time has elapsed. But, it can mean that the pool can be erroneously full until several heartbeats have elapsed on a restart.

To correct this, we'd need to store data somewhere externally to track which containers have been handed out and which are just waiting in the pool. I thought this would be okay for now, though. In the meantime, you might want to drop the cull time to something like ten or fifteen minutes if you're restarting often.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's ok if we explore this piece later. I wasn't expecting a fully healing pool as part of this PR. 😜

@rgbkrk

rgbkrk commented Oct 18, 2014

Copy link
Copy Markdown
Member

This is excellent work @smashwilson, thank you so much. I am 👍 on merge. When ready, take your [wip] tag off.

Comment thread dockworker.py Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to match the name of the named tuple with the variable ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, yes. I just renamed one and not the other. I blame insufficient ☕

Comment thread dockworker.py Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could almost remove max_tries and default it to RETRIES :-)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could almost remove max_tries and default it to RETRIES :-)

So I could! Nice.

@Carreau

Carreau commented Oct 18, 2014

Copy link
Copy Markdown
Member

that's a lot of code I'm not confortable with.
I'm disappointed though, I saw a lot of yield, but no yield from :-P.

I'll re-take a look later.

@smashwilson

Copy link
Copy Markdown
Contributor Author

that's a lot of code I'm not confortable with.
I'm disappointed though, I saw a lot of yield, but no yield from :-P.

I'll re-take a look later.

Understood, it's a big sprawling PR 😉 Thanks for looking it over!

Also, isn't yield from Python 3.3+? As far as I know we're still running orchestrate with 2.7. Also I'm not entirely sure how it would play with tornado.gen. If it would make things more elegant I'm all ears 😁

@Carreau

Carreau commented Oct 18, 2014

Copy link
Copy Markdown
Member

Also, isn't yield from Python 3.3+? As far as I know we're still running orchestrate with 2.7. Also I'm not entirely sure how it would play with tornado.gen. If it would make things more elegant I'm all ears

I still need to use yield from myself, I have a few side project where I tried, but I think it will take me some time to be able to code = yield from brain.instance() :-)

@smashwilson smashwilson changed the title [wip] Spawning pool Spawning pool Oct 18, 2014
@smashwilson

Copy link
Copy Markdown
Contributor Author

@rgbkrk You can merge it whenever you like, especially considering you've already used it in a demo! I'd be happy to fix things here or in subsequent PRs as we find them.

@rgbkrk

rgbkrk commented Oct 18, 2014

Copy link
Copy Markdown
Member

I'd be happy to fix things here or in subsequent PRs as we find them.

😄

@rgbkrk

rgbkrk commented Oct 18, 2014

Copy link
Copy Markdown
Member

:shipit:

rgbkrk added a commit that referenced this pull request Oct 18, 2014
@rgbkrk rgbkrk merged commit d21a34b into jupyter:master Oct 18, 2014
@smashwilson smashwilson deleted the spawnpool branch October 18, 2014 20:48
@rgbkrk

rgbkrk commented Oct 19, 2014

Copy link
Copy Markdown
Member

Thanks again for the review @Carreau and @smashwilson for this PR and addressing the review comments!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants