Spawning pool#69
Conversation
832a61e to
6e60415
Compare
|
Awesome. That 2-second delay turns into a 15 second delay with high activity. That's the real problem. |
There was a problem hiding this comment.
I need a more passive-aggressive emoji here. ➡️
6e60415 to
f2dea5e
Compare
There was a problem hiding this comment.
I'm ok with saying this is "provisioning a new ad-hoc container".
|
One question: what do we do when the pool is empty? Right now, it raises an exception. There are two directions I can take this in:
2 makes for cleaner conceptual boundaries, but I like 1 because it feels more controlled. Maybe I can even do some things to balance ad-hoc vs. prelaunched containers. What do you think? |
|
Also, I think it'd clean up quite a bit of redundancy if I refactored out a class to interact with the proxy - something you can instantiate with an endpoint and a token and pass around |
|
Something else I could explore is a decaying notebook expiration time. Under heavy load, we could start expiring containers that have been alive for 45 minutes, 30 minutes, and so on, down to some minimum, before we just give up. |
|
I'd like to let ad-hoc containers get created outside the bounds of the pool, but then that could easily overload the system. The real reason we want the pooling is for speed though. What if we just remove one of the allocated containers to make room for the ad-hoc container? That user does have to wait for theirs to spin up. Only other way around that would be to somehow turn an allocated container's |
|
I'd leave out the decaying notebook expiration time for now. That wouldn't work well for the use cases that resonate really well with people (tutorials, classes). It would work well for demos though. I guess post an issue as a feature request for now. |
|
Refactoring a class out for the proxy sounds good to me. I was originally going to do that up until it became apparent how few calls we were going to make against the proxy. What probably needs to happen though is for the pool to do the creation of the new adhoc containers and for when the pool is empty, all centralized. |
There was a problem hiding this comment.
This piece is redundant now, since path is not None.
|
@rgbkrk I've pushed my work so far if you want to take a look. I think stale container culling is a bit broken, though. |
|
@rgbkrk Actually... I gave this another spin this morning, and now everything seems to be working fine. For extra fun, kill the tmpnb container and re-launch it with a different |
There was a problem hiding this comment.
I had a bunch of dead containers that needed to be removed (after a reboot). This didn't pick them up and I nuked them by hand.
There was a problem hiding this comment.
Hmm, that could be a bug in heartbeat, or it could be "working as intended."
When you restart the tmpnb container, the new process has no way to distinguish the old process' pooled containers from its active containers, so to be safe, I assume that all of them are potentially active. They'll eventually be reaped and replaced with fresh ones once the normal culling time has elapsed. But, it can mean that the pool can be erroneously full until several heartbeats have elapsed on a restart.
To correct this, we'd need to store data somewhere externally to track which containers have been handed out and which are just waiting in the pool. I thought this would be okay for now, though. In the meantime, you might want to drop the cull time to something like ten or fifteen minutes if you're restarting often.
There was a problem hiding this comment.
Yeah, it's ok if we explore this piece later. I wasn't expecting a fully healing pool as part of this PR. 😜
|
This is excellent work @smashwilson, thank you so much. I am 👍 on merge. When ready, take your |
There was a problem hiding this comment.
Want to match the name of the named tuple with the variable ?
There was a problem hiding this comment.
Haha, yes. I just renamed one and not the other. I blame insufficient ☕
There was a problem hiding this comment.
You could almost remove max_tries and default it to RETRIES :-)
There was a problem hiding this comment.
You could almost remove max_tries and default it to RETRIES :-)
So I could! Nice.
|
that's a lot of code I'm not confortable with. I'll re-take a look later. |
Understood, it's a big sprawling PR 😉 Thanks for looking it over! Also, isn't |
I still need to use |
|
@rgbkrk You can merge it whenever you like, especially considering you've already used it in a demo! I'd be happy to fix things here or in subsequent PRs as we find them. |
😄 |
|
|
|
Thanks again for the review @Carreau and @smashwilson for this PR and addressing the review comments! |
Time to shave off that final 2-second delay. This PR implements #32 by pre-launching a configured
(by which I mean hardcoded to "3")set of containers and handing them out to incoming requests. When a container is culled for inactivity, it's scrapped and a new one is launched and added to the pool in its place.If you hit
orchestrate.pywith a path already set (using a stale link, for example, or an appear.in-style link like #67 implements), you have a personal container created and linked as it happens now./cc @rgbkrk so he can see just how terrible this is right now, and suggest embetterment techniques 😉
make devshould probably default to something like 3, maybe.orchestrate.py._clean_orphaned_containers. Bring more of that functionality intocullas well, so we also clean up zombies and dead proxy entries periodically.