-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bundling containers for a kernel session #78
Labels
comp:agent
Related to Agent component
comp:manager
Related to Manager component
type:feature
Add new features
Milestone
Comments
achimnol
referenced
this issue
in lablup/backend.ai-manager
Aug 23, 2017
* Now manager + gateway runs on multiple CPU cores with sane transaction semantics. Thanks to aiotools! - It no longer depends on Redis as a pub-sub broker nor a database. All communications are done via ZeroMQ with no centralized queue server. - Redis is used only for per-keypair rate-limiting. * Now the manager searches available agent to spawn new containers based on available memory / CPU / GPU capacity units. No more hard-coded instance types! * The db schema is now prepared for multi-container kernel sessions. - User-facing APIs now use "session ID" which is directed to the master container of the given session. - Each container has unique "kernel ID" and managed individually. * Replace asyncpg + asyncpgsa with aiopg for better SQLAlchemy supports (especially custom type decorators). * TODOs - Stabilize accounting of used/available resource units. - Still some parts are confused of session/kernel IDs...
achimnol
referenced
this issue
in lablup/backend.ai-manager
Aug 23, 2017
* Refactor "app.dbpool" to use the recommended custom context format: app['dbpool']
achimnol
referenced
this issue
in lablup/backend.ai-manager
Aug 23, 2017
achimnol
referenced
this issue
in lablup/backend.ai-manager
Aug 29, 2017
achimnol
referenced
this issue
in lablup/backend.ai-manager
Jan 4, 2018
There were many places that missed appropriate filtering conditions when fetching active sessions. This bug has been the major source of concurrency tracking errors.
Closed
5 tasks
Now the server-side implementation is done with the v20.09.0 release, with the following work:
|
Closing as completed and let's handle fine-grained UI improvements in separate issues. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
comp:agent
Related to Agent component
comp:manager
Related to Manager component
type:feature
Add new features
For large-scale computations, sometimes we need to run multiple containers on different hosts for resource aggregation and distributed/parallel processing.
In the past, this was very difficult to implement because Docker's networking was limited to mounting a container to another via an hostname alias (
--link
), which is essentially one-to-one private links. Now, it's 2017, and Docker offers a nice distributed coordination called "Swarm" which includes overlay networking.Docker Swarm uses the Raft algorithm to share node information and any new Docker daemon can join to an existing Swarm via host:port and a secret token. Once joined, any containers of the daemons in the swarm can be connected to volatile overlay networks created and destroyed at runtime.
Let's try this and support multi-container distributed computing!
Update for 2020!
Docker Swarm has problems with overlapped IP addresses in different overlay networks and creating/destroying and attaching/detaching networks is proven to be unstable.
After some testing by @kyujin-cho , we decided to fall back to the "classic" Swarm mode, which uses an external etcd to manage multi-host networks, and use namespaced container hostname aliasing to access other containers in the same overlay network.
Basically we keep the same "kernels" table extension as we prototyped in 2017-2018. A single record of the kernels table correspond to a container and multiple records may share the same
sess_id
indicating that they belong to a overlay cluster.Phase 1
scripts
directory of this meta repository.Phase 2
BACKEND_CLUSTER_ROLE
,BACKEND_CLUSTER_ROLE_IDX
)Phase 3
asyncio.gather()
with proper interruption handling.┆Issue is synchronized with this Asana task by Unito
The text was updated successfully, but these errors were encountered: