-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nodes must be able to be halted, destroyed and then re-integrated, possibly with a "light provision" or "no provision" alternative #6
Comments
transcription of timing data...some timings: Vagrant 2.2.9, Virtualbox 6.1.6, underlying platform; Fedora30: 5.6.13-100
|
VMtouch, to lock memory pages in the host: https://github.com/hoytech/vmtouch preliminary timings don't indicate much of a gain, implying the problem isn't host I/O, but rather host to VM I/O or in-VM I/O. |
Contacting the epel repository is a source of delay in node provisioning. This is a wait for network I/O, rather than any real productive work. This may be best addressed by a snapshot-checkpoint-suspend/resume cycle. This has been addressed by careful caching of RPMs to avoid epel and yum search timeouts, especially in the early RPM installation step. |
compute nodes may be unprovisioned and reprovisioned safely, separate from other nodes. |
estale monitoring code added, allowing most nodes (except vcdb) to be reprovisioned. In particular, vcfs may disappear and reappear, much like a traditional NFS server |
This issue arose from a discussion regarding the USRC Resilience team's requirements for fault detection and signature generation.
It may be a justification for a lighterweight underlying provider, such as libvirt/ovirt or docker/container.
This would allow node configuration changes to be tested in a CI/CD pipeline, self-validating an hpc-collab node recipe & configuration change.
Another alternative: , returning node back to an earlier full state
Perhaps an underlying vagrant snapshot/suspend mechanism could be utilized, provided:
there are sufficient hooks to trigger a mini-provision, consisting mostly of verification, and
there is a mechanism to do multi-machine snapshot/resume in the preserving dependencies.
The text was updated successfully, but these errors were encountered: