Nodes must be able to be halted, destroyed and then re-integrated, possibly with a "light provision" or "no provision" alternative #6

ssenator · 2020-05-13T19:07:06Z

This issue arose from a discussion regarding the USRC Resilience team's requirements for fault detection and signature generation.

It may be a justification for a lighterweight underlying provider, such as libvirt/ovirt or docker/container.

This would allow node configuration changes to be tested in a CI/CD pipeline, self-validating an hpc-collab node recipe & configuration change.

Another alternative: , returning node back to an earlier full state
Perhaps an underlying vagrant snapshot/suspend mechanism could be utilized, provided:
there are sufficient hooks to trigger a mini-provision, consisting mostly of verification, and
there is a mechanism to do multi-machine snapshot/resume in the preserving dependencies.

ssenator · 2020-05-26T19:15:15Z

transcription of timing data...

some timings: Vagrant 2.2.9, Virtualbox 6.1.6, underlying platform; Fedora30: 5.6.13-100
VirtualBox VMs file system: xfs over luks, underlying SSD/nvme
tarballs file system: xfs, underlying SSD/nvme

vcfs: 9m, ingest/createrepo repos.tgz, yum install/timeouts, selinux homedirs
vcsvc: 3m, yum install/timeouts, rsyslog selinux
vcbuild: 11m, yum (epel/priorities) install/timeouts, slurm build - if triggered, prereq timeouts
vcdb: 7m, yum (epel/priorities) install/timeouts, mysql (community mysql repo?) install, prereq timeouts
vcaltdb: 9m, yum (epel/priorities?) install/timeouts, mysql (community mysql repo?) install, prereq timeouts
vcsched: 6m, yum (epel/priorities?) install/timeouts, prereq timeouts
vc1: 13m, yum (epel/priorities?) install/timeouts, prereq timeouts
vc2: 4m, yum (epel/priorities?) install timeouts, prereq timeouts
vclogin: 22m, slurm test jobs, slurm db/config, yum (epel/priorities?) install timeouts
vxsched: 3m, yum install (epel/priorities?) timeouts
vx1: 5m, yum install (epel/priorities?) timeouts
vx2: 4m, yum install (epel/priorities?) timeouts
vxlogin: 10m, slurm test jobs, slurm db/config, yum (epel/priorities?) install timetouts

muon/centos underlying similar, slower: VirtualBox VMs & tarballs: zfs, raidz1 over HDD

So, first-cut:

cache more from epel into local repositories, attempting to remove it from most/all nodes,
find another mechanism besides yum-plugin-prorities to force local repo first, then remote fallback,
tune prereq timeouts
all else, such as vclogin, slurm-test-jobs, selinux

ssenator · 2020-06-01T15:57:19Z

VMtouch, to lock memory pages in the host: https://github.com/hoytech/vmtouch
User-level NFS server, replacing vboxsf

preliminary timings don't indicate much of a gain, implying the problem isn't host I/O, but rather host to VM I/O or in-VM I/O.
Cpu load is not high when this is occurring, implying that it is not trashing on the host system. (in either the host or guests)

ssenator · 2020-06-08T19:00:54Z

Contacting the epel repository is a source of delay in node provisioning. This is a wait for network I/O, rather than any real productive work. This may be best addressed by a snapshot-checkpoint-suspend/resume cycle.

This has been addressed by careful caching of RPMs to avoid epel and yum search timeouts, especially in the early RPM installation step.

ssenator · 2020-06-18T05:24:30Z

compute nodes may be unprovisioned and reprovisioned safely, separate from other nodes.

ssenator · 2020-12-15T21:29:57Z

estale monitoring code added, allowing most nodes (except vcdb) to be reprovisioned. In particular, vcfs may disappear and reappear, much like a traditional NFS server

ssenator added technical debt beneficial/enabler Fixing this issue will enable multiple capabilities or provide benefit beyond the immediate problem. and removed technical debt labels May 13, 2020

ssenator added bug Something isn't working summer institute https://docs.google.com/document/d/1HEWkCig6d6-heDXwzmEO81pB8lda9ie0AXOlzVDYTGc/edit labels May 26, 2020

This was referenced May 26, 2020

vcbuild builds munge, other nodes load this version rather than epel-provided #84

Open

minimize external repository-loaded packages, possibly caching locally via initial vcfs' localrepo #87

Closed

ssenator removed the summer institute https://docs.google.com/document/d/1HEWkCig6d6-heDXwzmEO81pB8lda9ie0AXOlzVDYTGc/edit label Jul 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes must be able to be halted, destroyed and then re-integrated, possibly with a "light provision" or "no provision" alternative #6

Nodes must be able to be halted, destroyed and then re-integrated, possibly with a "light provision" or "no provision" alternative #6

ssenator commented May 13, 2020 •

edited

Loading

ssenator commented May 26, 2020 •

edited

Loading

ssenator commented Jun 1, 2020 •

edited

Loading

ssenator commented Jun 8, 2020 •

edited

Loading

ssenator commented Jun 18, 2020

ssenator commented Dec 15, 2020

Nodes must be able to be halted, destroyed and then re-integrated, possibly with a "light provision" or "no provision" alternative #6

Nodes must be able to be halted, destroyed and then re-integrated, possibly with a "light provision" or "no provision" alternative #6

Comments

ssenator commented May 13, 2020 • edited Loading

ssenator commented May 26, 2020 • edited Loading

transcription of timing data...

some timings: Vagrant 2.2.9, Virtualbox 6.1.6, underlying platform; Fedora30: 5.6.13-100 VirtualBox VMs file system: xfs over luks, underlying SSD/nvme tarballs file system: xfs, underlying SSD/nvme

muon/centos underlying similar, slower: VirtualBox VMs & tarballs: zfs, raidz1 over HDD

ssenator commented Jun 1, 2020 • edited Loading

ssenator commented Jun 8, 2020 • edited Loading

ssenator commented Jun 18, 2020

ssenator commented Dec 15, 2020

ssenator commented May 13, 2020 •

edited

Loading

ssenator commented May 26, 2020 •

edited

Loading

some timings: Vagrant 2.2.9, Virtualbox 6.1.6, underlying platform; Fedora30: 5.6.13-100
VirtualBox VMs file system: xfs over luks, underlying SSD/nvme
tarballs file system: xfs, underlying SSD/nvme

ssenator commented Jun 1, 2020 •

edited

Loading

ssenator commented Jun 8, 2020 •

edited

Loading