Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes must be able to be halted, destroyed and then re-integrated, possibly with a "light provision" or "no provision" alternative #6

Open
ssenator opened this issue May 13, 2020 · 5 comments
Labels
beneficial/enabler Fixing this issue will enable multiple capabilities or provide benefit beyond the immediate problem. bug Something isn't working

Comments

@ssenator
Copy link
Collaborator

ssenator commented May 13, 2020

This issue arose from a discussion regarding the USRC Resilience team's requirements for fault detection and signature generation.

It may be a justification for a lighterweight underlying provider, such as libvirt/ovirt or docker/container.

This would allow node configuration changes to be tested in a CI/CD pipeline, self-validating an hpc-collab node recipe & configuration change.

Another alternative: , returning node back to an earlier full state
Perhaps an underlying vagrant snapshot/suspend mechanism could be utilized, provided:
there are sufficient hooks to trigger a mini-provision, consisting mostly of verification, and
there is a mechanism to do multi-machine snapshot/resume in the preserving dependencies.

@ssenator ssenator added technical debt beneficial/enabler Fixing this issue will enable multiple capabilities or provide benefit beyond the immediate problem. and removed technical debt labels May 13, 2020
@ssenator
Copy link
Collaborator Author

ssenator commented May 26, 2020

transcription of timing data...

some timings: Vagrant 2.2.9, Virtualbox 6.1.6, underlying platform; Fedora30: 5.6.13-100
VirtualBox VMs file system: xfs over luks, underlying SSD/nvme
tarballs file system: xfs, underlying SSD/nvme

vcfs: 9m, ingest/createrepo repos.tgz, yum install/timeouts, selinux homedirs
vcsvc: 3m, yum install/timeouts, rsyslog selinux
vcbuild: 11m, yum (epel/priorities) install/timeouts, slurm build - if triggered, prereq timeouts
vcdb: 7m, yum (epel/priorities) install/timeouts, mysql (community mysql repo?) install, prereq timeouts
vcaltdb: 9m, yum (epel/priorities?) install/timeouts, mysql (community mysql repo?) install, prereq timeouts
vcsched: 6m, yum (epel/priorities?) install/timeouts, prereq timeouts
vc1: 13m, yum (epel/priorities?) install/timeouts, prereq timeouts
vc2: 4m, yum (epel/priorities?) install timeouts, prereq timeouts
vclogin: 22m, slurm test jobs, slurm db/config, yum (epel/priorities?) install timeouts
vxsched: 3m, yum install (epel/priorities?) timeouts
vx1: 5m, yum install (epel/priorities?) timeouts
vx2: 4m, yum install (epel/priorities?) timeouts
vxlogin: 10m, slurm test jobs, slurm db/config, yum (epel/priorities?) install timetouts

muon/centos underlying similar, slower: VirtualBox VMs & tarballs: zfs, raidz1 over HDD

So, first-cut:

  1. cache more from epel into local repositories, attempting to remove it from most/all nodes,
  2. find another mechanism besides yum-plugin-prorities to force local repo first, then remote fallback,
  3. tune prereq timeouts
  4. all else, such as vclogin, slurm-test-jobs, selinux

@ssenator ssenator added bug Something isn't working summer institute https://docs.google.com/document/d/1HEWkCig6d6-heDXwzmEO81pB8lda9ie0AXOlzVDYTGc/edit labels May 26, 2020
@ssenator
Copy link
Collaborator Author

ssenator commented Jun 1, 2020

VMtouch, to lock memory pages in the host: https://github.com/hoytech/vmtouch
User-level NFS server, replacing vboxsf

preliminary timings don't indicate much of a gain, implying the problem isn't host I/O, but rather host to VM I/O or in-VM I/O.
Cpu load is not high when this is occurring, implying that it is not trashing on the host system. (in either the host or guests)

@ssenator
Copy link
Collaborator Author

ssenator commented Jun 8, 2020

Contacting the epel repository is a source of delay in node provisioning. This is a wait for network I/O, rather than any real productive work. This may be best addressed by a snapshot-checkpoint-suspend/resume cycle.

This has been addressed by careful caching of RPMs to avoid epel and yum search timeouts, especially in the early RPM installation step.

@ssenator
Copy link
Collaborator Author

compute nodes may be unprovisioned and reprovisioned safely, separate from other nodes.

@ssenator ssenator removed the summer institute https://docs.google.com/document/d/1HEWkCig6d6-heDXwzmEO81pB8lda9ie0AXOlzVDYTGc/edit label Jul 11, 2020
@ssenator
Copy link
Collaborator Author

estale monitoring code added, allowing most nodes (except vcdb) to be reprovisioned. In particular, vcfs may disappear and reappear, much like a traditional NFS server

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
beneficial/enabler Fixing this issue will enable multiple capabilities or provide benefit beyond the immediate problem. bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant