Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardware errors on toxis #51

Closed
mtelvers opened this issue Jun 14, 2023 · 10 comments
Closed

Hardware errors on toxis #51

mtelvers opened this issue Jun 14, 2023 · 10 comments

Comments

@mtelvers
Copy link
Collaborator

The machine toxis has multiple hardware issues. The following services have been affected:

Issues:

  • BTRFS volume has been corrupted and causing it to be marked as read-only.
  • Multiple ECC memory errors have been reported.

The machine has a spare spinning disk, which has been brought into service with a copy of /var/lib/docker, but due to size constraints, the job log output, var/job, has not been copied.

The current configuration is 2 x 18 core CPUs giving 72 threads with 512GB RAM and 1.8TB SSD. Historically, toxis also performed the solves locally, but this has recently been migrated to the solver-service; therefore, a smaller machine is required. We know that Opam Repo CI requires > 32GB of RAM, which is why it was migrated to toxis.

The suggested new specification is:

  • 64GB RAM
  • 4 x vCPU
  • ~500GB (or larger depending upon how many historic logs we wish to retain)
@avsm
Copy link
Member

avsm commented Jun 14, 2023

@mtelvers I've launched an instance, but what should I call it? Is toxis a replacement for what was formerly ci3.ocamllabs.io (opam.ci.ocaml.org)

@mtelvers
Copy link
Collaborator Author

toxis was ci.ocamllabs.io before becoming ocaml.ci.dev. opam-repo.ci.ocaml.org was then moved on to it. ci3.ocamllabs.io is the OCluster scheduler.

@mtelvers
Copy link
Collaborator Author

We only need an internal name for it at Scaleway, as it will have many external DNS names. Why not just ci or is that already taken?

@avsm
Copy link
Member

avsm commented Jun 14, 2023

ok i set it up with the internal name of opam-repo-ci.sw.ocaml.org, and we can clean up all the other names once you've got it running. We should move the ocluster scheduler as well at some point into this namespace...

@avsm
Copy link
Member

avsm commented Jun 14, 2023

The new instance is an experimental ARM (Graviton2) based setup that Scaleway says in on a trial basis, so we may have to migrate again in the future. But it's half the price and half the energy usage, so much better than the old toxis!

@mtelvers
Copy link
Collaborator Author

@avsm I'm nearly ready to make the switch over. I will try to maintain the current state from toxis. I have migrated the data. We need to shut down the services on toxis and perform a small incremental copy, then bring up the services on the new server. Are you available to do the DNS switchover? The entries that I would need you to change are opam-repo.ci.ocaml.org and opam.ci.ocaml.org, I can do the others.

For reference the complete list is:

  • opam-repo.ci.ocaml.org
  • opam.ci.ocaml.org
  • ocaml.ci.dev
  • status.ocaml.ci.dev
  • ci.ocamllabs.io (legacy)
  • status.ci.ocamllabs.io (legacy)

@avsm
Copy link
Member

avsm commented Jun 15, 2023

@mtelvers I've switched over opam-repo.ci.ocaml.org and opam.ci.ocaml.org to point to opam-repo-ci.sw.ocaml.org now.

@mtelvers
Copy link
Collaborator Author

@avsm Thank you for your help. The switchover of these services is complete.

@avsm
Copy link
Member

avsm commented Jun 16, 2023

Splendid! I'll keep an eye on the new ARM infrastructure VM. It seems like a good addition to their lineup.

@tmcgilchrist
Copy link
Collaborator

@avsm thank you very much for you help with provisioning a new machine and getting this all switched over.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants