New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get hardware with avx2 or avx512 capabilities in smaug instance #419
Comments
It was ocp4 internal cluster with ocp4
rick
|
Hey @rbo, WDYT, can we request additional machines at Hetzner and plug it into Rick for this? Right now rick can do |
Let me check next week, we have a limitation of the number of nodes because of operate-first/hetzner-baremetal-openshift/issues/8 . If we can not add more nodes we can replace one. But let me check next week. |
Still "next week" but unfortunately Friday :-( We can not add more nodes to the Rick cluster because of limitations at Hetzner and/or OpenShift.
The only option I can imagine is we replace all worker nodes with a new one they have the feature step by step. The only risky part is the OCS/ODF storage. |
What about going ahead and replacing the workload cluster's node with beefier machines? @durandom wdyt? |
Current node overview:
Current usage:
Potential new node optionsAt Serverbörse, there are no machines available with the CPU feature nly option we have, choose PX93 with a CPU Intel® Xeon® W-2295 18-Core - I guess avx512 is available, can anyone confirm? @pacospace ? Pricing, incl VAT: CPU: CPU Intel® Xeon® W-2295 18-Core
RAM is expensive, not the disks. Based on the RAM consumption above, I suggest 256GB, which means 14 GB RAM per core (256/18) which should be enough. And the 1x 1.92 TB NVME SSD version. (233.24 / 233.24 per month ) ToDos:
|
I'm ok with spending more. And before we do this, I'd like to understand overall utilization of the clusters better, which is somewhat blocked by the diagrams that @HumairAK is working on :) |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
Our new Morty cluster supports/have a CPU with AVX2 feature: $ oc get nodes -L cpu-feature.node.kubevirt.io/avx2 -L cpu-feature.node.kubevirt.io/avx -L cpu-feature.node.kubevirt.io/avx512
NAME STATUS ROLES AGE VERSION AVX2 AVX AVX512
morty-compute-0-private.emea.operate-first.cloud Ready worker 14d v1.22.3+fdba464 true true
morty-compute-1-private.emea.operate-first.cloud Ready worker 14d v1.22.3+fdba464 true true
morty-compute-2-private.emea.operate-first.cloud Ready worker 14d v1.22.3+fdba464 true true
morty-master-0-private.emea.operate-first.cloud Ready master 14d v1.22.3+fdba464
morty-master-1-private.emea.operate-first.cloud Ready master 14d v1.22.3+fdba464
morty-master-2-private.emea.operate-first.cloud Ready master 14d v1.22.3+fdba464
morty-storage-0-private.emea.operate-first.cloud Ready infra,worker 14d v1.22.3+fdba464
morty-storage-1-private.emea.operate-first.cloud Ready infra,worker 14d v1.22.3+fdba464
morty-storage-2-private.emea.operate-first.cloud Ready infra,worker 14d v1.22.3+fdba464
$ |
@pacospace feel free to create a ticket or pr to onboard on morty cluster. I will close this ticket, feel free to reopen it if needed. |
Great timing, NM introduced new speeds up also on old CPU with avx2 only, with version 0.11.0: https://github.com/neuralmagic/deepsparse/releases/ Thanks a lot @rbo! |
Is your feature request related to a problem? Please describe.
Some ML models are optimized for certain architectures. It would be nice to get hardware with
avx2
oravx512
capabilities in smaug instance.Describe the solution you'd like
Describe alternatives you've considered
Deploy on rick cluster which has hardware with avx512.
Additional context
Related-To: #409
Related-To: #408
Related-To: AICoE/elyra-aidevsecops-tutorial#297 (comment)
from
cat /proc/cpuinfo
in a pod from smaug instance I get:source for this chip shows that Intel(R) Xeon(R) CPU E5-2667 v2 does not support
avx2
but onlyavx
.avx2 is available in Intel(R) Xeon(R) CPU E5-2667 >=v3: https://www.cpu-world.com/Compare/422/Intel_Xeon_E5-2667_v2_vs_Intel_Xeon_E5-2667_v3.html but not
avx512
.cc @riekrh @durandom @goern
The text was updated successfully, but these errors were encountered: