Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpuCost computation error #2657

Open
ElieLiabeuf opened this issue Mar 22, 2024 · 1 comment
Open

gpuCost computation error #2657

ElieLiabeuf opened this issue Mar 22, 2024 · 1 comment
Labels
E2 Estimated level of Effort (1 is easiest, 4 is hardest) kubecost Relevant to Kubecost's downstream project needs-follow-up opencost OpenCost issues vs. external/downstream P1 Estimated Priority (P0 is highest, P4 is lowest)

Comments

@ElieLiabeuf
Copy link

ElieLiabeuf commented Mar 22, 2024

Describe the bug
gpuCost in /costDataModel is not computed correctly.

To Reproduce
I'm using g5.2xlarge instance on AWS with nvidia GPU operator v23.9.1.
Hourly instance cost is 1.3529600000

Specs:

{
   "memory":"32 GiB",
   "vcpu":"8",
   "gpu":"1"
}

Node has the following status.capacity:

apiVersion: v1
kind: Node
status:
  capacity:
    cpu: '8'
    memory: 32500272Ki
    nvidia.com/gpu: '1'

and the following labels:

apiVersion: v1
kind: Node
metadata:
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: g5.2xlarge
    beta.kubernetes.io/os: linux
    feature.node.kubernetes.io/cpu-cpuid.ADX: 'true'
    feature.node.kubernetes.io/cpu-cpuid.AESNI: 'true'
    feature.node.kubernetes.io/cpu-cpuid.AVX: 'true'
    feature.node.kubernetes.io/cpu-cpuid.AVX2: 'true'
    feature.node.kubernetes.io/cpu-cpuid.CLZERO: 'true'
    feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8: 'true'
    feature.node.kubernetes.io/cpu-cpuid.FMA3: 'true'
    feature.node.kubernetes.io/cpu-cpuid.FP256: 'true'
    feature.node.kubernetes.io/cpu-cpuid.FXSR: 'true'
    feature.node.kubernetes.io/cpu-cpuid.FXSROPT: 'true'
    feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR: 'true'
    feature.node.kubernetes.io/cpu-cpuid.IBPB: 'true'
    feature.node.kubernetes.io/cpu-cpuid.IBRS: 'true'
    feature.node.kubernetes.io/cpu-cpuid.IBRS_PREFERRED: 'true'
    feature.node.kubernetes.io/cpu-cpuid.IBRS_PROVIDES_SMP: 'true'
    feature.node.kubernetes.io/cpu-cpuid.LAHF: 'true'
    feature.node.kubernetes.io/cpu-cpuid.MCOMMIT: 'true'
    feature.node.kubernetes.io/cpu-cpuid.MOVBE: 'true'
    feature.node.kubernetes.io/cpu-cpuid.MOVU: 'true'
    feature.node.kubernetes.io/cpu-cpuid.OSXSAVE: 'true'
    feature.node.kubernetes.io/cpu-cpuid.RDPRU: 'true'
    feature.node.kubernetes.io/cpu-cpuid.SHA: 'true'
    feature.node.kubernetes.io/cpu-cpuid.SPEC_CTRL_SSBD: 'true'
    feature.node.kubernetes.io/cpu-cpuid.SSE4A: 'true'
    feature.node.kubernetes.io/cpu-cpuid.STIBP: 'true'
    feature.node.kubernetes.io/cpu-cpuid.SYSCALL: 'true'
    feature.node.kubernetes.io/cpu-cpuid.SYSEE: 'true'
    feature.node.kubernetes.io/cpu-cpuid.TOPEXT: 'true'
    feature.node.kubernetes.io/cpu-cpuid.WBNOINVD: 'true'
    feature.node.kubernetes.io/cpu-cpuid.X87: 'true'
    feature.node.kubernetes.io/cpu-cpuid.XGETBV1: 'true'
    feature.node.kubernetes.io/cpu-cpuid.XSAVE: 'true'
    feature.node.kubernetes.io/cpu-cpuid.XSAVEC: 'true'
    feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT: 'true'
    feature.node.kubernetes.io/cpu-hardware_multithreading: 'true'
    feature.node.kubernetes.io/cpu-model.family: '23'
    feature.node.kubernetes.io/cpu-model.id: '49'
    feature.node.kubernetes.io/cpu-model.vendor_id: AMD
    feature.node.kubernetes.io/kernel-config.NO_HZ: 'true'
    feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE: 'true'
    feature.node.kubernetes.io/kernel-version.full: 5.15.0-1045-aws
    feature.node.kubernetes.io/kernel-version.major: '5'
    feature.node.kubernetes.io/kernel-version.minor: '15'
    feature.node.kubernetes.io/kernel-version.revision: '0'
    feature.node.kubernetes.io/pci-10de.present: 'true'
    feature.node.kubernetes.io/pci-1d0f.present: 'true'
    feature.node.kubernetes.io/storage-nonrotationaldisk: 'true'
    feature.node.kubernetes.io/system-os_release.ID: ubuntu
    feature.node.kubernetes.io/system-os_release.VERSION_ID: '20.04'
    feature.node.kubernetes.io/system-os_release.VERSION_ID.major: '20'
    feature.node.kubernetes.io/system-os_release.VERSION_ID.minor: '04'
    gpu-g5-2xlarge: 'true'
    k8s.amazonaws.com/accelerator: nvidia-a10g
    kubernetes.io/arch: amd64
    kubernetes.io/os: linux
    node.kubernetes.io/instance-type: g5.2xlarge
    nvidia.com/cuda.driver.major: '535'
    nvidia.com/cuda.driver.minor: '129'
    nvidia.com/cuda.driver.rev: '03'
    nvidia.com/cuda.runtime.major: '12'
    nvidia.com/cuda.runtime.minor: '2'
    nvidia.com/gfd.timestamp: '1709634570'
    nvidia.com/gpu.compute.major: '8'
    nvidia.com/gpu.compute.minor: '6'
    nvidia.com/gpu.count: '1'
    nvidia.com/gpu.deploy.container-toolkit: 'true'
    nvidia.com/gpu.deploy.dcgm: 'true'
    nvidia.com/gpu.deploy.dcgm-exporter: 'true'
    nvidia.com/gpu.deploy.device-plugin: 'true'
    nvidia.com/gpu.deploy.driver: 'true'
    nvidia.com/gpu.deploy.gpu-feature-discovery: 'true'
    nvidia.com/gpu.deploy.node-status-exporter: 'true'
    nvidia.com/gpu.deploy.nvsm: ''
    nvidia.com/gpu.deploy.operator-validator: 'true'
    nvidia.com/gpu.family: ampere
    nvidia.com/gpu.machine: g5.2xlarge
    nvidia.com/gpu.memory: '23028'
    nvidia.com/gpu.present: 'true'
    nvidia.com/gpu.product: NVIDIA-A10G
    nvidia.com/gpu.replicas: '1'
    nvidia.com/mig.capable: 'false'
    nvidia.com/mig.strategy: single

The default.json (cost model ratio) are the following:

{
   "CPU":"0.031611",
   "RAM":"0.004237",
   "GPU":"0.95"
}

/costDataModel is giving:

{
   "hourlyCost":"1.3529600000",
   "CPU":"8",
   "CPUHourlyCost":"0.111315",
   "RAM":"32 GiB",
   "RAMBytes":"33280282624.000000",
   "RAMGBHourlyCost":"0.014920",
   "storage":"1 x 450 GB NVMe SSD",
   "storageHourlyCost":"",
   "usesDefaultPrice":false,
   "baseCPUPrice":"0.031611",
   "baseRAMPrice":"0.004237",
   "baseGPUPrice":"0.95",
   "usageType":"ondemand",
   "gpu":"1",
   "gpuName":"",
   "gpuCost":"3.345316",
   "vgpu":"1",
   "instanceType":"g5.2xlarge",
   "region":"eu-west-1",
   "providerID":"aws:///eu-west-1a/i-xxxxxx",
   "archType":"amd64"
}

Expected behavior
gpuCost is 3.345316 (greater than the instance cost) instead of 0.96028296.

Potential solution
GPU cost is computed in this part: https://github.com/opencost/opencost/blob/v1.109.0/pkg/costmodel/costmodel.go#L1172
According to the cost I get it acts as gpuc is 0.
It seems that given the node labels, we enter in this case: https://github.com/opencost/opencost/blob/v1.109.0/pkg/costmodel/costmodel.go#L1087 and gpuc is never set.
Adding gpuc = float64(q.Value()) at https://github.com/opencost/opencost/blob/v1.109.0/pkg/costmodel/costmodel.go#L1098 does solve the problem.

Which version of OpenCost are you using?
v1.109.0

Additional context
Add any other context about the problem here. Kubernetes versions and which public clouds you are working with are especially important.

@AjayTripathy
Copy link
Contributor

Thanks for the find! If you could send a PR with that suggested change I'm happy to get it reviewed ASAP.

@mattray mattray added opencost OpenCost issues vs. external/downstream P1 Estimated Priority (P0 is highest, P4 is lowest) kubecost Relevant to Kubecost's downstream project E2 Estimated level of Effort (1 is easiest, 4 is hardest) labels Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
E2 Estimated level of Effort (1 is easiest, 4 is hardest) kubecost Relevant to Kubecost's downstream project needs-follow-up opencost OpenCost issues vs. external/downstream P1 Estimated Priority (P0 is highest, P4 is lowest)
Projects
None yet
Development

No branches or pull requests

3 participants