Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor kserve metrics solution to work with prometheus annotations #200

Merged
merged 4 commits into from
May 10, 2024

Conversation

spolti
Copy link
Member

@spolti spolti commented Apr 29, 2024

chore: Make sure that the vLLM metrics are getting correctly exposed when the
prometheus annotations are set.
Fixes [RHOAIENG-6264] - Ensure Metrics work in vLLM

Signed-off-by: Spolti fspolti@redhat.com

Description

This change removes the hardcode monitoring Service, ServiceMonitor and PeerAuthentication resources in favor of a centralized configuration using the PodMonitor and Istio's ServiceMonitor resources to fit our needs and make them available in the aggregated metrics endpoint: http://:15020/stats/prometheus

How Has This Been Tested?

To test it follow the following steps (Can be tested with OVMS ServingRuntime as well):

  • Have a cluster with GPU nodes available (only if testing with vLLM)
  • Deploy the following ServingRuntime (If testing with vLLM): Available in the jira
  • Deploy the following ISVC (If testing with vLLM): Available in the jira
  • Use the Following DSC:
spec:
  components:
    codeflare:
      managementState: Removed
    kserve:
      devFlags:
        manifests:
          - contextDir: config
            sourcePath: ''
            uri: 'https://github.com/spolti/odh-model-controller/tarball/RHOAIENG-6264-test'
      managementState: Managed
      serving:
        ingressGateway:
          certificate:
            type: SelfSigned
        managementState: Managed
        name: knative-serving
    trustyai:
      managementState: Removed
    ray:
      managementState: Removed
    kueue:
      managementState: Removed
    workbenches:
      managementState: Removed
    dashboard:
      managementState: Removed
    modelmeshserving:
      managementState: Removed
    datasciencepipelines:
      managementState: Removed

When everything is deployed, check the metrics in the Prometheus console, it should show the vllm prefixed metrics:
Screenshot 2024-04-29 at 16 21 59

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Copy link
Contributor

@VedantMahabaleshwarkar VedantMahabaleshwarkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to add logic such that existing resources related to the old solution are deleted.

@VedantMahabaleshwarkar
Copy link
Contributor

I'd also recommend changing the commit/PR title to reflect the generic solution rather than being vLLM specific.
Maybe something like refactor kserve metrics solution to work with prometheus annotations

@spolti spolti force-pushed the RHOAIENG-6264 branch 3 times, most recently from 3b0bed7 to a79bd61 Compare April 30, 2024 18:49
@spolti spolti force-pushed the RHOAIENG-6264 branch 2 times, most recently from 4aa1bc4 to 5da6f8f Compare April 30, 2024 18:59
@spolti
Copy link
Member Author

spolti commented Apr 30, 2024

refactor kserve metrics solution to work with prometheus annotations

Sure, updated.

@spolti spolti changed the title [RHOAIENG-6264] - Ensure Metrics work in vLLM Refactor kserve metrics solution to work with prometheus annotations Apr 30, 2024
@VedantMahabaleshwarkar
Copy link
Contributor

Code changes look ok to me (see minor comment about function name GetDSCIName. Testing on a cluster now.

@spolti spolti force-pushed the RHOAIENG-6264 branch 3 times, most recently from 2124edf to e91aa6c Compare April 30, 2024 19:15
chore:  Make sure that the vLLM metrics are getting correctly exposed when the
        prometheus annotations are set.
	Fixes [RHOAIENG-6264] - Ensure Metrics work in vLLM

Signed-off-by: Spolti <fspolti@redhat.com>
Signed-off-by: Spolti <fspolti@redhat.com>
controllers/utils/utils.go Outdated Show resolved Hide resolved
controllers/utils/utils.go Outdated Show resolved Hide resolved
Signed-off-by: Spolti <fspolti@redhat.com>
Copy link
Contributor

@VedantMahabaleshwarkar VedantMahabaleshwarkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Signed-off-by: Spolti <fspolti@redhat.com>
Copy link
Contributor

@israel-hdez israel-hdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good.
I haven't been able to try the changes. I'll use the upcoming ODH release to do any needed testings there. It is not idea, but it is for what I have time.

/lgtm
Great work @spolti & @VedantMahabaleshwarkar .

Comment on lines +93 to +95
Key: "component",
Operator: metav1.LabelSelectorOpIn,
Values: []string{"predictor", "explainer", "transformer"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For later improvement, I inspected the pod and it also gets the serving.kserve.io/inferenceservice label.
So, maybe we can change the selector similarly to what you did in the webhook fix.

Copy link
Contributor

openshift-ci bot commented May 10, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: israel-hdez, spolti, VedantMahabaleshwarkar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [VedantMahabaleshwarkar,israel-hdez,spolti]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit a3df072 into opendatahub-io:main May 10, 2024
5 checks passed
@spolti spolti deleted the RHOAIENG-6264 branch May 10, 2024 00:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants