Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot deploy Kserve and Modelmesh together #573

Closed
VaishnaviHire opened this issue Jul 17, 2023 · 12 comments · Fixed by opendatahub-io/kserve#120
Closed

Cannot deploy Kserve and Modelmesh together #573

VaishnaviHire opened this issue Jul 17, 2023 · 12 comments · Fixed by opendatahub-io/kserve#120

Comments

@VaishnaviHire
Copy link
Member

Describe the bug
Kserve manifests and Modelmesh manifests cannot be deployed together. There is a conflict when deploying Inferenceservice CRD

Steps To Reproduce
Steps to reproduce the behavior:

  1. Deploy ODH operator
  2. Deploy Kfdef with modelmesh and kserve

Expected behavior
Kserve and modelmesh should be deployed together

Workaround (if any)
Update kserve manifest to match the version of modelmesh manifests.

Open Data Hub Version
Please attach relevant kfdef manifest if applicable

OpenShift Version
Version #:
Provider (Baremetal, OpenStack, RHV, AWS, OKD, CodeReady Containers, ...):

Additional context
Add any other context about the problem here.

@VaishnaviHire
Copy link
Member Author

/cc @Jooho

Created this bug for reference.

@VaishnaviHire VaishnaviHire changed the title Cannot deploy Kserve and Modelmesh at a time Cannot deploy Kserve and Modelmesh together Jul 17, 2023
@israel-hdez israel-hdez assigned heyselbi and unassigned israel-hdez Sep 26, 2023
@LaVLaS
Copy link
Member

LaVLaS commented Sep 27, 2023

Transferring this to odh-operator issue board for tracking before opendatahub-io/opendatahub-community/issues/77 is complete

@LaVLaS LaVLaS transferred this issue from opendatahub-io/odh-manifests Sep 27, 2023
@heyselbi
Copy link

heyselbi commented Oct 31, 2023

This issue has been addressed in ODH 1.11 and RHODS 1.34. It can be closed (I don't have permissions to close the issue).

@zdtsw
Copy link
Member

zdtsw commented Oct 31, 2023

This issue has been addressed in ODH 1.11 and RHODS 1.34. It can be closed (I don't have permissions to close the issue).

Does it mean, in RHODS 2.4, we can have both kserve and modelmesh components enabled at the same time?

@bdattoma
Copy link

bdattoma commented Nov 6, 2023

@zdtsw according to opendatahub-io/odh-model-controller#87, this should be solved.

I checked on a cluster with 2.4, and in fact I was able to deploy a kserve model and a modelmesh one in the same cluster at the same moment.
However, when I disabled KServe, MM wasn't able to deploy models anymore bcos of kserve webhook

Internal error occurred: failed calling webhook "inferenceservice.kserve-webhook-server.defaulter": failed to call webhook: Post "https://kserve-webhook-server-service.redhat-ods-applications.svc:443/mutate-serving-kserve-io-v1beta1-inferenceservice?timeout=10s": service "kserve-webhook-server-service" not found

(same error which was reported by Wen in #508)
@heyselbi @VedantMahabaleshwarkar

@VedantMahabaleshwarkar
Copy link
Contributor

@bdattoma opendatahub-io/odh-model-controller#87 this particular issue is separate from the webhook/CRD mismatch error.
We will take a look at the webhook/CRD error

@VedantMahabaleshwarkar
Copy link
Contributor

@bdattoma @zdtsw

The above error is because when Kserve is removed, the MutatingWebhookConfiguration "inferenceservice.serving.kserve.io" is not removed by the operator.
I was able to successfully replicate the issue. I can also see that it has the correct labels

  labels:
    app.kubernetes.io/part-of: kserve
    app.opendatahub.io/kserve: 'true'

AFAIK, these labels are what the operator uses to clean up resources when a component is removed. Not sure why this isn't getting cleaned up.

@zdtsw
Copy link
Member

zdtsw commented Nov 7, 2023

@bdattoma @zdtsw

The above error is because when Kserve is removed, the MutatingWebhookConfiguration "inferenceservice.serving.kserve.io" is not removed by the operator. I was able to successfully replicate the issue. I can also see that it has the correct labels

  labels:
    app.kubernetes.io/part-of: kserve
    app.opendatahub.io/kserve: 'true'

AFAIK, these labels are what the operator uses to clean up resources when a component is removed. Not sure why this isn't getting cleaned up.

I think it is related to the fix by this PR #669
we had a bug which did not clean up all the left-over resources by DSC

@bdattoma
Copy link

bdattoma commented Nov 7, 2023

understood @VedantMahabaleshwarkar @zdtsw . Could we target the resolution of this issue for the very next release? 2.5 then.

Anyways, @VedantMahabaleshwarkar I wonder why the webhook doesn't bother MM when kserve is enabled but it bothers when it is disabled. Does it mean that the KServe webhook is always called, even when we deploy a MM model?

@zdtsw
Copy link
Member

zdtsw commented Nov 7, 2023

understood @VedantMahabaleshwarkar @zdtsw . Could we target the resolution of this issue for the very next release? 2.5 then.

Anyways, @VedantMahabaleshwarkar I wonder why the webhook doesn't bother MM when kserve is enabled but it bothers when it is disabled. Does it mean that the KServe webhook is always called, even when we deploy a MM model?

sure, we will have the fix from our side in asap and have some local test first.

@bdattoma
Copy link

bdattoma commented Nov 8, 2023

should this ticket marked as completed?

@VedantMahabaleshwarkar
Copy link
Contributor

If this issue is going to track the bug related to the ODH operator not deleting the MutatingWebhookConfiguration , then I'd say it can be closed.
The remaining issue with installing Kserve and ModelMesh together related to the CRD conflicts is being tracked here : opendatahub-io/kserve#121

@zdtsw zdtsw closed this as completed Nov 9, 2023
@heyselbi heyselbi removed their assignment Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Status: Done
Development

Successfully merging a pull request may close this issue.

7 participants