Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance controller setup based on available CRDs #3472

Merged
merged 2 commits into from
May 4, 2024

Conversation

israel-hdez
Copy link
Contributor

What this PR does / why we need it:

This enhances the setup of the InferenceService controller and the InferenceGraph controller. Instead of relying on the defaultDeploymentMode configuration to determine what CRDs to watch, the setup now checks whether KNative Services and Istio VirtualServices are available in the cluster and setup the watches (invoke Owns) accordingly.

This enhancement has the following advantages:

  • A crashloop is prevented if the CRDs are missing in the cluster. The user would still be able to create InferenceServices by taking care of annotating the ISVC for RawDeployment mode.
  • If RawDeployment mode is configured as the default mode, the controllers would still watch for KNative and Istio resources if these components are available. This will let the controller watch for changes for the dependent resources if the user uses Serverless mode for some of the InferenceServices.
  • In the InferenceService controller, the watch for the VirtualServices is still conditioned to the value of the disableVirtualHost configuration.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #3470

Type of changes
Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)

Feature/Issue validation/testing:

  • Install KServe in a cluster where Istio and KNative are missing, and still configure Serverless as default deployment mode. See the controller outputting in the logs messages like The InferenceGraph controller won't watch serving.knative.dev/v1/Service resources because the CRD is not available. and verified there no errors are logged.
    • Verify that InferenceServices with the RawDeployment mode annotation deploy successfully.
  • Do a regular Serverless install but change the default deployment mode to Raw
    • Verify that an InferenceService can both be created in Raw mode and Serverless mode.
    • For the InferenceService deployed in Serverless mode, verify that modifying the KServe created VirtualService and/or the KNative Service would trigger a reconcile and KServe would fix the resources.

Checklist:

  • Have you added unit/e2e tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

Copy link
Contributor

@Jooho Jooho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by the way, don't we need tests for this change? crashlooping is a big issue so adding tests would be a good idea.

For(&v1alpha1api.InferenceGraph{}).
Owns(&appsv1.Deployment{}).
Complete(r)
ksvcFound, err := utils.IsCrdAvailable(mgr.GetConfig(), "serving.knative.dev/v1", "Service")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If istio changes api version, it would behavior something wrong. Why don't we check all api versions that have the kind of CRD?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To manipulate the KNative-related and Istio-related resources, their API libraries are imported and used in go.mod. Right now, Istio library version 1.19.4 is being used and KNative 0.39.3 is being used. Such libraries would not contain any future versions of the Istio or KNative APIs. Furthermore, we can watch (invoke Own) only on a specific API version of the CRDs.

So, if either Istio or KNative change their APIs, KServe would require more code changes than just changing this check (see an example in #3150). So, to me, it does not make sense to check for all versions, while the rest of the code would only work against a specific version.

BTW, there is a recommended version matrix in the website which seems to be outdated. I'd think it should work as a compatibility matrix, as I don't think it is feasible to support any combinations.

@danielezonca
Copy link

@israel-hdez
Nice job with the PR, I have a couple of questions

  • What about a scenario where Knative/Istio are not installed when KServe is installed but then they are? I think it is fine to mention this scenario is not supported for now if it makes too complex this PR, I just want to understand the overall behaviour.
  • Have you tested the error if a user tries to deploy a Serverless deployment InferenceService without having Knative after this PR? Before the PR the controller was failing checking the CRD but what about now? I would expect a proper error in the InferenceService status/event to mention that it is not possible to deploy but at least that the controller doesn't crash and the IS status reports the failed status

Comment on lines 299 to 304
ksvcFound, err := utils.IsCrdAvailable(mgr.GetConfig(), "serving.knative.dev/v1", "Service")
if err != nil {
return err
}

vsFound, err := utils.IsCrdAvailable(mgr.GetConfig(), "networking.istio.io/v1beta1", "VirtualService")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move these strings to constants and consider using a loop to check all of these CRDs since we may need to check them for different versions, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree on moving to constants, I'll do it.

About a loop for checking different versions, I'm not sure that's useful. Read more in my reply to Jooho.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess the best way is to have a const (array in case of more versions) and check those.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified to using constants and 3rd-party library-provided variables

For(&v1alpha1api.InferenceGraph{}).
Owns(&knservingv1.Service{}).
Complete(r)
r.Log.Info("The InferenceGraph controller won't watch serving.knative.dev/v1/Service resources because the CRD is not available.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be warning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The r.Log does not have a warning function. Only Info and Error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh interesting

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess warnings would looks like:

V(0).Info(...

Copy link
Contributor Author

@israel-hdez israel-hdez Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would V(0) be meaningful?
I'm not sure it will have any outcome in the logs.

Copy link
Contributor

@spolti spolti Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear in the documentation, but digging a little bit about it, V(0) seems to be a way to add verbosity in the logs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's the verbosity level.

Copy link
Member

@terrytangyuan terrytangyuan Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if you run the controller you can do something like --gloglevel 5 to print out only logs that falls into that verbosity boundary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed this and I think we should not raise the verbosity for this one as it helps to see why the controller would not watch the involved CRD -- and it is printed only once for the pod lifetime.

So, I think it is worth to have it as a regular Info, given r.Log does not have a warning function.

@israel-hdez
Copy link
Contributor Author

israel-hdez commented Feb 26, 2024

@danielezonca

@israel-hdez Nice job with the PR, I have a couple of questions

* What about a scenario where Knative/Istio are not installed when KServe is installed but then they are? I think it is fine to mention this scenario is not supported for now if it makes too complex this PR, I just want to understand the overall behaviour.

I think detecting the change in the setup would add some complexity. I'm not sure if it is worth to do that work, given changes to the stack may not be that common.

For this PR, if a Serverless setup is done after KServe is installed, the only additional action needed would be to restart the KServe controller pod and it should start using KNative as normally.

* Have you tested the error if a user tries to deploy a Serverless deployment `InferenceService` without having Knative after this PR? Before the PR the controller was failing checking the CRD but what about now? I would expect a proper error in the `InferenceService` status/event to mention that it is not possible to deploy but at least that the controller doesn't crash and the IS status reports the failed status

I just tried it. There is no crash, although the controller will try to reconcile continuously. I do see that it tries to write something in the status field, but it fails for some reason (looks like a bug) - so, looks like the error message is hidden by another one.

I agree that it should fail more gracefully. Given the detected CRDs it shouldn't reconcile continuously a Serverless InferenceService when the stack is absent. I'll do the needed code changes for this.

}

return ctrlBuilder.Complete(r)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you planning to add tests to check if the configuration is correctly applied after the reconcile runs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What configuration are you referring to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to the isvc, e.g.
if you change the isvc and force knative, and if it is disabled, the isvc should not be reconciled, right? so the confs should be the same than before running the reconcile loop again.

does it make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point in time, I added to this PR some units in pkg/controller/v1beta1/inferenceservice/controller_test.go and pkg/controller/v1alpha1/inferencegraph/controller_test.go. Maybe that's enough?

@israel-hdez
Copy link
Contributor Author

  • I would expect a proper error in the InferenceService status/event to mention that it is not possible to deploy

I just added some code to abort reconciliation early and log an Event that should be visible with kubectl describẹ .

@israel-hdez israel-hdez force-pushed the k-3470-fix branch 5 times, most recently from a3446a2 to 9be59e7 Compare February 27, 2024 20:55
@israel-hdez israel-hdez marked this pull request as ready for review February 27, 2024 23:09
@oss-prow-bot oss-prow-bot bot requested a review from theofpa February 27, 2024 23:09
@israel-hdez
Copy link
Contributor Author

@Jooho @terrytangyuan @spolti The code is ready. Please, do another review round.

I added tests around the impacted Reconcile functions.

I failed to find a way to add tests for the SetupWithManager functions. If you know how to do it, I'll appreciate your advice.

return false, err
}

found := false
Copy link
Contributor

@spolti spolti Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be reduced to something like:

return gvResources != nil && any(crd.Kind == kind for crd := range gvResources.APIResources), nil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied and pasted and it does not build....
I guess this is pseudocode. If it is, then yes, what you provided is the short form of all that code.

return nil, getGvResourcesErr
}

SetAvailableResourcesForApi(groupVersion, gvResources)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is SetAvailableResourcesForApi used outside this package?
If not you could, instead create a func to store it, just put them in the cache here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, it is. :)

if ksvcFound {
ctrlBuilder = ctrlBuilder.Owns(&knservingv1.Service{})
} else {
r.Log.Info("The InferenceService controller won't watch serving.knative.dev/v1/Service resources because the CRD is not available.")
Copy link
Contributor

@spolti spolti Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't be good to publish it to events as well?
Most of the time we don't usually see logs, having it in the events would be helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rarely deal with the Events API. So, I don't know.
Personally, looking at the logs would be the first thing I'd do to diagnose things. Then, I typically look at Events of some resource. If we want to record an event, I'm not sure which resource the event should belong to?

@spolti
Copy link
Contributor

spolti commented Mar 7, 2024

/approve


ksvcFound, ksvcCheckErr := utils.IsCrdAvailable(cfg, knservingv1.SchemeGroupVersion.String(), constants.KnativeServiceKind)
if ksvcCheckErr != nil {
setupLog.Error(ksvcCheckErr, "error when checking if Knative Services are available")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: seems like not all changes are synced from opendatahub-io#265?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@israel-hdez
Copy link
Contributor Author

@terrytangyuan @spolti I finally had time to finish this PR.
Please, take another look at this.

@spolti
Copy link
Contributor

spolti commented Apr 16, 2024

/lgtm

@israel-hdez
Copy link
Contributor Author

I pushed an empty commit to re-run CI, which is now passing.
@spolti @terrytangyuan Please, re-approve if you are still OK with this PR.

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@israel-hdez
Copy link
Contributor Author

Per prow instructions:

/assign @yuzisun

This enhances the setup of the InferenceService controller and the InferenceGraph controller. Instead of relying on the `defaultDeploymentMode` configuration to determine what CRDs to watch, the setup now checks whether KNative Services and Istio VirtualServices are available in the cluster and setup the watches (invoke `Owns`) accordingly.

This enhancement has the following advantages:
* A crashloop is prevented if the CRDs are missing in the cluster. The user would still be able to create InferenceServices by taking care of annotating the ISVC for RawDeployment mode.
* If RawDeployment mode is configured as the default mode, the controllers would still watch for KNative and Istio resources if these components are available. This will let the controller watch for changes for the dependent resources if the user uses Serverless mode for some of the InferenceServices.
* In the InferenceService controller, the watch for the VirtualServices is still conditioned to the value of the `disableVirtualHost` configuration.

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>
Since KServe controllers are modified to watch resources based on
available CRDs, a similar change in the setup of the manager is needed:
schemas need to be added to the manager based on available CRDs rather
than based only on the values in the inferenceservice-config ConfigMap.
This would keep both manager setup and controller setup in sync with
regards schemas and watches around the CRDs.

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>
@spolti
Copy link
Contributor

spolti commented Apr 25, 2024

/lgtm

@oss-prow-bot oss-prow-bot bot added the lgtm label Apr 25, 2024
@israel-hdez
Copy link
Contributor Author

israel-hdez commented Apr 25, 2024

@yuzisun This is rebased to resolve conflicts, and CI is passing.

@yuzisun
Copy link
Member

yuzisun commented May 4, 2024

Thanks @israel-hdez for the nice PR descriptions for easier review!

/approve

Copy link

oss-prow-bot bot commented May 4, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: israel-hdez, spolti, terrytangyuan, yuzisun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@oss-prow-bot oss-prow-bot bot added the approved label May 4, 2024
@yuzisun yuzisun merged commit 59d705b into kserve:master May 4, 2024
56 of 57 checks passed
@israel-hdez israel-hdez deleted the k-3470-fix branch May 6, 2024 18:46
israel-hdez added a commit to israel-hdez/kserve that referenced this pull request May 6, 2024
* Enhance controller setup based on available CRDs

This enhances the setup of the InferenceService controller and the InferenceGraph controller. Instead of relying on the `defaultDeploymentMode` configuration to determine what CRDs to watch, the setup now checks whether KNative Services and Istio VirtualServices are available in the cluster and setup the watches (invoke `Owns`) accordingly.

This enhancement has the following advantages:
* A crashloop is prevented if the CRDs are missing in the cluster. The user would still be able to create InferenceServices by taking care of annotating the ISVC for RawDeployment mode.
* If RawDeployment mode is configured as the default mode, the controllers would still watch for KNative and Istio resources if these components are available. This will let the controller watch for changes for the dependent resources if the user uses Serverless mode for some of the InferenceServices.
* In the InferenceService controller, the watch for the VirtualServices is still conditioned to the value of the `disableVirtualHost` configuration.

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

* Controller setup - add schemas based on CRDs available

Since KServe controllers are modified to watch resources based on
available CRDs, a similar change in the setup of the manager is needed:
schemas need to be added to the manager based on available CRDs rather
than based only on the values in the inferenceservice-config ConfigMap.
This would keep both manager setup and controller setup in sync with
regards schemas and watches around the CRDs.

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

---------

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>
asd981256 pushed a commit to asd981256/kserve that referenced this pull request May 14, 2024
* Enhance controller setup based on available CRDs

This enhances the setup of the InferenceService controller and the InferenceGraph controller. Instead of relying on the `defaultDeploymentMode` configuration to determine what CRDs to watch, the setup now checks whether KNative Services and Istio VirtualServices are available in the cluster and setup the watches (invoke `Owns`) accordingly.

This enhancement has the following advantages:
* A crashloop is prevented if the CRDs are missing in the cluster. The user would still be able to create InferenceServices by taking care of annotating the ISVC for RawDeployment mode.
* If RawDeployment mode is configured as the default mode, the controllers would still watch for KNative and Istio resources if these components are available. This will let the controller watch for changes for the dependent resources if the user uses Serverless mode for some of the InferenceServices.
* In the InferenceService controller, the watch for the VirtualServices is still conditioned to the value of the `disableVirtualHost` configuration.

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

* Controller setup - add schemas based on CRDs available

Since KServe controllers are modified to watch resources based on
available CRDs, a similar change in the setup of the manager is needed:
schemas need to be added to the manager based on available CRDs rather
than based only on the values in the inferenceservice-config ConfigMap.
This would keep both manager setup and controller setup in sync with
regards schemas and watches around the CRDs.

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

---------

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>
Signed-off-by: asd981256 <asd981256@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enhance manager setup based on available CRDs
6 participants