Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use ServiceMeshMember resources to enroll namespaces to the service mesh #219

Merged

Conversation

israel-hdez
Copy link
Contributor

@israel-hdez israel-hdez commented Jun 6, 2024

Currently, odh-model-controller is editing the ServiceMeshMemberRoll resource to enroll and remove namespaces from the service mesh. Unfortunately, such resource is typically managed by a mesh administrator (it is unique per control plane), and it is not expected to be edited by 3rd parties.

With these changes odh-model-controller will stop doing changes to the ServiceMeshMemberRoll resource. Instead, namespaces are enrolled to the mesh by using the ServiceMeshMember resource (no Roll suffix). This resource is created inside the namespace that is wanted to be enrolled to the mesh.

Unfortunately, it is not possible to migrate previous setups to the ServiceMeshMember resource. Since the ServiceMeshMemberRoll resource is unique per mesh control plane, it is not possible to determine if a namespace was enrolled by odh-model-controller or if it was enrolled by the user. Thus, a migration is more risky than leaving the SMCP as is (i.e. there is the risk to remove namespaces where there are no InferenceServices, but the user enrolled and mesh features are needed there).

Fixes: https://issues.redhat.com/browse/RHOAIENG-1043
Fixes: https://issues.redhat.com/browse/RHOAIENG-6711

How Has This Been Tested?

Manifests for testing can be taken from here: https://github.com/israel-hdez/odh-model-controller/blob/j1043-smmr-to-smm-manifests/config/base/params.env

The DSC of the operator can be as follows:

spec:
  components:
    dashboard:
      managementState: Removed
    workbenches:
      managementState: Removed
    modelmeshserving:
      managementState: Removed
    datasciencepipelines:
      managementState: Removed
    kserve:
      managementState: Managed
      serving:
        managementState: Managed
      devFlags:
        manifests:
        - uri: https://github.com/israel-hdez/odh-model-controller/tarball/j1043-smmr-to-smm-manifests
          contextDir: config
          sourcePath: ''
    kueue:
      managementState: Removed
    codeflare:
      managementState: Removed
    ray:
      managementState: Removed
    trustyai:
      managementState: Removed
    modelregistry:
      managementState: Removed
    trainingoperator:
      managementState: Removed

Test cases:

  1. Test a namespace is correctly enrolled into the mesh
    • Create a new namespace
    • Deploy a sample model into the namespace
    • Verify the pod of the model has an istio-proxy container.
    • Verify a ServiceMeshMember resource has been created in the namespace.
  2. Test a namespace is correctly removed from the mesh
    • Building (continuing) on the previous test...
    • Remove all inference services from the namespace.
    • Verify the ServiceMeshMember resource is deleted.
    • Check against the ServiceMeshMemberRoll resource that the namespace has been removed from the mesh (look at the status.members field).
  3. Test models can be deployed on a namespace that the user has enrolled to the mesh
    • Create a new namespace
    • Enroll the namespace in the service mesh. You can enroll either by adding it to the members of the ServiceMeshMemberRoll (docs) or by creating a ServiceMeshMember resource in the newly created namespace (docs)
    • Deploy a sample model into the namespace
    • Verify the pod of the model has an istio-proxy container.
  4. Test a namespaced enrolled to the mesh by the user won't be removed from the mesh
    • Building (continuing) on the previous test...
    • Remove all inference services from the namespace.
    • If you opted to enroll the namespace using the ServiceMeshMember resource, check that it won't be deleted.
    • Check in the ServiceMeshMemberRoll resource that the namespace is still in the mesh (look at the status.members field).

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Copy link
Contributor

openshift-ci bot commented Jun 6, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the approved label Jun 6, 2024
@israel-hdez israel-hdez force-pushed the j1043-smmr-to-smm branch 2 times, most recently from d3e5110 to 313ec20 Compare June 7, 2024 20:29
@israel-hdez israel-hdez marked this pull request as ready for review June 7, 2024 20:30
@israel-hdez israel-hdez requested a review from lugi0 June 7, 2024 20:47
Currently, odh-model-controller is editing the `ServiceMeshMemberRoll` resource to enroll and remove namespaces from the service mesh. Unfortunately, such resource is typically managed by a mesh administrator (it is unique per control plane), and it is not expected to be edited by 3rd parties.

With these changes odh-model-controller will stop doing changes to the `ServiceMeshMemberRoll` resource. Instead, namespaces are enrolled to the mesh by using the `ServiceMeshMember` resource (no `Roll` suffix). This resource is created inside the namespace that is wanted to be enrolled to the mesh.

Unfortunately, it is not possible to migrate previous setups to the `ServiceMeshMember` resource. Since the `ServiceMeshMemberRoll` resource is unique per mesh control plane, it is not possible to determine if a namespace was enrolled by odh-model-controller or if it was enrolled by the user. Thus, a migration is more risky than leaving the SMCP as is (i.e. there is the risk to remove namespaces where there are no InferenceServices, but the user enrolled and mesh features are needed there).

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>
"smm.desired.smcp_name", desiredSMM.Spec.ControlPlaneRef.Name,
"smm.current.smcp_namespace", existingSMM.Spec.ControlPlaneRef.Namespace,
"smm.current.smcp_name", existingSMM.Spec.ControlPlaneRef.Name)
// Don't return error because it is not recoverable. It does not make sense
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we, we shouldn't touch resources owned by another component, instead a manual change needs to be done.

@mwaykole
Copy link

hey @ederign is it possible to get manifest for early testing ?

@spolti
Copy link
Member

spolti commented Jun 10, 2024

@mwaykole did you mean @israel-hdez instead?

@israel-hdez
Copy link
Contributor Author

hey @ederign is it possible to get manifest for early testing ?

Which kind of manifests would you need?

@israel-hdez
Copy link
Contributor Author

@lugi0 @mwaykole I provided the needed manifests. Instructions are in the main comment.

@lugi0
Copy link

lugi0 commented Jun 13, 2024

@israel-hdez Here's my testing results:

Testing on 2.9.1:

  • First scenario works as expected: can create a NS, deploy a model which has an istio-proxy container attached, SMM resource is created in the NS, can query the model, NS is added to SMMR resource in istio-system

  • Second scenario

    • Cannot add NS to SMMR directly. I can edit the member roll resource to add my NS, but it seems to be immediately reconciled away.

    • Can create SMM resource in NS, NS gets correctly added to the SMMR resource.

      • However, I can create a SMM resource which references a non-existing control plane and even though there's a reconciliation error in the SMM and SMMR itself, the NS gets added to the SMMR regardless. (potential bug?)
    • Can deploy a model, can query it - I believe odh-model-controller finds the SMM resource and is happy with it

2024-06-13T10:58:02Z DEBUG controllers.InferenceService Verifying that the namespace is enrolled to the mesh {"InferenceService": "flanenroll", "namespace": "luca-enroll"}
2024-06-13T10:58:02Z DEBUG Istio Control Plane name and namespace read from environment variables
2024-06-13T10:58:02Z DEBUG controllers.InferenceService Successfully fetch deployed ServiceMeshMember {"InferenceService": "flanenroll", "namespace": "luca-enroll", "smm.name": "default"}
  • Can remove the IS from the NS and the SMM is still present, NS is still referenced in SMMR

Testing on 2.10 (latest build available at time of testing):

  • First scenario does not work as expected - The IS pods go in ready state, the service mesh member resource is also created (and the NS is shown in the member roll resource) - however the IS never moves to Loaded state and the inference URL never becomes responsive / shown on the dashboard. (retested, see below)

  • Second scenario

    • Cannot enroll NS by modifying SMMR

    • Can create SMM resource in NS, NS gets added to SMMR resource

      • However, I can create a SMM resource which references a non-existing control plane and even though there's a reconciliation error in the SMM and SMMR itself, the NS gets added to the SMMR regardless. (potential bug?)
    • Can deploy a model, can query it (IS gets Loaded / the URL almost immediately - why did it fail on scenario 1??)

odh-model-controller appears to find the SMM resource and is happy with it
2024-06-13T12:20:07Z DEBUG controllers.InferenceService Verifying that the namespace is enrolled to the mesh {"InferenceService": "flan-210", "namespace": "luca-enroll-210"}
2024-06-13T12:20:07Z DEBUG Istio Control Plane name and namespace read from environment variables
2024-06-13T12:20:07Z INFO admission.KsvcValidatingWebhook Validating Knative Service {"webhookGroup": "serving.knative.dev", "webhookKind": "Service", "Service": {"name":"flan-210-predictor","namespace":"luca-enroll-210"}, "namespace": "luca-enroll-210", "name": "flan-210-predictor", "resource": {"group":"serving.knative.dev","version":"v1","resource":"services"}, "user": "system:serviceaccount:redhat-ods-applications:kserve-controller-manager", "requestID": "4155b2b7-3dbf-4681-aaaa-747572c37660", "namespace": "luca-enroll-210", "ksvc": "flan-210-predictor"}
2024-06-13T12:20:07Z DEBUG controllers.InferenceService Successfully fetch deployed ServiceMeshMember {"InferenceService": "flan-210", "namespace": "luca-enroll-210", "smm.name": "default"}
  • Can remove IS from NS, SMM resource remains and NS is still referenced in SMMR

After retesting first scenario on 2.10, it is working as expected (same as what is reported above, but now the IS goes into Loaded state and I can query the model) - might've been a fluke on the first go, I'll keep an eye on it for 2.10

Due to all of the above, this PR LGTM with the suggestion to keep an eye on test results for 2.10

Comment on lines +74 to +77
"app.kubernetes.io/name": "odh-model-controller",
"app.kubernetes.io/component": "kserve",
"app.kubernetes.io/part-of": "odh-model-serving",
"app.kubernetes.io/managed-by": "odh-model-controller",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there reusable constants for these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Contributor

openshift-ci bot commented Jun 13, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: israel-hdez, lugi0, spolti, terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [israel-hdez,spolti,terrytangyuan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 4ffbb33 into opendatahub-io:main Jun 13, 2024
5 checks passed
@israel-hdez israel-hdez deleted the j1043-smmr-to-smm branch June 14, 2024 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants