Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assert edit permissions on appropriate mutations in namespaceUtils, fix promise logic when deploying a model or adding a model server, show inner error messages from the backend when present #2319

Merged

Conversation

mturley
Copy link
Contributor

@mturley mturley commented Jan 3, 2024

Resolves RHOAIENG-548 and RHOAIENG-556.

Description

When a user has "edit" permission on a data science project they don't own, they don't have permission to update that namespace. When adding the first model or model server to a project, the namespace needs to be updated by addSupportServingPlatformProject to add labels that describe which serving platform the project is using, which fails with a 403 error for users in this position (as expected). Currently, even though this namespace update fails, the ServingRuntime gets created anyway. It does not become visible in the UI because it still shows the platform-selection view (choosing single- or multi-model serving), but it is visible in the cluster.

The submitServingRuntimeResources function that is called to make these updates and create the ServingRuntime currently operates like this:

  • Execute a dry-run of the operations concurrently with a Promise.all.
  • If no error in the dry-run, execute the same operations with another Promise.all, but this time include the namespace update (because the namespace update performed by addSupportServingPlatformProject has no dry-run option).

This means that if addSupportServingPlatformProject fails, the other requests will run anyway. The fix is to add a step in between the dry-run and the actual run of these requests where we perform the namespace update before proceeding, which means if it fails the ServingRuntime will not be created. As part of the fix, we convert the submitServingRuntimeResources to an async function so we can use await for readability.

There are also a few other notable changes:

  • Removes the allowCreate condition on updating the namespace, which was a relic of an earlier workaround that is no longer needed (@lucferbux can provide more context on this -- thanks for your help Lucas!).
  • Changes the applyNamespaceChange function on the backend to perform a more accurate selfSubjectAccessReview: users with "edit" permission should still be able to create ServingRuntimes when the namespace update is not needed (a serving platform has already been chosen for the project), so we check for that permission only for the MODEL_MESH_PROMOTION and KSERVE_PROMOTION cases.
  • When deploying a Kserve model, there was another Promise.all for concurrently calling submitServingRuntimeResources and submitInferenceServiceResource, which means if the error described above happens here the InferenceService will also still be created. Changing this to be a sequential .then also fixes that problem. (Edit: I hadn't realized this was the separate issue RHOAIENG-556, so this PR fixes that now too :))
  • Adds a new throwErrorFromAxios utility and uses it in addSupportServingPlatformProject (and in createProject, the only other place we're directly calling axios()). This will ensure that if an error thrown by the axios() call has a response object containing a message from the backend, that message will be used instead of the top-level message property from Axios ("Request failed with status code X"). If the error being displayed does not contain an inner error message in this expected structure, the current behavior is retained and the top-level error.message is used.

Considering that last point above, the error message that used to look like this:

Screenshot 2024-01-15 at 5 44 47 PM

now looks like this:

Screenshot 2024-01-16 at 3 07 39 PM

@vconzola, does that error text look good to you?

How Has This Been Tested?

With @lucferbux's help, I reproduced the error in a cluster and verified the fix by using the PR image in that cluster. Testing locally gave us trouble because apparently an impersonated regular user cannot list templates in a namespace they don't own (which does not seem to match the behavior in production), so we can't get the "Models and model servers" section of the data science project page to render when trying to reproduce the bug this way.

Test Impact

New Cypress tests have been added to submit the "deploy model" and "add model server" modals and intercept their requests, ensuring that the request to create ServingRuntimes and InferenceServices are not made when there is an error updating the namespace.

Request review criteria:

Self checklist (all need to be checked):

  • The developer has manually tested the changes and verified that the changes work
  • Commits have been squashed into descriptive, self-contained units of work (e.g. 'WIP' and 'Implements feedback' style messages have been removed)
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has added tests or explained why testing cannot be added (unit tests & storybook for related changes)

If you have UI changes:

  • Included any necessary screenshots or gifs if it was a UI change.
  • Included tags to the UX team if it was a UI/UX change (find relevant UX in the SMEs section).

After the PR is posted & before it merges:

  • The developer has tested their solution on a cluster by using the image produced by the PR to main

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress This PR is in WIP state label Jan 3, 2024
Copy link
Contributor

openshift-ci bot commented Jan 3, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Contributor

@lucferbux lucferbux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good to me, just take a look at the package-lock.json situation (I mean, it could be something expected but it's kinda weird).
You can undraft it if you want!

backend/src/routes/api/namespaces/namespaceUtils.ts Outdated Show resolved Hide resolved
frontend/package-lock.json Outdated Show resolved Hide resolved
@mturley mturley force-pushed the 548-deploy-model-403-error branch 4 times, most recently from 3ceb1e7 to ba6a311 Compare January 9, 2024 16:12
@mturley
Copy link
Contributor Author

mturley commented Jan 9, 2024

Sorry for the delay here, I'm back from PTO. @lucferbux I only still have this in draft because I figured it should have updated unit tests before it's ready to merge. Getting back to that now.

@mturley
Copy link
Contributor Author

mturley commented Jan 9, 2024

/test all

@mturley
Copy link
Contributor Author

mturley commented Jan 9, 2024

Hey @andrewballantyne @lucferbux, quick question here: I've been poking around and reading code trying to figure out how to mock things correctly to test the submitServingRuntimeResources function in frontend/src/pages/modelServing/screens/projects/__tests__/utils.spec.ts by somehow mocking a user with and without the appropriate permissions and then calling the function and inspecting the returned promise(s). Do you know if I'm on the right track there? I'm a bit stuck on that, and it may be telling that there aren't any tests for that function yet. Maybe instead I should add an integration test for ManageServingRuntimeModal or ManageKServeModal. Or I wonder if we should just let this PR merge without new tests.

@mturley mturley marked this pull request as ready for review January 9, 2024 19:28
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress This PR is in WIP state label Jan 9, 2024
@mturley mturley requested a review from lucferbux January 9, 2024 19:29
@andrewballantyne
Copy link
Member

andrewballantyne commented Jan 9, 2024

We don't have e2e tests... why do you need to mock user permissions? @christianvogt can you help out here?

EDIT: I bumped this to slack partially as well... I'm feeling there isn't a lot we can do here, but Christian should have the final say in it.

@mturley
Copy link
Contributor Author

mturley commented Jan 9, 2024

It also looks like the accessibility tests are failing to run in CI but they pass locally :(

@mturley
Copy link
Contributor Author

mturley commented Jan 9, 2024

We don't have e2e tests... why do you need to mock user permissions? @christianvogt can you help out here?

I figured since the fix here has to do with behavior being different for users with different permissions, I would want to mock each of the two cases (should create model server if user can update the namespace, should not create model server if user cannot update the namespace). My apologies for my inexperience with the repo getting in the way here.

@christianvogt
Copy link
Contributor

@mturley as far as mocking goes. Use jest to mock whatever you need on the frontend being used by the function you're testing. It's sometimes best to mock down to the k8s apis from @openshift/dynamic-plugin-sdk-utils - eg k8sListResource, k8sGetResource, etc...

But sometimes those mocks end up being very deep in the call tree and therefore you may want to mock another function that's used directly by your function. The choice is yours.

The other option is to add a new cypress test where you'd end up mocking the network calls.

@mturley
Copy link
Contributor Author

mturley commented Jan 9, 2024

/test all

@mturley
Copy link
Contributor Author

mturley commented Jan 9, 2024

Ok thanks @christianvogt , I'll keep playing with it.

@lucferbux
Copy link
Contributor

@mturley @christianvogt I wouldn't wast too much time trying to mock e2e within the backend, the main point i would cover here would be this call. Just try to mock it making sure it's failing so we can prove the rest of the requests are not being executed.

@mturley mturley changed the title Assert edit permissions on appropriate mutations in namespaceUtils, fix promise logic in serving runtime create [WIP] Assert edit permissions on appropriate mutations in namespaceUtils, fix promise logic in serving runtime create Jan 11, 2024
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress This PR is in WIP state label Jan 11, 2024
@mturley mturley force-pushed the 548-deploy-model-403-error branch 4 times, most recently from cfc7243 to 39dc5e2 Compare January 15, 2024 19:19
Comment on lines 94 to 106
default:
throw createCustomError('Unknown configuration', 'Cannot apply namespace change', 400);
}

const selfSubjectAccessReview = await checkPermissionsFn(fastify, request, name);
if (isK8sStatus(selfSubjectAccessReview)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll fall into the next if-statement after this -- but I am wondering if we want to put a "dev setup" if statement first -- that basically says "checkPermissionsFn is null -- you need to set permissions for all actions" -- basically as a "fail immediately" sorta deal with a dev message so the dev knows the new case they added needs to have a check permissions.

Just my two cents. It should fail in some fashion on the checkPermissionsFn() call but I dunno how that will look.

Copy link
Contributor Author

@mturley mturley Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, fair enough. Just another throw createCustomError maybe? 500 error code I guess, since it shouldn't ever happen? It would be impossible to hit with the current code but that's a good point about future switch cases. Alternatively though we could just have it fall back to checkAdminNamespacePermission which is the behavior in all cases before this PR? I'm leaning towards your idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, like a 500 error "invalid backend state -- dev broken workflow" or something like that. Something annoying and blunt it was a dev setup. It will be easy to search in the codebase and should prevent anything from being poorly configured.

It should never happen in production, but it won't mask when code is poorly written -- so it gets caught in the PR for the change. Or in our test infra when we have that going nightly or whatever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed with this change as well.

frontend/src/concepts/dashboard/DashboardModalFooter.tsx Outdated Show resolved Hide resolved
@vconzola
Copy link

@mturley Can you make the error message more specific by replacing "labels" with "serving platform labels". A normal DS might not know what a label is, but they'll know it has something to do with a serving platform and their permissions. Also, put a period at the end.

@mturley mturley force-pushed the 548-deploy-model-403-error branch 4 times, most recently from c88568e to d93dbd3 Compare January 16, 2024 20:42
@mturley
Copy link
Contributor Author

mturley commented Jan 16, 2024

@andrewballantyne I've rebase and pushed with changes to address your feedback. Updated the PR description to match. Let me know what you think of the new error message approach.

@vconzola you got it, updated the message and the screenshot.

@mturley mturley force-pushed the 548-deploy-model-403-error branch 2 times, most recently from 8f1a68a to 3e49d71 Compare January 16, 2024 20:45
@mturley
Copy link
Contributor Author

mturley commented Jan 16, 2024

Fixed some lint errors I missed. Should be ready for re-review.

@mturley mturley force-pushed the 548-deploy-model-403-error branch 2 times, most recently from 3ba54a7 to 1652a56 Compare January 16, 2024 21:19
…ix promise logic in serving runtime create

Signed-off-by: Mike Turley <mike.turley@alum.cs.umass.edu>
@mturley
Copy link
Contributor Author

mturley commented Jan 17, 2024

Rebased and addressed remaining comments.

Copy link
Contributor

@lucferbux lucferbux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Ok, after andrew's suggestion, I think is ready to go, the message is descriptive enough to get a grasp of what's happening. Works great in both kserve and modelmesh, great work!

Screenshot 2024-01-17 at 18 27 11

@openshift-ci openshift-ci bot added the lgtm label Jan 17, 2024
Copy link
Contributor

openshift-ci bot commented Jan 17, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lucferbux

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit e328be0 into opendatahub-io:main Jan 17, 2024
6 checks passed
@mturley mturley deleted the 548-deploy-model-403-error branch January 17, 2024 18:21
mturley added a commit to mturley/odh-dashboard that referenced this pull request Jan 18, 2024
Signed-off-by: Mike Turley <mike.turley@alum.cs.umass.edu>
openshift-merge-bot bot added a commit that referenced this pull request Jan 18, 2024
Followup to PR #2319 - permission check typo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants