Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove cluster level list/watch for configmaps, serviceaccounts, secrets #3469

Conversation

sivanantha321
Copy link
Member

@sivanantha321 sivanantha321 commented Feb 22, 2024

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #3467
Fixes #366

Type of changes
Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Feature/Issue validation/testing:

Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

  • Test A

  • Test B

  • Logs

Special notes for your reviewer:

  1. Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

Checklist:

  • Have you added unit/e2e tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

Release note:

Remove cluster level list/watch for configmaps, serviceaccounts, secrets

@sivanantha321 sivanantha321 force-pushed the remove-cluster-level-watch-credentials branch 4 times, most recently from 561d305 to 5af1c83 Compare February 23, 2024 06:44
@sivanantha321 sivanantha321 marked this pull request as ready for review February 23, 2024 06:46
@oss-prow-bot oss-prow-bot bot requested a review from njhill February 23, 2024 06:46
@sivanantha321
Copy link
Member Author

@israel-hdez Can you help test whether this fixes the out-of-memory issue ?

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for fixing this!

@israel-hdez
Copy link
Contributor

@israel-hdez Can you help test whether this fixes the out-of-memory issue ?

I'll test on my Wednesday.

@sivanantha321 sivanantha321 force-pushed the remove-cluster-level-watch-credentials branch from 5af1c83 to b8e07c0 Compare February 28, 2024 09:18
@terrytangyuan
Copy link
Member

terrytangyuan commented Feb 29, 2024

@sivanantha321 Could you point me to where you removed the list calls for secrets? Perhaps I missed it somewhere but I only see the removal from manifests.

@israel-hdez
Copy link
Contributor

israel-hdez commented Mar 1, 2024

@israel-hdez Can you help test whether this fixes the out-of-memory issue ?

@sivanantha321 I'm still replicating the setup. I tried today, and my cluster died while on it.
I'll re-try on my Friday.

@sivanantha321
Copy link
Member Author

@sivanantha321 Could you point me to where you removed the list calls for secrets? Perhaps I missed it somewhere but I only see the removal from manifests.

We are not using list calls in the code base.

@israel-hdez
Copy link
Contributor

israel-hdez commented Mar 1, 2024

I've just replicated the OOMKill with master branch to ensure I have the needed setup.

I used a script from @skonto available in this gist: https://gist.github.com/skonto/188b2c45c5af449629caaa69a190392c, with a minor modification: edit line 3 to NC=64, because that was enough in my environment. The script would create about 8192 secrets, to a total of about 9528 secrets in my env.

I'm now trying the code changes in this PR.

@israel-hdez
Copy link
Contributor

I confirm the kserve-controller pod no longer crashes with OOMKill.
I'll do code review soon.

Notice that I only tried around Secrets which is what I reported in #3467. I didn't try the other resources.

@terrytangyuan
Copy link
Member

We are not using list calls in the code base.

I meant "list/watch for secrets" as described in the PR title but thanks I got it now.

Copy link
Contributor

@israel-hdez israel-hdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving a few minor comments.

@sivanantha321 sivanantha321 force-pushed the remove-cluster-level-watch-credentials branch 2 times, most recently from 7f50571 to 610f77b Compare March 4, 2024 17:08
@israel-hdez
Copy link
Contributor

/lgtm

@israel-hdez
Copy link
Contributor

Per request in prow comment:

/assign @rachitchauhan43

Copy link
Contributor

@israel-hdez israel-hdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@oss-prow-bot oss-prow-bot bot added the lgtm label Mar 6, 2024
Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@israel-hdez
Copy link
Contributor

I pinged Rachit previously.
But now I'm also pinging @yuzisun in case Rachit is not available.

Copy link
Contributor

@rachitchauhan43 rachitchauhan43 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small change requested. Rest LGTM.

cmd/manager/main.go Show resolved Hide resolved
@rachitchauhan43
Copy link
Contributor

Looks like lots of imports reorganization happened in this PR. @sivanantha321 : Possible to add as an action itself that fails if imports are not as per standard in go files ?

@rachitchauhan43
Copy link
Contributor

Also, @israel-hdez , if we have reproducibility, what do you think of adding a test case to make sure regression always catches such issues ?

@israel-hdez
Copy link
Contributor

Also, @israel-hdez , if we have reproducibility, what do you think of adding a test case to make sure regression always catches such issues ?

@rachitchauhan43 I do know how to reproduce, but IMO that should be solved under its own ticket, because we would need to find a way to replicate within CI -- I crashed my dev cluster a few times before I could replicate successfully, so what I did on my device may not be suitable for GitHub Actions. And, of course, we would need to ensure all relevant code paths are covered.

@israel-hdez
Copy link
Contributor

Also, @israel-hdez , if we have reproducibility, what do you think of adding a test case to make sure regression always catches such issues ?

@rachitchauhan43 I do know how to reproduce, but IMO that should be solved under its own ticket, because we would need to find a way to replicate within CI -- I crashed my dev cluster a few times before I could replicate successfully, so what I did on my device may not be suitable for GitHub Actions. And, of course, we would need to ensure all relevant code paths are covered.

Actually, I think contributors and reviewers should be careful when a watch or list privilege is being added to the RBAC role ---and this cannot be covered by tests.

This PR is around ConfigMaps, ServiceAccounts and Secrets and is already removing the no longer needed privileges. But any resource that the controller is watching may lead to an OOMKill, if the watch/informer fills the available memory.

@sivanantha321
Copy link
Member Author

Looks like lots of imports reorganization happened in this PR. @sivanantha321 : Possible to add as an action itself that fails if imports are not as per standard in go files ?

We can do it in a separate PR as many files requires reorganization.

Signed-off-by: Sivanantham Chinnaiyan <sivanantham.chinnaiyan@ideas2it.com>
@sivanantha321 sivanantha321 force-pushed the remove-cluster-level-watch-credentials branch from 06d911a to 48fcc80 Compare March 11, 2024 06:59
@oss-prow-bot oss-prow-bot bot removed the lgtm label Mar 11, 2024
Signed-off-by: Sivanantham Chinnaiyan <sivanantham.chinnaiyan@ideas2it.com>
@sivanantha321 sivanantha321 force-pushed the remove-cluster-level-watch-credentials branch from 48fcc80 to a886fbb Compare March 11, 2024 07:04
Copy link
Contributor

@israel-hdez israel-hdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approving

/lgtm

@oss-prow-bot oss-prow-bot bot added the lgtm label Mar 13, 2024
@rachitchauhan43
Copy link
Contributor

Looks like lots of imports reorganization happened in this PR. @sivanantha321 : Possible to add as an action itself that fails if imports are not as per standard in go files ?

We can do it in a separate PR as many files requires reorganization.

Sure, I meant that only. Don't want to have multi-purpose PR.

@rachitchauhan43
Copy link
Contributor

Also, @israel-hdez , if we have reproducibility, what do you think of adding a test case to make sure regression always catches such issues ?

@rachitchauhan43 I do know how to reproduce, but IMO that should be solved under its own ticket, because we would need to find a way to replicate within CI -- I crashed my dev cluster a few times before I could replicate successfully, so what I did on my device may not be suitable for GitHub Actions. And, of course, we would need to ensure all relevant code paths are covered.

Agreed. Please create separate ISsue for that.

@rachitchauhan43
Copy link
Contributor

/lgtm

Copy link

oss-prow-bot bot commented Mar 13, 2024

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by: sivanantha321, terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@oss-prow-bot oss-prow-bot bot merged commit ba100b4 into kserve:master Mar 13, 2024
58 checks passed
tjandy98 pushed a commit to tjandy98/kserve that referenced this pull request Apr 10, 2024
…ets (kserve#3469)

* Remove cluster level list/watch for configmaps, serviceaccounts, secrets

Signed-off-by: Sivanantham Chinnaiyan <sivanantham.chinnaiyan@ideas2it.com>

* Resolve comments

Signed-off-by: Sivanantham Chinnaiyan <sivanantham.chinnaiyan@ideas2it.com>

---------

Signed-off-by: Sivanantham Chinnaiyan <sivanantham.chinnaiyan@ideas2it.com>
Signed-off-by: tjandy98 <3953059+tjandy98@users.noreply.github.com>
israel-hdez pushed a commit to israel-hdez/kserve that referenced this pull request May 3, 2024
…ets (kserve#3469)

* Remove cluster level list/watch for configmaps, serviceaccounts, secrets

Signed-off-by: Sivanantham Chinnaiyan <sivanantham.chinnaiyan@ideas2it.com>

* Resolve comments

Signed-off-by: Sivanantham Chinnaiyan <sivanantham.chinnaiyan@ideas2it.com>

---------

Signed-off-by: Sivanantham Chinnaiyan <sivanantham.chinnaiyan@ideas2it.com>
israel-hdez pushed a commit to israel-hdez/kserve that referenced this pull request May 4, 2024
…ets (kserve#3469)

* Remove cluster level list/watch for configmaps, serviceaccounts, secrets

Signed-off-by: Sivanantham Chinnaiyan <sivanantham.chinnaiyan@ideas2it.com>

* Resolve comments

Signed-off-by: Sivanantham Chinnaiyan <sivanantham.chinnaiyan@ideas2it.com>

---------

Signed-off-by: Sivanantham Chinnaiyan <sivanantham.chinnaiyan@ideas2it.com>
israel-hdez pushed a commit to israel-hdez/kserve that referenced this pull request May 4, 2024
…ets (kserve#3469)

* Remove cluster level list/watch for configmaps, serviceaccounts, secrets

Signed-off-by: Sivanantham Chinnaiyan <sivanantham.chinnaiyan@ideas2it.com>

* Resolve comments

Signed-off-by: Sivanantham Chinnaiyan <sivanantham.chinnaiyan@ideas2it.com>

---------

Signed-off-by: Sivanantham Chinnaiyan <sivanantham.chinnaiyan@ideas2it.com>
israel-hdez pushed a commit to israel-hdez/kserve that referenced this pull request May 6, 2024
…ets (kserve#3469)

* Remove cluster level list/watch for configmaps, serviceaccounts, secrets

Signed-off-by: Sivanantham Chinnaiyan <sivanantham.chinnaiyan@ideas2it.com>

* Resolve comments

Signed-off-by: Sivanantham Chinnaiyan <sivanantham.chinnaiyan@ideas2it.com>

---------

Signed-off-by: Sivanantham Chinnaiyan <sivanantham.chinnaiyan@ideas2it.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants