Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix operator crashing on first startup #28997

Merged
merged 1 commit into from
Jul 12, 2023
Merged

Conversation

hugoShaka
Copy link
Contributor

@hugoShaka hugoShaka commented Jul 12, 2023

The operator crashes on startup because of a regression introduced in #28391

This currently affects 13.2.0.

The issue happens because:

  • the DeleteBot() endpoint returns an aggregate with 2+ NotFoundError errors
  • ToGRPC does not support checking the aggregate type and sets the Unknown status code
  • FromGRPC doesn't unpack into a trace.NotFoundError because status is not 5
  • trace.IsNotFound returns false as this is not a NotFoundError anymore
  • we expect a NotFoundError and fail because we got something else

The issue got unnoticed because:

  • it was added during review and tested less extensively than the first changeset
  • there's no integration test for this (actionable item: Add operator sidecar test coverage #29006, but for now we need to fix the bug)
  • this doesn't trigger if the bot is already here, this passed smoke tests

@hugoShaka
Copy link
Contributor Author

hugoShaka commented Jul 12, 2023

Another approach suggested by @strideynet is to call DeleteUser() and DeleteRole() directly instead of DeleteBot().

@zmb3
Copy link
Collaborator

zmb3 commented Jul 12, 2023

Can we write a test for this?

@hugoShaka
Copy link
Contributor Author

hugoShaka commented Jul 12, 2023

Can we write a test for this?

Not within a day, this requires setting up a full integration test suite. There's no integration test framework covering this part, so this will delay the fix. I suggest we add the test the next time we touch the operator code.

@zmb3
Copy link
Collaborator

zmb3 commented Jul 12, 2023

Please file an issue then so we don't lose track of this (who knows when the next time we touch this code will be?)

Regressions indicate a lack of test coverage, so any time we fix one we should be asking why it wasn't caught with automated testing and how we can fix it.

@hugoShaka
Copy link
Contributor Author

Please file an issue then so we don't lose track of this (who knows when the next time we touch this code will be?)

#29006

@hugoShaka hugoShaka added this pull request to the merge queue Jul 12, 2023
Merged via the queue into master with commit b5c11a2 Jul 12, 2023
23 checks passed
@hugoShaka hugoShaka deleted the hugo/fix-operator-crash branch July 12, 2023 16:23
@public-teleport-github-review-bot

@hugoShaka See the table below for backport results.

Branch Result
branch/v12 Create PR
branch/v13 Create PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants