Skip to content

Add regenerate-bot-tokens operation for JWT key rotation#26477

Merged
pmbrull merged 12 commits intomainfrom
task/fix-openmetadata-collate-issue-3157-30901f0a
Mar 16, 2026
Merged

Add regenerate-bot-tokens operation for JWT key rotation#26477
pmbrull merged 12 commits intomainfrom
task/fix-openmetadata-collate-issue-3157-30901f0a

Conversation

@pmbrull
Copy link
Collaborator

@pmbrull pmbrull commented Mar 13, 2026

Summary

  • Adds a new regenerate-bot-tokens subcommand to OpenMetadataOperations that regenerates JWT tokens for all bot users
  • Needed when rotating JWT signing keys or changing the cluster name (e.g., DR scenario moving from POV to PROD)
  • Paginates through all non-deleted bots, regenerates JWT tokens for those using JWT auth, and displays results in an ASCII table

Test plan

  • Run ./openmetadata-ops.sh regenerate-bot-tokens -c conf/openmetadata.yaml against a local environment with bots configured
  • Verify tokens are regenerated for all JWT-authenticated bots
  • Verify bots without JWT auth or without associated users are correctly skipped
  • Verify the new tokens work by authenticating with the regenerated bot token

Closes open-metadata/openmetadata-collate#3157

Add a new `regenerate-bot-tokens` subcommand to OpenMetadataOperations
that regenerates JWT tokens for all bot users. This is needed when
rotating JWT signing keys or changing the cluster name (e.g., during
disaster recovery from POV to PROD).

The command:
- Iterates through all non-deleted bots with pagination
- Regenerates JWT tokens only for bots using JWT authentication
- Skips bots without associated users or non-JWT auth mechanisms
- Displays results in an ASCII table with status per bot

Closes open-metadata/openmetadata-collate#3157

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels Mar 13, 2026
pmbrull and others added 2 commits March 13, 2026 12:34
- Token expiry is now a CLI parameter (--expiry) defaulting to Unlimited,
  so operators can choose OneHour, One, Seven, Thirty, Sixty, Ninety, or
  Unlimited when regenerating
- Add null guard on getPaging() to prevent NPE on empty result sets

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 13, 2026

🟡 Playwright Results — all passed (23 flaky)

✅ 3330 passed · ❌ 0 failed · 🟡 23 flaky · ⏭️ 183 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 453 0 2 2
🟡 Shard 2 304 0 1 1
🟡 Shard 3 653 0 8 33
🟡 Shard 4 720 0 7 47
✅ Shard 5 591 0 0 67
🟡 Shard 6 609 0 5 33
🟡 23 flaky test(s) (passed on retry)
  • Features/CustomizeDetailPage.spec.ts › Dashboard - customization should work (shard 1, 1 retry)
  • Flow/Tour.spec.ts › Tour should work from welcome screen (shard 1, 1 retry)
  • Features/ColumnBulkOperations.spec.ts › should navigate through pages (shard 2, 1 retry)
  • Features/DataProductRenameConsolidation.spec.ts › Rename then add tags - assets should be preserved (shard 3, 1 retry)
  • Features/DataQuality/ColumnLevelTests.spec.ts › Column Values Sum To Be Between (shard 3, 1 retry)
  • Features/DataQuality/DataQualityPermissions.spec.ts › User with TEST_CASE.VIEW_BASIC can view test case CONTENT details in UI (shard 3, 1 retry)
  • Features/DataQuality/TestCaseIncidentPermissions.spec.ts › User with only VIEW cannot PATCH incidents (shard 3, 1 retry)
  • Features/DataQuality/TestCaseResultPermissions.spec.ts › User with only VIEW cannot PATCH results (shard 3, 1 retry)
  • Features/EntitySummaryPanel.spec.ts › should display summary panel for table (shard 3, 1 retry)
  • Features/ImpactAnalysis.spec.ts › Verify Upstream connections (shard 3, 1 retry)
  • Features/Permissions/GlossaryPermissions.spec.ts › Team-based permissions work correctly (shard 3, 1 retry)
  • Flow/ObservabilityAlerts.spec.ts › Test Suite alert (shard 4, 1 retry)
  • Pages/Customproperties-part2.spec.ts › entityReferenceList shows item count, scrollable list, no expand toggle (shard 4, 1 retry)
  • Pages/Domains.spec.ts › Verify domain and subdomain asset count accuracy (shard 4, 1 retry)
  • Pages/Domains.spec.ts › Rename domain with deeply nested subdomains (3+ levels) verifies FQN propagation (shard 4, 1 retry)
  • Pages/Entity.spec.ts › Tag Add, Update and Remove for child entities (shard 4, 1 retry)
  • Pages/Entity.spec.ts › Glossary Term Add, Update and Remove (shard 4, 1 retry)
  • Pages/Entity.spec.ts › Inactive Announcement create & delete (shard 4, 1 retry)
  • Pages/Glossary.spec.ts › Glossary Term Update in Glossary Page should persist tree (shard 6, 1 retry)
  • Pages/Login.spec.ts › Refresh should work (shard 6, 1 retry)
  • Pages/Users.spec.ts › Permissions for table details page for Data Consumer (shard 6, 1 retry)
  • Pages/Users.spec.ts › Check permissions for Data Steward (shard 6, 1 retry)
  • VersionPages/EntityVersionPages.spec.ts › Directory (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

pmbrull and others added 4 commits March 13, 2026 17:11
Move the token regeneration logic from the ops CLI into a new REST
endpoint PUT /v1/users/regenerateBotTokens (admin-only). The ops
command now calls the server API via HTTP, following the same pattern
as deploy-pipelines.

This keeps business logic in the server and the ops CLI as a thin
HTTP client.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After JWT key rotation, all existing bot tokens are invalid — including
the ingestion-bot token needed to authenticate against the server API.
This command must operate directly on the database to recover from
that state. Added a doc comment explaining the reasoning.

Also reverts the UserResource endpoint added in the previous commit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tern

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
listAll/listAfter use setFieldsInBulk which only calls registered
fieldFetchers (tags, owners, etc.) — not the entity-specific setFields
override. BotRepository.setFields populates botUser via a relationship
lookup, but this is never called in bulk listing. Fetch each bot
individually with getByName which triggers setFields properly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests cover token regeneration producing new tokens, respecting
expiry parameters, and handling multiple bots independently.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pmbrull and others added 2 commits March 13, 2026 17:54
Unlimited tokens generated in the same second produce identical JWTs
(same iat, null exp). Changed test to use different expiry values so
the tokens are guaranteed to differ.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gitar-bot
Copy link

gitar-bot bot commented Mar 16, 2026

Code Review ✅ Approved 5 resolved / 5 findings

Adds regenerate-bot-tokens operation for JWT key rotation, addressing hardcoded expiry overrides, null pointer exceptions on pagination boundaries, URL injection via CLI params, and error handling for failed regenerations. All findings resolved.

✅ 5 resolved
Bug: Hardcoded Unlimited expiry overrides bots' existing token expiry

📄 openmetadata-service/src/main/java/org/openmetadata/service/util/OpenMetadataOperations.java:2519
At line 2519, JWTTokenExpiry.Unlimited is hardcoded when regenerating tokens. This silently overrides whatever expiry a bot's token originally had (e.g., 7-day, 30-day, 90-day) with Unlimited, which is a security regression. The canonical pattern (used in UserResource.java) is to read the existing expiry from the bot's stored JWTAuthMechanism config and pass it through.

Bug: NPE when getPaging() returns null on empty/last page

📄 openmetadata-service/src/main/java/org/openmetadata/service/util/OpenMetadataOperations.java:2533
At line 2533, botPage.getPaging().getAfter() is called without a null guard. The ResultList class has constructors that leave paging = null (no-arg constructor and single-arg ResultList(List<T> data)). If listAfter returns a ResultList with null paging (e.g., when there are no bots), this will throw a NullPointerException, causing the command to fail and fall into the outer catch block with a confusing error message.

Security: CLI expiry param is not URL-encoded, enabling URL injection

📄 openmetadata-service/src/main/java/org/openmetadata/service/util/OpenMetadataOperations.java:2500 📄 openmetadata-service/src/main/java/org/openmetadata/service/util/OpenMetadataOperations.java:2478
The expiry CLI parameter is concatenated directly into the URL at line 2500 without URL-encoding or validation against the JWTTokenExpiry enum values. A value like Unlimited&admin=true would be passed verbatim into URI.create(...), injecting arbitrary query parameters into the request.

Additionally, the CLI declares expiry as a plain String (not the JWTTokenExpiry enum), so picocli won't reject invalid values at parse time. The user only gets a confusing 404 from the server when JAX-RS fails to bind an invalid enum value.

Fix both by (a) URL-encoding the value and (b) declaring the CLI option type as JWTTokenExpiry so picocli validates it upfront.

Edge Case: Endpoint returns 200 even when all bots fail regeneration

📄 openmetadata-service/src/main/java/org/openmetadata/service/resources/teams/UserResource.java:1023
The regenerateBotTokens endpoint always returns Response.ok() (HTTP 200) regardless of how many bots failed. If every bot fails, the caller still gets a 200 with failedBots populated but no HTTP-level indication of failure. The CLI side handles this correctly (returns exit code 1 if failedBots is non-empty), but other HTTP callers (e.g., UI, scripts) would need to inspect the response body to detect failures.

Consider returning 207 Multi-Status when there are partial failures, or 500 when all bots fail, to give HTTP-level callers a clear signal.

Bug: CLI returns exit code 0 even when all bot token regenerations fail

📄 openmetadata-service/src/main/java/org/openmetadata/service/util/OpenMetadataOperations.java:2552
The old server-API-based code correctly returned exit code 1 when any bot failed (return result.getOrDefault("failedBots", List.of()).isEmpty() ? 0 : 1;). The new direct-DB code at line 2552 always returns 0, regardless of how many bots ended up in FAILED status. This makes the command unsuitable for scripted/CI use where the exit code signals success or failure.

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link

@pmbrull pmbrull merged commit f0602ff into main Mar 16, 2026
37 of 42 checks passed
@pmbrull pmbrull deleted the task/fix-openmetadata-collate-issue-3157-30901f0a branch March 16, 2026 16:12
JRDuncan pushed a commit to JRDuncan/OpenMetadata that referenced this pull request Mar 17, 2026
…ta#26477)

* Add regenerate-bot-tokens operation for JWT key rotation

Add a new `regenerate-bot-tokens` subcommand to OpenMetadataOperations
that regenerates JWT tokens for all bot users. This is needed when
rotating JWT signing keys or changing the cluster name (e.g., during
disaster recovery from POV to PROD).

The command:
- Iterates through all non-deleted bots with pagination
- Regenerates JWT tokens only for bots using JWT authentication
- Skips bots without associated users or non-JWT auth mechanisms
- Displays results in an ASCII table with status per bot

Closes open-metadata/openmetadata-collate#3157

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix review: add --expiry CLI parameter and null-guard paging

- Token expiry is now a CLI parameter (--expiry) defaulting to Unlimited,
  so operators can choose OneHour, One, Seven, Thirty, Sixty, Ninety, or
  Unlimited when regenerating
- Add null guard on getPaging() to prevent NPE on empty result sets

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Document regenerate-bot-tokens expiry values in help output

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Move bot token regeneration logic to server API

Move the token regeneration logic from the ops CLI into a new REST
endpoint PUT /v1/users/regenerateBotTokens (admin-only). The ops
command now calls the server API via HTTP, following the same pattern
as deploy-pipelines.

This keeps business logic in the server and the ops CLI as a thin
HTTP client.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Revert server API approach, use direct DB access for token regeneration

After JWT key rotation, all existing bot tokens are invalid — including
the ingestion-bot token needed to authenticate against the server API.
This command must operate directly on the database to recover from
that state. Added a doc comment explaining the reasoning.

Also reverts the UserResource endpoint added in the previous commit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix compilation: use fully-qualified ResultList to match existing pattern

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix botUser always null: fetch each bot individually via getByName

listAll/listAfter use setFieldsInBulk which only calls registered
fieldFetchers (tags, owners, etc.) — not the entity-specific setFields
override. BotRepository.setFields populates botUser via a relationship
lookup, but this is never called in bulk listing. Fetch each bot
individually with getByName which triggers setFields properly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add integration tests for regenerate-bot-tokens operation

Tests cover token regeneration producing new tokens, respecting
expiry parameters, and handling multiple bots independently.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix flaky test: use different expiries to guarantee distinct tokens

Unlimited tokens generated in the same second produce identical JWTs
(same iat, null exp). Changed test to use different expiry values so
the tokens are guaranteed to differ.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* format

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants