HYPERFLEET-774 - feat: update aggregation logic#82
HYPERFLEET-774 - feat: update aggregation logic#82rh-amarin wants to merge 1 commit intoopenshift-hyperfleet:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis PR overhauls status aggregation and its validation/lifecycle semantics. Key factual changes: extensive docs updates describing aggregation rules; config loader env-var parsing refinements; cluster and nodepool services now delegate status recomputation to internal updaters that accept observedTime and isLifecycleChange; ProcessAdapterStatus enforces pre-checks (generation staleness, mandatory conditions, Unknown Available discard), upserts adapter statuses, and triggers aggregation; status_aggregation was refactored to snapshot-driven Available computation, context-aware ComputeReadyCondition and BuildSyntheticConditions signatures, and many internal helpers; tests and new e2e curl scripts expanded to cover rules and edge cases. ComputeAvailableCondition was removed; UpdateNodePoolStatusFromAdapters was added. Sequence DiagramsequenceDiagram
actor Adapter
participant API
participant Service as Cluster/NodePool Service
participant Agg as Status Aggregation
participant DB as Database
participant Resource
Adapter->>API: POST adapter status (generation, Available, conditions)
API->>Service: ProcessAdapterStatus(ctx, adapterStatus)
alt Pre-processing validation
Service->>Service: Check stale generation (P1)
Service->>Service: Validate mandatory conditions (P2)
Service->>Service: Reject Unknown Available (P3)
end
alt Discarded
Service-->>API: HTTP 204 No Content
API-->>Adapter: 204
else Accepted
Service->>DB: Upsert adapter status
Service->>Service: update...StatusFromAdapters(ctx, id, observedTime, isLifecycleChange)
Service->>Agg: BuildSyntheticConditions(ctx, existingConditions, adapterStatuses, requiredAdapters, resourceGen, now, observedTime, isLifecycleChange)
Agg-->>Service: (Available, Ready synthetic conditions)
Service->>DB: Persist resource conditions
Service->>Resource: Update in-memory/resource view
Service-->>API: Return updated resource (201/200)
API-->>Adapter: 201/200
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment Tip CodeRabbit can use OpenGrep to find security vulnerabilities and bugs across 17+ programming languages.OpenGrep is compatible with Semgrep configurations. Add an |
There was a problem hiding this comment.
Actionable comments posted: 11
🧹 Nitpick comments (2)
pkg/config/loader.go (1)
243-257: Warn when canonical and alias env vars conflict.If both env vars are set with different non-empty values, current behavior silently picks the first one. Adding a warning would reduce misconfiguration risk and improve operability.
Suggested adjustment
setFromEnvVars := func(viperKey string, envVars []string) error { + if len(envVars) >= 2 { + primary := os.Getenv(envVars[0]) + alias := os.Getenv(envVars[1]) + if primary != "" && alias != "" && primary != alias { + logger.With(ctx, "viper_key", viperKey, "primary_env", envVars[0], "alias_env", envVars[1]). + Warn("Conflicting env vars detected; using primary env var by precedence") + } + } for _, envVar := range envVars { jsonValue := os.Getenv(envVar) if jsonValue == "" { continue }As per coding guidelines, “Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/config/loader.go` around lines 243 - 257, In setFromEnvVars, detect when multiple envVars in the provided slice are set to different non-empty JSON values before returning: iterate all envVars, collect non-empty jsonValue strings, parse each into arrayValue (reuse the existing json.Unmarshal logic or parse first then compare raw strings), if more than one distinct value exists emit a warning via logger.With(ctx, "env_vars", envVars, "values", <list-of-values>).Warn("Conflicting environment variables for same config key; using first value"), then proceed to set viper.Set(viperKey, arrayValue) with the value chosen (first in the slice) and continue returning as before; reference setFromEnvVars, viper.Set and logger.With to locate where to add the conflict check and warning.test/e2e-curl/12-unknown-first/test.sh (1)
82-90: Re-assert “not stored” after the second Unknown POST.Right now the second event only asserts HTTP 204. Add a second
/statusescount check to guard against regressions where subsequent Unknown reports are accidentally persisted.Proposed hardening
# ── Event 2: second Unknown report ──────────────────────────────────────────── echo "" log_step "POST ${ADAPTER2}@gen1=Unknown (second report — also discarded)" CODE2=$(post_adapter_status "$CLUSTER_ID" "$ADAPTER2" 1 "Unknown") echo "" echo " API response (second Unknown also discarded):" assert_http "second Unknown discarded (204)" "204" "$CODE2" + +log_step "GET /statuses — adapter record must still NOT exist" +STATUSES2=$(get_statuses "$CLUSTER_ID") +VAL_COUNT2=$(count_adapter_status "$STATUSES2" "$ADAPTER2") +assert_eq "${ADAPTER2} status record still NOT stored" "0" "$VAL_COUNT2"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e-curl/12-unknown-first/test.sh` around lines 82 - 90, After posting the second Unknown (stored in CODE2 via post_adapter_status), add a repeat check against the /statuses collection to assert the count did not change (i.e., the Unknown was not persisted). Concretely, immediately after the existing assert_http "second Unknown discarded (204)" call, perform the same GET /statuses count assertion used earlier in the script (the helper that verifies number of statuses) to ensure the total count remains unchanged for CLUSTER_ID/ADAPTER2.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/api-resources.md`:
- Line 553: The documented contract for last_updated_time is outdated and
misleading: update the docs so last_updated_time for the Available synthetic
condition is derived from adapter report times (use min(last_report_time) of
reporting adapters or the triggering observed_time on a True→False transition),
and for Ready=False set last_updated_time to min(last_report_time) across
required adapters with a fallback to now when no reports exist; reference the
fields last_updated_time, Available, Ready, last_report_time, observed_time and
describe those exact rules in place of the old sentence.
In `@pkg/services/CLAUDE.md`:
- Around line 24-30: The documentation is outdated:
UpdateClusterStatusFromAdapters no longer computes Available by taking a minimum
across mixed generations—modify the text to state that Available only updates
when all required adapters report the same observed generation (i.e., it
requires a shared observedGeneration), and remove the sentence about using
minimum observed generation; also update the description of ProcessAdapterStatus
to reflect that Available=Unknown is discarded on every report (not allowed even
on the first report); finally, adjust the note about Ready/LastUpdatedTime in
status_aggregation.computeReadyLastUpdated only if any behavior around
generation alignment or Available calculation changed there to reference the new
generation-alignment rule.
In `@pkg/services/cluster.go`:
- Around line 286-324: The code applies P2/P3 condition validation before P1
(observed_generation) which causes reports to a missing cluster to be swallowed
instead of returning NotFound; move the cluster fetch and the
observed-generation check (the P1 logic using s.clusterDao.Get and the
adapterStatus.ObservedGeneration > cluster.Generation comparison) up so it runs
immediately after unmarshalling adapterStatus.Conditions (or even before
condition validation), and only proceed to ValidateMandatoryConditions and the
Available==Unknown loop after confirming the cluster exists and
observed_generation is not ahead; update references to adapterStatus,
ValidateMandatoryConditions, s.clusterDao.Get, and the ObservedGeneration check
accordingly.
- Around line 332-336: The code currently sets observedTime from
upsertedStatus.LastReportTime; change it to prefer the adapter-reported
observed_time field instead. In the "Trigger aggregation" block, set
observedTime from upsertedStatus.ObservedTime (if non-nil), otherwise fall back
to upsertedStatus.LastReportTime, then time.Now(); replace the current
observedTime assignment that references upsertedStatus.LastReportTime with this
new preference so aggregation uses the adapter's reported observed_time for
transition timestamps.
In `@pkg/services/node_pool.go`:
- Around line 337-340: The aggregation error from
updateNodePoolStatusFromAdapters is only logged and not returned, letting the
caller see success while synthesized conditions remain stale; change the method
to propagate the failure instead of masking it: after calling
updateNodePoolStatusFromAdapters (symbol: updateNodePoolStatusFromAdapters)
check aggregateErr and return it (or a wrapped error) rather than just logging
via logger.With(...).WithError(aggregateErr).Warn(...); ensure the surrounding
Upsert/handler method returns the error so callers observe the failure and can
retry or surface it.
- Around line 287-324: The code performs P2/P3 checks before P1; move the node
pool fetch and the observed-generation check (the logic using s.nodePoolDao.Get
/ nodePool and the comparison adapterStatus.ObservedGeneration >
nodePool.Generation) to run before condition validation so P1 is applied prior
to P2/P3; specifically, call s.nodePoolDao.Get(ctx, nodePoolID) and handle
errors via handleGetError(...) and then perform the observed_generation vs
nodePool.Generation check (logging and returning nil) before invoking
ValidateMandatoryConditions(conditions) or iterating conditions for
Available==Unknown, ensuring malformed or deleted-node-pool reports return
NotFound instead of being silently discarded.
- Around line 332-336: The code currently sets observedTime from
upsertedStatus.LastReportTime, but you must use the adapter's reported
observed_time instead; change the assignment to pull the adapter's observed_time
field from upsertedStatus (e.g., upsertedStatus.ObservedTime or the nested
status field that contains the adapter's observed_time) and only fall back to
LastReportTime or time.Now when that adapter-reported observed_time is nil,
keeping the variable name observedTime and the same fallback behavior.
In `@test/e2e-curl/common.sh`:
- Around line 11-14: The default adapter names in the helper are inconsistent
with the documented test setup; update the ADAPTER1/ADAPTER2 defaults so they
match the documented server adapters (use ADAPTER1 default "dns" and ADAPTER2
default "validation") so that run_all.sh posts the correct adapter names and
aligns with HYPERFLEET_CLUSTER_ADAPTERS; locate the ADAPTER1 and ADAPTER2
assignments in the common helper (the lines setting
ADAPTER1="${ADAPTER1:-adapter1}" and ADAPTER2="${ADAPTER2:-adapter2}") and
change their fallback values accordingly.
In `@test/e2e-curl/README.md`:
- Around line 55-62: The fenced code block in test/e2e-curl/README.md is missing
a language tag (MD040); update the opening fence to include a language like
"text" so the block becomes ```text and keep the contents unchanged; locate the
unlabeled fence in the e2e-curl directory structure example in README.md and add
the language tag to the opening triple backticks.
In `@test/e2e-curl/run_all.sh`:
- Around line 42-50: When applying filter prefixes to TESTS (the SELECTED
population logic in the for-loop that builds SELECTED from TESTS), detect the
case where no tests matched the provided filters and exit non-zero with an error
message instead of proceeding and reporting 0/0 success; after the selection
loop that fills SELECTED, check if SELECTED is empty and print a clear error
like "no tests matched filters: $@" and exit with a failure code (e.g., exit 2).
Apply the same empty-SELECTED check and failure behavior to the duplicate
selection block used later (the other SELECTED population around the second
filter-handling section).
---
Nitpick comments:
In `@pkg/config/loader.go`:
- Around line 243-257: In setFromEnvVars, detect when multiple envVars in the
provided slice are set to different non-empty JSON values before returning:
iterate all envVars, collect non-empty jsonValue strings, parse each into
arrayValue (reuse the existing json.Unmarshal logic or parse first then compare
raw strings), if more than one distinct value exists emit a warning via
logger.With(ctx, "env_vars", envVars, "values",
<list-of-values>).Warn("Conflicting environment variables for same config key;
using first value"), then proceed to set viper.Set(viperKey, arrayValue) with
the value chosen (first in the slice) and continue returning as before;
reference setFromEnvVars, viper.Set and logger.With to locate where to add the
conflict check and warning.
In `@test/e2e-curl/12-unknown-first/test.sh`:
- Around line 82-90: After posting the second Unknown (stored in CODE2 via
post_adapter_status), add a repeat check against the /statuses collection to
assert the count did not change (i.e., the Unknown was not persisted).
Concretely, immediately after the existing assert_http "second Unknown discarded
(204)" call, perform the same GET /statuses count assertion used earlier in the
script (the helper that verifies number of statuses) to ensure the total count
remains unchanged for CLUSTER_ID/ADAPTER2.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 78db7f36-6f41-4908-bfde-4dce5d34730b
📒 Files selected for processing (25)
docs/api-resources.mdpkg/config/loader.gopkg/services/CLAUDE.mdpkg/services/cluster.gopkg/services/cluster_test.gopkg/services/node_pool.gopkg/services/node_pool_test.gopkg/services/status_aggregation.gopkg/services/status_aggregation_test.gotest/e2e-curl/01-initial-state/test.shtest/e2e-curl/02-partial-adapters/test.shtest/e2e-curl/03-all-adapters-ready/test.shtest/e2e-curl/04-generation-bump/test.shtest/e2e-curl/05-mixed-generations/test.shtest/e2e-curl/06-stale-report/test.shtest/e2e-curl/07-all-adapters-new-gen/test.shtest/e2e-curl/08-adapter-goes-false/test.shtest/e2e-curl/09-stable-true/test.shtest/e2e-curl/10-stable-false/test.shtest/e2e-curl/11-unknown-subsequent/test.shtest/e2e-curl/12-unknown-first/test.shtest/e2e-curl/README.mdtest/e2e-curl/common.shtest/e2e-curl/run_all.shtest/integration/adapter_status_test.go
2d77e35 to
f825425
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (10)
docs/api-resources.md (1)
553-553:⚠️ Potential issue | 🟠 Major
last_updated_timecontract is still outdated for synthetic conditions.This line still says
Availablealways uses evaluation time andReady=Falseuses evaluation time, which contradicts the updated aggregation semantics. Update this sentence to match the current rules (report-time-based with transition/fallback behavior).As per coding guidelines, "**: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/api-resources.md` at line 553, Update the `last_updated_time` sentence to reflect report-time-based semantics: state that `last_updated_time` is primarily driven by adapters' report times (`last_report_time`) rather than always using evaluation time; for `Available` use the adapter's `last_report_time` when present (fall back to evaluation time only if no report-time exists), and for `Ready` when Ready=True use the minimum `last_report_time` across all required adapters reporting `Available=True` at the current generation, while when Ready=False use the report-time that caused the Ready transition (or fall back to evaluation time if no such report exists).pkg/services/node_pool.go (3)
337-340:⚠️ Potential issue | 🟠 MajorDon't return success when aggregation fails.
If
Upsertsucceeds andupdateNodePoolStatusFromAdapters(...)fails, Lines 337-340 only log a warning and still return success. That leaves the stored adapter status and the synthetic node-pool conditions out of sync until another write happens.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/services/node_pool.go` around lines 337 - 340, The current logic logs a warning when updateNodePoolStatusFromAdapters(ctx, nodePoolID, observedTime, false) returns an error but still treats the overall operation as successful; change this so that when updateNodePoolStatusFromAdapters returns a non-nil error the caller returns that error (or wraps it) instead of proceeding as success. Locate the call to updateNodePoolStatusFromAdapters in the same function (after the Upsert) and propagate aggregateErr upward (e.g., return aggregateErr or fmt.Errorf("...: %w", aggregateErr)) so the Upsert result is not reported as successful when aggregation fails, preserving consistency between stored adapter status and synthetic node-pool conditions.
332-336:⚠️ Potential issue | 🟠 MajorUse the adapter's observed_time, not
LastReportTime.The model documents
LastReportTimeas the API-managed receipt time ("Updated on every POST"). Using it on Lines 333-335 shifts syntheticAvailable/Readytimestamps for delayed or retried reports. Pass the request'sobserved_timethrough here and only fall back toLastReportTime/time.Now()when it is missing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/services/node_pool.go` around lines 332 - 336, The code currently sets observedTime from upsertedStatus.LastReportTime; instead use the adapter/request-provided observed_time first (the adapter's observed_time field passed into the handler) and only fall back to upsertedStatus.LastReportTime and then time.Now() if the request observed_time is nil/missing; update the observedTime assignment in pkg/services/node_pool.go (where observedTime is defined and upsertedStatus is referenced) to read the incoming request's observed_time value (or its wrapper/parameter) before falling back to upsertedStatus.LastReportTime and time.Now().
287-324:⚠️ Potential issue | 🟠 MajorRun P1 before P2/P3.
The docstring on Lines 263-267 says
stale → P1 → P2 → P3, but Lines 287-315 still parse and apply P2/P3 befores.nodePoolDao.Get(...). A malformed orAvailable=Unknownreport for a deleted node pool is swallowed beforehandleGetError(...)can returnNotFound, and future-generation reports are checked later than the spec requires.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/services/node_pool.go` around lines 287 - 324, The P1 observed_generation check must run before P2/P3; fetch the NodePool via s.nodePoolDao.Get(ctx, nodePoolID) and perform the observed_generation comparison (using adapterStatus.ObservedGeneration vs nodePool.Generation and returning handleGetError on Get errors) before calling ValidateMandatoryConditions(conditions) or iterating conditions for Available==Unknown; keep the existing json.Unmarshal of adapterStatus.Conditions if needed but relocate the nodePool fetch and the if adapterStatus.ObservedGeneration > nodePool.Generation check to occur prior to ValidateMandatoryConditions and the Available-Unknown loop.pkg/services/cluster.go (3)
337-340:⚠️ Potential issue | 🟠 MajorDon't return success when aggregation fails.
If
Upsertsucceeds andupdateClusterStatusFromAdapters(...)fails, Lines 337-340 only log a warning and still return success. That leaves the stored adapter status and the synthetic cluster conditions out of sync until another write happens.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/services/cluster.go` around lines 337 - 340, The call to s.updateClusterStatusFromAdapters(ctx, clusterID, observedTime, false) currently only logs a warning on error, causing Upsert to return success despite aggregation failure; change this so that when aggregateErr != nil you propagate/return an error (e.g., wrap aggregateErr with context like "failed to aggregate cluster status for clusterID") from the enclosing function instead of only logging, using the same ctx via logger.WithClusterID and logger.WithError to preserve logs; locate the call to updateClusterStatusFromAdapters and replace the warning-only branch with a return of the wrapped aggregateErr so the caller sees the failure.
332-336:⚠️ Potential issue | 🟠 MajorUse the adapter's observed_time, not
LastReportTime.The model documents
LastReportTimeas the API-managed receipt time ("Updated on every POST"). Using it on Lines 333-335 shifts syntheticAvailable/Readytimestamps for delayed or retried reports. Pass the request'sobserved_timethrough here and only fall back toLastReportTime/time.Now()when it is missing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/services/cluster.go` around lines 332 - 336, The code currently sets observedTime from upsertedStatus.LastReportTime; instead prefer the adapter's reported observed_time and only fall back to LastReportTime or time.Now() if that field is missing. Update the observedTime assignment in pkg/services/cluster.go (the observedTime variable) to first use the incoming report/request observed_time (e.g., report.ObservedTime or req.ObservedTime from the handler), then if nil use upsertedStatus.LastReportTime, and finally time.Now(); keep the variable name observedTime and the existing fallbacks order as described.
286-324:⚠️ Potential issue | 🟠 MajorRun P1 before P2/P3.
The docstring on Lines 262-269 says
stale → P1 → P2 → P3, but Lines 286-314 still parse and apply P2/P3 befores.clusterDao.Get(...). A malformed orAvailable=Unknownreport for a deleted cluster is swallowed beforehandleGetError(...)can returnNotFound, and future-generation reports are checked later than the spec requires.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/services/cluster.go` around lines 286 - 324, Unmarshal conditions stays, but perform P1 (fetch resource and check observed_generation) immediately after unmarshalling and before calling ValidateMandatoryConditions/Available checks: call s.clusterDao.Get(ctx, clusterID), handle errors with handleGetError("Cluster","id",clusterID,err), then if adapterStatus.ObservedGeneration > cluster.Generation log and return nil (the existing observed_generation check using adapterStatus.ObservedGeneration and cluster.Generation); only after that run ValidateMandatoryConditions(conditions) and the loop checking api.ConditionTypeAvailable == api.AdapterConditionUnknown.test/e2e-curl/common.sh (1)
11-14:⚠️ Potential issue | 🟠 MajorAlign the defaults with the documented adapters.
Lines 13-14 still default to
adapter1/adapter2, but this helper says they must match the server's required adapters and the header documentsdns/validation.run_all.shagainst the documented setup will post the wrong adapter names and never satisfy the required set.Suggested fix
-ADAPTER1="${ADAPTER1:-adapter1}" -ADAPTER2="${ADAPTER2:-adapter2}" +ADAPTER1="${ADAPTER1:-dns}" +ADAPTER2="${ADAPTER2:-validation}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e-curl/common.sh` around lines 11 - 14, The default adapter names ADAPTER1 and ADAPTER2 in test/e2e-curl/common.sh currently use "adapter1"/"adapter2" which mismatch the documented required adapters; update the default assignments ADAPTER1="${ADAPTER1:-...}" and ADAPTER2="${ADAPTER2:-...}" to use "dns" and "validation" respectively so the helper aligns with the server's HYPERFLEET_CLUSTER_ADAPTERS expectations while preserving the ability to override via env vars.test/e2e-curl/README.md (1)
55-62:⚠️ Potential issue | 🟡 MinorAdd a language tag to the fenced code block (MD040).
Line [55] starts an unlabeled fence; markdownlint will keep warning until it is tagged.
Proposed fix
-``` +```text e2e-curl/ common.sh # Shared helpers: API wrappers, assertions, logging run_all.sh # Suite runner with per-test pass/fail tracking 01-initial-state/ test.sh ...</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@test/e2e-curl/README.mdaround lines 55 - 62, The unlabeled fenced code
block in README.md should include a language tag to satisfy markdownlint MD040;
change the triple-backtick fence that precedes the directory listing (the code
block showing "e2e-curl/ common.sh run_all.sh 01-initial-state/ test.sh ...") to
include an appropriate language tag such as text or bash (e.g., ```text ortest/e2e-curl/run_all.sh (1)
42-50:⚠️ Potential issue | 🟠 MajorFail fast when no tests match provided filters.
If filters match nothing, the runner reports success with
0/0, which can mask misconfigured executions.Proposed fix
SELECTED=() if [ $# -gt 0 ]; then for filter in "$@"; do for t in "${TESTS[@]}"; do [[ "$t" == "${filter}"* ]] && SELECTED+=("$t") done done else SELECTED=("${TESTS[@]}") fi + +if [ "${`#SELECTED`[@]}" -eq 0 ]; then + echo -e "${RED}ERROR${NC}: no tests matched filters: $*" + echo "Available tests:" + for t in "${TESTS[@]}"; do + echo " - $t" + done + exit 2 +fiAlso applies to: 81-86
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e-curl/run_all.sh` around lines 42 - 50, The runner currently silently succeeds when no tests match filters because SELECTED remains empty; after the filtering block (where TESTS is iterated and SELECTED is populated) add a check that if SELECTED is empty you print a clear error like "No tests match provided filters" and exit with a non-zero status (e.g., exit 1) so the CI fails fast; apply the same empty-SELECTED check in the alternate filtering section mentioned (lines 81-86) to cover both code paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/hyperfleet-api/migrate/cmd.go`:
- Around line 61-66: The migrate command currently calls
db.Migrate(connection.New(ctx)) directly which can race concurrent invocations;
wrap the migration call with a DB-backed lock (e.g., an advisory lock or a
dedicated migrations lock table) so only one process runs migrations at a time:
obtain the lock before calling db.Migrate, fail gracefully or wait with timeout
if the lock cannot be acquired, defer releasing the lock after migrations
complete (or on error), and log lock acquisition/failure; add or reuse a helper
like AcquireMigrationLock/ReleaseMigrationLock (or implement this logic in
pkg/db/migrations.go) and call it from the migrate command around db.Migrate so
migrations are serialized.
In `@docs/api-resources.md`:
- Around line 312-313: The table's Creation rows use observed_generation = 1
which conflicts with the Cluster and NodePool examples that show
observed_generation: 0; pick one source of truth and make them consistent:
either change the Creation table rows (the two "Creation" entries) to set
observed_generation (obs_gen) to 0 to match the examples, or update the Cluster
and NodePool example initial synthetic conditions to use observed_generation: 1;
update all instances of observed_generation/obs_gen (in the table and in the
Cluster/NodePool examples) so they match and add no other semantic changes.
- Around line 287-343: The aggregation matrix currently derives lut from
statuses[].lut (last_update_time); update the docs so lut is explicitly defined
as the adapter's last_report_time (replace all occurrences of statuses[].lut
with adapter last_report_time), and adjust each matrix rule: use observed_time
for synthetic last_report_time when a True→False transition is triggered, and
use now as the fallback last_report_time for Ready=False when any required
adapter lacks a stored status; update the notation lines (lut/last_update_time)
and all rows in the "Adapter Report Aggregation Matrix" and "Lifecycle Events"
tables to reflect this new source.
---
Duplicate comments:
In `@docs/api-resources.md`:
- Line 553: Update the `last_updated_time` sentence to reflect report-time-based
semantics: state that `last_updated_time` is primarily driven by adapters'
report times (`last_report_time`) rather than always using evaluation time; for
`Available` use the adapter's `last_report_time` when present (fall back to
evaluation time only if no report-time exists), and for `Ready` when Ready=True
use the minimum `last_report_time` across all required adapters reporting
`Available=True` at the current generation, while when Ready=False use the
report-time that caused the Ready transition (or fall back to evaluation time if
no such report exists).
In `@pkg/services/cluster.go`:
- Around line 337-340: The call to s.updateClusterStatusFromAdapters(ctx,
clusterID, observedTime, false) currently only logs a warning on error, causing
Upsert to return success despite aggregation failure; change this so that when
aggregateErr != nil you propagate/return an error (e.g., wrap aggregateErr with
context like "failed to aggregate cluster status for clusterID") from the
enclosing function instead of only logging, using the same ctx via
logger.WithClusterID and logger.WithError to preserve logs; locate the call to
updateClusterStatusFromAdapters and replace the warning-only branch with a
return of the wrapped aggregateErr so the caller sees the failure.
- Around line 332-336: The code currently sets observedTime from
upsertedStatus.LastReportTime; instead prefer the adapter's reported
observed_time and only fall back to LastReportTime or time.Now() if that field
is missing. Update the observedTime assignment in pkg/services/cluster.go (the
observedTime variable) to first use the incoming report/request observed_time
(e.g., report.ObservedTime or req.ObservedTime from the handler), then if nil
use upsertedStatus.LastReportTime, and finally time.Now(); keep the variable
name observedTime and the existing fallbacks order as described.
- Around line 286-324: Unmarshal conditions stays, but perform P1 (fetch
resource and check observed_generation) immediately after unmarshalling and
before calling ValidateMandatoryConditions/Available checks: call
s.clusterDao.Get(ctx, clusterID), handle errors with
handleGetError("Cluster","id",clusterID,err), then if
adapterStatus.ObservedGeneration > cluster.Generation log and return nil (the
existing observed_generation check using adapterStatus.ObservedGeneration and
cluster.Generation); only after that run ValidateMandatoryConditions(conditions)
and the loop checking api.ConditionTypeAvailable == api.AdapterConditionUnknown.
In `@pkg/services/node_pool.go`:
- Around line 337-340: The current logic logs a warning when
updateNodePoolStatusFromAdapters(ctx, nodePoolID, observedTime, false) returns
an error but still treats the overall operation as successful; change this so
that when updateNodePoolStatusFromAdapters returns a non-nil error the caller
returns that error (or wraps it) instead of proceeding as success. Locate the
call to updateNodePoolStatusFromAdapters in the same function (after the Upsert)
and propagate aggregateErr upward (e.g., return aggregateErr or fmt.Errorf("...:
%w", aggregateErr)) so the Upsert result is not reported as successful when
aggregation fails, preserving consistency between stored adapter status and
synthetic node-pool conditions.
- Around line 332-336: The code currently sets observedTime from
upsertedStatus.LastReportTime; instead use the adapter/request-provided
observed_time first (the adapter's observed_time field passed into the handler)
and only fall back to upsertedStatus.LastReportTime and then time.Now() if the
request observed_time is nil/missing; update the observedTime assignment in
pkg/services/node_pool.go (where observedTime is defined and upsertedStatus is
referenced) to read the incoming request's observed_time value (or its
wrapper/parameter) before falling back to upsertedStatus.LastReportTime and
time.Now().
- Around line 287-324: The P1 observed_generation check must run before P2/P3;
fetch the NodePool via s.nodePoolDao.Get(ctx, nodePoolID) and perform the
observed_generation comparison (using adapterStatus.ObservedGeneration vs
nodePool.Generation and returning handleGetError on Get errors) before calling
ValidateMandatoryConditions(conditions) or iterating conditions for
Available==Unknown; keep the existing json.Unmarshal of adapterStatus.Conditions
if needed but relocate the nodePool fetch and the if
adapterStatus.ObservedGeneration > nodePool.Generation check to occur prior to
ValidateMandatoryConditions and the Available-Unknown loop.
In `@test/e2e-curl/common.sh`:
- Around line 11-14: The default adapter names ADAPTER1 and ADAPTER2 in
test/e2e-curl/common.sh currently use "adapter1"/"adapter2" which mismatch the
documented required adapters; update the default assignments
ADAPTER1="${ADAPTER1:-...}" and ADAPTER2="${ADAPTER2:-...}" to use "dns" and
"validation" respectively so the helper aligns with the server's
HYPERFLEET_CLUSTER_ADAPTERS expectations while preserving the ability to
override via env vars.
In `@test/e2e-curl/README.md`:
- Around line 55-62: The unlabeled fenced code block in README.md should include
a language tag to satisfy markdownlint MD040; change the triple-backtick fence
that precedes the directory listing (the code block showing "e2e-curl/ common.sh
run_all.sh 01-initial-state/ test.sh ...") to include an appropriate language
tag such as text or bash (e.g., ```text or ```bash) so the fence is labeled and
the linter warning is resolved.
In `@test/e2e-curl/run_all.sh`:
- Around line 42-50: The runner currently silently succeeds when no tests match
filters because SELECTED remains empty; after the filtering block (where TESTS
is iterated and SELECTED is populated) add a check that if SELECTED is empty you
print a clear error like "No tests match provided filters" and exit with a
non-zero status (e.g., exit 1) so the CI fails fast; apply the same
empty-SELECTED check in the alternate filtering section mentioned (lines 81-86)
to cover both code paths.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5da2d72e-4d9a-4a2c-89e5-415cc9acd725
📒 Files selected for processing (38)
cmd/hyperfleet-api/migrate/cmd.godocs/api-resources.mdpkg/config/db.gopkg/config/loader.gopkg/db/advisory_locks.gopkg/db/context.gopkg/db/db_session/default.gopkg/db/db_session/test.gopkg/db/db_session/testcontainer.gopkg/db/migrations.gopkg/db/mocks/session_factory.gopkg/db/session.gopkg/logger/fields.gopkg/services/CLAUDE.mdpkg/services/cluster.gopkg/services/cluster_test.gopkg/services/node_pool.gopkg/services/node_pool_test.gopkg/services/status_aggregation.gopkg/services/status_aggregation_test.gotest/e2e-curl/01-initial-state/test.shtest/e2e-curl/02-partial-adapters/test.shtest/e2e-curl/03-all-adapters-ready/test.shtest/e2e-curl/04-generation-bump/test.shtest/e2e-curl/05-mixed-generations/test.shtest/e2e-curl/06-stale-report/test.shtest/e2e-curl/07-all-adapters-new-gen/test.shtest/e2e-curl/08-adapter-goes-false/test.shtest/e2e-curl/09-stable-true/test.shtest/e2e-curl/10-stable-false/test.shtest/e2e-curl/11-unknown-subsequent/test.shtest/e2e-curl/12-unknown-first/test.shtest/e2e-curl/README.mdtest/e2e-curl/common.shtest/e2e-curl/run_all.shtest/helper.gotest/integration/adapter_status_test.gotest/integration/advisory_locks_test.go
💤 Files with no reviewable changes (11)
- pkg/db/mocks/session_factory.go
- pkg/db/db_session/default.go
- pkg/db/db_session/test.go
- pkg/db/session.go
- pkg/db/db_session/testcontainer.go
- pkg/db/migrations.go
- test/integration/advisory_locks_test.go
- pkg/db/context.go
- pkg/db/advisory_locks.go
- test/helper.go
- pkg/logger/fields.go
🚧 Files skipped from review as they are similar to previous changes (9)
- test/e2e-curl/02-partial-adapters/test.sh
- test/e2e-curl/01-initial-state/test.sh
- test/e2e-curl/08-adapter-goes-false/test.sh
- pkg/services/CLAUDE.md
- test/e2e-curl/04-generation-bump/test.sh
- test/e2e-curl/12-unknown-first/test.sh
- test/e2e-curl/03-all-adapters-ready/test.sh
- pkg/services/status_aggregation_test.go
- test/e2e-curl/10-stable-false/test.sh
f825425 to
3ba7df1
Compare
There was a problem hiding this comment.
Actionable comments posted: 6
♻️ Duplicate comments (6)
pkg/services/cluster.go (2)
337-340:⚠️ Potential issue | 🟠 MajorReturn the aggregation error instead of swallowing it.
After Line 327 stores the adapter status, a failure at Line 337 leaves the cluster’s synthetic conditions stale while this method still returns success. Please propagate
aggregateErrso callers can retry or surface the partial failure.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/services/cluster.go` around lines 337 - 340, The call to s.updateClusterStatusFromAdapters(ctx, clusterID, observedTime, false) currently logs aggregateErr and continues, which swallows aggregation failures; change the flow so that if aggregateErr != nil the method returns that error (or wraps it) instead of only logging—locate the block around the call to updateClusterStatusFromAdapters and replace the logger-only branch with a return of aggregateErr (use fmt.Errorf or errors.Wrap if you need context) so callers can observe and retry on aggregation failures.
332-336:⚠️ Potential issue | 🟠 MajorUse the adapter-reported
observed_timehere.Line 334 still derives
observedTimefromLastReportTime, which is the API receipt timestamp. The synthetic transition rules are keyed off the adapter’sobserved_time, so delayed or retried reports will stampReady/Availablewith the wrong time unless this prefers the stored observed time and only falls back when it is absent.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/services/cluster.go` around lines 332 - 336, The code currently sets observedTime using upsertedStatus.LastReportTime (API receipt timestamp); change it to prefer the adapter-reported timestamp by checking upsertedStatus.ObservedTime first, then falling back to upsertedStatus.LastReportTime, and finally time.Now() if both are nil. Update the observedTime assignment in the aggregation/transition block (where observedTime is defined) to use upsertedStatus.ObservedTime -> upsertedStatus.LastReportTime -> time.Now() order so synthetic transition rules use the adapter's observed_time.pkg/services/node_pool.go (2)
337-340:⚠️ Potential issue | 🟠 MajorReturn the aggregation error instead of swallowing it.
After Line 327 stores the adapter status, a failure at Line 337 leaves the node pool’s synthetic conditions stale while this method still returns success. Please propagate
aggregateErrso callers can retry or surface the partial failure.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/services/node_pool.go` around lines 337 - 340, The call to s.updateNodePoolStatusFromAdapters currently logs aggregateErr and swallows it; instead propagate that error to the caller so callers can retry or surface the partial failure. Modify the function containing this snippet to return aggregateErr when s.updateNodePoolStatusFromAdapters(ctx, nodePoolID, observedTime, false) yields an error (i.e., propagate the aggregateErr rather than only logging), ensuring the enclosing function's signature returns an error and updating any callers to handle the propagated error if necessary; keep the existing logging but return the error after logging.
332-336:⚠️ Potential issue | 🟠 MajorUse the adapter-reported
observed_timehere.Line 334 still derives
observedTimefromLastReportTime, which is the API receipt timestamp. The synthetic transition rules are keyed off the adapter’sobserved_time, so delayed or retried reports will stampReady/Availablewith the wrong time unless this prefers the stored observed time and only falls back when it is absent.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/services/node_pool.go` around lines 332 - 336, The code currently sets observedTime from upsertedStatus.LastReportTime (API receipt timestamp) which is incorrect; update the logic that sets observedTime (the variable) to prefer the adapter-reported observed_time stored on upsertedStatus (the stored adapter observed time field) and only fall back to upsertedStatus.LastReportTime or time.Now() if that stored adapter observed time is nil/zero; change the assignment near where observedTime is initialized so it first checks the adapter observed-time field on upsertedStatus (instead of LastReportTime) before using LastReportTime/time.Now().docs/api-resources.md (2)
310-313:⚠️ Potential issue | 🟠 MajorCreation
obs_genhere conflicts with the examples above.The lifecycle table initializes both synthetic conditions at
obs_gen=1, while the Cluster and NodePool create examples in this file still showobserved_generation: 0. Please align these to one contract.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/api-resources.md` around lines 310 - 313, The lifecycle table sets synthetic condition column obs_gen to 1 but the Cluster and NodePool examples still use observed_generation: 0 causing a mismatch; pick one contract and make examples consistent (either change the table to show obs_gen=0 or update the Cluster/NodePool examples to observed_generation: 1). Locate occurrences of obs_gen in the lifecycle table and the examples referencing observed_generation and update them so both use the same value (prefer updating the Cluster and NodePool examples to observed_generation: 1 to match the lifecycle table) ensuring all instances of obs_gen/observed_generation are aligned.
282-305:⚠️ Potential issue | 🟠 MajorThe spec still documents the old
last_updated_timeinputs.These sections still drive
Ready/Availablefromall statuses[]andstatuses[].lut, butpkg/services/status_aggregation.gonow computes syntheticlast_updated_timefrom required adapters only, using adapterlast_report_timeand the triggeringobserved_timeon True→False transitions. In particular, Line 553 is wrong for both synthetic conditions, andReady=Falseis no longer “evaluation time” except for the fallback case.Also applies to: 320-343, 553-553
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/api-resources.md` around lines 282 - 305, Update the docs to stop treating statuses[].lut (last_updated_time) as the source for Ready/Available synthetic conditions and instead describe the new computation: synthetic last_updated_time is derived from required adapters only using each adapter's last_report_time and, for True→False transitions, the report's observed_time as the triggering time; state that Ready=False uses that synthetic last_updated_time (not “evaluation time”) except in the explicit fallback case, and remove or replace any language that says Ready/Available are driven by all statuses[] or statuses[].lut (including the sections currently referencing statuses[], lut, obs_time and obs_gen). Reference the aggregation logic that computes the synthetic last_updated_time and the Ready/Available synthetic conditions (the status aggregation implementation that synthesizes last_updated_time from adapter last_report_time and observed_time on transitions) when updating the text.
🧹 Nitpick comments (1)
test/e2e-curl/common.sh (1)
134-148: Let callers provideobserved_timefor deterministic scenarios.This helper always stamps
observed_timeitself, which is why several new scenarios have tosleep 1/2just to make timestamp assertions reliable (test/e2e-curl/05-mixed-generations/test.shLine 49,test/e2e-curl/07-all-adapters-new-gen/test.shLines 35-40 and 58,test/e2e-curl/10-stable-false/test.shLine 51,test/e2e-curl/11-unknown-subsequent/test.shLine 49). Accepting an optional timestamp here would make the suite faster and less timing-sensitive.♻️ Proposed refactor
post_adapter_status() { - local cluster_id="$1" adapter="$2" gen="$3" available="$4" + local cluster_id="$1" adapter="$2" gen="$3" available="$4" observed_time="${5:-$(now_rfc3339)}" curl -s -o /dev/null -w "%{http_code}" \ -X POST "$API/clusters/$cluster_id/statuses" \ -H "Content-Type: application/json" \ -d "{ \"adapter\": \"$adapter\", \"observed_generation\": $gen, - \"observed_time\": \"$(now_rfc3339)\", + \"observed_time\": \"$observed_time\", \"conditions\": [ {\"type\": \"Available\", \"status\": \"$available\", \"reason\": \"Testing\"}, {\"type\": \"Applied\", \"status\": \"True\", \"reason\": \"Testing\"}, {\"type\": \"Health\", \"status\": \"True\", \"reason\": \"Testing\"} ] }" }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e-curl/common.sh` around lines 134 - 148, The post_adapter_status helper currently always calls now_rfc3339 for observed_time; change post_adapter_status to accept an optional observed_time parameter (e.g., add a fifth parameter or make the fourth optional) and set a local variable like observed_time="${5:-$(now_rfc3339)}"; then use that variable in the JSON payload for "observed_time" instead of calling now_rfc3339 inline. Update callers that need deterministic timestamps to pass their desired timestamp to post_adapter_status (leave existing callers functioning when they omit the arg).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/services/cluster.go`:
- Around line 281-327: The stale-generation check in clusterService's status
handling is racy because the read-side check in s.clusterDao/Get +
existingStatus check is not atomic with the write in s.adapterStatusDao.Upsert;
modify adapterStatusDao.Upsert (the Upsert implementation in
pkg/dao/adapter_status.go) to perform the stale-filtering atomically by
including a WHERE predicate on observed_generation (e.g., UPDATE ... WHERE
adapter = ? AND observed_generation <= ?) or by loading-and-writing within a
single DB transaction so an older observed_generation cannot overwrite a newer
one; ensure the Upsert returns whether the write was applied (or an error) so
callers such as the code invoking s.adapterStatusDao.Upsert can detect a no-op
and avoid assuming success.
In `@test/e2e-curl/08-adapter-goes-false/test.sh`:
- Around line 31-34: Tests call post_adapter_status with hardcoded generation 1
which can mismatch actual cluster generation; change calls to use the captured
RESOURCE_GEN variable instead of literal 1. Update the two invocations of
post_adapter_status (the ones posting for ADAPTER1 and ADAPTER2) to pass
RESOURCE_GEN as the generation arg, ensuring the function calls use CLUSTER_ID,
ADAPTER1/ADAPTER2 and RESOURCE_GEN so status reports reflect the actual cluster
generation.
- Around line 31-34: The two setup POST calls using post_adapter_status (for
ADAPTER1 and ADAPTER2 with CLUSTER_ID) currently discard HTTP responses and may
hide failures; update the calls to capture the command exit status or HTTP
response and if it indicates failure, log an error including adapter and cluster
info and exit immediately (fail fast) instead of redirecting output to
/dev/null. Specifically, run post_adapter_status "$CLUSTER_ID" "$ADAPTER1" 1
"True" and post_adapter_status "$CLUSTER_ID" "$ADAPTER2" 1 "True" into a
variable or check their return code, emit a descriptive error via log_step or
echo when non-zero (include $ADAPTER1/$ADAPTER2 and $CLUSTER_ID), and call exit
1 to stop the script on failure.
In `@test/e2e-curl/10-stable-false/test.sh`:
- Around line 30-49: The gen2 setup POST for ${ADAPTER2} must be verified so the
test proves this is a re-report, not the first gen2 report: capture the result
of post_adapter_status "$CLUSTER_ID" "$ADAPTER2" 2 "True" (the second call that
currently discards output), assert it succeeded (nonzero exit should fail the
test) and then call get_cluster and inspect the returned state to confirm
ADAPTER2 already reported for generation 2 before continuing (use get_cluster
and condition_field or whatever adapter-status accessor you have to prove an
existing gen=2 entry). If the POST failed or the cluster snapshot does not show
ADAPTER2@gen2=True, fail the test so the later POST cannot be mistaken for the
first gen2 report.
In `@test/e2e-curl/11-unknown-subsequent/test.sh`:
- Around line 29-47: The second setup POST to post_adapter_status for ADAPTER1
is ignored and the test never asserts the baseline cluster is Ready/Available;
capture and assert the ADAPTER1 POST response (like CODE=$(post_adapter_status
... ) and assert_http "initial ${ADAPTER1} True accepted" "201" "$CODE") and
then assert PRE_AVAIL_STATUS == "True" and PRE_READY_STATUS == "True" (use the
test suite's assertion helper such as assert_equals or equivalent) after
computing PRE_AVAIL_STATUS and PRE_READY_STATUS to fail early if the
precondition wasn't met.
In `@test/e2e-curl/README.md`:
- Around line 17-30: The README examples for running tests assume you are in the
test/e2e-curl directory but don't state that; update the prose or commands to
make the working directory explicit: either add a line telling the user to cd
test/e2e-curl before the examples, or change the invocations to use
repo-root-relative paths (e.g., test/e2e-curl/run_all.sh and ADAPTER1=...
test/e2e-curl/run_all.sh). Make the change around the examples that reference
run_all.sh so the commands work from the repo root.
---
Duplicate comments:
In `@docs/api-resources.md`:
- Around line 310-313: The lifecycle table sets synthetic condition column
obs_gen to 1 but the Cluster and NodePool examples still use
observed_generation: 0 causing a mismatch; pick one contract and make examples
consistent (either change the table to show obs_gen=0 or update the
Cluster/NodePool examples to observed_generation: 1). Locate occurrences of
obs_gen in the lifecycle table and the examples referencing observed_generation
and update them so both use the same value (prefer updating the Cluster and
NodePool examples to observed_generation: 1 to match the lifecycle table)
ensuring all instances of obs_gen/observed_generation are aligned.
- Around line 282-305: Update the docs to stop treating statuses[].lut
(last_updated_time) as the source for Ready/Available synthetic conditions and
instead describe the new computation: synthetic last_updated_time is derived
from required adapters only using each adapter's last_report_time and, for
True→False transitions, the report's observed_time as the triggering time; state
that Ready=False uses that synthetic last_updated_time (not “evaluation time”)
except in the explicit fallback case, and remove or replace any language that
says Ready/Available are driven by all statuses[] or statuses[].lut (including
the sections currently referencing statuses[], lut, obs_time and obs_gen).
Reference the aggregation logic that computes the synthetic last_updated_time
and the Ready/Available synthetic conditions (the status aggregation
implementation that synthesizes last_updated_time from adapter last_report_time
and observed_time on transitions) when updating the text.
In `@pkg/services/cluster.go`:
- Around line 337-340: The call to s.updateClusterStatusFromAdapters(ctx,
clusterID, observedTime, false) currently logs aggregateErr and continues, which
swallows aggregation failures; change the flow so that if aggregateErr != nil
the method returns that error (or wraps it) instead of only logging—locate the
block around the call to updateClusterStatusFromAdapters and replace the
logger-only branch with a return of aggregateErr (use fmt.Errorf or errors.Wrap
if you need context) so callers can observe and retry on aggregation failures.
- Around line 332-336: The code currently sets observedTime using
upsertedStatus.LastReportTime (API receipt timestamp); change it to prefer the
adapter-reported timestamp by checking upsertedStatus.ObservedTime first, then
falling back to upsertedStatus.LastReportTime, and finally time.Now() if both
are nil. Update the observedTime assignment in the aggregation/transition block
(where observedTime is defined) to use upsertedStatus.ObservedTime ->
upsertedStatus.LastReportTime -> time.Now() order so synthetic transition rules
use the adapter's observed_time.
In `@pkg/services/node_pool.go`:
- Around line 337-340: The call to s.updateNodePoolStatusFromAdapters currently
logs aggregateErr and swallows it; instead propagate that error to the caller so
callers can retry or surface the partial failure. Modify the function containing
this snippet to return aggregateErr when s.updateNodePoolStatusFromAdapters(ctx,
nodePoolID, observedTime, false) yields an error (i.e., propagate the
aggregateErr rather than only logging), ensuring the enclosing function's
signature returns an error and updating any callers to handle the propagated
error if necessary; keep the existing logging but return the error after
logging.
- Around line 332-336: The code currently sets observedTime from
upsertedStatus.LastReportTime (API receipt timestamp) which is incorrect; update
the logic that sets observedTime (the variable) to prefer the adapter-reported
observed_time stored on upsertedStatus (the stored adapter observed time field)
and only fall back to upsertedStatus.LastReportTime or time.Now() if that stored
adapter observed time is nil/zero; change the assignment near where observedTime
is initialized so it first checks the adapter observed-time field on
upsertedStatus (instead of LastReportTime) before using
LastReportTime/time.Now().
---
Nitpick comments:
In `@test/e2e-curl/common.sh`:
- Around line 134-148: The post_adapter_status helper currently always calls
now_rfc3339 for observed_time; change post_adapter_status to accept an optional
observed_time parameter (e.g., add a fifth parameter or make the fourth
optional) and set a local variable like observed_time="${5:-$(now_rfc3339)}";
then use that variable in the JSON payload for "observed_time" instead of
calling now_rfc3339 inline. Update callers that need deterministic timestamps to
pass their desired timestamp to post_adapter_status (leave existing callers
functioning when they omit the arg).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 697b363b-f760-4866-b2d2-98965b4d6ddb
📒 Files selected for processing (38)
cmd/hyperfleet-api/migrate/cmd.godocs/api-resources.mdpkg/config/db.gopkg/config/loader.gopkg/db/advisory_locks.gopkg/db/context.gopkg/db/db_session/default.gopkg/db/db_session/test.gopkg/db/db_session/testcontainer.gopkg/db/migrations.gopkg/db/mocks/session_factory.gopkg/db/session.gopkg/logger/fields.gopkg/services/CLAUDE.mdpkg/services/cluster.gopkg/services/cluster_test.gopkg/services/node_pool.gopkg/services/node_pool_test.gopkg/services/status_aggregation.gopkg/services/status_aggregation_test.gotest/e2e-curl/01-initial-state/test.shtest/e2e-curl/02-partial-adapters/test.shtest/e2e-curl/03-all-adapters-ready/test.shtest/e2e-curl/04-generation-bump/test.shtest/e2e-curl/05-mixed-generations/test.shtest/e2e-curl/06-stale-report/test.shtest/e2e-curl/07-all-adapters-new-gen/test.shtest/e2e-curl/08-adapter-goes-false/test.shtest/e2e-curl/09-stable-true/test.shtest/e2e-curl/10-stable-false/test.shtest/e2e-curl/11-unknown-subsequent/test.shtest/e2e-curl/12-unknown-first/test.shtest/e2e-curl/README.mdtest/e2e-curl/common.shtest/e2e-curl/run_all.shtest/helper.gotest/integration/adapter_status_test.gotest/integration/advisory_locks_test.go
💤 Files with no reviewable changes (11)
- pkg/logger/fields.go
- pkg/db/db_session/default.go
- test/integration/advisory_locks_test.go
- pkg/db/mocks/session_factory.go
- pkg/db/db_session/test.go
- pkg/db/migrations.go
- pkg/db/context.go
- pkg/db/advisory_locks.go
- test/helper.go
- pkg/db/db_session/testcontainer.go
- pkg/db/session.go
🚧 Files skipped from review as they are similar to previous changes (8)
- test/e2e-curl/12-unknown-first/test.sh
- pkg/services/CLAUDE.md
- test/e2e-curl/06-stale-report/test.sh
- test/e2e-curl/run_all.sh
- pkg/services/node_pool_test.go
- test/e2e-curl/09-stable-true/test.sh
- test/e2e-curl/04-generation-bump/test.sh
- test/e2e-curl/02-partial-adapters/test.sh
3ba7df1 to
41963b7
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (10)
test/e2e-curl/README.md (1)
23-31:⚠️ Potential issue | 🟡 MinorClarify the working directory for running tests.
The commands
./run_all.shand./run_all.sh 03 07assume the reader is intest/e2e-curl/. From the repo root, these paths won't work. Add an explicitcd test/e2e-curlinstruction or use repo-root-relative paths.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e-curl/README.md` around lines 23 - 31, The README's run instructions assume the user is already in test/e2e-curl; update the text to make the working directory explicit by adding a step that instructs the user to change into test/e2e-curl (e.g., "cd test/e2e-curl") before running ./run_all.sh, or alternatively present repo-root-relative commands that call test/e2e-curl/run_all.sh; reference the existing run_all.sh examples so readers know which scripts to run.test/e2e-curl/run_all.sh (1)
42-50:⚠️ Potential issue | 🟠 MajorExit with error when filter arguments match zero tests.
If filter arguments don't match any test prefixes,
SELECTEDwill be empty and the script will report "ALL TESTS PASSED (0/0)", which masks misconfiguration. Add a check after the selection loop:if [ "${`#SELECTED`[@]}" -eq 0 ]; then echo -e "${RED}ERROR${NC}: no tests matched filters: $*" exit 1 fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e-curl/run_all.sh` around lines 42 - 50, After filtering tests into SELECTED from TESTS, add a post-selection guard that checks if SELECTED is empty (length zero) and, if so, prints an error including the provided filters ($*) using the existing color variables (RED and NC) and exits with status 1; place this check immediately after the else/fi block that populates SELECTED so any invocation with filters that match zero tests fails fast instead of reporting 0/0 passed.pkg/services/node_pool.go (2)
337-340:⚠️ Potential issue | 🟠 MajorAggregation failure is logged but not returned to caller.
If
Upsertsucceeds butupdateNodePoolStatusFromAdaptersfails, the method returns success while the node pool's synthetic conditions remain stale. Without an event/retry mechanism, this inconsistency can persist until another write occurs.Consider returning the aggregation error so callers can observe the failure and handle it appropriately.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/services/node_pool.go` around lines 337 - 340, The call to updateNodePoolStatusFromAdapters currently logs aggregation errors but swallows them, causing callers (e.g., the Upsert flow) to return success while nodepool conditions may be stale; change the surrounding method to propagate that aggregation error: after calling updateNodePoolStatusFromAdapters(nodePoolID, observedTime, false) check if aggregateErr != nil and return or wrap it (preserving any prior success/err semantics) instead of only logging; ensure you use the same error wrapping pattern used elsewhere in this file so callers can observe and handle the failure.
332-336:⚠️ Potential issue | 🟠 MajorConsider using the adapter's reported
observed_timeinstead ofLastReportTime.
LastReportTimeis the API-managed receipt timestamp (when the server received the report), while the adapter'sobserved_timeis when the adapter actually performed its work. For delayed or retried reports, these can differ significantly.The comment at line 332 says "using the adapter's observed_time" but the code uses
LastReportTime. If the spec intends for transition timestamps to reflect when the adapter observed the state (not when the API received the report), this should useadapterStatus.ObservedTimeor equivalent.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/services/node_pool.go` around lines 332 - 336, The code currently sets observedTime from upsertedStatus.LastReportTime but the spec requires using the adapter's reported observed_time; change the selection to prefer adapterStatus.ObservedTime (or the struct field that carries the adapter's observed_time) when non-nil/zero, falling back to upsertedStatus.LastReportTime and finally time.Now() if neither is present; update the observedTime assignment logic (referencing observedTime, upsertedStatus.LastReportTime and adapterStatus.ObservedTime) and ensure proper nil checks and types when reading adapterStatus.ObservedTime.test/e2e-curl/11-unknown-subsequent/test.sh (1)
33-47:⚠️ Potential issue | 🟠 MajorAssert baseline preconditions before the Unknown event.
The setup POST for
ADAPTER1at line 34 is not verified, and lines 36-47 don't assert that the baseline showsAvailable=TrueandReady=True. If the precondition isn't met, the later "state unchanged" assertions could pass against an incorrect baseline.Add assertions after capturing the baseline:
assert_eq "baseline Available.status" "True" "$PRE_AVAIL_STATUS" assert_eq "baseline Ready.status" "True" "$PRE_READY_STATUS"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e-curl/11-unknown-subsequent/test.sh` around lines 33 - 47, After posting the adapter status with post_adapter_status and capturing the baseline with get_cluster and condition_field into PRE_AVAIL_STATUS and PRE_READY_STATUS, add assertions using assert_eq to verify the baseline preconditions (Available.status == "True" and Ready.status == "True") so the test fails early if setup didn't produce the expected Ready/Available state; place these assert_eq checks immediately after the assignments to PRE_AVAIL_STATUS and PRE_READY_STATUS (i.e., after show_state "baseline" "$BASELINE") to validate the POST succeeded before continuing.docs/api-resources.md (2)
553-553:⚠️ Potential issue | 🟠 Major
last_updated_timedocumentation for Available is inconsistent with PR objectives.The current text states "For Available, always the evaluation time." However, the PR objectives specify that Available uses
min(LastReportTime)across required adapters, and True→False transitions use the triggering adapter'sobserved_time. This mismatch will mislead API consumers.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/api-resources.md` at line 553, The docs for last_updated_time are incorrect for the Available condition: update the description for `last_updated_time` so that for Available it is the minimum of `LastReportTime` across all required adapters that report Available=True at the current generation, and specify that a True→False transition uses the triggering adapter's `observed_time` (while keeping the Ready semantics as already described: when Ready=True use min(`last_report_time`) across required adapters that report Available=True at the current generation, when Ready=False use the evaluation time).
312-313:⚠️ Potential issue | 🟡 Minor
observed_generationat creation is inconsistent with examples.The Lifecycle Events table sets
obs_gento1for Creation, but the Cluster and NodePool creation examples (lines 58, 68, 404, 414) showobserved_generation: 0. This inconsistency makes the API contract ambiguous for clients.Either update the table to use
obs_gen=0or update the examples to useobserved_generation: 1.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/api-resources.md` around lines 312 - 313, The docs show an inconsistency between the Lifecycle Events table (which lists obs_gen = 1 on Creation) and the Cluster/NodePool creation examples (which set observed_generation: 0); pick one convention and make all occurrences match—for example, update the Cluster and NodePool creation examples (the creation YAML/JSON examples that include observed_generation) to use observed_generation: 1 so they align with the Lifecycle Events table (or alternatively change the table rows for Creation to obs_gen = 0 if you prefer that convention); ensure every example and the table use the same field name (`observed_generation` / `obs_gen`) and value.pkg/services/cluster.go (2)
337-340:⚠️ Potential issue | 🟠 MajorDo not return success when post-upsert aggregation fails.
After upsert succeeds, failing aggregation is only logged. That leaves cluster synthetic conditions stale with no guaranteed retry path.
🛠️ Suggested fix
if _, aggregateErr := s.updateClusterStatusFromAdapters(ctx, clusterID, observedTime, false); aggregateErr != nil { ctx = logger.WithClusterID(ctx, clusterID) logger.WithError(ctx, aggregateErr).Warn("Failed to aggregate cluster status") + return upsertedStatus, aggregateErr } return upsertedStatus, nil🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/services/cluster.go` around lines 337 - 340, After a successful upsert, the code currently swallows failures from s.updateClusterStatusFromAdapters (called with clusterID and observedTime) and only logs a warning; instead propagate the failure so the caller sees the operation as failed. Replace the log-only branch so that when updateClusterStatusFromAdapters returns an aggregateErr you attach cluster context (using logger.WithClusterID/WithError) and return or wrap aggregateErr from the enclosing function (rather than continuing as success), ensuring callers and retry logic can handle the failed post-upsert aggregation.
332-336:⚠️ Potential issue | 🟠 MajorUse the adapter-reported observed timestamp for transition timing.
Line 334 currently derives
observedTimefromLastReportTime(receipt time). Delayed/retried reports can skew synthetic transition times; aggregation should use the adapter’s reported observed timestamp and only fall back to receipt time/now.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/services/cluster.go` around lines 332 - 336, The code sets observedTime from upsertedStatus.LastReportTime but should prefer the adapter-reported observed timestamp to avoid skew from delayed receipts; update the logic that assigns observedTime (currently using observedTime and upsertedStatus.LastReportTime) to first check for the adapter-reported observed timestamp field on the status (e.g., upsertedStatus.ObservedTime or the actual adapter-observed timestamp property on the status struct), use that if present, otherwise fall back to upsertedStatus.LastReportTime, and finally default to time.Now(); ensure you reference and use the actual adapter-reported field name on upsertedStatus in place of the placeholder.test/e2e-curl/10-stable-false/test.sh (1)
34-48:⚠️ Potential issue | 🟠 MajorVerify the gen2 precondition before the “re-report” event.
Line 35 drops the setup response, and current baseline checks do not prove
${ADAPTER2}@gen2=Trueexists before Line 55. The event can become the first gen2 report, so this test may stop covering stable-false re-evaluation.🛠️ Suggested fix
log_step "Setup: POST ${ADAPTER2}@gen2=True (${ADAPTER1} still missing → False@rG2)" -post_adapter_status "$CLUSTER_ID" "$ADAPTER2" 2 "True" > /dev/null +CODE=$(post_adapter_status "$CLUSTER_ID" "$ADAPTER2" 2 "True") +assert_http "initial ${ADAPTER2}@gen2 accepted" "201" "$CODE" @@ assert_eq "baseline Available.status" "False" "$(condition_field "$BASELINE" Available status)" assert_eq "baseline Available.obsgen" "1" "$(condition_field "$BASELINE" Available observed_generation)" assert_eq "baseline Ready.status" "False" "$(condition_field "$BASELINE" Ready status)" +assert_eq "baseline Ready.obsgen" "2" "$(condition_field "$BASELINE" Ready observed_generation)"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e-curl/10-stable-false/test.sh` around lines 34 - 48, The test currently posts "${ADAPTER2}@gen2=True" with post_adapter_status and discards the response, then relies on baseline checks that don't prove gen2 has been reported; change the setup to capture and verify the post result and/or re-poll the cluster until the gen2 condition is observed before proceeding: call post_adapter_status "$CLUSTER_ID" "$ADAPTER2" 2 "True" and save its output, then use get_cluster "$CLUSTER_ID" (or re-run condition_field checks) to assert that the gen2 report is reflected (e.g., verify the appropriate condition/status or observed_generation for ADAPTER2 is present) before running the baseline assertions.
🧹 Nitpick comments (1)
pkg/services/status_aggregation.go (1)
461-464: Align the Ready LUT comment with the implemented behavior.Line 462 says Ready=False uses evaluation time, but the code now uses
min(LastReportTime)across required adapters with fallback tonow. Updating this comment will prevent incorrect future changes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/services/status_aggregation.go` around lines 461 - 464, Update the comment on the Ready LUT in the buildReadyCondition function to reflect the implemented behavior: when Ready=False, LastUpdatedTime is computed as the minimum LastReportTime across required adapters with a fallback to now (rather than always using the evaluation time), and when Ready=True it uses the earliest adapter report time; adjust the wording to mention min(LastReportTime) and the fallback to now so future readers won't assume evaluation-time-only semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e-curl/README.md`:
- Around line 55-62: The fenced code block that lists the test directory
structure (the block starting with the line showing "e2e-curl/" and the entries
common.sh, run_all.sh, 01-initial-state/test.sh) is missing a language tag;
update the opening fence from ``` to ```text in the README.md's example block so
the block is tagged as text and satisfies markdownlint MD040.
---
Duplicate comments:
In `@docs/api-resources.md`:
- Line 553: The docs for last_updated_time are incorrect for the Available
condition: update the description for `last_updated_time` so that for Available
it is the minimum of `LastReportTime` across all required adapters that report
Available=True at the current generation, and specify that a True→False
transition uses the triggering adapter's `observed_time` (while keeping the
Ready semantics as already described: when Ready=True use
min(`last_report_time`) across required adapters that report Available=True at
the current generation, when Ready=False use the evaluation time).
- Around line 312-313: The docs show an inconsistency between the Lifecycle
Events table (which lists obs_gen = 1 on Creation) and the Cluster/NodePool
creation examples (which set observed_generation: 0); pick one convention and
make all occurrences match—for example, update the Cluster and NodePool creation
examples (the creation YAML/JSON examples that include observed_generation) to
use observed_generation: 1 so they align with the Lifecycle Events table (or
alternatively change the table rows for Creation to obs_gen = 0 if you prefer
that convention); ensure every example and the table use the same field name
(`observed_generation` / `obs_gen`) and value.
In `@pkg/services/cluster.go`:
- Around line 337-340: After a successful upsert, the code currently swallows
failures from s.updateClusterStatusFromAdapters (called with clusterID and
observedTime) and only logs a warning; instead propagate the failure so the
caller sees the operation as failed. Replace the log-only branch so that when
updateClusterStatusFromAdapters returns an aggregateErr you attach cluster
context (using logger.WithClusterID/WithError) and return or wrap aggregateErr
from the enclosing function (rather than continuing as success), ensuring
callers and retry logic can handle the failed post-upsert aggregation.
- Around line 332-336: The code sets observedTime from
upsertedStatus.LastReportTime but should prefer the adapter-reported observed
timestamp to avoid skew from delayed receipts; update the logic that assigns
observedTime (currently using observedTime and upsertedStatus.LastReportTime) to
first check for the adapter-reported observed timestamp field on the status
(e.g., upsertedStatus.ObservedTime or the actual adapter-observed timestamp
property on the status struct), use that if present, otherwise fall back to
upsertedStatus.LastReportTime, and finally default to time.Now(); ensure you
reference and use the actual adapter-reported field name on upsertedStatus in
place of the placeholder.
In `@pkg/services/node_pool.go`:
- Around line 337-340: The call to updateNodePoolStatusFromAdapters currently
logs aggregation errors but swallows them, causing callers (e.g., the Upsert
flow) to return success while nodepool conditions may be stale; change the
surrounding method to propagate that aggregation error: after calling
updateNodePoolStatusFromAdapters(nodePoolID, observedTime, false) check if
aggregateErr != nil and return or wrap it (preserving any prior success/err
semantics) instead of only logging; ensure you use the same error wrapping
pattern used elsewhere in this file so callers can observe and handle the
failure.
- Around line 332-336: The code currently sets observedTime from
upsertedStatus.LastReportTime but the spec requires using the adapter's reported
observed_time; change the selection to prefer adapterStatus.ObservedTime (or the
struct field that carries the adapter's observed_time) when non-nil/zero,
falling back to upsertedStatus.LastReportTime and finally time.Now() if neither
is present; update the observedTime assignment logic (referencing observedTime,
upsertedStatus.LastReportTime and adapterStatus.ObservedTime) and ensure proper
nil checks and types when reading adapterStatus.ObservedTime.
In `@test/e2e-curl/10-stable-false/test.sh`:
- Around line 34-48: The test currently posts "${ADAPTER2}@gen2=True" with
post_adapter_status and discards the response, then relies on baseline checks
that don't prove gen2 has been reported; change the setup to capture and verify
the post result and/or re-poll the cluster until the gen2 condition is observed
before proceeding: call post_adapter_status "$CLUSTER_ID" "$ADAPTER2" 2 "True"
and save its output, then use get_cluster "$CLUSTER_ID" (or re-run
condition_field checks) to assert that the gen2 report is reflected (e.g.,
verify the appropriate condition/status or observed_generation for ADAPTER2 is
present) before running the baseline assertions.
In `@test/e2e-curl/11-unknown-subsequent/test.sh`:
- Around line 33-47: After posting the adapter status with post_adapter_status
and capturing the baseline with get_cluster and condition_field into
PRE_AVAIL_STATUS and PRE_READY_STATUS, add assertions using assert_eq to verify
the baseline preconditions (Available.status == "True" and Ready.status ==
"True") so the test fails early if setup didn't produce the expected
Ready/Available state; place these assert_eq checks immediately after the
assignments to PRE_AVAIL_STATUS and PRE_READY_STATUS (i.e., after show_state
"baseline" "$BASELINE") to validate the POST succeeded before continuing.
In `@test/e2e-curl/README.md`:
- Around line 23-31: The README's run instructions assume the user is already in
test/e2e-curl; update the text to make the working directory explicit by adding
a step that instructs the user to change into test/e2e-curl (e.g., "cd
test/e2e-curl") before running ./run_all.sh, or alternatively present
repo-root-relative commands that call test/e2e-curl/run_all.sh; reference the
existing run_all.sh examples so readers know which scripts to run.
In `@test/e2e-curl/run_all.sh`:
- Around line 42-50: After filtering tests into SELECTED from TESTS, add a
post-selection guard that checks if SELECTED is empty (length zero) and, if so,
prints an error including the provided filters ($*) using the existing color
variables (RED and NC) and exits with status 1; place this check immediately
after the else/fi block that populates SELECTED so any invocation with filters
that match zero tests fails fast instead of reporting 0/0 passed.
---
Nitpick comments:
In `@pkg/services/status_aggregation.go`:
- Around line 461-464: Update the comment on the Ready LUT in the
buildReadyCondition function to reflect the implemented behavior: when
Ready=False, LastUpdatedTime is computed as the minimum LastReportTime across
required adapters with a fallback to now (rather than always using the
evaluation time), and when Ready=True it uses the earliest adapter report time;
adjust the wording to mention min(LastReportTime) and the fallback to now so
future readers won't assume evaluation-time-only semantics.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c09c2e74-99f1-45e3-bcfe-518874f87d0e
📒 Files selected for processing (25)
docs/api-resources.mdpkg/config/loader.gopkg/services/CLAUDE.mdpkg/services/cluster.gopkg/services/cluster_test.gopkg/services/node_pool.gopkg/services/node_pool_test.gopkg/services/status_aggregation.gopkg/services/status_aggregation_test.gotest/e2e-curl/01-initial-state/test.shtest/e2e-curl/02-partial-adapters/test.shtest/e2e-curl/03-all-adapters-ready/test.shtest/e2e-curl/04-generation-bump/test.shtest/e2e-curl/05-mixed-generations/test.shtest/e2e-curl/06-stale-report/test.shtest/e2e-curl/07-all-adapters-new-gen/test.shtest/e2e-curl/08-adapter-goes-false/test.shtest/e2e-curl/09-stable-true/test.shtest/e2e-curl/10-stable-false/test.shtest/e2e-curl/11-unknown-subsequent/test.shtest/e2e-curl/12-unknown-first/test.shtest/e2e-curl/README.mdtest/e2e-curl/common.shtest/e2e-curl/run_all.shtest/integration/adapter_status_test.go
🚧 Files skipped from review as they are similar to previous changes (8)
- test/e2e-curl/12-unknown-first/test.sh
- pkg/services/CLAUDE.md
- test/e2e-curl/04-generation-bump/test.sh
- test/e2e-curl/common.sh
- test/e2e-curl/02-partial-adapters/test.sh
- test/e2e-curl/01-initial-state/test.sh
- test/e2e-curl/06-stale-report/test.sh
- test/e2e-curl/08-adapter-goes-false/test.sh
41963b7 to
5247788
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
pkg/services/node_pool.go (1)
33-36:⚠️ Potential issue | 🟡 MinorStale interface documentation contradicts new behavior.
Same issue as in
cluster.go: the interface comment still describes accepting Unknown on first report, but the implementation now always discards Unknown Available reports per P3 rule.📝 Suggested documentation update
// ProcessAdapterStatus handles the business logic for adapter status: -// - First report: accepts Unknown Available condition, skips aggregation -// - Subsequent reports: rejects Unknown Available condition +// - Discards reports with Available=Unknown (P3 rule) +// - Discards reports ahead of resource generation (P1 rule) +// - Validates mandatory conditions (P2 rule) // - Otherwise: upserts the status and triggers aggregation🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/services/node_pool.go` around lines 33 - 36, Update the stale interface comment for ProcessAdapterStatus in node_pool.go to reflect the current behavior: remove the note about accepting Unknown Available on the first report and instead state that Unknown Available reports are always discarded per the P3 rule; keep that the function upserts non-Unknown statuses and triggers aggregation. Also mirror the same documentation change in cluster.go where the same outdated comment appears so both ProcessAdapterStatus implementations describe the new consistent behavior.pkg/services/cluster.go (1)
33-36:⚠️ Potential issue | 🟡 MinorStale interface documentation contradicts new behavior.
The interface comment states:
First report: accepts Unknown Available condition, skips aggregation
Subsequent reports: rejects Unknown Available conditionBut the implementation (lines 302-310) now always discards Unknown Available reports per P3 rule, regardless of whether it's the first report. The comment should be updated to reflect the current behavior.
📝 Suggested documentation update
// ProcessAdapterStatus handles the business logic for adapter status: -// - First report: accepts Unknown Available condition, skips aggregation -// - Subsequent reports: rejects Unknown Available condition +// - Discards reports with Available=Unknown (P3 rule) +// - Discards reports ahead of resource generation (P1 rule) +// - Validates mandatory conditions (P2 rule) // - Otherwise: upserts the status and triggers aggregation🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/services/cluster.go` around lines 33 - 36, Update the stale doc comment for ProcessAdapterStatus to reflect current behavior: instead of saying the first report accepts Unknown Available and subsequent reports reject it, state that Unknown Available reports are always discarded (per the P3 rule) and thus never trigger aggregation; keep the rest of the description about upserting status and triggering aggregation for non-Unknown reports and reference the function name ProcessAdapterStatus so reviewers can find and update the comment accordingly.
♻️ Duplicate comments (2)
pkg/services/cluster.go (1)
337-340:⚠️ Potential issue | 🟠 MajorAggregation failure is still swallowed despite being marked as addressed.
The past review flagged that swallowing
updateClusterStatusFromAdapterserrors leaves the cluster's synthesized conditions stale. The current code still only logs the error and returns success:if _, aggregateErr := s.updateClusterStatusFromAdapters(...); aggregateErr != nil { logger.WithError(ctx, aggregateErr).Warn("Failed to aggregate cluster status") } return upsertedStatus, nilThis means the adapter status is persisted but cluster synthetic conditions may be inconsistent. Without a retry mechanism, this inconsistency persists until another write occurs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/services/cluster.go` around lines 337 - 340, The call to updateClusterStatusFromAdapters currently logs aggregateErr and continues, causing synthesized cluster conditions to remain stale; change the behavior in the function (the block using updateClusterStatusFromAdapters and returning upsertedStatus) to propagate the aggregation error instead of swallowing it—i.e., when updateClusterStatusFromAdapters returns a non-nil aggregateErr, return upsertedStatus together with that error (or wrap it) so callers can retry/handle the failure, ensuring callers see the failure and do not assume success.pkg/services/node_pool.go (1)
337-340:⚠️ Potential issue | 🟠 MajorAggregation failure is still swallowed.
Same issue as in
cluster.go: the aggregation error is logged but not returned. The adapter status is persisted while nodepool synthetic conditions may remain stale.if _, aggregateErr := s.updateNodePoolStatusFromAdapters(...); aggregateErr != nil { logger.With(ctx, logger.FieldNodePoolID, nodePoolID). WithError(aggregateErr).Warn("Failed to aggregate nodepool status") } return upsertedStatus, nil🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/services/node_pool.go` around lines 337 - 340, The aggregation error from s.updateNodePoolStatusFromAdapters is currently only logged and swallowed; change the control flow so that when updateNodePoolStatusFromAdapters returns a non-nil aggregateErr you both log it and propagate it to the caller instead of returning nil error. Locate the call to s.updateNodePoolStatusFromAdapters in the node pool reconciliation path (the block that logs "Failed to aggregate nodepool status") and modify it to return the aggregateErr (after logging) so synthetic nodePool conditions are not left stale.
🧹 Nitpick comments (1)
test/integration/adapter_status_test.go (1)
435-498: Consider consolidating overlapping test coverage.
TestClusterStatusPost_UnknownAvailableAlwaysDiscarded(lines 435-498) andTestClusterStatusPost_UnknownAvailableNeverStored(lines 878-941) test very similar scenarios - both verify that Unknown Available reports are discarded with 204 and not stored. The main difference is the adapter name used.This is not blocking, but consolidating or differentiating these tests more clearly would improve maintainability.
Also applies to: 878-941
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/integration/adapter_status_test.go` around lines 435 - 498, The two tests TestClusterStatusPost_UnknownAvailableAlwaysDiscarded and TestClusterStatusPost_UnknownAvailableNeverStored overlap by asserting that reports with Available=Unknown are discarded (204) and not persisted; consolidate them by removing one and ensuring the remaining test covers both first and subsequent reports for multiple adapters or, alternatively, make their intent distinct by renaming one and adding an explicit differentiator (e.g., different adapter lifecycle or preexisting stored status) so each exercises a unique code path; update references to the helper newAdapterStatusRequest and calls to PostClusterStatusesWithResponse/GetClusterStatusesWithResponse accordingly to keep coverage and avoid duplicate assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e-curl/common.sh`:
- Line 6: The header comment in common.sh uses HYPERFLEET_CLUSTER_ADAPTERS but
the check_server() function prints HYPERFLEET_ADAPTERS_REQUIRED_CLUSTER, causing
a docs mismatch; update the header comment to reference the same environment
variable name used by check_server() or list both aliases
(HYPERFLEET_CLUSTER_ADAPTERS and HYPERFLEET_ADAPTERS_REQUIRED_CLUSTER) so they
are consistent and unambiguous for users, and ensure any examples match the
variable name used in the check_server() output.
---
Outside diff comments:
In `@pkg/services/cluster.go`:
- Around line 33-36: Update the stale doc comment for ProcessAdapterStatus to
reflect current behavior: instead of saying the first report accepts Unknown
Available and subsequent reports reject it, state that Unknown Available reports
are always discarded (per the P3 rule) and thus never trigger aggregation; keep
the rest of the description about upserting status and triggering aggregation
for non-Unknown reports and reference the function name ProcessAdapterStatus so
reviewers can find and update the comment accordingly.
In `@pkg/services/node_pool.go`:
- Around line 33-36: Update the stale interface comment for ProcessAdapterStatus
in node_pool.go to reflect the current behavior: remove the note about accepting
Unknown Available on the first report and instead state that Unknown Available
reports are always discarded per the P3 rule; keep that the function upserts
non-Unknown statuses and triggers aggregation. Also mirror the same
documentation change in cluster.go where the same outdated comment appears so
both ProcessAdapterStatus implementations describe the new consistent behavior.
---
Duplicate comments:
In `@pkg/services/cluster.go`:
- Around line 337-340: The call to updateClusterStatusFromAdapters currently
logs aggregateErr and continues, causing synthesized cluster conditions to
remain stale; change the behavior in the function (the block using
updateClusterStatusFromAdapters and returning upsertedStatus) to propagate the
aggregation error instead of swallowing it—i.e., when
updateClusterStatusFromAdapters returns a non-nil aggregateErr, return
upsertedStatus together with that error (or wrap it) so callers can retry/handle
the failure, ensuring callers see the failure and do not assume success.
In `@pkg/services/node_pool.go`:
- Around line 337-340: The aggregation error from
s.updateNodePoolStatusFromAdapters is currently only logged and swallowed;
change the control flow so that when updateNodePoolStatusFromAdapters returns a
non-nil aggregateErr you both log it and propagate it to the caller instead of
returning nil error. Locate the call to s.updateNodePoolStatusFromAdapters in
the node pool reconciliation path (the block that logs "Failed to aggregate
nodepool status") and modify it to return the aggregateErr (after logging) so
synthetic nodePool conditions are not left stale.
---
Nitpick comments:
In `@test/integration/adapter_status_test.go`:
- Around line 435-498: The two tests
TestClusterStatusPost_UnknownAvailableAlwaysDiscarded and
TestClusterStatusPost_UnknownAvailableNeverStored overlap by asserting that
reports with Available=Unknown are discarded (204) and not persisted;
consolidate them by removing one and ensuring the remaining test covers both
first and subsequent reports for multiple adapters or, alternatively, make their
intent distinct by renaming one and adding an explicit differentiator (e.g.,
different adapter lifecycle or preexisting stored status) so each exercises a
unique code path; update references to the helper newAdapterStatusRequest and
calls to PostClusterStatusesWithResponse/GetClusterStatusesWithResponse
accordingly to keep coverage and avoid duplicate assertions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 91a97282-e828-4c7b-8bed-4c3adf7c0b31
📒 Files selected for processing (25)
docs/api-resources.mdpkg/config/loader.gopkg/services/CLAUDE.mdpkg/services/cluster.gopkg/services/cluster_test.gopkg/services/node_pool.gopkg/services/node_pool_test.gopkg/services/status_aggregation.gopkg/services/status_aggregation_test.gotest/e2e-curl/01-initial-state/test.shtest/e2e-curl/02-partial-adapters/test.shtest/e2e-curl/03-all-adapters-ready/test.shtest/e2e-curl/04-generation-bump/test.shtest/e2e-curl/05-mixed-generations/test.shtest/e2e-curl/06-stale-report/test.shtest/e2e-curl/07-all-adapters-new-gen/test.shtest/e2e-curl/08-adapter-goes-false/test.shtest/e2e-curl/09-stable-true/test.shtest/e2e-curl/10-stable-false/test.shtest/e2e-curl/11-unknown-subsequent/test.shtest/e2e-curl/12-unknown-first/test.shtest/e2e-curl/README.mdtest/e2e-curl/common.shtest/e2e-curl/run_all.shtest/integration/adapter_status_test.go
🚧 Files skipped from review as they are similar to previous changes (12)
- test/e2e-curl/11-unknown-subsequent/test.sh
- test/e2e-curl/05-mixed-generations/test.sh
- test/e2e-curl/run_all.sh
- pkg/services/node_pool_test.go
- pkg/services/CLAUDE.md
- test/e2e-curl/04-generation-bump/test.sh
- test/e2e-curl/07-all-adapters-new-gen/test.sh
- test/e2e-curl/01-initial-state/test.sh
- test/e2e-curl/12-unknown-first/test.sh
- test/e2e-curl/09-stable-true/test.sh
- test/e2e-curl/03-all-adapters-ready/test.sh
- test/e2e-curl/06-stale-report/test.sh
5247788 to
b0c2cfc
Compare
xueli181114
left a comment
There was a problem hiding this comment.
Code review: the Unknown discard change (P3) introduces a behavioral difference in the gen-bump scenario that could leave Available frozen at True when it should go False. See inline comment for details.
| ctx = logger.WithClusterID(ctx, clusterID) | ||
| logger.Info(ctx, fmt.Sprintf("Discarding adapter status update from %s: Available=Unknown reports are not processed", | ||
| adapterStatus.Adapter)) | ||
| return nil, nil |
There was a problem hiding this comment.
Uncovered scenario: gen bump + Unknown@gen2 = Available frozen at True
Discarding Unknown means the adapter's previous status (e.g. True@gen1) stays in the DB. This creates a behavioral difference vs the old code when a generation bump is involved:
Setup: Adapters [A, B], both True@gen1 → Available=True, Ready=True
Event: Generation bumps to gen2
A reports Unknown@gen2 → discarded here, A stays True@gen1 in DB
B reports True@gen2 → stored, triggers aggregation
New code — scanAdapterAvailability sees A@gen1 vs B@gen2 → consistent=false → Available frozen at True@gen1.
Old code — Unknown@gen2 would have been stored (first report at gen2). Aggregation sees A@gen2(Unknown) vs B@gen2 → consistent=true, allAvailable=false → Available set to False.
The cluster appears Available=True when one adapter is actually in an uncertain state at the new generation. This persists until A sends a definitive True/False at gen2.
Suggested test to add:
// TestUnknownAtNewGen_AvailableFrozen verifies that after a generation bump,
// if one adapter reports Unknown@gen2 (discarded) and the other reports True@gen2,
// Available is correctly frozen (not actively set to False) and Ready=False.
func TestUnknownAtNewGen_AvailableFrozen(t *testing.T) {
// 1. Both adapters True@gen1 → Available=True, Ready=True
// 2. Bump generation to 2
// 3. Adapter A reports Unknown@gen2 → discarded (204)
// 4. Adapter B reports True@gen2 → stored, aggregation runs
// 5. Assert Available still True@gen1 (frozen), Ready=False
// — or decide this should be Available=False and fix the code
}The question for the author: is "frozen at True" the intended behavior here, or should the code detect that a required adapter hasn't given a definitive answer at the new generation and actively set Available=False?
There was a problem hiding this comment.
Old code - Only there is no adapter status, the unknown status is stored; it means that for an adapter deployed from the beginning, you only can see Unknown@gen1, and should never see Unknown@gen2.
There was a problem hiding this comment.
"Available frozen at True at an old generation" is the correct behavior, it means:
- The resource was True for this generation and that was updated until observed_date
Getting Unknown does not provide more useful information.
Getting False on the other hand means that the resource is not Available for some generation later.
pkg/config/loader.go
Outdated
| // viper key -> ordered list of env var names (first one set wins) | ||
| clusterEnvVars := []string{ | ||
| EnvPrefix + "_ADAPTERS_REQUIRED_CLUSTER", | ||
| EnvPrefix + "_CLUSTER_ADAPTERS", // alias for user convenience |
There was a problem hiding this comment.
I previously migrated from HYPERFLEET_CLUSTER_ADAPTERS to HYPERFLEET_ADAPTERS_REQUIRED_CLUSTER for better clarity. Any strong reason to maintain two Env variables for the same config? It also brings the confusion what happens if both are set?
There was a problem hiding this comment.
This seems an error from my side when rebasing. I'm restoring the code you introduced
pkg/services/status_aggregation.go
Outdated
| numReady := 0 | ||
| for _, name := range requiredAdapters { | ||
| info, exists := adapterMap[name] | ||
| if !exists || (resourceGeneration > 0 && info.observedGeneration != resourceGeneration) { |
There was a problem hiding this comment.
Do we need resourceGeneration > 0 as the default value is 1?
There was a problem hiding this comment.
Not needed, I simplify it
| // - P3: discards if Available == Unknown (not processed per spec) | ||
| // | ||
| // Otherwise: upserts the status and triggers aggregation. | ||
| // Returns (nil, nil) for discarded/rejected updates. |
There was a problem hiding this comment.
Maybe define a custom error type ErrDiscarded or isDiscarded could be better than nil.
There was a problem hiding this comment.
This function is invoked just once, adding it IMO (and chatting with the LLM) introduce more complexity than it it is worth
b0c2cfc to
9ddceb8
Compare
9ddceb8 to
42d9b60
Compare
https://redhat.atlassian.net/browse/HYPERFLEET-774
Important for review:
The logic for aggregation is explained in api-resources.md in the section named "Aggregation logic"
I added some external black test scripts as curl scripts to tests manually the state changes
Corrects LastUpdatedTime semantics for the Ready and Available synthetic conditions. Previously, LastUpdatedTime was set uniformly to time.Now() regardless of the event. Now it is
computed from the actual adapter report times:
adapter's observed_time.
§2 P3). This removes a latent inconsistency where an unknown first status could linger without ever being recomputed.
generation.
adapter's observed_time through to the synthetic condition builders.
HYPERFLEET_ADAPTERS_REQUIRED_CLUSTER / HYPERFLEET_ADAPTERS_REQUIRED_NODEPOOL, with first-wins precedence.
tests; 12 new e2e-curl scenario scripts covering initial state through stable-false and unknown transitions.
computeReadyLastUpdated semantics.
Test plan
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests