Skip to content

docs: NO-JIRA: clarify serialization tag behaviour in api/AGENTS.md#8328

Merged
enxebre merged 1 commit intoopenshift:mainfrom
enxebre:enxebre/api-agents-serialization-tags
Apr 27, 2026
Merged

docs: NO-JIRA: clarify serialization tag behaviour in api/AGENTS.md#8328
enxebre merged 1 commit intoopenshift:mainfrom
enxebre:enxebre/api-agents-serialization-tags

Conversation

@enxebre
Copy link
Copy Markdown
Member

@enxebre enxebre commented Apr 24, 2026

Summary

  • Explain why omitempty/omitzero should be set on every field regardless of +required or +optional: the tag controls serialization (what goes on the wire), not validation (enforced at admission via the schema). Without the tag, zero-value fields serialize as explicit values, breaking defaulting, server-side apply, and strategic merge patch.
  • Add upstream Kubernetes CRD docs link alongside the downstream OpenShift API conventions, with explicit note that downstream wins on conflicts.

Test plan

  • Docs-only change, no code impact

🤖 Generated with Claude Code

Summary by CodeRabbit

Documentation

  • Updated API convention guidance to rely on the linter as the enforcement source and clarified precedence between OpenShift and upstream Kubernetes guidance.
  • Strengthened serialization guidance: fields must use explicit omit-style tagging to control wire representation, with notes on scalar vs struct usage, behavioral impacts when zero values are serialized, and tooling/runtime expectations.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot
Copy link
Copy Markdown

@enxebre: This pull request explicitly references no jira issue.

Details

In response to this:

Summary

  • Explain why omitempty/omitzero should be set on every field regardless of +required or +optional: the tag controls serialization (what goes on the wire), not validation (enforced at admission via the schema). Without the tag, zero-value fields serialize as explicit values, breaking defaulting, server-side apply, and strategic merge patch.
  • Add upstream Kubernetes CRD docs link alongside the downstream OpenShift API conventions, with explicit note that downstream wins on conflicts.

Test plan

  • Docs-only change, no code impact

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 24, 2026
@enxebre enxebre force-pushed the enxebre/api-agents-serialization-tags branch from af8e045 to bad810b Compare April 24, 2026 11:18
@enxebre enxebre changed the title docs: NO-JIRA: clarify serialization tag conventions in api/AGENTS.md NO-JIRA: clarify serialization tag conventions in api/AGENTS.md Apr 24, 2026
@openshift-ci openshift-ci Bot requested review from Nirshal and sjenning April 24, 2026 11:19
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 24, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/api Indicates the PR includes changes for the API and removed do-not-merge/needs-area labels Apr 24, 2026
@enxebre enxebre force-pushed the enxebre/api-agents-serialization-tags branch from bad810b to add4f98 Compare April 24, 2026 11:20
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 24, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 3b9c21c0-381e-4946-9048-10034fe1b8a0

📥 Commits

Reviewing files that changed from the base of the PR and between 21b69ae and 7adff1e.

📒 Files selected for processing (1)
  • api/AGENTS.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • api/AGENTS.md

📝 Walkthrough

Walkthrough

The pull request updates api/AGENTS.md to (1) replace the static conventions link with guidance to rely on kube-api-linter (use make api-lint-fix) as the enforcement source, referencing OpenShift conventions as authoritative and upstream Kubernetes conventions as informational with downstream (this repo) taking precedence when conflicts occur, and (2) strengthen serialization rules by requiring every API field to have either omitempty or omitzero regardless of +required/+optional, clarifying these tags control on‑the‑wire representation (not admission validation), noting concrete behavioral breakages if zero values are serialized, and refining guidance on scalar vs struct usage and repository Go version expectations for omitzero.

🚥 Pre-merge checks | ✅ 12
✅ Passed checks (12 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: clarifying serialization tag behavior in api/AGENTS.md documentation.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR modifies only api/AGENTS.md documentation file with no Ginkgo test patterns found, making test name check not applicable.
Test Structure And Quality ✅ Passed This PR is a documentation-only change to api/AGENTS.md with no test files or test code, making the custom Ginkgo test code quality check inapplicable.
Microshift Test Compatibility ✅ Passed This PR is a documentation-only change to api/AGENTS.md with no new Ginkgo e2e tests added.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR is a documentation-only change to api/AGENTS.md with no tests added. The SNO compatibility check targets new Ginkgo e2e tests, making it not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed Pull request exclusively updates api/AGENTS.md documentation file with no changes to deployment manifests, operator code, controllers, or scheduling configurations.
Ote Binary Stdout Contract ✅ Passed This pull request is a documentation-only change that updates api/AGENTS.md with no modifications to any Go source code files. The OTE Binary Stdout Contract check specifically targets stdout writes in process-level code (main(), init(), TestMain(), BeforeSuite(), AfterSuite(), etc.) that would corrupt test listing JSON output. Since this PR contains only documentation changes (+3/-4 lines in a markdown file) with no code modifications, there is no executable code to evaluate for stdout contract violations. The check is not applicable to documentation-only changes.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The PR modifies only api/AGENTS.md documentation; no Ginkgo e2e tests are added or modified, so this check does not apply.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@enxebre enxebre force-pushed the enxebre/api-agents-serialization-tags branch from add4f98 to 21b69ae Compare April 24, 2026 11:21
@enxebre enxebre changed the title NO-JIRA: clarify serialization tag conventions in api/AGENTS.md docs: NO-JIRA: clarify serialization tag behaviour in api/AGENTS.md Apr 24, 2026
Comment thread api/AGENTS.md Outdated
Comment on lines +11 to +12
- **Downstream (authoritative):** [OpenShift API conventions](https://github.com/openshift/enhancements/blob/master/dev-guide/api-conventions.md)
- **Upstream (informational):** [Kubernetes API conventions](https://github.com/kubernetes/community/blob/main/contributors/devel/sig-architecture/api-conventions.md)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of these docs are huge, is telling the agent to read so much context going to cause context rot? Some of the serialization stuff you add here is already mentioned in these docs, but its in the ~middle. So I suspect the agent forgets it because we are giving it too much.

I think those docs need a lot of work to make them agent accessible

Worth teaching the agent to trust the linter more? And save it from having to consume any conventions docs that cannot be linted?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth teaching the agent to trust the linter more? And save it from having to consume any conventions docs that cannot be linted?

I see those as different goals. We can and do enforce the agent building blocks to run the linter, we can do that more deterministically with git/cc hooks, whatever. What I'm after here is to improve the agent ability to reason about api machinery and understanding why the linter does what it does, not to enforce running the linter

I agree on the signal/noise potential problem, and this has not given good results so far, so let's change it.

Comment thread api/AGENTS.md Outdated
### Serialization

- `omitempty` **does nothing for non-pointer structs.** Only `omitzero` correctly omits a struct field when it equals its zero value. This is a Go encoding/json behavior, not a Kubernetes convention.
- **Always set `omitempty` or `omitzero` on every field, regardless of whether it is `+required` or `+optional`.** These tags control serialization, not validation. `+required` is a schema constraint enforced at admission time; the serialization tag controls what goes on the wire. Without a tag, a zero-value field serializes as an explicit value (e.g., `"pullSecret": {"name": ""}`), which makes the API server unable to distinguish "not set" from "explicitly set to empty." This breaks defaulting, server-side apply field ownership, and strategic merge patch — all of which rely on field absence to mean "don't touch this."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be explicit about which tags?

Suggested change
- **Always set `omitempty` or `omitzero` on every field, regardless of whether it is `+required` or `+optional`.** These tags control serialization, not validation. `+required` is a schema constraint enforced at admission time; the serialization tag controls what goes on the wire. Without a tag, a zero-value field serializes as an explicit value (e.g., `"pullSecret": {"name": ""}`), which makes the API server unable to distinguish "not set" from "explicitly set to empty." This breaks defaulting, server-side apply field ownership, and strategic merge patch — all of which rely on field absence to mean "don't touch this."
- **Always set `omitempty` or `omitzero` on every field, regardless of whether it is `+required` or `+optional`.** `omitempty`/`omitzero` tags control serialization, not validation. `+required` is a schema constraint enforced at admission time; the serialization tag controls what goes on the wire. Without a tag, a zero-value field serializes as an explicit value (e.g., `"pullSecret": {"name": ""}`), which makes the API server unable to distinguish "not set" from "explicitly set to empty." This breaks defaulting, server-side apply field ownership, and strategic merge patch — all of which rely on field absence to mean "don't touch this."

The other element of this is that without omission, a structured client serializes the empty object as you've said, and this passes a +required check (based on key presence) without any check on the value. Most often folks add +required but expect that means non-empty, which it doesn't. So the API user can forget to add a value for the required field, it then passes the requiredness check, and the reader sees a required field with empty value which is unexpected

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 36.08%. Comparing base (6daa9ce) to head (7adff1e).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8328   +/-   ##
=======================================
  Coverage   36.08%   36.08%           
=======================================
  Files         767      767           
  Lines       93485    93485           
=======================================
  Hits        33737    33737           
  Misses      57041    57041           
  Partials     2707     2707           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Explain why omitempty/omitzero should be set on every field regardless
of whether it is required or optional: the tag controls serialization
behaviour (what goes on the wire), not validation (which is enforced
at admission time via the schema). Without the tag, zero-value fields
serialize as explicit values, breaking defaulting, server-side apply,
and strategic merge patch.

Add upstream Kubernetes API conventions link alongside the downstream
OpenShift API conventions, with a note that downstream wins on
conflicts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@enxebre enxebre force-pushed the enxebre/api-agents-serialization-tags branch from 21b69ae to 7adff1e Compare April 24, 2026 12:26
@JoelSpeed
Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 24, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

@cwbotbot
Copy link
Copy Markdown

cwbotbot commented Apr 24, 2026

Test Results

e2e-aks

e2e-aws

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2047663076571877376 | Cost: $2.0292075000000005 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2047663076479602688 | Cost: $3.8339332000000006 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 24, 2026

@enxebre: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-v2-aws 7adff1e link true /test e2e-v2-aws
ci/prow/e2e-kubevirt-aws-ovn-reduced 7adff1e link true /test e2e-kubevirt-aws-ovn-reduced
ci/prow/e2e-aws 7adff1e link true /test e2e-aws
ci/prow/e2e-aks 7adff1e link true /test e2e-aks
ci/prow/e2e-aws-upgrade-hypershift-operator 7adff1e link true /test e2e-aws-upgrade-hypershift-operator
ci/prow/e2e-azure-self-managed 7adff1e link true /test e2e-azure-self-managed

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hypershift-jira-solve-ci
Copy link
Copy Markdown

Now I have a comprehensive picture. Let me generate the final report:

Test Failure Analysis Complete

Job Information

  • Prow Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed
  • Build ID: 2047663076571877376
  • Target: e2e-azure-self-managed
  • Result: 18 failures out of 59 tests (31% failure rate)
  • Duration: Test step ran for 2h0m49s (hit the 2h timeout)

  • Prow Job: pull-ci-openshift-hypershift-main-e2e-aws
  • Build ID: 2047663076479602688
  • Target: e2e-aws
  • Result: 28 failures out of 97 tests (29% failure rate)
  • Duration: Test step ran for 2h0m1s (hit the 2h timeout)

Test Failure Analysis

Error

Multiple hosted clusters failed to become available across both jobs.
Common errors:
- "dial tcp: lookup api-*.service.ci.hypershift.devcluster.openshift.com: no such host"
- "dial tcp <ip>:443: i/o timeout"
- "client rate limiter Wait returned an error: context deadline exceeded"
- "EtcdAvailable=False: EtcdWaitingForQuorum(Waiting for etcd to reach quorum)"
- "KubeletNotReady: container runtime network not ready: NetworkPluginNotReady: no CNI configuration file"
- "Failed to wait for control plane components to complete rollout in 30m0s: context deadline exceeded"

Summary

Both jobs (AWS and Azure) experienced widespread, identical infrastructure-level failures unrelated to PR #8328 (a documentation-only change to api/AGENTS.md). The failures affect 10+ independent test cases across both cloud providers with the same pattern: hosted cluster API servers fail DNS resolution, then time out on TCP connections, nodes fail to become ready due to CNI not being configured, etcd cannot reach quorum, and control plane components stall waiting for dependencies (primarily kube-apiserver and openshift-apiserver). Both jobs ran in the same CI namespace (ci-op-spt6fn1c) on build01, and both hit the 2-hour test timeout. The only test that passed (besides pre-provisioned TestHAEtcdChaos) was TestNodePoolMultiArch (which was skipped). This is a CI infrastructure flake, not a code regression.

Root Cause

The root cause is CI infrastructure instability affecting the management cluster's ability to provision and manage hosted clusters. The failure chain is:

  1. DNS resolution failures: Immediately after hosted cluster creation, API server DNS entries (api-*.service.ci.hypershift.devcluster.openshift.com for AWS, api-*.aks-e2e.hypershift.azure.devcluster.openshift.com for Azure) failed to resolve via the cluster DNS (172.30.0.10:53), returning "no such host" errors.

  2. Network connectivity failures: After DNS eventually resolved, TCP connections to the API server IPs timed out (multiple different IPs tried across retries), indicating the load balancers or network paths were not functional.

  3. Control plane not bootstrapping: Due to connectivity issues, etcd could not reach quorum (EtcdWaitingForQuorum), KubeAPIServer deployments were not found, and the hosted control plane could not be created (kubeconfig never published for several clusters).

  4. Worker nodes not joining: Nodes that did come up had KubeletNotReady with NetworkPluginNotReady — the CNI plugin never received configuration because the control plane was not available to provide it.

  5. Cascading timeouts: The API client rate limiter became saturated from repeated retry attempts across 10+ parallel hosted clusters, causing "client rate limiter Wait returned an error: context deadline exceeded" across all tests. Control plane upgrade tests failed because dependent components (kube-apiserver, openshift-apiserver) never rolled out, blocking all downstream components.

  6. AWS credential issue (AWS job only): TestCreateClusterRequestServingIsolation showed AWSEndpointAvailable=False due to missing service account token (/var/run/secrets/openshift/serviceaccount/token: no such file or directory), suggesting pod-level credential mounting issues on the management cluster.

The PR (docs: NO-JIRA: clarify serialization tag behaviour in api/AGENTS.md) changes only a Markdown documentation file and cannot have caused these failures. The identical failure pattern across two independent cloud providers (AWS and Azure) running in the same CI namespace confirms this is an infrastructure issue.

Recommendations
  1. Retest the PR — These failures are CI infrastructure flakes unrelated to the documentation change. A /retest should be sufficient.

  2. No code changes needed — PR docs: NO-JIRA: clarify serialization tag behaviour in api/AGENTS.md #8328 modifies only api/AGENTS.md (documentation) and cannot affect hosted cluster provisioning, DNS resolution, or network connectivity.

  3. If failures persist on retest, investigate:

    • The build01 CI cluster's DNS resolution health and external DNS operator status
    • Management cluster node resource pressure (both jobs ran in ci-op-spt6fn1c namespace)
    • AWS/Azure credential provisioning for the devcluster.openshift.com management clusters
    • Whether the service.ci.hypershift.devcluster.openshift.com or aks-e2e.hypershift.azure.devcluster.openshift.com DNS zones are healthy
Evidence
Evidence Detail
PR Content Documentation-only: docs: NO-JIRA: clarify serialization tag behaviour in api/AGENTS.md
Failure scope 18/59 tests failed (Azure), 28/97 tests failed (AWS) — ~30% failure rate across both
Shared CI namespace Both jobs ran in ci-op-spt6fn1c on build01
DNS failures (AWS) lookup api-node-pool-hv5nq.service.ci.hypershift.devcluster.openshift.com on 172.30.0.10:53: no such host (10+ clusters affected)
DNS failures (Azure) lookup api-azure-oauth-lb-r4pnj.aks-e2e.hypershift.azure.devcluster.openshift.com on 172.30.0.10:53: no such host
TCP timeouts (AWS) dial tcp 34.237.231.140:443: i/o timeout, dial tcp 98.83.123.233:443: i/o timeout (multiple IPs)
Etcd quorum EtcdAvailable=False: EtcdWaitingForQuorum across TestCreateCluster, TestCreateClusterPrivate, TestUpgradeControlPlane, TestCreateClusterRequestServingIsolation
CNI not ready (AWS) KubeletNotReady: NetworkPluginNotReady: no CNI configuration file in /etc/kubernetes/cni/net.d/ on nodes ip-10-0-1-67.ec2.internal, ip-10-0-0-58.ec2.internal
Rate limiter saturation 24 rate limiter errors (AWS), 12 rate limiter errors (Azure)
Control plane upgrade stuck 15+ components stuck at WaitingForDependencies(kube-apiserver, openshift-apiserver)
AWS credential error open /var/run/secrets/openshift/serviceaccount/token: no such file or directory in TestCreateClusterRequestServingIsolation
Pod restarts (AWS) etcd-metrics (12 restarts), ovnkube-control-plane (5 restarts), packageserver (3 restarts), cluster-node-tuning-operator (2 restarts)
Test timeout Both jobs hit 2h test step timeout: Azure at 15:52:47Z, AWS at 15:51:57Z
Tests that passed Only TestHAEtcdChaos passed (uses pre-provisioned cluster), confirming new cluster provisioning was broken

@enxebre enxebre merged commit b4192d4 into openshift:main Apr 27, 2026
29 of 36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/api Indicates the PR includes changes for the API jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants