Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-18287,OCPBUGS-19480: Update to Kubernetes 1.26.9 #1715

Merged

Conversation

suleymanakbas91
Copy link

No description provided.

tnqn and others added 30 commits April 13, 2023 11:17
… creation

The topology.kubernetes.io/zone label may be added by could provider
asynchronously after the Node is created. The previous code didn't
update the topology cache after receiving the Node update event, causing
TopologyAwareHint to not work until kube-controller-manager restarts or
other Node events trigger the update.

Signed-off-by: Quan Tian <qtian@vmware.com>
The member variable `cpuRatiosByZone` should be accessed with the lock
acquired as it could be be updated by `SetNodes` concurrently.

Signed-off-by: Quan Tian <qtian@vmware.com>
Co-authored-by: Antonio Ojea <aojea@google.com>
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
This allow for a small time jump backward after
certificates generation.

Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
When attempting to record a new Event and a new Serie on the apiserver
at the same time, the patch of the Serie might happen before the Event
is actually created. In that case, we handle the error and try to create
the Event. But the Event might be created during that period of time and
it is treated as an error today. So in order to handle that scenario, we
need to retry when a Create call for a Serie results in an AlreadyExist
error.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
There was a data race in the recordToSink function that caused changes
to the events cache to be overriden if events were emitted
simultaneously via Eventf calls.

The race lies in the fact that when recording an Event, there might be
multiple calls updating the cache simultaneously. The lock period is
optimized so that after updating the cache with the new Event, the lock
is unlocked until the Event is recorded on the apiserver side and then
the cache is locked again to be updated with the new value returned by
the apiserver.

The are a few problem with the approach:

1. If two identical Events are emitted successively the changes of the
   second Event will override the first one. In code the following
   happen:
   1. Eventf(ev1)
   2. Eventf(ev2)
   3. Lock cache
   4. Set cache[getKey(ev1)] = &ev1
   5. Unlock cache
   6. Lock cache
   7. Update cache[getKey(ev2)] = &ev1 + Series{Count: 1}
   8. Unlock cache
   9. Start attempting to record the first event &ev1 on the apiserver side.

   This can be mitigated by recording a copy of the Event stored in
   cache instead of reusing the pointer from the cache.

2. When the Event has been recorded on the apiserver the cache is
   updated again with the value of the Event returned by the server.
   This update will override any changes made to the cache entry when
   attempting to record the new Event since the cache was unlocked at
   that time. This might lead to some inconsistencies when dealing with
   EventSeries since the count may be overriden or the client might even
   try to record the first isomorphic Event multiple time.

   This could be mitigated with a lock that has a larger scope, but we
   shouldn't want to reflect Event returned by the apiserver in the
   cache in the first place since mutation could mess with the
   aggregation by either allowing users to manipulate values to update
   a different cache entry or even having two cache entries for the same
   Events.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
The kube-apiserver validation expects the Count of an EventSeries to be
at least 2, otherwise it rejects the Event. There was is discrepancy
between the client and the server since the client was iniatizing an
EventSeries to a count of 1.

According to the original KEP, the first event emitted should have an
EventSeries set to nil and the second isomorphic event should have an
EventSeries with a count of 2. Thus, we should matcht the behavior
define by the KEP and update the client.

Also, as an effort to make the old clients compatible with the servers,
we should allow Events with an EventSeries count of 1 to prevent any
unexpected rejections.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
…ax seats

Max seats from prioriy & fairness work estimator is now min(0.15 x
nominalCL, nominalCL/handSize)

'Max seats' calculated by work estimator is currently hard coded to 10.
When using lower values for --max-requests-inflight, a single
LIST request taking up 10 seats could end up using all if not most seats in
the priority level. This change updates the default work estimator
config such that 'max seats' is at most 10% of the
maximum concurrency limit for a priority level, with an upper limit of 10.
This ensures seats taken from LIST request is proportional to the total
available seats.

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
image_list.go is one of the files included in the non-test variant Go build list, but its getSampleDevicePluginPod function references readDaemonSetV1OrDie function defined in device_plugin_test.go which is included in the test variant Go build list only. (The file name is *_test.go).

As a result, "go build" fails with the undefined reference error.

In practice, that may not be an issue since k8s project contributors aren't meant to run go build on this package. However, tools that depend on go build to operate - e.g., gopls or govulncheck ./... - will report this as an error.

Fix this error and make test/e2e package pass go build by moving this file to also test-only source code.
…y-pick-of-#118601-origin-release-1.26

Automated cherry pick of kubernetes#118601: priority & fairness: support dynamic max seats
…f-#118549-upstream-release-1.26

Automated cherry pick of kubernetes#118549: fix 'pod' in kubelet prober metrics
…ick-of-#118922-upstream-release-1.26

Automated cherry pick of kubernetes#118922: kubeadm: backdate generated CAs
…-pick-of-#114237-kubernetes#114236-kubernetes#112334-upstream-release-1.26

Automated cherry pick of kubernetes#114237: tools/events: retry on AlreadyExist for Series
kubernetes#114236: tools/events: fix data race when emitting series
kubernetes#112334: events: fix EventSeries starting count discrepancy
…of-#117245-kubernetes#117249-upstream-release-1.26

Automated cherry pick of kubernetes#117245: Fix TopologyAwareHint not working when zone label is added
kubernetes#117249: Fix a data race in TopologyCache
…ck-of-#117710-upstream-release-1.26

Automated cherry pick of kubernetes#117710: e2e_node: move getSampleDevicePluginPod to
When kubelet initializes, runs admission for pods and possibly
allocated requested resources. We need to distinguish between
node reboot (no containers running) versus kubelet restart (containers
potentially running).

Running pods should always survive kubelet restart.
This means that device allocation on admission should not be attempted,
because if a container requires devices and is still running when kubelet
is restarting, that container already has devices allocated and working.

Thus, we need to properly detect this scenario in the allocation step
and handle it explicitely. We need to inform
the devicemanager about which pods are already running.

Note that if container runtime is down when kubelet restarts, the
approach implemented here won't work. In this scenario, so on kubelet
restart containers will again fail admission, hitting
kubernetes#118559 again.
This scenario should however be pretty rare.

Signed-off-by: Francesco Romani <fromani@redhat.com>
Fix e2e device manager tests.
Most notably, the workload pods needs to survive a kubelet
restart. Update tests to reflect that.

--- 1.26 backport notice ---
moved from custom gomega matcher to check functions
because in 1.26 and below the needed gomega deps (types, matcher)
were not added, and we didn't want to pull them in for this PR.

This is a reimplementation of the same concepts, no change in test
scope or behavior is expected.

Signed-off-by: Francesco Romani <fromani@redhat.com>
The recently added e2e device plugins test to cover node reboot
works fine if runs every time on CI environment (e.g CI) but
doesn't handle correctly partial setup when run repeatedly on
the same instance (developer setup).

To accomodate both flows, we extend the error management, checking
more error conditions in the flow.

Signed-off-by: Francesco Romani <fromani@redhat.com>
Make sure orphanded pods (pods deleted while kubelet is down) are
handled correctly.
Outline:
1. create a pod (not static pod)
2. stop kubelet
3. while kubelet is down, force delete the pod on API server
4. restart kubelet
the pod becomes an orphaned pod and is expected to be killed by HandlePodCleanups.

There is a similar test already, but here we want to check device
assignment.

Signed-off-by: Francesco Romani <fromani@redhat.com>
One of the contributing factors of issues kubernetes#118559 and kubernetes#109595 hard to
debug and fix is that the devicemanager has very few logs in important
flow, so it's unnecessarily hard to reconstruct the state from logs.

We add minimal logs to be able to improve troubleshooting.
We add minimal logs to be backport-friendly, deferring a more
comprehensive review of logging to later PRs.

Signed-off-by: Francesco Romani <fromani@redhat.com>
…20.7 and update protoc

Signed-off-by: Jeremy Rickard <jeremyrrickard@gmail.com>
[release-1.26] releng/go: Bump images, versions and deps to use Go 1.20.7
@openshift-ci-robot openshift-ci-robot removed the backports/validated-commits Indicates that all commits come to merged upstream PRs. label Sep 26, 2023
@openshift-ci-robot
Copy link

@suleymanakbas91: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@soltysh
Copy link
Member

soltysh commented Sep 26, 2023

/test unit
/test integration
/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 26, 2023
@soltysh
Copy link
Member

soltysh commented Sep 26, 2023

/override ci/prow/verify-commits
this never passes on a k8s bump PR

@soltysh
Copy link
Member

soltysh commented Sep 26, 2023

/remove-label backports/unvalidated-commits
/label backports/validated-commits

@openshift-ci openshift-ci bot added backports/validated-commits Indicates that all commits come to merged upstream PRs. and removed backports/unvalidated-commits Indicates that not all commits come to merged upstream PRs. labels Sep 26, 2023
@openshift-ci
Copy link

openshift-ci bot commented Sep 26, 2023

@soltysh: Overrode contexts on behalf of soltysh: ci/prow/verify-commits

In response to this:

/override ci/prow/verify-commits
this never passes on a k8s bump PR

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link

openshift-ci bot commented Sep 26, 2023

@suleymanakbas91: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/verify-commits 46910c6 link true /test verify-commits

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Member

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@soltysh
Copy link
Member

soltysh commented Sep 26, 2023

/label backport-risk-assessed

@openshift-ci openshift-ci bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label Sep 26, 2023
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 26, 2023
@openshift-ci
Copy link

openshift-ci bot commented Sep 26, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: soltysh, suleymanakbas91

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 26, 2023
@gangwgr
Copy link

gangwgr commented Sep 26, 2023

/label cherry-pick-approved

@openshift-ci openshift-ci bot added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Sep 26, 2023
@openshift-merge-robot openshift-merge-robot merged commit 52589e6 into openshift:release-4.13 Sep 26, 2023
22 checks passed
@openshift-ci-robot
Copy link

@suleymanakbas91: Jira Issue OCPBUGS-18287: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-18287 has been moved to the MODIFIED state.

Jira Issue OCPBUGS-19480: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-19480 has been moved to the MODIFIED state.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@suleymanakbas91 suleymanakbas91 deleted the bump-1.26.9 branch September 26, 2023 16:39
@openshift-merge-robot
Copy link

Fix included in accepted release 4.13.0-0.nightly-2023-09-27-193040

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. backports/validated-commits Indicates that all commits come to merged upstream PRs. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR vendor-update Touching vendor dir or related files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet