New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-18287,OCPBUGS-19480: Update to Kubernetes 1.26.9 #1715
OCPBUGS-18287,OCPBUGS-19480: Update to Kubernetes 1.26.9 #1715
Conversation
… creation The topology.kubernetes.io/zone label may be added by could provider asynchronously after the Node is created. The previous code didn't update the topology cache after receiving the Node update event, causing TopologyAwareHint to not work until kube-controller-manager restarts or other Node events trigger the update. Signed-off-by: Quan Tian <qtian@vmware.com>
The member variable `cpuRatiosByZone` should be accessed with the lock acquired as it could be be updated by `SetNodes` concurrently. Signed-off-by: Quan Tian <qtian@vmware.com> Co-authored-by: Antonio Ojea <aojea@google.com>
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
This allow for a small time jump backward after certificates generation. Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
When attempting to record a new Event and a new Serie on the apiserver at the same time, the patch of the Serie might happen before the Event is actually created. In that case, we handle the error and try to create the Event. But the Event might be created during that period of time and it is treated as an error today. So in order to handle that scenario, we need to retry when a Create call for a Serie results in an AlreadyExist error. Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
There was a data race in the recordToSink function that caused changes to the events cache to be overriden if events were emitted simultaneously via Eventf calls. The race lies in the fact that when recording an Event, there might be multiple calls updating the cache simultaneously. The lock period is optimized so that after updating the cache with the new Event, the lock is unlocked until the Event is recorded on the apiserver side and then the cache is locked again to be updated with the new value returned by the apiserver. The are a few problem with the approach: 1. If two identical Events are emitted successively the changes of the second Event will override the first one. In code the following happen: 1. Eventf(ev1) 2. Eventf(ev2) 3. Lock cache 4. Set cache[getKey(ev1)] = &ev1 5. Unlock cache 6. Lock cache 7. Update cache[getKey(ev2)] = &ev1 + Series{Count: 1} 8. Unlock cache 9. Start attempting to record the first event &ev1 on the apiserver side. This can be mitigated by recording a copy of the Event stored in cache instead of reusing the pointer from the cache. 2. When the Event has been recorded on the apiserver the cache is updated again with the value of the Event returned by the server. This update will override any changes made to the cache entry when attempting to record the new Event since the cache was unlocked at that time. This might lead to some inconsistencies when dealing with EventSeries since the count may be overriden or the client might even try to record the first isomorphic Event multiple time. This could be mitigated with a lock that has a larger scope, but we shouldn't want to reflect Event returned by the apiserver in the cache in the first place since mutation could mess with the aggregation by either allowing users to manipulate values to update a different cache entry or even having two cache entries for the same Events. Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
The kube-apiserver validation expects the Count of an EventSeries to be at least 2, otherwise it rejects the Event. There was is discrepancy between the client and the server since the client was iniatizing an EventSeries to a count of 1. According to the original KEP, the first event emitted should have an EventSeries set to nil and the second isomorphic event should have an EventSeries with a count of 2. Thus, we should matcht the behavior define by the KEP and update the client. Also, as an effort to make the old clients compatible with the servers, we should allow Events with an EventSeries count of 1 to prevent any unexpected rejections. Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
…ax seats Max seats from prioriy & fairness work estimator is now min(0.15 x nominalCL, nominalCL/handSize) 'Max seats' calculated by work estimator is currently hard coded to 10. When using lower values for --max-requests-inflight, a single LIST request taking up 10 seats could end up using all if not most seats in the priority level. This change updates the default work estimator config such that 'max seats' is at most 10% of the maximum concurrency limit for a priority level, with an upper limit of 10. This ensures seats taken from LIST request is proportional to the total available seats. Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
image_list.go is one of the files included in the non-test variant Go build list, but its getSampleDevicePluginPod function references readDaemonSetV1OrDie function defined in device_plugin_test.go which is included in the test variant Go build list only. (The file name is *_test.go). As a result, "go build" fails with the undefined reference error. In practice, that may not be an issue since k8s project contributors aren't meant to run go build on this package. However, tools that depend on go build to operate - e.g., gopls or govulncheck ./... - will report this as an error. Fix this error and make test/e2e package pass go build by moving this file to also test-only source code.
…y-pick-of-#118601-origin-release-1.26 Automated cherry pick of kubernetes#118601: priority & fairness: support dynamic max seats
…f-#118549-upstream-release-1.26 Automated cherry pick of kubernetes#118549: fix 'pod' in kubelet prober metrics
…ick-of-#118922-upstream-release-1.26 Automated cherry pick of kubernetes#118922: kubeadm: backdate generated CAs
…-pick-of-#114237-kubernetes#114236-kubernetes#112334-upstream-release-1.26 Automated cherry pick of kubernetes#114237: tools/events: retry on AlreadyExist for Series kubernetes#114236: tools/events: fix data race when emitting series kubernetes#112334: events: fix EventSeries starting count discrepancy
…of-#117245-kubernetes#117249-upstream-release-1.26 Automated cherry pick of kubernetes#117245: Fix TopologyAwareHint not working when zone label is added kubernetes#117249: Fix a data race in TopologyCache
…ck-of-#117710-upstream-release-1.26 Automated cherry pick of kubernetes#117710: e2e_node: move getSampleDevicePluginPod to
When kubelet initializes, runs admission for pods and possibly allocated requested resources. We need to distinguish between node reboot (no containers running) versus kubelet restart (containers potentially running). Running pods should always survive kubelet restart. This means that device allocation on admission should not be attempted, because if a container requires devices and is still running when kubelet is restarting, that container already has devices allocated and working. Thus, we need to properly detect this scenario in the allocation step and handle it explicitely. We need to inform the devicemanager about which pods are already running. Note that if container runtime is down when kubelet restarts, the approach implemented here won't work. In this scenario, so on kubelet restart containers will again fail admission, hitting kubernetes#118559 again. This scenario should however be pretty rare. Signed-off-by: Francesco Romani <fromani@redhat.com>
Fix e2e device manager tests. Most notably, the workload pods needs to survive a kubelet restart. Update tests to reflect that. --- 1.26 backport notice --- moved from custom gomega matcher to check functions because in 1.26 and below the needed gomega deps (types, matcher) were not added, and we didn't want to pull them in for this PR. This is a reimplementation of the same concepts, no change in test scope or behavior is expected. Signed-off-by: Francesco Romani <fromani@redhat.com>
The recently added e2e device plugins test to cover node reboot works fine if runs every time on CI environment (e.g CI) but doesn't handle correctly partial setup when run repeatedly on the same instance (developer setup). To accomodate both flows, we extend the error management, checking more error conditions in the flow. Signed-off-by: Francesco Romani <fromani@redhat.com>
Make sure orphanded pods (pods deleted while kubelet is down) are handled correctly. Outline: 1. create a pod (not static pod) 2. stop kubelet 3. while kubelet is down, force delete the pod on API server 4. restart kubelet the pod becomes an orphaned pod and is expected to be killed by HandlePodCleanups. There is a similar test already, but here we want to check device assignment. Signed-off-by: Francesco Romani <fromani@redhat.com>
One of the contributing factors of issues kubernetes#118559 and kubernetes#109595 hard to debug and fix is that the devicemanager has very few logs in important flow, so it's unnecessarily hard to reconstruct the state from logs. We add minimal logs to be able to improve troubleshooting. We add minimal logs to be backport-friendly, deferring a more comprehensive review of logging to later PRs. Signed-off-by: Francesco Romani <fromani@redhat.com>
…20.7 and update protoc Signed-off-by: Jeremy Rickard <jeremyrrickard@gmail.com>
[release-1.26] releng/go: Bump images, versions and deps to use Go 1.20.7
/test unit |
/override ci/prow/verify-commits |
/remove-label backports/unvalidated-commits |
@soltysh: Overrode contexts on behalf of soltysh: ci/prow/verify-commits In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@suleymanakbas91: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
/label backport-risk-assessed |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: soltysh, suleymanakbas91 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/label cherry-pick-approved |
52589e6
into
openshift:release-4.13
@suleymanakbas91: Jira Issue OCPBUGS-18287: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-18287 has been moved to the MODIFIED state. Jira Issue OCPBUGS-19480: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-19480 has been moved to the MODIFIED state. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Fix included in accepted release 4.13.0-0.nightly-2023-09-27-193040 |
No description provided.