-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubelet: fix fsnotify CHMOD event in arm architecture #107634
Conversation
/sig node |
/test pull-kubernetes-node-e2e-containerd |
/test pull-kubernetes-e2e-kind-ipv6 |
/test pull-kubernetes-node-e2e-containerd |
case fsnotify.Remove: | ||
fallthrough | ||
case fsnotify.Chmod: | ||
return true, true, nil | ||
default: | ||
klog.ErrorS(nil, "Received unexpected fsnotify event, retrying", "event", e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If case fsnotify.Chmod:
is deleted, fsnotify.Chmod event is judged as default:
and logged as an unexpected fsnotify event error.
It seems that we need to consider whether it is appropriate.
Although fsnotify.Chmod event may occur at an unexpected times according to your report, not an unexpected kind of event.
And, fsnotify.Chmod will be indistinguishable from a truly unexpected event.
How about not deleting case fsnotify.Chmod:
and not processing anything?
However, I think we need to check that not handling this event will not affect existing processing on any architecture.
Because there might be a reason why it has been handled this event so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cyclinder can you check this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the following comment that explains why there is fsnotify.Chmod.
#73041 (comment)
I've added it as I've seen the Chmod event coming from fsnotify instead of Rename or Create when the log file was replaced with rename(2).
However, I tested rename(2) on ubuntu and Chmod was not notify. It might depend on the environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds a bit scary, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @giuseppe
from earlier change
It seems to me that we're not actually using whether the log file was found anywhere:
kubernetes/pkg/kubelet/kuberuntime/logs/logs.go
Lines 429 to 432 in fbffe05
// waitLogs wait for the next log write. It returns two booleans and an error. The first boolean | |
// indicates whether a new log is found; the second boolean if the log file was recreated; | |
// the error is error happens during waiting new logs. | |
func waitLogs(ctx context.Context, id string, w *fsnotify.Watcher, runtimeService internalapi.RuntimeService) (bool, bool, error) { |
I would expect chmod would neither create a new file nor recreate the existing one, so I'd expect a return value of something like false, false, nil
for this case.
I agree that Chmod is an expected case and we should handle it.
Are there no accompanying test changes to ensure this fix works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @ehashman, the chmod event should be handled.
I don't see why it should behave differently on arm, it could be a kernel bug, or an issue in github.com/fsnotify/fsnotify
.
Do you have a reproducer with fsnotify that returns the chmod event (which is mapped to the kernel event IN_ATTRIB
) when there are no changes happening to the file?
And if it behaves differently on arm, could you strace it and see what events are received from the kernel?
/triage accepted |
I extracted the key code from the kubelet and wrote a demo to test the fix, Please refer to the following: https://github.com/cyclinder/mock-kubelet/blob/main/main.go Based on the test results, I think we should refer to the following fix: If chmod returns When I change the chmod return to Here are some tests I did on the arm:
[root@dce-18-18-10-202 kubelet-fsnotify]# ls
go.mod go.sum main.go
[root@dce-18-18-10-202 kubelet-fsnotify]# go run main.go
I0303 13:14:22.965921 373162 main.go:534] WebServer Starting on 127.0.0.1:18080...
[root@dce-18-18-10-202 ~]# curl http://127.0.0.1:18080/container/9f1ecdb55e78881c458f7f0808c0a4fee3aa7f9108632614a5b9af27740d0dc1?tail=50
E0303 05:11:59.562169 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection refused
E0303 05:12:26.778512 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Unauthorized
E0303 05:12:57.433730 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection timed out
E0303 05:13:27.673790 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection timed out
E0303 05:13:59.205837 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection refused
E0303 05:14:22.354639 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection refused
E0303 05:14:44.073696 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection timed out
E0303 05:15:12.474206 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection timed out
E0303 05:15:40.233722 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection timed out
E0303 05:15:55.923740 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection refused
E0303 05:16:30.474135 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: no route to host [root@dce-18-18-10-202 kubelet-fsnotify]# go run main.go
I0303 13:14:22.965921 373162 main.go:534] WebServer Starting on 127.0.0.1:18080...
I0303 13:16:19.852128 373162 main.go:286] success to watch file "/var/lib/containers/docker/containers/9f1ecdb55e78881c458f7f0808c0a4fee3aa7f9108632614a5b9af27740d0dc1/9f1ecdb55e78881c458f7f0808c0a4fee3aa7f9108632614a5b9af27740d0dc1-json.log"
I0303 13:16:21.584216 373162 main.go:407] fsnotify.Chmod
I0303 13:16:21.591680 373162 main.go:407] fsnotify.Chmod
I0303 13:16:22.443785 373162 main.go:407] fsnotify.Chmod
I0303 13:16:22.451412 373162 main.go:407] fsnotify.Chmod
I0303 13:16:22.462365 373162 main.go:407] fsnotify.Chmod
I0303 13:16:22.469724 373162 main.go:407] fsnotify.Chmod
I0303 13:16:26.596240 373162 main.go:407] fsnotify.Chmod
I0303 13:16:26.604875 373162 main.go:407] fsnotify.Chmod
I0303 13:16:30.474513 373162 main.go:404] fsnotify.Write
I0303 13:16:31.609368 373162 main.go:407] fsnotify.Chmod
I0303 13:16:31.617111 373162 main.go:407] fsnotify.Chmod
I0303 13:16:32.490666 373162 main.go:407] fsnotify.Chmod
I0303 13:16:32.499424 373162 main.go:407] fsnotify.Chmod
I0303 13:16:32.509655 373162 main.go:407] fsnotify.Chmod
I0303 13:16:32.518956 373162 main.go:407] fsnotify.Chmod fsnotify.Write corresponds to the generation of a new log, while Chmod does not affect.(I don't know why there are constant reports of Chmod :) ) If I change the return value of chmod to [root@dce-18-18-10-202 kubelet-fsnotify]# go run main.go
I0303 13:25:34.242341 418994 main.go:534] WebServer Starting on 127.0.0.1:18080...
I0303 13:25:43.286513 418994 main.go:286] success to watch file "/var/lib/containers/docker/containers/9f1ecdb55e78881c458f7f0808c0a4fee3aa7f9108632614a5b9af27740d0dc1/9f1ecdb55e78881c458f7f0808c0a4fee3aa7f9108632614a5b9af27740d0dc1-json.log"
I0303 13:25:43.635576 418994 main.go:407] fsnotify.Chmod [root@dce-18-18-10-202 ~]# curl http://127.0.0.1:18080/container/9f1ecdb55e78881c458f7f0808c0a4fee3aa7f9108632614a5b9af27740d0dc1?tail=50
E0303 05:22:05.434183 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection refused
E0303 05:22:35.514768 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: no route to host
E0303 05:22:56.448400 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection refused
E0303 05:23:18.362679 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection refused
E0303 05:23:45.838833 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Unauthorized
E0303 05:24:04.953697 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection timed out
E0303 05:24:24.350545 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection refused
E0303 05:24:39.591346 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection refused
E0303 05:24:56.950443 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection refused
E0303 05:25:15.266495 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection refused
[root@dce-18-18-10-202 ~]#
[root@dce-18-18-10-202 ~]# and If I don't change anything and just keep the same code as in master, I will get pod log parameter is the wireless cycle of log after Follow, just as reported in the issue. so I suggest changing the return value to |
✋ I am from the v1.24 bug triage team , just a reminder that code freeze is on 03/30 , @cyclinder if this PR is ready can it be merged before that :) |
Hi @DiptoChakrabarty, This PR is ready, Currently need reviewer to review it or lgtm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/lgtm cancel I don't think my or @giuseppe's comments have been adequately addressed, and we don't have a test for this, so I'm not comfortable merging at this point. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen We still meet this issue in arm. According to the discussion with the fsnotify community, there is a high probability of a kernel bug, so what should we do to avoid this problem? I still think the change in this PR makes sense, maybe it is currently we don't have a test for this, but I can test for this. How do you think? Any thoughts? Thanks. |
@cyclinder: Reopened this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: cyclinder, matthyx The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
1. The change LGTM except for no test case.
2. I still want to ask why CHMOD should be
Why CHMOD = recreated? This is a reasonable fix to me. For inotify,
3. as fsnotify owner confirms, this should be a kernel problem. There are more and more arm users, and the kernel bug fix is still not in process.
The kernel problem in arm env just raised the problem in my opinion. For short, if there is no side effect and logically correct, I think we should approve this fix.
|
Thank you for the comment. @pacoxu
Yes, I made a test a long time ago for this in amd, When the permissions of the container log file changed, so it should: chmod ! = recreate . which also indirectly causes the log file to be refreshed all the time in arm. That is the original issue that this PR is trying to fix. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Signed-off-by: cyclinder qifeng.guo@daocloud.io
What type of PR is this?
/kind bug
What this PR does / why we need it:
Under ARM64 environment, POD log parameter is the wireless cycle of log after Follow
kubernetes/pkg/kubelet/kuberuntime/logs/logs.go
Lines 438 to 454 in fbffe05
The CHMOD of the file should not cause the kubelet to need to re-read the entire log file
Refer to The fsnotify.WRITE and fsnotify.CHMOD events are generated when a file is changed in arm architecture fsnotify/fsnotify#421
After my testing, each file update resulted in one WRITE event and two CHMOD events. as shown below:
![wecom-temp-2f08e0a10e1f8406a39e5278f360fa90](https://user-images.githubusercontent.com/59680092/150116600-0c352766-e511-4a27-bcc3-ffb44955b46a.png)
So I don't think we need to pay attention to the fsnotify.CHMOD event.
Which issue(s) this PR fixes:
Fixes #103500
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: