Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet: fix fsnotify CHMOD event in arm architecture #107634

Closed
wants to merge 1 commit into from

Conversation

cyclinder
Copy link
Contributor

Signed-off-by: cyclinder qifeng.guo@daocloud.io

What type of PR is this?

/kind bug

What this PR does / why we need it:

Under ARM64 environment, POD log parameter is the wireless cycle of log after Follow

for {
select {
case <-ctx.Done():
return false, false, fmt.Errorf("context cancelled")
case e := <-w.Events:
switch e.Op {
case fsnotify.Write:
return true, false, nil
case fsnotify.Create:
fallthrough
case fsnotify.Rename:
fallthrough
case fsnotify.Remove:
fallthrough
case fsnotify.Chmod:
return true, true, nil
default:

  1. The CHMOD of the file should not cause the kubelet to need to re-read the entire log file

  2. Refer to The fsnotify.WRITE and fsnotify.CHMOD events are generated when a file is changed in arm architecture fsnotify/fsnotify#421

After my testing, each file update resulted in one WRITE event and two CHMOD events. as shown below:
wecom-temp-2f08e0a10e1f8406a39e5278f360fa90

So I don't think we need to pay attention to the fsnotify.CHMOD event.

Which issue(s) this PR fixes:

Fixes #103500

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jan 19, 2022
@cyclinder
Copy link
Contributor Author

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 19, 2022
@cyclinder
Copy link
Contributor Author

/test pull-kubernetes-node-e2e-containerd

@cyclinder
Copy link
Contributor Author

/test pull-kubernetes-e2e-kind-ipv6

@cyclinder
Copy link
Contributor Author

/test pull-kubernetes-node-e2e-containerd

@dims
Copy link
Member

dims commented Feb 11, 2022

/milestone v1.24
/assign @ehashman @klueska

@k8s-ci-robot k8s-ci-robot added this to the v1.24 milestone Feb 11, 2022
Comment on lines 450 to 461
case fsnotify.Remove:
fallthrough
case fsnotify.Chmod:
return true, true, nil
default:
klog.ErrorS(nil, "Received unexpected fsnotify event, retrying", "event", e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If case fsnotify.Chmod: is deleted, fsnotify.Chmod event is judged as default: and logged as an unexpected fsnotify event error.
It seems that we need to consider whether it is appropriate.
Although fsnotify.Chmod event may occur at an unexpected times according to your report, not an unexpected kind of event.
And, fsnotify.Chmod will be indistinguishable from a truly unexpected event.

How about not deleting case fsnotify.Chmod: and not processing anything?
However, I think we need to check that not handling this event will not affect existing processing on any architecture.
Because there might be a reason why it has been handled this event so far.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyclinder can you check this comment?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the following comment that explains why there is fsnotify.Chmod.

#73041 (comment)
I've added it as I've seen the Chmod event coming from fsnotify instead of Rename or Create when the log file was replaced with rename(2).

However, I tested rename(2) on ubuntu and Chmod was not notify. It might depend on the environment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds a bit scary, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @giuseppe
from earlier change

It seems to me that we're not actually using whether the log file was found anywhere:

// waitLogs wait for the next log write. It returns two booleans and an error. The first boolean
// indicates whether a new log is found; the second boolean if the log file was recreated;
// the error is error happens during waiting new logs.
func waitLogs(ctx context.Context, id string, w *fsnotify.Watcher, runtimeService internalapi.RuntimeService) (bool, bool, error) {

I would expect chmod would neither create a new file nor recreate the existing one, so I'd expect a return value of something like false, false, nil for this case.

I agree that Chmod is an expected case and we should handle it.

Are there no accompanying test changes to ensure this fix works?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @ehashman, the chmod event should be handled.

I don't see why it should behave differently on arm, it could be a kernel bug, or an issue in github.com/fsnotify/fsnotify.

Do you have a reproducer with fsnotify that returns the chmod event (which is mapped to the kernel event IN_ATTRIB) when there are no changes happening to the file?

And if it behaves differently on arm, could you strace it and see what events are received from the kernel?

@matthyx
Copy link
Contributor

matthyx commented Feb 22, 2022

/triage accepted
/priority important-soon

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 22, 2022
@matthyx matthyx moved this from Triage to Waiting on Author in SIG Node PR Triage Feb 22, 2022
@cyclinder
Copy link
Contributor Author

I extracted the key code from the kubelet and wrote a demo to test the fix, Please refer to the following: https://github.com/cyclinder/mock-kubelet/blob/main/main.go

Based on the test results, I think we should refer to the following fix:
https://github.com/cyclinder/mock-kubelet/blob/fbd1d5d1f8d5e3bf4a9b166af97f7f917be340e5/main.go#L404-418

If chmod returns false,false,nil, then found=false, which will cause the kubectl logs -f command to exit(according to the code logic and actual test results).

When I change the chmod return to true,false,nil, it looks like a new log was found, but no new logs were actually generated, so the result is works, so I suggest changing the return value to true,false,nil

Here are some tests I did on the arm:

[root@dce-18-18-10-202 ~]# uname -a
Linux dce-18-18-10-202 4.19.90-23.8.v2101.ky10.aarch64 #1 SMP Mon May 17 17:07:38 CST 2021 aarch64 aarch64 aarch64 GNU/Linux
  1. go run main.go
[root@dce-18-18-10-202 kubelet-fsnotify]# ls
go.mod  go.sum  main.go
[root@dce-18-18-10-202 kubelet-fsnotify]# go run main.go
I0303 13:14:22.965921  373162 main.go:534] WebServer Starting on 127.0.0.1:18080...
  1. Open a new terminal, Simulate kubectl logs -f :
[root@dce-18-18-10-202 ~]# curl http://127.0.0.1:18080/container/9f1ecdb55e78881c458f7f0808c0a4fee3aa7f9108632614a5b9af27740d0dc1?tail=50
E0303 05:11:59.562169 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection refused
E0303 05:12:26.778512 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Unauthorized
E0303 05:12:57.433730 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection timed out
E0303 05:13:27.673790 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection timed out
E0303 05:13:59.205837 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection refused
E0303 05:14:22.354639 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection refused
E0303 05:14:44.073696 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection timed out
E0303 05:15:12.474206 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection timed out
E0303 05:15:40.233722 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection timed out
E0303 05:15:55.923740 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection refused





E0303 05:16:30.474135 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: no route to host
[root@dce-18-18-10-202 kubelet-fsnotify]# go run main.go
I0303 13:14:22.965921  373162 main.go:534] WebServer Starting on 127.0.0.1:18080...
I0303 13:16:19.852128  373162 main.go:286] success to watch file "/var/lib/containers/docker/containers/9f1ecdb55e78881c458f7f0808c0a4fee3aa7f9108632614a5b9af27740d0dc1/9f1ecdb55e78881c458f7f0808c0a4fee3aa7f9108632614a5b9af27740d0dc1-json.log"
I0303 13:16:21.584216  373162 main.go:407] fsnotify.Chmod
I0303 13:16:21.591680  373162 main.go:407] fsnotify.Chmod
I0303 13:16:22.443785  373162 main.go:407] fsnotify.Chmod
I0303 13:16:22.451412  373162 main.go:407] fsnotify.Chmod
I0303 13:16:22.462365  373162 main.go:407] fsnotify.Chmod
I0303 13:16:22.469724  373162 main.go:407] fsnotify.Chmod
I0303 13:16:26.596240  373162 main.go:407] fsnotify.Chmod
I0303 13:16:26.604875  373162 main.go:407] fsnotify.Chmod
I0303 13:16:30.474513  373162 main.go:404] fsnotify.Write
I0303 13:16:31.609368  373162 main.go:407] fsnotify.Chmod
I0303 13:16:31.617111  373162 main.go:407] fsnotify.Chmod
I0303 13:16:32.490666  373162 main.go:407] fsnotify.Chmod
I0303 13:16:32.499424  373162 main.go:407] fsnotify.Chmod
I0303 13:16:32.509655  373162 main.go:407] fsnotify.Chmod
I0303 13:16:32.518956  373162 main.go:407] fsnotify.Chmod

fsnotify.Write corresponds to the generation of a new log, while Chmod does not affect.(I don't know why there are constant reports of Chmod :) )

If I change the return value of chmod to false,false,nil, When a chmod event occurs, the request will exit.

[root@dce-18-18-10-202 kubelet-fsnotify]# go run main.go
I0303 13:25:34.242341  418994 main.go:534] WebServer Starting on 127.0.0.1:18080...


I0303 13:25:43.286513  418994 main.go:286] success to watch file "/var/lib/containers/docker/containers/9f1ecdb55e78881c458f7f0808c0a4fee3aa7f9108632614a5b9af27740d0dc1/9f1ecdb55e78881c458f7f0808c0a4fee3aa7f9108632614a5b9af27740d0dc1-json.log"
I0303 13:25:43.635576  418994 main.go:407] fsnotify.Chmod
[root@dce-18-18-10-202 ~]# curl http://127.0.0.1:18080/container/9f1ecdb55e78881c458f7f0808c0a4fee3aa7f9108632614a5b9af27740d0dc1?tail=50
E0303 05:22:05.434183 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection refused
E0303 05:22:35.514768 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: no route to host
E0303 05:22:56.448400 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection refused
E0303 05:23:18.362679 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection refused
E0303 05:23:45.838833 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Unauthorized
E0303 05:24:04.953697 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection timed out
E0303 05:24:24.350545 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection refused
E0303 05:24:39.591346 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection refused
E0303 05:24:56.950443 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection refused
E0303 05:25:15.266495 3890480 leaderelection.go:320] error retrieving resource lock kube-system/dce-uds-local-storage-worker-dce-18-18-10-202: Get "https://172.31.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/dce-uds-local-storage-worker-dce-18-18-10-202": dial tcp 172.31.0.1:443: connect: connection refused
[root@dce-18-18-10-202 ~]#
[root@dce-18-18-10-202 ~]#

and If I don't change anything and just keep the same code as in master, I will get pod log parameter is the wireless cycle of log after Follow, just as reported in the issue.

so I suggest changing the return value to true,false,nil.
@ehashman @matthyx

@DiptoChakrabarty
Copy link
Member

✋ I am from the v1.24 bug triage team , just a reminder that code freeze is on 03/30 , @cyclinder if this PR is ready can it be merged before that :)

@cyclinder
Copy link
Contributor Author

cyclinder commented Mar 23, 2022

Hi @DiptoChakrabarty, This PR is ready, Currently need reviewer to review it or lgtm.
So can you take a look? Thanks.
/cc @ehashman @matthyx @giuseppe

Copy link
Contributor

@matthyx matthyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@matthyx matthyx moved this from Waiting on Author to Needs Approver in SIG Node PR Triage Mar 24, 2022
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 24, 2022
@ehashman
Copy link
Member

/lgtm cancel
/milestone clear

I don't think my or @giuseppe's comments have been adequately addressed, and we don't have a test for this, so I'm not comfortable merging at this point.

@k8s-ci-robot k8s-ci-robot removed this from the v1.24 milestone Mar 28, 2022
@ehashman ehashman moved this from Needs Approver to Waiting on Author in SIG Node PR Triage Mar 28, 2022
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 28, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 26, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 26, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

SIG Node PR Triage automation moved this from Waiting on Author to Done Aug 25, 2022
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cyclinder
Copy link
Contributor Author

/reopen

We still meet this issue in arm.

According to the discussion with the fsnotify community, there is a high probability of a kernel bug, so what should we do to avoid this problem? I still think the change in this PR makes sense, maybe it is currently we don't have a test for this, but I can test for this.

How do you think? Any thoughts?

@klueska @pacoxu

Thanks.

@k8s-ci-robot k8s-ci-robot reopened this Nov 10, 2022
@k8s-ci-robot
Copy link
Contributor

@cyclinder: Reopened this PR.

In response to this:

/reopen

We still meet this issue in arm.

According to the discussion with the fsnotify community, there is a high probability of a kernel bug, so what should we do to avoid this problem? I still think the change in this PR makes sense, maybe it is currently we don't have a test for this, but I can test for this.

How do you think? Any thoughts?

@klueska @pacoxu

Thanks.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

SIG Node PR Triage automation moved this from Done to Triage Nov 10, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cyclinder, matthyx
Once this PR has been reviewed and has the lgtm label, please ask for approval from klueska by writing /assign @klueska in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pacoxu
Copy link
Member

pacoxu commented Nov 10, 2022

1. The change LGTM except for no test case.
However, per your screenshot and description, you have already tested this.

  • a test case would be better
  • TODO: I would test for the log rotate scenario and see if this has any side effects.

2. I still want to ask why CHMOD should be true, true, nil? Logically, this fix is correct.

the second boolean if the log file was recreated;

Why CHMOD = recreated? This is a reasonable fix to me.

For inotify, chmod() is only permission change of the metadata of the file.

IN_ATTRIB (*)
                  Metadata changed—for example, permissions (e.g.,
                  [chmod(2)](https://man7.org/linux/man-pages/man2/chmod.2.html)), timestamps (e.g., [utimensat(2)](https://man7.org/linux/man-pages/man2/utimensat.2.html)), extended
                  attributes ([setxattr(2)](https://man7.org/linux/man-pages/man2/setxattr.2.html)), link count (since Linux
                  2.6.25; e.g., for the target of [link(2)](https://man7.org/linux/man-pages/man2/link.2.html) and for
                  [unlink(2)](https://man7.org/linux/man-pages/man2/unlink.2.html)), and user/group ID (e.g., [chown(2)](https://man7.org/linux/man-pages/man2/chown.2.html)).

3. as fsnotify owner confirms, this should be a kernel problem.

There are more and more arm users, and the kernel bug fix is still not in process.

  • I prefer to fix it on the kubelet side as soon as possible.

The kernel problem in arm env just raised the problem in my opinion.
For amd64 users, if someone chmod a log file, the kubectl log -f will reprint logs that have already been printed. It is still a big problem.


For short, if there is no side effect and logically correct, I think we should approve this fix.

  • Let me do some further tests on the log rotating and get back to you.

@cyclinder
Copy link
Contributor Author

Thank you for the comment. @pacoxu

The kernel problem in arm env just raised the problem in my opinion.
For amd64 users, if someone chmod a log file, the kubectl log -f will reprint logs that have already been printed. It is still a big problem.

Yes, I made a test a long time ago for this in amd, When the permissions of the container log file changed, kubectl logs -f will re-read the entire file, which obviously doesn't make sense.

so it should: chmod ! = recreate .

which also indirectly causes the log file to be refreshed all the time in arm. That is the original issue that this PR is trying to fix.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

SIG Node PR Triage automation moved this from Triage to Done Dec 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

kubectl logs with the --follow args, the log loops indefinitely