Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaking test- TestRotateLogs from kubelet/logs/container_log_manager_test #123414

Closed
Rajalakshmi-Girish opened this issue Feb 21, 2024 · 19 comments · Fixed by #123469
Closed

Flaking test- TestRotateLogs from kubelet/logs/container_log_manager_test #123414

Rajalakshmi-Girish opened this issue Feb 21, 2024 · 19 comments · Fixed by #123469
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/windows Categorizes an issue or PR as relevant to SIG Windows. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@Rajalakshmi-Girish
Copy link
Contributor

Rajalakshmi-Girish commented Feb 21, 2024

Which jobs are flaking?

https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/ci-kubernetes-unit-windows-master
https://prow.k8s.io/job-history/gs/ppc64le-kubernetes/logs/periodic-kubernetes-unit-test-ppc64le

Which tests are flaking?

TestRotateLogs from kubelet/logs/container_log_manager_test

Since when has it been flaking?

Probably since the merge of change #114301

Testgrid link

https://testgrid.k8s.io/sig-windows-signal#windows-unit-master

https://testgrid.k8s.io/ibm-unit-tests-ppc64le#Periodic%20unit%20test%20suite%20on%20ppc64le

Reason for failure (if possible)

=== FAIL: pkg/kubelet/logs TestRotateLogs (0.08s)
    container_log_manager_test.go:176: 
        	Error Trace:	C:/kubernetes/pkg/kubelet/logs/container_log_manager_test.go:176
        	Error:      	Not equal: 
        	            	expected: "test-log-3.00000000-000001.gz"
        	            	actual  : "test-log-3"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-test-log-3.00000000-000001.gz
        	            	+test-log-3
        	Test:       	TestRotateLogs
    container_log_manager_test.go:177: 
        	Error Trace:	C:/kubernetes/pkg/kubelet/logs/container_log_manager_test.go:177
        	Error:      	Not equal: 
        	            	expected: "test-log-3.20240218-033956"
        	            	actual  : "test-log-3.00000000-000001"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-test-log-3.20240218-033956
        	            	+test-log-3.00000000-000001
        	Test:       	TestRotateLogs

Anything else we need to know?

No response

Relevant SIG(s)

/sig node

@Rajalakshmi-Girish Rajalakshmi-Girish added the kind/flake Categorizes issue or PR as related to a flaky test. label Feb 21, 2024
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 21, 2024
@kannon92
Copy link
Contributor

/triage accepted
/cc @dims

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 21, 2024
@kannon92
Copy link
Contributor

/cc @harshanarayana

@kannon92
Copy link
Contributor

/priority important-longterm

Seems to impact on this arch only?

@k8s-ci-robot k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Feb 21, 2024
@dims
Copy link
Member

dims commented Feb 21, 2024

cc @harshanarayana

@kannon92
Copy link
Contributor

/sig windows

@k8s-ci-robot k8s-ci-robot added the sig/windows Categorizes an issue or PR as relevant to SIG Windows. label Feb 21, 2024
@harshanarayana
Copy link
Contributor

@dims @kannon92 ack.. Checking

@harshanarayana
Copy link
Contributor

@dims @kannon92 what is the easiest way for me to test this one locally ? using qemu based vm ?

@dims
Copy link
Member

dims commented Feb 22, 2024

@harshanarayana poke @Rajalakshmi-Girish and @mkumatag on slack please

@Rajalakshmi-Girish
Copy link
Contributor Author

@harshanarayana unfortunately, I couldn't reproduce this flake locally.
Will update here once I am able to.

@Rajalakshmi-Girish
Copy link
Contributor Author

I am now able to see the flake,
when I run the test within a Pod with a resource limit as our CI job does.
Steps to reproduce:

  1. git clone https://github.com/kubernetes/kubernetes
  2. cd kubernetes
  3. go test k8s.io/kubernetes/pkg/kubelet/logs.TestRotateLogs -c -race
  4. stress ./logs.test -test.run TestRotateLogs
root@testpod1:/workspace/kubernetes# stress ./logs.test -test.run TestRotateLogs
Please find the attached trace file 
[stress_output.txt](https://github.com/kubernetes/kubernetes/files/14374677/stress_output.txt)

@harshanarayana
Copy link
Contributor

@Rajalakshmi-Girish This is with ppc64 or on a native x86_64 as well ?

@Rajalakshmi-Girish
Copy link
Contributor Author

@Rajalakshmi-Girish This is with ppc64 or on a native x86_64 as well ?

I could reproduce this even on x86_64.
Also, the failures do happen even when run on the VM itself and NOT ONLY within the pod as I mentioned above.

The flakiness % is higher when run from a pod rather than a VM.

@harshanarayana
Copy link
Contributor

I was able to reproduce this on regular machines as well with -count 1000 while running tests. Let me check and see what needs to be done to fix this.

@harshanarayana
Copy link
Contributor

diff --git a/pkg/kubelet/logs/container_log_manager_test.go b/pkg/kubelet/logs/container_log_manager_test.go
index 7d6e7055274..a2c5a097b25 100644
--- a/pkg/kubelet/logs/container_log_manager_test.go
+++ b/pkg/kubelet/logs/container_log_manager_test.go
@@ -157,12 +157,12 @@ func TestRotateLogs(t *testing.T) {
 		},
 	}
 	f.SetFakeContainers(testContainers)
-	go c.processQueueItems(ctx, 1)
 	require.NoError(t, c.rotateLogs(ctx))
+	go c.processQueueItems(ctx, 1)
 
 	pollTimeoutCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
 	defer cancel()
-	err = wait.PollUntilContextCancel(pollTimeoutCtx, 20*time.Millisecond, false, func(ctx context.Context) (done bool, err error) {
+	err = wait.PollUntilContextCancel(pollTimeoutCtx, 50*time.Millisecond, false, func(ctx context.Context) (done bool, err error) {
 		return c.queue.Len() == 0, nil
 	})
 	require.NoError(t, err)

This seem to take care of the issue. Do you think the poll interval of 50ms is too much ? cc @dims @kannon92 @Rajalakshmi-Girish

@harshanarayana
Copy link
Contributor

❯ stress ./logs.test -test.run TestRotateLogs -test.v
5s: 0 runs so far, 0 failures
10s: 48 runs so far, 0 failures
15s: 96 runs so far, 0 failures
20s: 146 runs so far, 0 failures
25s: 196 runs so far, 0 failures
30s: 244 runs so far, 0 failures
35s: 292 runs so far, 0 failures
40s: 339 runs so far, 0 failures
45s: 387 runs so far, 0 failures

@harshanarayana
Copy link
Contributor

Spoke too early.

1m30s: 812 runs so far, 4 failures (0.49%)

/var/folders/l3/tf0xj3gj67bfmk13h462668w0000gn/T/go-stress-20240223T144024-4097897728
=== RUN   TestRotateLogs
    container_log_manager_test.go:173:
        	Error Trace:	/Users/harshnar/Work/Tesseract/Maglev/mks/k8s/kubernetes/pkg/kubelet/logs/container_log_manager_test.go:173
        	Error:      	"[- test-log-1 - test-log-2.20240223-144155 - test-log-3 - test-log-3.00000000-000001 - test-log-3.00000000-000001.tmp - test-log-4]" should have 5 item(s), but has 6
        	Test:       	TestRotateLogs
    container_log_manager_test.go:176:
        	Error Trace:	/Users/harshnar/Work/Tesseract/Maglev/mks/k8s/kubernetes/pkg/kubelet/logs/container_log_manager_test.go:176
        	Error:      	Not equal:
        	            	expected: "test-log-3.00000000-000001.gz"
        	            	actual  : "test-log-3"

        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-test-log-3.00000000-000001.gz
        	            	+test-log-3
        	Test:       	TestRotateLogs
    container_log_manager_test.go:177:
        	Error Trace:	/Users/harshnar/Work/Tesseract/Maglev/mks/k8s/kubernetes/pkg/kubelet/logs/container_log_manager_test.go:177
        	Error:      	Not equal:
        	            	expected: "test-log-3.20240223-144155"
        	            	actual  : "test-log-3.00000000-000001"

        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-test-log-3.20240223-144155
        	            	+test-log-3.00000000-000001
        	Test:       	TestRotateLogs
    container_log_manager_test.go:178:
        	Error Trace:	/Users/harshnar/Work/Tesseract/Maglev/mks/k8s/kubernetes/pkg/kubelet/logs/container_log_manager_test.go:178
        	Error:      	Not equal:
        	            	expected: "test-log-4"
        	            	actual  : "test-log-3.00000000-000001.tmp"

        	            	Diff:
        	            	--- Expected

…
1m35s: 851 runs so far, 5 failures (0.59%)

@harshanarayana
Copy link
Contributor

/assign

@harshanarayana
Copy link
Contributor

/triage accepted

@harshanarayana
Copy link
Contributor

I0223 15:36:45.203710   20383 container_log_manager.go:242] "Prrocessing container for log rotation" id="container-not-need-rotate" name="" labels=null
I0223 15:36:45.203788   20383 container_log_manager.go:242] "Prrocessing container for log rotation" id="container-need-rotate" name="" labels=null
I0223 15:36:45.203811   20383 container_log_manager.go:242] "Prrocessing container for log rotation" id="container-has-excess-log" name="" labels=null
I0223 15:36:45.203830   20383 container_log_manager.go:242] "Prrocessing container for log rotation" id="container-is-not-running" name="" labels=null
I0223 15:36:45.203964   20383 container_log_manager.go:302] "log file doesn't need to be rotated" worker=1 containerID="container-not-need-rotate" path="/var/folders/l3/tf0xj3gj67bfmk13h462668w0000gn/T/test-rotate-logs3784354061/test-log-1" currentSize=5 maxSize=10
I0223 15:36:45.203991   20383 container_log_manager.go:264] "Processing of container log completed" worker=1 containerID="container-not-need-rotate" ok=true
I0223 15:36:45.204294   20383 container_log_manager.go:322] "Found logs to rotate" containerID="container-need-rotate" log="/var/folders/l3/tf0xj3gj67bfmk13h462668w0000gn/T/test-rotate-logs3784354061/test-log-2" logs=null
I0223 15:36:45.204323   20383 container_log_manager.go:328] "Cleaned up unused logs" containerID="container-need-rotate" log="/var/folders/l3/tf0xj3gj67bfmk13h462668w0000gn/T/test-rotate-logs3784354061/test-log-2" logs=null
I0223 15:36:45.204345   20383 container_log_manager.go:334] "Removed excess logs" containerID="container-need-rotate" log="/var/folders/l3/tf0xj3gj67bfmk13h462668w0000gn/T/test-rotate-logs3784354061/test-log-2" logs=null
I0223 15:36:45.204366   20383 container_log_manager.go:464] "Rotating latest log" containerID="container-need-rotate" log="/var/folders/l3/tf0xj3gj67bfmk13h462668w0000gn/T/test-rotate-logs3784354061/test-log-2"
I0223 15:36:45.204822   20383 container_log_manager.go:310] "Rotated log for container" worker=1 containerID="container-need-rotate" path="/var/folders/l3/tf0xj3gj67bfmk13h462668w0000gn/T/test-rotate-logs3784354061/test-log-2" currentSize=20 maxSize=10
I0223 15:36:45.204849   20383 container_log_manager.go:264] "Processing of container log completed" worker=1 containerID="container-need-rotate" ok=true
I0223 15:36:45.205135   20383 container_log_manager.go:322] "Found logs to rotate" containerID="container-has-excess-log" log="/var/folders/l3/tf0xj3gj67bfmk13h462668w0000gn/T/test-rotate-logs3784354061/test-log-3" logs=["/var/folders/l3/tf0xj3gj67bfmk13h462668w0000gn/T/test-rotate-logs3784354061/test-log-3.00000000-000000.gz","/var/folders/l3/tf0xj3gj67bfmk13h462668w0000gn/T/test-rotate-logs3784354061/test-log-3.00000000-000001"]
I0223 15:36:45.205183   20383 container_log_manager.go:328] "Cleaned up unused logs" containerID="container-has-excess-log" log="/var/folders/l3/tf0xj3gj67bfmk13h462668w0000gn/T/test-rotate-logs3784354061/test-log-3" logs=["/var/folders/l3/tf0xj3gj67bfmk13h462668w0000gn/T/test-rotate-logs3784354061/test-log-3.00000000-000000.gz","/var/folders/l3/tf0xj3gj67bfmk13h462668w0000gn/T/test-rotate-logs3784354061/test-log-3.00000000-000001"]
I0223 15:36:45.205521   20383 container_log_manager.go:334] "Removed excess logs" containerID="container-has-excess-log" log="/var/folders/l3/tf0xj3gj67bfmk13h462668w0000gn/T/test-rotate-logs3784354061/test-log-3" logs=["/var/folders/l3/tf0xj3gj67bfmk13h462668w0000gn/T/test-rotate-logs3784354061/test-log-3.00000000-000001"]
I0223 15:36:45.205551   20383 container_log_manager.go:420] "Compressing log" log="/var/folders/l3/tf0xj3gj67bfmk13h462668w0000gn/T/test-rotate-logs3784354061/test-log-3.00000000-000001"
I0223 15:36:45.205733   20383 container_log_manager.go:427] "Creating temporary log" log="/var/folders/l3/tf0xj3gj67bfmk13h462668w0000gn/T/test-rotate-logs3784354061/test-log-3.00000000-000001.tmp"
    container_log_manager_test.go:174: 
        	Error Trace:	/Users/harshnar/Work/Tesseract/Maglev/mks/k8s/kubernetes/pkg/kubelet/logs/container_log_manager_test.go:174
        	Error:      	"[- test-log-1 - test-log-2.20240223-153645 - test-log-3 - test-log-3.00000000-000001 - test-log-3.00000000-000001.tmp - test-log-4]" should have 5 item(s), but has 6
        	Test:       	TestRotateLogs
    container_log_manager_test.go:177: 
        	Error Trace:	/Users/harshnar/Work/Tesseract/Maglev/mks/k8s/kubernetes/pkg/kubelet/logs/container_log_manager_test.go:177
        	Error:      	Not equal: 
        	            	expected: "test-log-3.00000000-000001.gz"
        	            	actual  : "test-log-3"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-test-log-3.00000000-000001.gz
        	            	+test-log-3
        	Test:       	TestRotateLogs
    container_log_manager_test.go:178: 
        	Error Trace:	/Users/harshnar/Work/Tesseract/Maglev/mks/k8s/kubernetes/pkg/kubelet/logs/container_log_manager_test.go:178
        	Error:      	Not equal: 
        	            	expected: "test-log-3.20240223-153645"
        	            	actual  : "test-log-3.00000000-000001"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-test-log-3.20240223-153645
        	            	+test-log-3.00000000-000001
        	Test:       	TestRotateLogs
    container_log_manager_test.go:179: 
        	Error Trace:	/Users/harshnar/Work/Tesseract/Maglev/mks/k8s/kubernetes/pkg/kubelet/logs/container_log_manager_test.go:179
        	Error:      	Not equal: 
        	            	expected: "test-log-4"
        	            	actual  : "test-log-3.00000000-000001.tmp"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-test-log-4
        	            	+test-log-3.00000000-000001.tmp
        	Test:       	TestRotateLogs
I0223 15:36:45.227321   20383 container_log_manager.go:447] "Renaming temporary log to compressed log" tmpLog="/var/folders/l3/tf0xj3gj67bfmk13h462668w0000gn/T/test-rotate-logs3784354061/test-log-3.00000000-000001.tmp" compressedLog="/var/folders/l3/tf0xj3gj67bfmk13h462668w0000gn/T/test-rotate-logs3784354061/test-log-3.00000000-000001.gz"
E0223 15:36:45.227999   20383 container_log_manager.go:449] "Failed to rename temporary log to compressed log" err="rename /var/folders/l3/tf0xj3gj67bfmk13h462668w0000gn/T/test-rotate-logs3784354061/test-log-3.00000000-000001.tmp /var/folders/l3/tf0xj3gj67bfmk13h462668w0000gn/T/test-rotate-logs3784354061/test-log-3.00000000-000001.gz: no such file or directory" tmpLog="/var/folders/l3/tf0xj3gj67bfmk13h462668w0000gn/T/test-rotate-logs3784354061/test-log-3.00000000-000001.tmp" compressedLog="/var/folders/l3/tf0xj3gj67bfmk13h462668w0000gn/T/test-rotate-logs3784354061/test-log-3.00000000-000001.gz"
E0223 15:36:45.228164   20383 container_log_manager.go:307] "Failed to rotate log for container" err="failed to compress log \"/var/folders/l3/tf0xj3gj67bfmk13h462668w0000gn/T/test-rotate-logs3784354061/test-log-3.00000000-000001\": failed to rename \"/var/folders/l3/tf0xj3gj67bfmk13h462668w0000gn/T/test-rotate-logs3784354061/test-log-3.00000000-000001.tmp\" to \"/var/folders/l3/tf0xj3gj67bfmk13h462668w0000gn/T/test-rotate-logs3784354061/test-log-3.00000000-000001.gz\": rename /var/folders/l3/tf0xj3gj67bfmk13h462668w0000gn/T/test-rotate-logs3784354061/test-log-3.00000000-000001.tmp /var/folders/l3/tf0xj3gj67bfmk13h462668w0000gn/T/test-rotate-logs3784354061/test-log-3.00000000-000001.gz: no such file or directory" worker=1 containerID="container-has-excess-log" path="/var/folders/l3/tf0xj3gj67bfmk13h462668w0000gn/T/test-rotate-logs3784354061/test-log-3" currentSize=20 maxSize=10
I0223 15:36:45.228193   20383 container_log_manager.go:264] "Processing of container log completed" worker=1 containerID="container-has-excess-log" ok=true
I0223 15:36:45.228223   20383 container_log_manager.go:260] "Did not get any new entry to process, terminating worker" workerID=1
--- FAIL: TestRotateLogs (0.03s)

Had to add a few additional logs to identify the issue, but has a fix now. PTAL when possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/windows Categorizes an issue or PR as relevant to SIG Windows. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: Done
5 participants