Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File descriptors and internal connection leak(reopen) #15033

Closed
lihaohua66 opened this issue Jun 3, 2022 · 22 comments
Closed

File descriptors and internal connection leak(reopen) #15033

lihaohua66 opened this issue Jun 3, 2022 · 22 comments

Comments

@lihaohua66
Copy link

NOTE

After upgrading to RELEASE.2022-05-19T18-20-59Z, we encountered the following issues in both UI and clients after running the cluster for few days (The cluster has very low work load) :

 A timeout exceeded while waiting to proceed with the request, please reduce your request rate.

We could observe the file descriptor keeping increasing in the last few days.
image

We could also see thousands of file descriptors as below:

minio   20109 minio  178u     sock                0,9      0t0   513460923 protocol: TCPv6
minio   20109 minio  179u     sock                0,9      0t0   513255278 protocol: TCPv6
minio   20109 minio  180u     sock                0,9      0t0   513603050 protocol: TCPv6
minio   20109 minio  181u     sock                0,9      0t0   513340630 protocol: TCPv6
minio   20109 minio  182u     sock                0,9      0t0   513542780 protocol: TCPv6
minio   20109 minio  183u     sock                0,9      0t0   513382558 protocol: TCPv6
minio   20109 minio  184u     sock                0,9      0t0   513726529 protocol: TCPv6

With @harshavardhana ' suggestion in #15022, we upgrade to RELEASE.2022-06-02T02-11-04Z, but the file descriptor still keep increasing. (there are some clif here because we restarted the cluster at that time)
image

Expected Behavior

The cluster should recycle the file descriptor/connections and should not stop working.

Current Behavior

The cluster could not handle client request. The performance is downgrade

Possible Solution

Steps to Reproduce (for bugs)

  1. Run monio baremetal with RELEASE.2022-05-19T18-20-59Z for few days
  2. Check the file descriptor on each servers.

Context

Regression

Your Environment

  • Version used (minio --version): RELEASE.2022-06-02T02-11-04Z
  • Server setup and configuration: 6 servers with Minio Baremetal
  • Operating System and version (uname -a): Linux 5.4.0-54-generic Adding more api suite tests #60~18.04.1-Ubuntu SMP Fri Nov 6 17:25:16 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
@harshavardhana
Copy link
Member

We need to see profiling output here @lihaohua66

@harshavardhana harshavardhana added incomplete info Incomplete information and removed triage labels Jun 3, 2022
@lihaohua66
Copy link
Author

the profile for 10 minutes are here:
profile.zip

@lihaohua66
Copy link
Author

FYI. with version 20211209061941.0.0, we don't see this issue.
image

@harshavardhana
Copy link
Member

FYI. with version 20211209061941.0.0, we don't see this issue.

I have no idea what this version is we do not publish such versions @lihaohua66

@lihaohua66
Copy link
Author

@harshavardhana it is RELEASE.2021-12-09T06-19-41Z

@harshavardhana
Copy link
Member

@klauspost

goroutine profile: total 1205
320 @ 0x439b96 0x4057c5 0x40537d 0x1eb8db0 0x1ec0bc7 0x1ebd07b 0x46b901
#       0x1eb8daf       github.com/minio/minio/cmd.mergeEntryChannels+0x68f                     github.com/minio/minio/cmd/metacache-entries.go:698
#       0x1ec0bc6       github.com/minio/minio/cmd.(*erasureServerPools).listMerged+0x4e6       github.com/minio/minio/cmd/metacache-server-pool.go:583
#       0x1ebd07a       github.com/minio/minio/cmd.(*erasureServerPools).listPath.func3+0xda    github.com/minio/minio/cmd/metacache-server-pool.go:236

320 @ 0x439b96 0x44accc 0x44aca6 0x4675e5 0x4776d1 0x1ebcae9 0x1df0fd6 0x1defa5c 0x1cc3b5a 0x1e75113 0x1e7510e 0x6ce34f 0x7949a7 0x1e72e79 0x1e72e54 0x1e755d2 0x1e755b9 0x6ce34f 0x1e6d5d8 0x6ce34f 0x1e6e142 0x6ce34f 0x1e6d16a 0x6ce34f 0x1e6c20e 0x6ce34f 0x1e6b5af 0x6ce34f 0x1d36d4f 0x6ce34f 0x1e6b8c3 0x6ce34f
#       0x4675e4        sync.runtime_Semacquire+0x24                                            runtime/sema.go:56
#       0x4776d0        sync.(*WaitGroup).Wait+0x70                                             sync/waitgroup.go:130
#       0x1ebcae8       github.com/minio/minio/cmd.(*erasureServerPools).listPath+0x1428        github.com/minio/minio/cmd/metacache-server-pool.go:242
#       0x1df0fd5       github.com/minio/minio/cmd.(*erasureServerPools).ListObjects+0x795      github.com/minio/minio/cmd/erasure-server-pool.go:1280
#       0x1defa5b       github.com/minio/minio/cmd.(*erasureServerPools).ListObjectsV2+0xdb     github.com/minio/minio/cmd/erasure-server-pool.go:1160
#       0x1cc3b59       github.com/minio/minio/cmd.objectAPIHandlers.ListObjectsV2Handler+0x679 github.com/minio/minio/cmd/bucket-listobjects-handlers.go:241

@lihaohua66 you 300+ listing calls going on all servers - this is why there is a connection build-up - this is normal are you sure you have the right application workload here?

You can't be really doing 300*6 - 1800 concurrent List operations on a tiny cluster like this.

@lihaohua66
Copy link
Author

lihaohua66 commented Jun 3, 2022

@harshavardhana from the metrics, we did not see many list operations in the cluster.
image

@harshavardhana
Copy link
Member

yes but why? @lihaohua66 what is the purpose of doing 1800 list calls concurrently?

grep -B1 mergeEntry profile*-goroutines.txt
profile-host1:9000-goroutines.txt-320 @ 0x439b96 0x4057c5 0x40537d 0x1eb8db0 0x1ec0bc7 0x1ebd07b 0x46b901
profile-host1:9000-goroutines.txt:#	0x1eb8daf	github.com/minio/minio/cmd.mergeEntryChannels+0x68f			github.com/minio/minio/cmd/metacache-entries.go:698
--
profile-host2:9000-goroutines.txt-295 @ 0x439b96 0x4057c5 0x40537d 0x1eb8db0 0x1ec0bc7 0x1ebd07b 0x46b901
profile-host2:9000-goroutines.txt:#	0x1eb8daf	github.com/minio/minio/cmd.mergeEntryChannels+0x68f			github.com/minio/minio/cmd/metacache-entries.go:698
--
profile-host3:9000-goroutines.txt-325 @ 0x439b96 0x4057c5 0x40537d 0x1eb8db0 0x1ec0bc7 0x1ebd07b 0x46b901
profile-host3:9000-goroutines.txt:#	0x1eb8daf	github.com/minio/minio/cmd.mergeEntryChannels+0x68f			github.com/minio/minio/cmd/metacache-entries.go:698
--
profile-host4:9000-goroutines.txt-345 @ 0x439b96 0x4057c5 0x40537d 0x1eb8db0 0x1ec0bc7 0x1ebd07b 0x46b901
profile-host4:9000-goroutines.txt:#	0x1eb8daf	github.com/minio/minio/cmd.mergeEntryChannels+0x68f			github.com/minio/minio/cmd/metacache-entries.go:698
--
profile-host5:9000-goroutines.txt-336 @ 0x439b96 0x4057c5 0x40537d 0x1eb8db0 0x1ec0bc7 0x1ebd07b 0x46b901
profile-host5:9000-goroutines.txt:#	0x1eb8daf	github.com/minio/minio/cmd.mergeEntryChannels+0x68f			github.com/minio/minio/cmd/metacache-entries.go:698
--
profile-host6:9000-goroutines.txt-338 @ 0x439b96 0x4057c5 0x40537d 0x1eb8db0 0x1ec0bc7 0x1ebd07b 0x46b901
profile-host6:9000-goroutines.txt:#	0x1eb8daf	github.com/minio/minio/cmd.mergeEntryChannels+0x68f			github.com/minio/minio/cmd/metacache-entries.go:698

@lihaohua66
Copy link
Author

I don't think we really perform 1800 list calls concurrently. is it possible that list operation lead to any leak?

@lihaohua66
Copy link
Author

lihaohua66 commented Jun 3, 2022

For the open file descriptors, most of entries are minio 8955 minio 630u sock 0,9 0t0 515895585 protocol: TCPv6

lsof -p 8955 | grep TCPv6 | wc -l
345
lsof -p 8955 |  wc -l
485

@harshavardhana
Copy link
Member

git diff
diff --git a/cmd/metacache-entries.go b/cmd/metacache-entries.go
index 41c3da3b1..3bad979cf 100644
--- a/cmd/metacache-entries.go
+++ b/cmd/metacache-entries.go
@@ -695,8 +695,12 @@ func mergeEntryChannels(ctx context.Context, in []chan metaCacheEntry, out chan<
                        }
                }
                if best.name > last {
-                       out <- *best
-                       last = best.name
+                       select {
+                       case <-ctxDone:
+                               return ctx.Err()
+                       case out <- *best:
+                               last = best.name
+                       }
                } else if serverDebugLog {
                        console.Debugln("mergeEntryChannels: discarding duplicate", best.name, "<=", last)
                }
diff --git a/cmd/metacache-server-pool.go b/cmd/metacache-server-pool.go
index 82998f86f..96c213073 100644
--- a/cmd/metacache-server-pool.go
+++ b/cmd/metacache-server-pool.go
@@ -554,8 +554,8 @@ func (z *erasureServerPools) listMerged(ctx context.Context, o listPathOptions,
        for _, pool := range z.serverPools {
                for _, set := range pool.sets {
                        wg.Add(1)
-                       results := make(chan metaCacheEntry, 100)
-                       inputs = append(inputs, results)
+                       innerResults := make(chan metaCacheEntry, 100)
+                       inputs = append(inputs, innerResults)
                        go func(i int, set *erasureObjects) {
                                defer wg.Done()
                                err := set.listPath(listCtx, o, results)

can you apply this fix and test again? @lihaohua66 how long does it take to reproduce the problem?

@harshavardhana
Copy link
Member

Actually, this is the bug, the context is canceled however the result is hung at this and this is leaking connections.

@lihaohua66
Copy link
Author

@harshavardhana thank you for the quick action. we use released deb package to deploy our minio. do you know if any easy way to apply this fix?

@lihaohua66
Copy link
Author

The file descriptors increases gradually so usually it takes 5~10 hours to see the obvious increasing in the dashboard.

@harshavardhana
Copy link
Member

@harshavardhana thank you for the quick action. we use released deb package to deploy our minio. do you know if any easy way to apply this fix?

you will have to compile, I can compile one for you to deploy - so you can test it out.

@lihaohua66
Copy link
Author

that would be great. thank you

@harshavardhana
Copy link
Member

You can download the binary from mc cp -a play/testbucket/minio.gz . - deploy it on all the nodes and let us know in 5hours.

@lihaohua66
Copy link
Author

sure, let me test it.

@harshavardhana
Copy link
Member

Please do not deploy this yet there seems to be some problem, let me make a more thorough fix @lihaohua66

@harshavardhana
Copy link
Member

Okay updated the binary once again, same location mc cp -a play/testbucket/minio.gz . - you are good to use this.

@harshavardhana harshavardhana added priority: high and removed incomplete info Incomplete information labels Jun 3, 2022
@harshavardhana harshavardhana self-assigned this Jun 3, 2022
@harshavardhana
Copy link
Member

Closing this as fixed in #15034

@lihaohua66
Copy link
Author

cool, let me tested with the latest again

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants