Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs: get inodes and disk usage via pure go #2171

Merged
merged 1 commit into from Feb 25, 2019

Conversation

namreg
Copy link
Contributor

@namreg namreg commented Feb 15, 2019

Hello there,
I would like to go back to the PR have been making almost two years ago. (this one #1576). @euank proposed to get rid of find command. I took as a basis his PR and also got rid of du command. Now we collect both disk space and inode stats at the same time.

Why it matters for us: both du and find overwhelme our audit log.

Also, I have benchmarked du command and pure go implementation and got the following results:

Benchmark_duDiskUsage-8             1000           1597743 ns/op           22551 B/op        165 allocs/op
Benchmark_osStatDiskUsage-8        20000             56153 ns/op           53624 B/op         88 allocs/op

As we can see, pure go implementation is ~20x faster.

@k8s-ci-robot
Copy link
Collaborator

Hi @namreg. Thanks for your PR.

I'm waiting for a google or kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dims
Copy link
Collaborator

dims commented Feb 15, 2019

/ok-to-test

@dashpole
Copy link
Collaborator

How much data is in the directory it is getting disk and inodes on?

@namreg
Copy link
Contributor Author

namreg commented Feb 16, 2019

@dashpole Do you mean how much data in the directory when benchmarking? To be honest, I have ran it in a small directory that contains ~10 items. Also, I ran benchmark only for du command.
But I have remade benchmarks and now have the following results:

goos: linux
goarch: amd64
pkg: github.com/namreg/find-du-bench
BenchmarkDiskUsageNew-2   	   20000	     85928 ns/op	   22720 B/op	      90 allocs/op
BenchmarkDiskUsageOld-2   	      50	  34123838 ns/op	   46605 B/op	     182 allocs/op

Benchmarks is running on the cadvisor source code directory, including .git (2557 files, ~57M). The benchmarks source code you can see in the repo https://github.com/namreg/find-du-bench

@dashpole
Copy link
Collaborator

Can you try it on a large directory? Say 10 GB?

Thanks a bunch for running these experiments.

@euank
Copy link
Collaborator

euank commented Feb 19, 2019

The best data to use is probably a /var/lib/docker directory using the overlay/overlay2 graphdriver. That's what this code is meant to count in kubernetes after all, and that directory will also often have quite a few hardlinks depending on the container images you have laying there. It could be worth trying to populate it with something approximating a typical k8s workload, either by running the benchmark on a k8s node, or by trying to download similar images and run similar sets of containers.

@namreg
Copy link
Contributor Author

namreg commented Feb 20, 2019

@dashpole @euank So, I've tried to run the benchmark on the one of the kubernetes node in the directory /var/lib/docker (the size is ~17GB), as was proposed above. And now I have the following results:

goos: linux
goarch: amd64
BenchmarkDiskUsageNew-4   	       5	 249946403 ns/op	96558353 B/op	  245327 allocs/op
BenchmarkDiskUsageOld-4   	       1	1132808839 ns/op	   63032 B/op	     238 allocs/op

As we can see, new implementation is still faster, but it consumes a lot more memory. I believe this happens because go runtime does not track memory for processes that were spawned by exec.Command.

Nevertheless, here are pprof results:

Showing nodes accounting for 777.70MB, 99.49% of 781.71MB total
Dropped 8 nodes (cum <= 3.91MB)
Showing top 10 nodes out of 26
      flat  flat%   sum%        cum   cum%
  548.19MB 70.13% 70.13%   557.21MB 71.28%  os.(*File).readdirnames
   85.52MB 10.94% 81.07%   145.03MB 18.55%  os.lstatNolog
   64.51MB  8.25% 89.32%    64.51MB  8.25%  syscall.ByteSliceFromString
   61.01MB  7.80% 97.12%    61.01MB  7.80%  strings.Join
    9.03MB  1.15% 98.28%     9.03MB  1.15%  syscall.ParseDirent
    5.50MB   0.7% 98.98%     5.50MB   0.7%  os.newFile
    3.96MB  0.51% 99.49%     3.96MB  0.51%  _/tmp/find-du-bench.DiskUsageNew.func1
         0     0% 99.49%   781.20MB 99.94%  _/tmp/find-du-bench.BenchmarkDiskUsageNew
         0     0% 99.49%   781.20MB 99.94%  _/tmp/find-du-bench.DiskUsageNew
         0     0% 99.49%   557.21MB 71.28%  os.(*File).Readdirnames
(pprof) list .readdirnames
Total: 781.71MB
ROUTINE ======================== os.(*File).readdirnames in /usr/local/go/src/os/dir_unix.go
  548.19MB   557.21MB (flat, cum) 71.28% of Total
         .          .     45:}
         .          .     46:
         .          .     47:func (f *File) readdirnames(n int) (names []string, err error) {
         .          .     48:	// If this file has no dirinfo, create one.
         .          .     49:	if f.dirinfo == nil {
    2.50MB     2.50MB     50:		f.dirinfo = new(dirInfo)
         .          .     51:		// The buffer must be at least a block long.
  454.03MB   454.03MB     52:		f.dirinfo.buf = make([]byte, blockSize)
         .          .     53:	}
         .          .     54:	d := f.dirinfo
         .          .     55:
         .          .     56:	size := n
         .          .     57:	if size <= 0 {
         .          .     58:		size = 100
         .          .     59:		n = -1
         .          .     60:	}
         .          .     61:
   91.66MB    91.66MB     62:	names = make([]string, 0, size) // Empty with room to grow.
         .          .     63:	for n != 0 {
         .          .     64:		// Refill the buffer if necessary
         .          .     65:		if d.bufp >= d.nbuf {
         .          .     66:			d.bufp = 0
         .          .     67:			var errno error
         .          .     68:			d.nbuf, errno = f.pfd.ReadDirent(d.buf)
         .          .     69:			runtime.KeepAlive(f)
         .          .     70:			if errno != nil {
         .          .     71:				return names, wrapSyscallError("readdirent", errno)
         .          .     72:			}
         .          .     73:			if d.nbuf <= 0 {
         .          .     74:				break // EOF
         .          .     75:			}
         .          .     76:		}
         .          .     77:
         .          .     78:		// Drain the buffer
         .          .     79:		var nb, nc int
         .     9.03MB     80:		nb, nc, names = syscall.ParseDirent(d.buf[d.bufp:d.nbuf], n, names)
         .          .     81:		d.bufp += nb
         .          .     82:		n -= nc
         .          .     83:	}
         .          .     84:	if n >= 0 && len(names) == 0 {
         .          .     85:		return names, io.EOF

@dashpole
Copy link
Collaborator

I am going to run our eviction tests with the PR as another data point.

You mentioned at the beginning: "both du and find overwhelm our audit log". Have you done any experiments to show that this helps your issue?

@dashpole
Copy link
Collaborator

eviction tests are all good. It looks like the latency on du and find is about the same as this implementation in the inode eviction test (lots of small files).

fs/fs.go Outdated Show resolved Hide resolved
fs/fs.go Show resolved Hide resolved
@namreg
Copy link
Contributor Author

namreg commented Feb 21, 2019

@dashpole

You mentioned at the beginning: "both du and find overwhelm our audit log". Have you done any experiments to show that this helps your issue?

Yeah, I have deployed this patch into a kubernetes cluster. As a result, number of the audit log items drastically decreased. It turns out, ionice and nice was also being there. Hence, this patch removes from audit logs four classes of records: du, find, ionice, nice

@dashpole
Copy link
Collaborator

Ah, I get it now. It was the calls themselves, not the du and find took ... log messages that were overwhelming your audit logging.

@namreg
Copy link
Contributor Author

namreg commented Feb 21, 2019

@dashpole yeah, exactly :)

Does anything else concern you in this PR?

fs/fs.go Show resolved Hide resolved
fs/fs.go Outdated Show resolved Hide resolved
@dashpole
Copy link
Collaborator

/retest

@namreg
Copy link
Contributor Author

namreg commented Feb 25, 2019

@dashpole is there anything else I should fix?

Copy link
Collaborator

@dashpole dashpole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@dashpole dashpole merged commit 05529e2 into google:master Feb 25, 2019
@PaulFurtado
Copy link

By moving this into the cadvisor process, we actually lose a couple of things:

  1. du/find were being wrapped in "ionice", "-c3", "nice", "-n", "19" to make them lower priority. Now they will run at the same priority as cadvisor/kubelet
  2. if storage was hung, the syscall would hang in the du/find child processes, instead of cadvisor/kubelet. Now, kubelet will be unkillable in this situation.

Many monitoring solutions fork off child processes to do these tasks for these reasons. I like that this is moving to pure-go, but I think it should still fork and alter the niceness values with syscalls.

@namreg namreg deleted the replace-du-and-find branch April 3, 2019 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants