Avoid calling mget with massive number of keys in Readdir #110

suzaku · 2021-01-22T12:57:59Z

This PR is related to #95 .

In the original implementation, mget is called once with all the keys which correspond to files in a directory. When there are many files in a directory, this call might block the Redis server process.

In this PR, I changed it to call mget in smaller fixed batch. The consequence is that we reduce the chance of blocking the Redis server, but ReadDir is made slower when the number of files in a directory exceeds the batch size (which is set to 4096 now).

For small directories, the latency difference is trivial:

Before

In [24]: %timeit os.listdir("/Users/satoru//jfs/some-files/")
1.26 ms ± 2.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

After

In [25]: %timeit os.listdir("/Users/satoru/jfs/some-files/")
1.27 ms ± 11.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

When I run the benchmark in a directory with more than 200,000 files, the new version is obviously slower:

Before

In [15]: %timeit os.listdir("/Users/satoru/jfs/many-files/")
446 ms ± 2.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

After

In [19]: %timeit os.listdir("/Users/satoru/jfs/many-files/")
488 ms ± 8.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

davies · 2021-01-22T15:15:25Z

pkg/meta/redis.go

-			Name:  []byte(name),
-			Attr:  &Attr{Typ: typ},
-		})
+		ent := newEntries[i]


ent should be a pointer

davies · 2021-01-22T15:15:42Z

pkg/meta/redis.go

+		ent := newEntries[i]
+		ent.Inode = inode
+		ent.Name = []byte(name)
+		attr := newAttrs[i]


attr should be a pointer

davies · 2021-01-22T15:16:03Z

pkg/meta/redis.go

-				if a, ok := re.(string); ok {
-					r.parseAttr([]byte(a), (*entries)[i].Attr)
+		batchSize := 4096
+		if batchSize > len(*entries) {


suzaku · 2021-01-24T07:42:58Z

I've changed the code to call mget in two goroutines, it turns out to be faster than the original implementation when reading a folder with 300,000 files.

Before

In [2]: %timeit os.listdir("/Users/satoru/jfs/many-files/")
657 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit os.listdir("/Users/satoru/jfs/many-files/")
654 ms ± 9.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit os.listdir("/Users/satoru/jfs/many-files/")
648 ms ± 8.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

After

In [5]: %timeit os.listdir("/Users/satoru/jfs/many-files/")
525 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit os.listdir("/Users/satoru/jfs/many-files/")
613 ms ± 5.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]: %timeit os.listdir("/Users/satoru/jfs/many-files/")
617 ms ± 9.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

suzaku · 2021-01-24T07:44:20Z

For folders with less files (100 in my benchmark), it's slower than the original implementation:

Before

In [18]: %timeit os.listdir("/Users/satoru/jfs/some-files/")
685 µs ± 8.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [19]: %timeit os.listdir("/Users/satoru/jfs/some-files/")
666 µs ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [20]: %timeit os.listdir("/Users/satoru/jfs/some-files/")
656 µs ± 2.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

After

In [15]: %timeit os.listdir("/Users/satoru/jfs/some-files/")
752 µs ± 6.62 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [16]: %timeit os.listdir("/Users/satoru/jfs/some-files/")
748 µs ± 4.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [17]: %timeit os.listdir("/Users/satoru/jfs/some-files/")
722 µs ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

davies · 2021-01-24T15:56:23Z

@suzaku Based on the result from benchmark, we should have a fast path for small directorym, thanks.

davies · 2021-01-24T15:57:33Z

Could you also add a benchmark in Go (for both of small and large directory)?

suzaku · 2021-01-24T23:15:55Z

@davies Is it safe here to use multiple HSCAN instead of a huge HGETALL here?

davies · 2021-01-25T01:09:31Z

@suzaku It's OK to use HSCAN. Right now, what's behavior for 5 millions files?

suzaku · 2021-01-25T01:43:21Z

If HSCAN is OK, we can set up a goroutine to do HSCAN on smaller batches instead of calling HGETALL upfront.
I haven't tested 5 million files yet. Creating files can be quite time consuming, I didn't wait that long...

davies · 2021-01-27T07:04:21Z

Let's merge this one first, then optimize the hgetall() later.

davies reviewed Jan 22, 2021

View reviewed changes

pkg/meta/redis.go

Name: []byte(name),

Attr: &Attr{Typ: typ},

})

ent := newEntries[i]

Copy link

Contributor

davies Jan 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ent should be a pointer

davies reviewed Jan 22, 2021

View reviewed changes

pkg/meta/redis.go

ent := newEntries[i]

ent.Inode = inode

ent.Name = []byte(name)

attr := newAttrs[i]

Copy link

Contributor

davies Jan 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attr should be a pointer

davies reviewed Jan 22, 2021

View reviewed changes

suzaku added 3 commits January 24, 2021 15:39

Optimize Readdir

fda7869

Bigger batch

6568e81

Call mget in two goroutines

c9826c1

suzaku force-pushed the fix-#95-split-mget-calls branch from 0da0ac5 to c9826c1 Compare January 24, 2021 07:40

suzaku changed the title ~~WIP: Avoid calling mget with massive number of keys in Readdir~~ Avoid calling mget with massive number of keys in Readdir Jan 24, 2021

davies merged commit c914708 into juicedata:main Jan 27, 2021

xiaogaozi mentioned this pull request Jan 28, 2021

Support directories with millions of files. #95

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid calling mget with massive number of keys in Readdir #110

Avoid calling mget with massive number of keys in Readdir #110

suzaku commented Jan 22, 2021

davies Jan 22, 2021

davies Jan 22, 2021

davies Jan 22, 2021

suzaku commented Jan 24, 2021

suzaku commented Jan 24, 2021

davies commented Jan 24, 2021

davies commented Jan 24, 2021

suzaku commented Jan 24, 2021

davies commented Jan 25, 2021

suzaku commented Jan 25, 2021

davies commented Jan 27, 2021

Avoid calling mget with massive number of keys in Readdir #110

Avoid calling mget with massive number of keys in Readdir #110

Conversation

suzaku commented Jan 22, 2021

davies Jan 22, 2021

Choose a reason for hiding this comment

davies Jan 22, 2021

Choose a reason for hiding this comment

davies Jan 22, 2021

Choose a reason for hiding this comment

suzaku commented Jan 24, 2021

suzaku commented Jan 24, 2021

davies commented Jan 24, 2021

davies commented Jan 24, 2021

suzaku commented Jan 24, 2021

davies commented Jan 25, 2021

suzaku commented Jan 25, 2021

davies commented Jan 27, 2021