Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid calling mget with massive number of keys in Readdir #110

Merged
merged 3 commits into from Jan 27, 2021

Conversation

suzaku
Copy link
Contributor

@suzaku suzaku commented Jan 22, 2021

This PR is related to #95 .

In the original implementation, mget is called once with all the keys which correspond to files in a directory. When there are many files in a directory, this call might block the Redis server process.

In this PR, I changed it to call mget in smaller fixed batch. The consequence is that we reduce the chance of blocking the Redis server, but ReadDir is made slower when the number of files in a directory exceeds the batch size (which is set to 4096 now).

For small directories, the latency difference is trivial:

  • Before
In [24]: %timeit os.listdir("/Users/satoru//jfs/some-files/")
1.26 ms ± 2.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  • After
In [25]: %timeit os.listdir("/Users/satoru/jfs/some-files/")
1.27 ms ± 11.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

When I run the benchmark in a directory with more than 200,000 files, the new version is obviously slower:

  • Before
In [15]: %timeit os.listdir("/Users/satoru/jfs/many-files/")
446 ms ± 2.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • After
In [19]: %timeit os.listdir("/Users/satoru/jfs/many-files/")
488 ms ± 8.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Name: []byte(name),
Attr: &Attr{Typ: typ},
})
ent := newEntries[i]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ent should be a pointer

ent := newEntries[i]
ent.Inode = inode
ent.Name = []byte(name)
attr := newAttrs[i]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attr should be a pointer

if a, ok := re.(string); ok {
r.parseAttr([]byte(a), (*entries)[i].Attr)
batchSize := 4096
if batchSize > len(*entries) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed

@suzaku
Copy link
Contributor Author

suzaku commented Jan 24, 2021

I've changed the code to call mget in two goroutines, it turns out to be faster than the original implementation when reading a folder with 300,000 files.

  • Before
In [2]: %timeit os.listdir("/Users/satoru/jfs/many-files/")
657 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit os.listdir("/Users/satoru/jfs/many-files/")
654 ms ± 9.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit os.listdir("/Users/satoru/jfs/many-files/")
648 ms ± 8.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • After
In [5]: %timeit os.listdir("/Users/satoru/jfs/many-files/")
525 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit os.listdir("/Users/satoru/jfs/many-files/")
613 ms ± 5.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]: %timeit os.listdir("/Users/satoru/jfs/many-files/")
617 ms ± 9.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@suzaku
Copy link
Contributor Author

suzaku commented Jan 24, 2021

For folders with less files (100 in my benchmark), it's slower than the original implementation:

  • Before
In [18]: %timeit os.listdir("/Users/satoru/jfs/some-files/")
685 µs ± 8.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [19]: %timeit os.listdir("/Users/satoru/jfs/some-files/")
666 µs ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [20]: %timeit os.listdir("/Users/satoru/jfs/some-files/")
656 µs ± 2.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  • After
In [15]: %timeit os.listdir("/Users/satoru/jfs/some-files/")
752 µs ± 6.62 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [16]: %timeit os.listdir("/Users/satoru/jfs/some-files/")
748 µs ± 4.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [17]: %timeit os.listdir("/Users/satoru/jfs/some-files/")
722 µs ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

@suzaku suzaku changed the title WIP: Avoid calling mget with massive number of keys in Readdir Avoid calling mget with massive number of keys in Readdir Jan 24, 2021
@davies
Copy link
Contributor

davies commented Jan 24, 2021

@suzaku Based on the result from benchmark, we should have a fast path for small directorym, thanks.

@davies
Copy link
Contributor

davies commented Jan 24, 2021

Could you also add a benchmark in Go (for both of small and large directory)?

@suzaku
Copy link
Contributor Author

suzaku commented Jan 24, 2021

@davies Is it safe here to use multiple HSCAN instead of a huge HGETALL here?

@davies
Copy link
Contributor

davies commented Jan 25, 2021

@suzaku It's OK to use HSCAN. Right now, what's behavior for 5 millions files?

@suzaku
Copy link
Contributor Author

suzaku commented Jan 25, 2021

If HSCAN is OK, we can set up a goroutine to do HSCAN on smaller batches instead of calling HGETALL upfront.
I haven't tested 5 million files yet. Creating files can be quite time consuming, I didn't wait that long...

@davies
Copy link
Contributor

davies commented Jan 27, 2021

Let's merge this one first, then optimize the hgetall() later.

@davies davies merged commit c914708 into juicedata:main Jan 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants