klient/machine: improve indexing performance with concurrent file scanning #10434
Conversation
what about using another hashing algo while having 10x more performance 😛
could you also send the code for measuring these? |
And in case we need parity with git index we can force index recreation
Cihangir <notifications@github.com> schrieb am Fr. 27. Jan. 2017 um 10:01:
… most time consuming operation is SHA-1 sum
what about using another hashing algo while having 10x more performance 😛
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#10434 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABG7ISUT8sKgT2S_QRhqenDREYJji3Ayks5rWbKGgaJpZM4LvbUr>
.
|
func (idx *Index) addEntry(wg *sync.WaitGroup, root, path string, info os.FileInfo) { | ||
idx.limitC <- struct{}{} | ||
|
||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we change the structure here a bit with goroutine worker pool pattern? Instead of creating goroutines per file operation we can limit it by 2*runtime.NumCPU()
just to clarify what i mean by worker pool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cihangir no problem at all - this was the simplest solution. I will left chan Type
as a member of index
in order to always have max 2*NumCPU
goroutins running simultaneously (eg when doing N concurrent calls to Apply
).
@cihangir - after switching to CRC-32 i got following results: This was surprising so I profiled the index and most of the time spent is I/O during hashing. After using byte pools: CRC-32 + I'm stopping at this point - need to focus on serialized file size optimization. Moreover, rest of the time is IO: |
f559a51
to
b22588d
Compare
b22588d
to
5cc12a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
First step when creating a mount is to create the index from remote FS. For entire Koding repository we have following numbers:
Scanning time: 43.349127583s
Files: 212972
Files time: 2.838µs
Files Disk Size: 3.6 GiB
Files Disk Size time: 16.514726ms
The most time consuming operation is SHA-1 sum of stored files (which is expected and cached after the first scan). This operation can be made concurrently and that's what this PR introduces. Time measurements after the change:
Scanning time: 19.260137022s (improvement: 225%)
Files: 212972
Files time: 8.605µs
Files Disk Size: 3.6 GiB
Files Disk Size time: 16.457452ms
Motivation and Context
Decrease mount initialization time (which currently times out after one minute - due to Kite connection deadline).
How Has This Been Tested?
Existing unit tests passed.
Types of changes