klient/machine: improve indexing performance with concurrent file scanning #10434

ppknap · 2017-01-27T04:34:50Z

First step when creating a mount is to create the index from remote FS. For entire Koding repository we have following numbers:

Scanning time: 43.349127583s
Files: 212972
Files time: 2.838µs
Files Disk Size: 3.6 GiB
Files Disk Size time: 16.514726ms

The most time consuming operation is SHA-1 sum of stored files (which is expected and cached after the first scan). This operation can be made concurrently and that's what this PR introduces. Time measurements after the change:

Scanning time: 19.260137022s (improvement: 225%)
Files: 212972
Files time: 8.605µs
Files Disk Size: 3.6 GiB
Files Disk Size time: 16.457452ms

Motivation and Context

Decrease mount initialization time (which currently times out after one minute - due to Kite connection deadline).

How Has This Been Tested?

Existing unit tests passed.

Types of changes

New feature (non-breaking change which adds functionality)

cihangir · 2017-01-27T09:01:58Z

most time consuming operation is SHA-1 sum

what about using another hashing algo while having 10x more performance 😛

Scanning time: 19.260137022s (improvement: 225%)
Files: 212972
Files time: 8.605µs
Files Disk Size: 3.6 GiB
Files Disk Size time: 16.457452ms

could you also send the code for measuring these?

rjeczalik · 2017-01-27T09:03:47Z

And in case we need parity with git index we can force index recreation Cihangir <notifications@github.com> schrieb am Fr. 27. Jan. 2017 um 10:01:

…

most time consuming operation is SHA-1 sum what about using another hashing algo while having 10x more performance 😛 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#10434 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG7ISUT8sKgT2S_QRhqenDREYJji3Ayks5rWbKGgaJpZM4LvbUr> .

cihangir · 2017-01-27T09:09:35Z

go/src/koding/klient/machine/index/index.go

+func (idx *Index) addEntry(wg *sync.WaitGroup, root, path string, info os.FileInfo) {
+	idx.limitC <- struct{}{}
+
+	go func() {


could we change the structure here a bit with goroutine worker pool pattern? Instead of creating goroutines per file operation we can limit it by 2*runtime.NumCPU()

just to clarify what i mean by worker pool

@cihangir no problem at all - this was the simplest solution. I will left chan Type as a member of indexin order to always have max 2*NumCPU goroutins running simultaneously (eg when doing N concurrent calls to Apply).

ppknap · 2017-01-27T16:00:51Z

what about using another hashing algo while having 10x more performance 😛

And in case we need parity with git index we can force index recreation

OK, lets use crc32 as @cihangir sugessted here. We version the index type so it should be easy to switch anyway.

ppknap · 2017-01-27T16:04:00Z

could you also send the code for measuring these?

@cihangir this is rather a dummy program than a real benchmark (I planning to add real bench when we have more time) but if you are interested, the code is here

ppknap · 2017-01-27T18:08:38Z

@cihangir - after switching to CRC-32 i got following results:
SHA-1: 13.191289437s
CRC-32: 11.610115897s

This was surprising so I profiled the index and most of the time spent is I/O during hashing. After using byte pools:

CRC-32 + ~~BytePoo~~ sync.Pool: Scanning time: 7.604643191s

I'm stopping at this point - need to focus on serialized file size optimization. Moreover, rest of the time is IO:

…nning

cihangir

💯

ppknap added the Waiting for Review label Jan 27, 2017

ppknap assigned cihangir and rjeczalik Jan 27, 2017

ppknap requested review from cihangir and rjeczalik January 27, 2017 04:34

cihangir reviewed Jan 27, 2017

View reviewed changes

ppknap added the wip label Jan 27, 2017

ppknap force-pushed the index_scan_performance branch from f559a51 to b22588d Compare January 27, 2017 20:22

ppknap removed the wip label Jan 27, 2017

Pawel Knap added 4 commits January 28, 2017 05:53

klient/machine: improve indexing performance with concurrent file sca…

878af21

…nning

klient/machine: use CRC-32 checksum instead of SHA-1

58d0a50

klient/machine: use sync.Pool to buffer copy slices

0f98765

klient/machine: use workers to add index entries

5cc12a0

ppknap force-pushed the index_scan_performance branch from b22588d to 5cc12a0 Compare January 28, 2017 04:53

cihangir approved these changes Jan 29, 2017

View reviewed changes

klient/machine: rename Index SHA1 field since we no longer use SHA-1

0a18934

rjeczalik approved these changes Jan 30, 2017

View reviewed changes

rjeczalik merged commit 460e884 into master Jan 30, 2017

rjeczalik deleted the index_scan_performance branch January 30, 2017 16:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

klient/machine: improve indexing performance with concurrent file scanning #10434

klient/machine: improve indexing performance with concurrent file scanning #10434

ppknap commented Jan 27, 2017 •

edited

cihangir commented Jan 27, 2017 •

edited

rjeczalik commented Jan 27, 2017 via email

cihangir Jan 27, 2017

ppknap Jan 27, 2017

ppknap commented Jan 27, 2017

ppknap commented Jan 27, 2017

ppknap commented Jan 27, 2017 •

edited

cihangir left a comment

klient/machine: improve indexing performance with concurrent file scanning #10434

klient/machine: improve indexing performance with concurrent file scanning #10434

Conversation

ppknap commented Jan 27, 2017 • edited

Motivation and Context

How Has This Been Tested?

Types of changes

cihangir commented Jan 27, 2017 • edited

rjeczalik commented Jan 27, 2017 via email

cihangir Jan 27, 2017

Choose a reason for hiding this comment

ppknap Jan 27, 2017

Choose a reason for hiding this comment

ppknap commented Jan 27, 2017

ppknap commented Jan 27, 2017

ppknap commented Jan 27, 2017 • edited

cihangir left a comment

Choose a reason for hiding this comment

ppknap commented Jan 27, 2017 •

edited

cihangir commented Jan 27, 2017 •

edited

ppknap commented Jan 27, 2017 •

edited