Skip to content
This repository has been archived by the owner on Aug 15, 2022. It is now read-only.

klient/machine: improve indexing performance with concurrent file scanning #10434

Merged
merged 5 commits into from Jan 30, 2017

Conversation

ppknap
Copy link
Contributor

@ppknap ppknap commented Jan 27, 2017

First step when creating a mount is to create the index from remote FS. For entire Koding repository we have following numbers:

Scanning time: 43.349127583s
Files: 212972
Files time: 2.838µs
Files Disk Size: 3.6 GiB
Files Disk Size time: 16.514726ms

The most time consuming operation is SHA-1 sum of stored files (which is expected and cached after the first scan). This operation can be made concurrently and that's what this PR introduces. Time measurements after the change:

Scanning time: 19.260137022s (improvement: 225%)
Files: 212972
Files time: 8.605µs
Files Disk Size: 3.6 GiB
Files Disk Size time: 16.457452ms

Motivation and Context

Decrease mount initialization time (which currently times out after one minute - due to Kite connection deadline).

How Has This Been Tested?

Existing unit tests passed.

Types of changes

  • New feature (non-breaking change which adds functionality)

@cihangir
Copy link
Contributor

cihangir commented Jan 27, 2017

most time consuming operation is SHA-1 sum

what about using another hashing algo while having 10x more performance 😛

Scanning time: 19.260137022s (improvement: 225%)
Files: 212972
Files time: 8.605µs
Files Disk Size: 3.6 GiB
Files Disk Size time: 16.457452ms

could you also send the code for measuring these?

@rjeczalik
Copy link
Member

rjeczalik commented Jan 27, 2017 via email

func (idx *Index) addEntry(wg *sync.WaitGroup, root, path string, info os.FileInfo) {
idx.limitC <- struct{}{}

go func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we change the structure here a bit with goroutine worker pool pattern? Instead of creating goroutines per file operation we can limit it by 2*runtime.NumCPU()

just to clarify what i mean by worker pool

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cihangir no problem at all - this was the simplest solution. I will left chan Type as a member of indexin order to always have max 2*NumCPU goroutins running simultaneously (eg when doing N concurrent calls to Apply).

@ppknap
Copy link
Contributor Author

ppknap commented Jan 27, 2017

what about using another hashing algo while having 10x more performance 😛

And in case we need parity with git index we can force index recreation

OK, lets use crc32 as @cihangir sugessted here. We version the index type so it should be easy to switch anyway.

@ppknap
Copy link
Contributor Author

ppknap commented Jan 27, 2017

could you also send the code for measuring these?

@cihangir this is rather a dummy program than a real benchmark (I planning to add real bench when we have more time) but if you are interested, the code is here

@ppknap
Copy link
Contributor Author

ppknap commented Jan 27, 2017

@cihangir - after switching to CRC-32 i got following results:
SHA-1: 13.191289437s
CRC-32: 11.610115897s

This was surprising so I profiled the index and most of the time spent is I/O during hashing. After using byte pools:

CRC-32 + BytePoo sync.Pool: Scanning time: 7.604643191s

I'm stopping at this point - need to focus on serialized file size optimization. Moreover, rest of the time is IO:
image

@ppknap ppknap added the wip label Jan 27, 2017
@ppknap ppknap removed the wip label Jan 27, 2017
Copy link
Contributor

@cihangir cihangir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@rjeczalik rjeczalik merged commit 460e884 into master Jan 30, 2017
@rjeczalik rjeczalik deleted the index_scan_performance branch January 30, 2017 16:04
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants