Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

command: use external sort for comparison in sync #483

Merged
merged 34 commits into from
Jun 16, 2023

Conversation

kucukaslan
Copy link
Contributor

@kucukaslan kucukaslan commented Aug 9, 2022

It uses external sort instead of in-memory sort to dramatically reduce memory usage in the expense of the speed.
It uses encoding/gob format to store to disk and restore from there.

Fixes #441
Fixes #447

@kucukaslan
Copy link
Contributor Author

kucukaslan commented Aug 10, 2022

I ran a few tests with sync (when it was using encoding/xml, but now uses encoding/json).

I created 1,000,000 files each with 16 bytes size (using truncate & split). Then I run sync command with old version and current (with extsort) version that copies these 1,000,000 to empty s3 bucket (prefix).

Using old version, sync operation took about 500 seconds. It used about 10 GB RAM.
Using external sort with chunk size1 of 50,000, sync operation took about 600 seconds. It used up to 1 GB RAM.
Using external sort with chunk size of 100,000, sync operation took about 600 seconds. It used up to 2 GB RAM.
External sort's channel buffer size was 1,000 by default. Increasing it 5,000 did not have an observable change in speed.

I'm planning to set default chunk size to 100,000 which uses up to 2 GB RAM. Together with the file's content (when real files with significant size not merely 16 bytes are used), I guess 8 GB RAM would be sufficient in general.
I also plan to add flags to specify chunk size and channel buffer size.

Footnotes

  1. "amount of records to store in each chunk [subset] which will be written to disk". see also

@kucukaslan
Copy link
Contributor Author

kucukaslan commented Aug 11, 2022

Updates

it was using encoding/xml, but now uses encoding/json format to write disk.
it now allows specifying chunk size. If a nonpositive chunk size is provided than, it sorts objects in memory.

1,000,000 objects, each one is 16 bytes

For sync'ing 1,000,000 local objects to s3.

  • Internal sort uses upto 13 GB RAM.
  • External sort with chunk size 100,000 uses upto 1.5 GB RAM.

When all those 1,000,000 objects are supposed to be uploaded

  • External sort takes about 600±20 seconds (current default)
  • lanrat/extsort's internal sort takes about 575±20 seconds
  • Ordinary internal sort1 takes about 575±20 seconds (current internal sort algorithm)
  • Previous algoritmh takes 490±20 seconds

When none of the objects are supposed to be uploaded, with

  • External sort takes about 90±5 seconds (current default)
  • lanrat/extsort's internal sort takes about 85±5 seconds
  • my custom internal sort takes about 80±5 seconds (current internal sort algorithm)
  • Previous algoritmh takes 75±5 seconds

Extra commentaries

  • changing size of channel buffers don't have a measurable effect,
  • changing extsort's NumWorkers has no measurable effect (2 vs 16)

9.1 GB semi-natural directory

A directory of 9.1 GB in 52250 objects. Directory content consists of the some repositories cloned from github, an .iso. and a script file:

QuantumLibraries Telegram digitalgov.gov fireship.io go openimage.py s5cmd sql-server-samples ubuntu.iso

Hyperfine single run results (no reproducible difference):

'Old internal sort' ran
1.01 times faster than 'New internal sort'
1.02 times faster than 'External Sort'

hyperfine --show-output -w 0 -r 1
-n 'Old internal sort' './s5cmdOLD --stat --log error sync ~/repos/ s3://s5cmd-benchmark///dest0/'
-n 'External Sort' './s5cmdNEW --stat --log error sync ./repos s3://s5cmd-benchmark///dest1/'
-n 'New internal sort' './s5cmdNEW --stat --log error sync --chunk-size "-1" ./repos s3://s5cmd-benchmark///dest2/'

Benchmark 1: Old internal sort

Operation	Total	Error	Success
cp		52250	0	52250
sync		1	0	1

  Time (abs ≡):        42.306 s               [User: 178.126 s, System: 41.943 s]

Benchmark 2: External Sort
ERROR "sync ./repos s3://s5cmd-benchmark///dest1/": no object found

Operation	Total	Error	Success
cp		52250	0	52250
sync		1	0	1

  Time (abs ≡):        43.070 s               [User: 195.419 s, System: 44.529 s]

Benchmark 3: New internal sort
ERROR "sync --chunk-size=-1 ./repos s3://s5cmd-benchmark///dest2/": no object found

Operation	Total	Error	Success
cp		52250	0	52250
sync		1	0	1

  Time (abs ≡):        42.603 s               [User: 182.621 s, System: 42.220 s]

Summary
  'Old internal sort' ran
    1.01 times faster than 'New internal sort'
    1.02 times faster than 'External Sort'
'External Sort' ran
1.07 times faster than 'Old internal sort'
1.10 times faster than 'New internal sort''

hyperfine --show-output -w 0 -r 1
-n 'Old internal sort' './s5cmdOLD --stat --log error sync ~/repos/ s3://s5cmd-benchmark///dest0f/'
-n 'External Sort' './s5cmdNEW --stat --log error sync ./repos s3://s5cmd-benchmark///dest1f/'
-n 'New internal sort' './s5cmdNEW --stat --log error sync --chunk-size "-1" ./repos s3://s5cmd-benchmark///dest2f/'

Benchmark 1: Old internal sort

Operation	Total	Error	Success
cp		52250	0	52250
sync		1	0	1

  Time (abs ≡):        43.098 s               [User: 179.197 s, System: 40.396 s]

Benchmark 2: External Sort
ERROR "sync ./repos s3://s5cmd-benchmark///dest1f/": no object found

Operation	Total	Error	Success
cp		52250	0	52250
sync		1	0	1

  Time (abs ≡):        40.405 s               [User: 196.294 s, System: 44.523 s]

Benchmark 3: New internal sort
ERROR "sync --chunk-size=-1 ./repos s3://s5cmd-benchmark///dest2f/": no object found

Operation	Total	Error	Success
cp		52250	0	52250
sync		1	0	1

  Time (abs ≡):        44.270 s               [User: 176.710 s, System: 40.735 s]

Summary
  'External Sort' ran
    1.07 times faster than 'Old internal sort'
    1.10 times faster than 'New internal sort'

Footnotes

  1. I didn't know how to say "just using the go's sort library as it was done earlier, except that reading from channels and sending to a channel".

@kucukaslan
Copy link
Contributor Author

kucukaslan commented Aug 15, 2022

I guess using StableSort would make difference only if the filepath.ToSlash was called before the sort operation, not after.

Furthermore I think using filepath.ToSlash is problematic. Because it, potentially, breaks the assumption that both of the lists/channels are listed. For example, in a windows device we may have such an order:
"ab"
"a\b"
but in compareObjects we use them after applying filepath.ToSlash which converts them to
"ab"
"a/b"
but in ascii "/" comes before "b", so practically it is not sorted as we expect and assume.

So I propose these changes as fix: kucukaslan@40a22c9

Warning
I'm wrong. The filepath.ToSlash is already called at the url.New so the path seperators are converted to slash well in advance. My, so-called, fix would be redundant, so I won't include that in this PR. For the record, I'm not sure if this behaviour of url.New is a good practice or not.

s5cmd/storage/url/url.go

Lines 78 to 81 in 2a6c7cc

if runtime.GOOS == "windows" {
url.Path = filepath.ToSlash(url.Path)
}
return url, nil

@kucukaslan kucukaslan marked this pull request as ready for review August 16, 2022 10:07
@kucukaslan kucukaslan requested a review from a team as a code owner August 16, 2022 10:07
@kucukaslan kucukaslan requested review from igungor and ilkinulas and removed request for a team August 16, 2022 10:07
command/sync.go Outdated Show resolved Hide resolved
@kucukaslan kucukaslan changed the title use external sort for comparison in sync command: use external sort for comparison in sync Aug 27, 2022
The destination only objects are only needed if they're going to be deleted.  So the consumer of onlyDest channel worked if they're going to be deleted.
This naive behaviour caused channel not to accept URLs and block send requests when the channel buffer is filled!

Moral of the story, don't be naive. The fact that you need channel in one case does not mean you need to consume channel only in that case.
@kucukaslan kucukaslan marked this pull request as ready for review September 1, 2022 07:39
@kucukaslan kucukaslan marked this pull request as draft September 6, 2022 07:31
@kucukaslan
Copy link
Contributor Author

kucukaslan commented Sep 7, 2022

I've implemented another variant that uses gob encoding instead of the json encoding in this branch.

I've also written a test code to compare their respective encoding/decoding1 speeds.
The gob encoding uses much less space than the json encoding (88 vs ~150 bytes for the test input).

In my first tests, the json encoding was much faster than the gob encoding. For 1 million objects the json encoding took about 8.5 seconds, while gob encoding took about 30 seconds. After perusing their implementations2, I identified the main reason of the difference: The gob's default time.Time encoding seems to be slower than storing the time as RFC3339Nano formatted string. Now gob takes about 10 seconds for the same test values.

I currently do not know how the encoded slice sizes will affect the external sorting.

Update: Tested to sync. 500000. local files to s3 bucket which has only one file.
Benchmark 1: s5cmdgob --dry-run sync /tmp/s5cmd/small500K/ s3://...
Time (abs ≡): 1266.216 s [User: 803.825 s, System: 1065.423 s]

Benchmark 2: s5cmdjson --dry-run sync /tmp/s5cmd/small500K/ s3://...
Time (abs ≡): 1299.062 s [User: 956.893 s, System: 1111.326 s]

Summary
's5cmdgob' ran
1.03 times faster than 's5cmdjson'

test code:

func TestFromToBytes(t *testing.T) {
	key := "s3://bucket/key/w*ldcard/\net"
	size := int64(53861)
	modTime := time.Now()
	ty := ObjectType{1}

	count := 1000000

	u, err := url.New(key, url.WithRaw(true))
	if err != nil {
		t.Fail()
	}

	o1 := Object{
		URL:     u,
		Size:    size,
		ModTime: &modTime,
		Type:    ty,
	}
	b := o1.ToBytes()
	fmt.Println(len(b), b)
	start := time.Now()
	for i := 0; i < count; i++ {
		_ = FromBytes(o1.ToBytes()).(Object)

	}
	elapsed := time.Since(start)
	fmt.Printf("Processing %d objects took %s", count, elapsed)
	t.Fail()
}

benchmark

cpu: Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz

case sample size time per op ? ?
json encoding 125584 8531 ns/op 3996 B/op 77 allocs/op
improved gob encoding 107913 10309 ns/op 5856 B/op 113 allocs/op
raw gob encoding 26679 41970 ns/op 11462 B/op 263 allocs/op
func BenchmarkFromToBytes(b *testing.B) {
	key := "s3://bucket/key/w*ldcard/\net"
	size := int64(53861)
	modTime := time.Now()
	ty := ObjectType{1}

	u, err := url.New(key, url.WithRaw(true))
	if err != nil {
		b.Fail()
	}

	o1 := Object{
		URL:     u,
		Size:    size,
		ModTime: &modTime,
		Type:    ty,
	}

	b.ResetTimer()

	for i := 0; i < b.N; i++ {
		_ = FromBytes(o1.ToBytes()).(Object)
	}
}

Footnotes

  1. More precisely the (Object).ToBytes and the storage.FromBytes methods.

  2. almost by luck

@kucukaslan kucukaslan marked this pull request as ready for review September 8, 2022 07:45
@ilkinulas
Copy link
Member

Thanks a lot for trying gob for encoding/decoding. Let's go with gob, because:

  • ToBytes and FromBytes functions are more readable and type-safe.
  • Gop implementation uses less space although it has almost the same speed as json.

kucukaslan and others added 4 commits September 15, 2022 22:11
…ding to increase speed

This change reduced the time it takes for 1.000.000 encode/decode operation from 30 seconds to 10 seconds.
@kucukaslan
Copy link
Contributor Author

kucukaslan commented Sep 15, 2022

Thanks a lot for trying gob for encoding/decoding. Let's go with gob, because:

  • ToBytes and FromBytes functions are more readable and type-safe.
  • Gop implementation uses less space although it has almost the same speed as json.

My pleasure. I've rebased the gob encoding branch to this PR.

ps. Time is converted to RFC3339Nano formatted string before gob encoding is called, and parsed from that string after gob decode.

command/sync.go Outdated Show resolved Hide resolved
@arturrez
Copy link

hello. any chance this fix will be merged and released? it's very useful for us

@kbarrette
Copy link

I'd also be interested in seeing this merged!

@igungor igungor merged commit f0ce87d into peak:master Jun 16, 2023
@igungor
Copy link
Member

igungor commented Jun 16, 2023

Sorry for the long wait @kucukaslan, and thank you for your hard work on this 😄

@kucukaslan
Copy link
Contributor Author

kucukaslan commented Jun 16, 2023

Sorry for the long wait @kucukaslan, and thank you for your hard work on this 😄

It was my pleasure, thanks 😊

@antonsmelyanskiy
Copy link

I love you for this change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sync command issue with many files. Large memory utilisation when syncing many files
6 participants