import: Parallel fast-import processes #408

craigds · 2021-04-14T22:01:26Z

Description

I noticed we were pegging a CPU core for a git-fast-import process for many hours during a large import (tens of millions of features).

This feeds features from the existing single import source connection
to multiple (default=4) git-fast-import processes.

Performance

The speedup is much better than I expected. I'm suspicious actually:

`--num-processes=1`

  10,000 features... @317.6s

I killed it at this point; I don't have all day.

`--num-processes=4` (the default)

  10,000 features... @9.8s
...
  50,000 features... @196.6s

This doesn't quite make sense to me - that's over 30x faster with only 4x more processes! But it is repeatable.

My sample repo easily achieves 100% saturation of four cores, with sno still basically idling:

I wonder if this unexpectedly good speedup is something to do with disk write caching. Since each process is writing to a subset of the trees, rather than writing to any random tree, perhaps the kernel (or git-fast-import itself) has some kind of per-process write cache which is now performing much better than previously. I don't have any other ideas at the moment (but fast is good! I'll take it)

`--num-processes=8`, let's see how far we can take this

  10,000 features... @2.1s
...
  50,000 features... @46.0s

or 3x as fast as --num-processes=4 and 66x as fast as --num-processes=1.

`--num-processes=16`

  10,000 features... @0.8s
...
  50,000 features... @11.2s
...
  161,000 features... @38.5s

`--num-processes=26`

I only have 26 cores available right now so this is where I'll stop:

  10,000 features... @0.9s
...
  50,000 features... @4.9s
...
  161,000 features... @16.9s

... 352x faster than --num-processes=1 🤨

Amazingly, I can still saturate 26 cores:

Checklist:

Have you reviewed your own change?
Have you included test(s)?
Have you updated the changelog?

olsen232

Instead of referring directly to Dataset2, you should be using the constant SUPPORTED_DATASET_CLASS...

... for some reason. It maybe doesn't matter too much

rcoup

Be interesting to see if this is actually much quicker? — should try it before we merge anything.

How much CPU was python cf git-fast-import using, and where were the profiled hotspots in Python?

sno/fast_import.py

craigds · 2021-04-16T01:50:03Z

The repos appear to be legit; I get the same exact feature tree with --num-processes=4 and --num-processes=26

update: I've actually checked a bunch of features against the source db; it really does look legit 👍

craigds · 2021-04-16T01:51:12Z

How much CPU was python cf git-fast-import using, and where were the profiled hotspots in Python?

each git-fast-import always saturates a core; I haven't managed to get python past about ~25% of a core even with --num-processes=26

sno/fast_import.py

rcoup

LGTM 🚀

rcoup · 2021-04-19T13:51:17Z

sno/fast_import.py

+                    for i, proc in enumerate(procs):
+                        if replace_ids is None:
+                            # Delete the existing dataset, before we re-import it.
+                            proc.stdin.write(f"D {source.dest_path}\n".encode("utf8"))


this is a global path, right? How does this work with multiple processes? So each process deletes everything independently, which clears the trees?

correct - but they work on different refs and thus different root trees. There's no conflict here; each starts by deleting a ton of stuff and then reimports separate non-overlapping stuff.

e.g process 0 will import tree ab, process 1 will import tree ac, process 0 will import tree ad, etc.

Because process 1 started by completely deleting tree ab, that tree won't conflict with process 0's tree later when we merge them.

sno/fast_import.py

rcoup · 2021-04-19T14:21:16Z

sno/fast_import.py

+                            # delete all features not pertaining to this process.
+                            # we also delete the features that *do*, but we do it further down
+                            # so that we don't have to iterate the IDs more than once.


wonders if there's any benefit to deleting at all? Rather than just starting from empty trees and then combining them at the end?

Oh yeah this is probably something to do with enum ReplaceExisting:

DONT_REPLACE importing a brand new dataset you don't currently have (all existing data is retained)

GIVEN importing a dataset over the top of a single existing dataset (retaining all other datasets)

ALL importing a dataset over the top of the entire repo (no data is retained)

So another way to look at it, if your repo has N datasets, DONT_REPLACE starts the import from a base of N, GIVEN starts from a base of N-1, and ALL starts from a base of 0.

Which type of ReplaceExisting do you use for mirroring? If it's ALL, then this delete code won't be run anyway for mirroring, so it doesn't make much difference to us... you'd just be cleaning this up for a (hypothetical) user who is using mode GIVEN, which is the only mode that needs to start from somewhere fancy where things need deleting to get to N-1 - starting from N (where we are now) or from 0 (the empty tree) is trivial. Consider switching to ALL if you are not, since it should be slightly cheaper...

But Rob is right - perhaps with your new found tree merging skills, we could do some sanity checks, and then we would always do a ReplaceExisting.ALL to import this dataset, merge all its trees into a single tree... and then, merge that tree in with the rest of the repo for a ReplaceExisting.GIVEN or DONT_REPLACE. It might even make the code a bit simpler

But --replace-ids is a list of IDs to reimport. It doesn't cover the whole dataset. The rest of the dataset should be left untouched.

rcoup · 2021-04-19T14:30:16Z

sno/fast_import.py

@@ -274,7 +344,9 @@ def _ids():
                        for pk in replace_ids:
                            pk = source.schema.sanitise_pks(pk)
                            path = dataset.encode_pks_to_path(pk)
-                            p.stdin.write(f"D {path}\n".encode("utf8"))
+                            proc_for_feature_path(path).stdin.write(
+                                f"D {path}\n".encode("utf8")


does it need deleting? Adding a new blob doesn't just replace the existing one?

The IDs in --replace-ids might no longer exist in the import source, and if that's the case then we have to delete them from the dataset.

Then immediately after this, they're fetched from the import source and reimported if they do exist

sno/fast_import.py

This feeds features from the existing single import source connection to multiple (default=4) git-fast-import processes.

olsen232 reviewed Apr 14, 2021

View reviewed changes

craigds force-pushed the parallel-imports branch from 01e5e5b to f98c06b Compare April 15, 2021 03:36

rcoup reviewed Apr 15, 2021

View reviewed changes

sno/fast_import.py Outdated Show resolved Hide resolved

sno/fast_import.py Outdated Show resolved Hide resolved

craigds marked this pull request as ready for review April 16, 2021 03:08

craigds requested review from rcoup and olsen232 April 16, 2021 03:08

craigds force-pushed the parallel-imports branch from 74e2925 to c7744ec Compare April 16, 2021 03:20

olsen232 approved these changes Apr 16, 2021

View reviewed changes

sno/fast_import.py Outdated Show resolved Hide resolved

sno/fast_import.py Outdated Show resolved Hide resolved

rcoup approved these changes Apr 19, 2021

View reviewed changes

craigds force-pushed the parallel-imports branch from c7744ec to 3e49d25 Compare April 19, 2021 23:14

craigds added 4 commits April 20, 2021 12:05

import: Parallel fast-import processes

ac8e83c

This feeds features from the existing single import source connection to multiple (default=4) git-fast-import processes.

init --num-processes: default to available CPUs

da8ea14

Review changes

c95eff1

init --num-processes=N

7aa4765

craigds force-pushed the parallel-imports branch from 8f0a0e0 to 7aa4765 Compare April 20, 2021 00:05

craigds merged commit b6989f1 into master Apr 20, 2021

craigds deleted the parallel-imports branch April 20, 2021 00:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

import: Parallel fast-import processes #408

import: Parallel fast-import processes #408

craigds commented Apr 14, 2021 •

edited

Loading

olsen232 left a comment

rcoup left a comment

craigds commented Apr 16, 2021 •

edited

Loading

craigds commented Apr 16, 2021

rcoup left a comment

rcoup Apr 19, 2021

craigds Apr 19, 2021 •

edited

Loading

rcoup Apr 19, 2021

olsen232 Apr 19, 2021

craigds Apr 19, 2021

rcoup Apr 19, 2021

craigds Apr 19, 2021

import: Parallel fast-import processes #408

import: Parallel fast-import processes #408

Conversation

craigds commented Apr 14, 2021 • edited Loading

Description

Performance

--num-processes=1

--num-processes=4 (the default)

--num-processes=8, let's see how far we can take this

--num-processes=16

--num-processes=26

Checklist:

olsen232 left a comment

Choose a reason for hiding this comment

rcoup left a comment

Choose a reason for hiding this comment

craigds commented Apr 16, 2021 • edited Loading

craigds commented Apr 16, 2021

rcoup left a comment

Choose a reason for hiding this comment

rcoup Apr 19, 2021

Choose a reason for hiding this comment

craigds Apr 19, 2021 • edited Loading

Choose a reason for hiding this comment

rcoup Apr 19, 2021

Choose a reason for hiding this comment

olsen232 Apr 19, 2021

Choose a reason for hiding this comment

craigds Apr 19, 2021

Choose a reason for hiding this comment

rcoup Apr 19, 2021

Choose a reason for hiding this comment

craigds Apr 19, 2021

Choose a reason for hiding this comment

craigds commented Apr 14, 2021 •

edited

Loading

`--num-processes=1`

`--num-processes=4` (the default)

`--num-processes=8`, let's see how far we can take this

`--num-processes=16`

`--num-processes=26`

craigds commented Apr 16, 2021 •

edited

Loading

craigds Apr 19, 2021 •

edited

Loading