Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import: Parallel fast-import processes #408

Merged
merged 4 commits into from
Apr 20, 2021
Merged

import: Parallel fast-import processes #408

merged 4 commits into from
Apr 20, 2021

Conversation

craigds
Copy link
Member

@craigds craigds commented Apr 14, 2021

Description

I noticed we were pegging a CPU core for a git-fast-import process for many hours during a large import (tens of millions of features).

This feeds features from the existing single import source connection
to multiple (default=4) git-fast-import processes.

Performance

The speedup is much better than I expected. I'm suspicious actually:

--num-processes=1

  10,000 features... @317.6s

I killed it at this point; I don't have all day.

--num-processes=4 (the default)

  10,000 features... @9.8s
...
  50,000 features... @196.6s

This doesn't quite make sense to me - that's over 30x faster with only 4x more processes! But it is repeatable.

My sample repo easily achieves 100% saturation of four cores, with sno still basically idling:
Screen Shot 2021-04-16 at 11 07 10 AM

I wonder if this unexpectedly good speedup is something to do with disk write caching. Since each process is writing to a subset of the trees, rather than writing to any random tree, perhaps the kernel (or git-fast-import itself) has some kind of per-process write cache which is now performing much better than previously. I don't have any other ideas at the moment (but fast is good! I'll take it)

--num-processes=8, let's see how far we can take this

  10,000 features... @2.1s
...
  50,000 features... @46.0s

or 3x as fast as --num-processes=4 and 66x as fast as --num-processes=1.

--num-processes=16

  10,000 features... @0.8s
...
  50,000 features... @11.2s
...
  161,000 features... @38.5s

--num-processes=26

I only have 26 cores available right now so this is where I'll stop:

  10,000 features... @0.9s
...
  50,000 features... @4.9s
...
  161,000 features... @16.9s

... 352x faster than --num-processes=1 🤨

Amazingly, I can still saturate 26 cores:

Screen Shot 2021-04-16 at 12 39 49 PM

Checklist:

  • Have you reviewed your own change?
  • Have you included test(s)?
  • Have you updated the changelog?

Copy link
Collaborator

@olsen232 olsen232 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of referring directly to Dataset2, you should be using the constant SUPPORTED_DATASET_CLASS...

... for some reason. It maybe doesn't matter too much

Copy link
Member

@rcoup rcoup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be interesting to see if this is actually much quicker? — should try it before we merge anything.

How much CPU was python cf git-fast-import using, and where were the profiled hotspots in Python?

sno/fast_import.py Outdated Show resolved Hide resolved
sno/fast_import.py Outdated Show resolved Hide resolved
@craigds
Copy link
Member Author

craigds commented Apr 16, 2021

The repos appear to be legit; I get the same exact feature tree with --num-processes=4 and --num-processes=26

update: I've actually checked a bunch of features against the source db; it really does look legit 👍

@craigds
Copy link
Member Author

craigds commented Apr 16, 2021

How much CPU was python cf git-fast-import using, and where were the profiled hotspots in Python?

each git-fast-import always saturates a core; I haven't managed to get python past about ~25% of a core even with --num-processes=26

@craigds craigds marked this pull request as ready for review April 16, 2021 03:08
@craigds craigds requested review from rcoup and olsen232 April 16, 2021 03:08
sno/fast_import.py Outdated Show resolved Hide resolved
sno/fast_import.py Outdated Show resolved Hide resolved
Copy link
Member

@rcoup rcoup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

for i, proc in enumerate(procs):
if replace_ids is None:
# Delete the existing dataset, before we re-import it.
proc.stdin.write(f"D {source.dest_path}\n".encode("utf8"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a global path, right? How does this work with multiple processes? So each process deletes everything independently, which clears the trees?

Copy link
Member Author

@craigds craigds Apr 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct - but they work on different refs and thus different root trees. There's no conflict here; each starts by deleting a ton of stuff and then reimports separate non-overlapping stuff.

e.g process 0 will import tree ab, process 1 will import tree ac, process 0 will import tree ad, etc.

Because process 1 started by completely deleting tree ab, that tree won't conflict with process 0's tree later when we merge them.

sno/fast_import.py Outdated Show resolved Hide resolved
Comment on lines 304 to 306
# delete all features not pertaining to this process.
# we also delete the features that *do*, but we do it further down
# so that we don't have to iterate the IDs more than once.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wonders if there's any benefit to deleting at all? Rather than just starting from empty trees and then combining them at the end?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah this is probably something to do with enum ReplaceExisting:

  1. DONT_REPLACE importing a brand new dataset you don't currently have (all existing data is retained)
  2. GIVEN importing a dataset over the top of a single existing dataset (retaining all other datasets)
  3. ALL importing a dataset over the top of the entire repo (no data is retained)

So another way to look at it, if your repo has N datasets, DONT_REPLACE starts the import from a base of N, GIVEN starts from a base of N-1, and ALL starts from a base of 0.

Which type of ReplaceExisting do you use for mirroring? If it's ALL, then this delete code won't be run anyway for mirroring, so it doesn't make much difference to us... you'd just be cleaning this up for a (hypothetical) user who is using mode GIVEN, which is the only mode that needs to start from somewhere fancy where things need deleting to get to N-1 - starting from N (where we are now) or from 0 (the empty tree) is trivial. Consider switching to ALL if you are not, since it should be slightly cheaper...

But Rob is right - perhaps with your new found tree merging skills, we could do some sanity checks, and then we would always do a ReplaceExisting.ALL to import this dataset, merge all its trees into a single tree... and then, merge that tree in with the rest of the repo for a ReplaceExisting.GIVEN or DONT_REPLACE. It might even make the code a bit simpler

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But --replace-ids is a list of IDs to reimport. It doesn't cover the whole dataset. The rest of the dataset should be left untouched.

@@ -274,7 +344,9 @@ def _ids():
for pk in replace_ids:
pk = source.schema.sanitise_pks(pk)
path = dataset.encode_pks_to_path(pk)
p.stdin.write(f"D {path}\n".encode("utf8"))
proc_for_feature_path(path).stdin.write(
f"D {path}\n".encode("utf8")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it need deleting? Adding a new blob doesn't just replace the existing one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IDs in --replace-ids might no longer exist in the import source, and if that's the case then we have to delete them from the dataset.

Then immediately after this, they're fetched from the import source and reimported if they do exist

sno/fast_import.py Show resolved Hide resolved
sno/fast_import.py Outdated Show resolved Hide resolved
This feeds features from the existing single import source connection
to multiple (default=4) git-fast-import processes.
@craigds craigds merged commit b6989f1 into master Apr 20, 2021
@craigds craigds deleted the parallel-imports branch April 20, 2021 00:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants