-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
import: Parallel fast-import processes #408
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of referring directly to Dataset2, you should be using the constant SUPPORTED_DATASET_CLASS...
... for some reason. It maybe doesn't matter too much
01e5e5b
to
f98c06b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be interesting to see if this is actually much quicker? — should try it before we merge anything.
How much CPU was python cf git-fast-import
using, and where were the profiled hotspots in Python?
The repos appear to be legit; I get the same exact feature tree with update: I've actually checked a bunch of features against the source db; it really does look legit 👍 |
each git-fast-import always saturates a core; I haven't managed to get python past about ~25% of a core even with |
74e2925
to
c7744ec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚀
sno/fast_import.py
Outdated
for i, proc in enumerate(procs): | ||
if replace_ids is None: | ||
# Delete the existing dataset, before we re-import it. | ||
proc.stdin.write(f"D {source.dest_path}\n".encode("utf8")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a global path, right? How does this work with multiple processes? So each process deletes everything independently, which clears the trees?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct - but they work on different refs and thus different root trees. There's no conflict here; each starts by deleting a ton of stuff and then reimports separate non-overlapping stuff.
e.g process 0 will import tree ab
, process 1 will import tree ac
, process 0 will import tree ad
, etc.
Because process 1 started by completely deleting tree ab
, that tree won't conflict with process 0's tree later when we merge them.
sno/fast_import.py
Outdated
# delete all features not pertaining to this process. | ||
# we also delete the features that *do*, but we do it further down | ||
# so that we don't have to iterate the IDs more than once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wonders if there's any benefit to deleting at all? Rather than just starting from empty trees and then combining them at the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah this is probably something to do with enum ReplaceExisting:
- DONT_REPLACE importing a brand new dataset you don't currently have (all existing data is retained)
- GIVEN importing a dataset over the top of a single existing dataset (retaining all other datasets)
- ALL importing a dataset over the top of the entire repo (no data is retained)
So another way to look at it, if your repo has N datasets, DONT_REPLACE starts the import from a base of N, GIVEN starts from a base of N-1, and ALL starts from a base of 0.
Which type of ReplaceExisting do you use for mirroring? If it's ALL, then this delete code won't be run anyway for mirroring, so it doesn't make much difference to us... you'd just be cleaning this up for a (hypothetical) user who is using mode GIVEN, which is the only mode that needs to start from somewhere fancy where things need deleting to get to N-1 - starting from N (where we are now) or from 0 (the empty tree) is trivial. Consider switching to ALL if you are not, since it should be slightly cheaper...
But Rob is right - perhaps with your new found tree merging skills, we could do some sanity checks, and then we would always do a ReplaceExisting.ALL to import this dataset, merge all its trees into a single tree... and then, merge that tree in with the rest of the repo for a ReplaceExisting.GIVEN or DONT_REPLACE. It might even make the code a bit simpler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But --replace-ids
is a list of IDs to reimport. It doesn't cover the whole dataset. The rest of the dataset should be left untouched.
sno/fast_import.py
Outdated
@@ -274,7 +344,9 @@ def _ids(): | |||
for pk in replace_ids: | |||
pk = source.schema.sanitise_pks(pk) | |||
path = dataset.encode_pks_to_path(pk) | |||
p.stdin.write(f"D {path}\n".encode("utf8")) | |||
proc_for_feature_path(path).stdin.write( | |||
f"D {path}\n".encode("utf8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it need deleting? Adding a new blob doesn't just replace the existing one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The IDs in --replace-ids
might no longer exist in the import source, and if that's the case then we have to delete them from the dataset.
Then immediately after this, they're fetched from the import source and reimported if they do exist
c7744ec
to
3e49d25
Compare
This feeds features from the existing single import source connection to multiple (default=4) git-fast-import processes.
8f0a0e0
to
7aa4765
Compare
Description
I noticed we were pegging a CPU core for a git-fast-import process for many hours during a large import (tens of millions of features).
This feeds features from the existing single import source connection
to multiple (default=4) git-fast-import processes.
Performance
The speedup is much better than I expected. I'm suspicious actually:
--num-processes=1
I killed it at this point; I don't have all day.
--num-processes=4
(the default)This doesn't quite make sense to me - that's over 30x faster with only 4x more processes! But it is repeatable.
My sample repo easily achieves 100% saturation of four cores, with sno still basically idling:
I wonder if this unexpectedly good speedup is something to do with disk write caching. Since each process is writing to a subset of the trees, rather than writing to any random tree, perhaps the kernel (or git-fast-import itself) has some kind of per-process write cache which is now performing much better than previously. I don't have any other ideas at the moment (but fast is good! I'll take it)
--num-processes=8
, let's see how far we can take thisor 3x as fast as
--num-processes=4
and 66x as fast as--num-processes=1
.--num-processes=16
--num-processes=26
I only have 26 cores available right now so this is where I'll stop:
... 352x faster than --num-processes=1 🤨
Amazingly, I can still saturate 26 cores:
Checklist: