Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove parallel import code #692

Merged
merged 1 commit into from
Aug 10, 2022
Merged

Remove parallel import code #692

merged 1 commit into from
Aug 10, 2022

Conversation

craigds
Copy link
Member

@craigds craigds commented Aug 10, 2022

Description

This code adds complexity and doesn't help since the repo layout
optimisations were introduced in Datasets V3, although
it probably does speed up datasets with string primary keys.

With sequential integer primary keys, it just results in idle processes
since features from a sequence are consistently fed into the same
process, resulting in no efficiency gain.

Since this code is quite complex, it doesn't make sense to keep it

On my laptop, I got these results:

--num-processes=10:

Added 2,328,809 Features to index in 87.5s
Overall rate: 26607 features/s)
Closed in 0s
Joining 10 parallel-imported trees...
Joined trees in 0s

--num-processes=1:

Added 2,328,809 Features to index in 88.4s
Overall rate: 26344 features/s)
Closed in 0s

The idle processes can be seen in Activity Monitor:

Screen Shot 2022-08-10 at 11 41 15 AM

Related links:

.

Checklist:

  • Have you reviewed your own change?
  • Have you included test(s)?
  • Have you updated the changelog?

@craigds craigds requested a review from olsen232 August 10, 2022 00:34
This code adds complexity and doesn't help since the repo layout
optimisations were introduced in Datasets V3, although
it probably does speed up datasets with string primary keys.

With sequential integer primary keys, it just results in idle processes
since features from a sequence are consistently fed into the same
process, resulting in no efficiency gain.

Since this code is quite complex, it doesn't make sense to keep it
@craigds craigds merged commit 46147de into master Aug 10, 2022
@craigds craigds deleted the rm-parallel-import branch August 10, 2022 01:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants