Skip to content

Conversation

@tiennou
Copy link
Contributor

@tiennou tiennou commented Jun 18, 2019

#26 originally removed all traces of parallelism in favor of less file-descriptor contention. It turns out Ruby has gems, and some of those provide ready-made, easy-to-use wrappers for parallel processing.

Use https://github.com/grosser/parallel instead of rolling our own, or removing parallelization altogether.

(Note that this is based on #39, for historic reasons. I can rebase that on top of master if it makes it easier to go in).

@carlosmn
Copy link
Member

carlosmn commented Jul 3, 2019

Yeah I'd rather see this against master, otherwise there's random changes that make it hard to see what's actually changing here.

Is file descriptor contention an issue you're seeing with the current multi-process scheme rather than oversubscription of the processors?

@tiennou tiennou force-pushed the feature/parallelism branch from 11092ac to d66324a Compare July 4, 2019 09:56
@tiennou
Copy link
Contributor Author

tiennou commented Jul 4, 2019

I'm pretty sure the issue is that the current master code does is :

  • spawn n processes, where n= # of valid tags
  • each subprocess goes through each "documentable" blob in the tagged commit, "unpacks" all the trees/blobs for each of those, sends the parse results back and exits. Clearly, we start an unbounded # of processes, repeatedly writing the same data over and over for each version.

I remember getting a sluggish machine, and it would sometimes start throwing "Too many open file descriptors errors" out of this. Don't know if it's CPU contention, SSD thrashing or whatever, but docurium was definitely mostly-unusable here, as there are no guarantee any run will be a complete success, thus sometimes causing complete versions to be missing from the final repository.

@carlosmn
Copy link
Member

Right, I was curious as to how you pinpointed file description contention as the source of the issue. I'm pretty sure it's due to CPU oversaturation, but I'd be impressed if we managed to thrash on an SSD ;)

The issue with too many open files is most likely due to lack of garbage collection on the repository, unless your ulimits are low.

This gem seems really useful, but I'd still want the ability to set how many processes I want to run.

@tiennou
Copy link
Contributor Author

tiennou commented Jul 30, 2019

It's possible, via either in_processes: or in_threads:. But I'd like to post-pone that to at least the option support (part of #44), and the auto-detection makes it less intrusive.

@tiennou tiennou force-pushed the feature/parallelism branch from d66324a to a8b9abf Compare August 8, 2019 08:36
@tiennou tiennou force-pushed the feature/parallelism branch from 3a6b097 to 3856485 Compare November 30, 2019 21:16
@carlosmn
Copy link
Member

It looks like we had some mismerges into master but as I'm currently restricted to my work laptop, the master branch makes it really unhappy so a patch is gonna make it here to fix things

Due to the `if` we'd return `nil` if the message was not an array.
@carlosmn carlosmn merged commit 14f08a1 into libgit2:master Dec 19, 2019
@tiennou tiennou deleted the feature/parallelism branch April 6, 2020 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants