Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use multi core maximize? #37

Closed
jiamo opened this issue Jun 21, 2021 · 2 comments
Closed

How to use multi core maximize? #37

jiamo opened this issue Jun 21, 2021 · 2 comments

Comments

@jiamo
Copy link

jiamo commented Jun 21, 2021

I found code in sortAndSave

            boolean distinct, boolean usegzip, boolean parallel) throws IOException {
            if (parallel) {
              tmplist = tmplist.parallelStream().sorted(cmp).collect(Collectors.toCollection(ArrayList<String>::new));
            } else {
              Collections.sort(tmplist, cmp);
            }

Is this the only thing related in multiple core (wrote in readme)?

In mergeSortedFiles seem read one line from the sorted files one by one.
in sortInBatch . seem sort one block one by one . files.add(sortAndSave(tmplist, cmp, cs,tmpdirectory, distinct, usegzip, parallel));

Can we do concurrent handling in mergeSortedFiles (like read block concurrent ) and in sortInBatch (one thread merge the smaller , one thread to merge the bigger, such like 20 tmpfiles to 10 tmpfiles then to 4 then to 1)

@jiamo jiamo changed the title Don't find function use multi core? How to use multi core maximize? Jun 21, 2021
@lemire
Copy link
Owner

lemire commented Jun 21, 2021

As you have yourself observed, there are no parameters.

You seem to believe that we can do much better. I am sure it is true but consider that the library is meant to be able to sort very large files using very little memory. So it is not a simple matter of throwing more cores and more memory. Please consider the following points:

  • Memory usage should be kept constant. It is easy to improve the performance by sorting in parallel multiple chunks, but that's not a fair comparison. The actual comparison is between sorting one chunk in memory, or two half-chunk, or four quarter-chunk. That is, the more cores you use, the less memory you have a per-core basis.
  • Your pull request should include reasonable benchmarks so we can measure the benefits as the number of cores grow.

@lemire
Copy link
Owner

lemire commented Jun 21, 2021

User is invited to provide a pull request. Closing.

@lemire lemire closed this as completed Jun 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants