Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<execution>: Parallelize more algorithms #7

Open
BillyONeal opened this issue Sep 5, 2019 · 6 comments
Open

<execution>: Parallelize more algorithms #7

BillyONeal opened this issue Sep 5, 2019 · 6 comments
Labels
performance Must go faster

Comments

@BillyONeal
Copy link
Member

BillyONeal commented Sep 5, 2019

Some strategies that can be used:

Just call parallel transform:

  • replace_copy
  • replace_copy_if

Scans:

  • copy_if
  • partition_copy
  • remove_copy
  • remove_copy_if
  • unique
  • unique_copy

Same as serial nth_element but call the parallel partition op for large N:

  • nth_element

Predicate tests (like all_of):

  • lexicographical_compare

Summary statistics (like find / find_end):

  • min_element
  • max_element
  • minmax_element

Divide and conquer:

  • inplace_merge
  • stable_partition

Divide range1 into chunks, binary search to find matching range2 chunks, scan:

  • merge
  • set_symmetric_difference
  • set_union

Other:

  • includes
@BillyONeal BillyONeal self-assigned this Sep 5, 2019
@StephanTLavavej StephanTLavavej added the enhancement Something can be improved label Sep 5, 2019
@StephanTLavavej StephanTLavavej changed the title Investigate parallelization of more algorithms <execution>: Parallelize more algorithms Sep 5, 2019
@StephanTLavavej StephanTLavavej added the performance Must go faster label Sep 19, 2019
@BillyONeal BillyONeal removed their assignment Oct 28, 2019
@StephanTLavavej StephanTLavavej removed the enhancement Something can be improved label Feb 6, 2020
@AlexGuteniev
Copy link
Contributor

Should generate and generate_n also be parallelized?

Here's how I try to compensate lack of transform with unseq

#include <vector>
#include <execution>

std::vector<int> v_a(16);
std::vector<int> v_b(16);
std::vector<int> v_c(16);

void sum(int* c, int* a, int* b)
{
    std::generate_n(std::execution::unseq, c, 16, [=]() mutable {
        return *(a++) + *(b++);
    });
}

int main()
{
    sum(v_c.data(), v_a.data(), v_b.data());
}

@CaseyCarter
Copy link
Member

Should generate and generate_n also be parallelized?

Your sample program is a great demonstration of why generate and generate_n can't be meaningfully parallelized: the function object is almost always stateful, and evaluations almost always order-dependent. I suspect the writer of this program would be very unhappy if the library spawned 16 threads with their own copies of the lambda resulting in all elements of v_c being equal to v_a[0] + v_b[0].

@AlexGuteniev
Copy link
Contributor

Why does execution policy parameter even exist for them?

@BillyONeal
Copy link
Member Author

Why does execution policy parameter even exist for them?

Because the 'design' process for the parallel algorithms was 'if nobody can come up with a reason why it can't be parallelized, without looking at real implementations, in 10 minutes, it gets a parallel one'. Note that partial_sort has such an overload too even though it is a heap algorithm and none of the other heap algorithms got parallel overloads.

@AlexGuteniev
Copy link
Contributor

AlexGuteniev commented Aug 10, 2020

I'm trying to make this work without #pragma in my code:

#include <vector>

std::vector<int> v_a(16);
std::vector<int> v_b(16);
std::vector<int> v_c(16);

void sum(int c[], int a[], int b[])
{
#pragma loop(ivdep)
	for (int i = 0; i < 16; i++)
	{
		c[i] = a[i] + b[i];
	}
}

int main()
{
	sum(v_c.data(), v_a.data(), v_b.data());
}

Note that without a pragma the compilers still emits SIMD version, but goes to it conditionally at runtime.

Now that this is useless:

void sum(int * c, int* a, int * b)
{
   std::transform(std::execution::unseq, a, a + 16, b, c, std::plus{});
}

As well as this:

void sum(int* c, int* a, int* b)
{
    std::for_each(std::execution::unseq, c, c + 16, [=](int& v) {
        v = a[&v - c] + b[&v - c];
    });
}

How else I can say it?

@BillyONeal
Copy link
Member Author

I don't believe we have a way to say that without a pragma at this time. It's really on the optimizer team if they want to consume the unseq signal and they have declined to do so as of yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

4 participants