Skip to content

Conversation

@AlexeySachkov
Copy link
Contributor

This commit still doesn't bring an exhaustive coverage for the feature, but still improves the situation by checking the following scenarios:

  • using math built-ins from virtual functions
  • using group barriers from virtual functions
  • using virtual functions in nd-range kernels where every work-item calls a different virtual function
  • using virtual functions when the code is scattered across several translation units

Some tests are disabled, because we do not support those scenarios yet and more changes are required to make them work.

This commit still doesn't bring an exhaustive coverage for the feature,
but still improves the situation by checking the following scenarios:
- using math built-ins from virtual functions
- using group barriers from virtual functions
- using virtual functions in nd-range kernels where every work-item
  calls a different virtual function
- using virtual functions when the code is scattered across several
  translation units

Some tests are disabled, because we do not support those scenarios yet
and more changes are required to make them work.
@AlexeySachkov AlexeySachkov marked this pull request as ready for review August 14, 2024 17:45
@AlexeySachkov AlexeySachkov requested a review from a team as a code owner August 14, 2024 17:45
SYCL_EXT_ONEAPI_FUNCTION_PROPERTY(oneapi::indirectly_callable)
virtual int apply(int *LocalData, sycl::nd_item<1> It) {
LocalData[It.get_local_id()] += It.get_local_id();
sycl::group_barrier(It.get_group());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid that people might copy-paste this example thoughtlessly in divergent control flow resulting in UB. I'm not sure if adding a comment here would be enough or if "convergent" functions should be prohibited under indirectly_callable by default and require explicit buy-in from the programmer (e.g. indirectly_callable_in_uniform_control_flow attribute).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't exactly share the concern. I.e. apply could have been a regular function which can also be blindly copy-pasted and called from a non-convergent/non-uniform context resulting in the very same UB.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe using apply(int *, sycl::group) would be a better pattern? group in arguments is what the spec uses for such interfaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it more. I can't just pass group only, because I need local IDs which aren't available in group. And nd_item already includes group, so passing them both together would be a bit weird>

I suppose that we should assume that if nd_item is passed, then some group operations can be performed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is group::get_local_id in core SYCL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is group::get_local_id in core SYCL.

Didn't know that! Switched to use group instead of nd_item in 0ea83a8

Host reference calculation has also been fixed by that commit: I've verified it on CPU (test passes there with internal newer version of OCL CPU RT).

Comment on lines 111 to 131
// We can't call group_barrier on host and therefore here we have a
// reference function instead of calling the same methods on host.
for (size_t GID = 0; GID < G.size() / L.size(); ++GID) {
for (size_t LID = 0; LID < L.size(); ++LID)
HostData[GID * L.size() + LID] += LID;

int Res = (TestCase == 0) ? 0 : 1;
for (size_t LID = 0; LID < L.size(); ++LID) {
if (TestCase == 0)
Res += HostData[GID * L.size() + LID];
else
Res *= HostData[GID * L.size() + LID];
}

for (size_t LID = 0; LID < L.size(); ++LID)
HostData[GID * L.size() + LID] = Res;
}

sycl::host_accessor HostAcc(DataStorage);
for (size_t I = 0; I < HostData.size(); ++I)
assert(HostAcc[I] == HostData[I]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, this requires some focus to understand... Can we use #ifdef __SYCL_DEVICE_ONLY__ to unify the paths instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we can use #ifdefs here, because nd_item is not user-constructible, i.e. the diff between host and device version of the function would be too huge.

But I will try to add some comments here which should help map this function to apply functions that we have above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code was re-written (11db515) to be closer to apply function that we have

}
};

int main() try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, C++ never stops to surprise me with something I didn't know before...

Comment on lines +92 to +96
q.submit([&](sycl::handler &CGH) {
CGH.single_task([=]() {
DeviceStorage->construct</* ret type = */ BaseOp>(TestCase);
});
}).wait_and_throw();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we just create a derived subclass normally and then pass it into the next kernel through its baseclass pointer? That would eliminate the dependency on "helpers.hpp" in this "uniform" tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason the obj_storage_t helper was introduced is to make sure that the storage we allocated is large enough and has correct alignment.

As noted in #14209 (comment) attempting to construct an object in a misaligned memory is a UB.

Here we have two different classes instances of which we may construct: SumOp and MultipleOp. Even though they are the same in their layout, I would still prefer not to hardcode their size and alignment, but instead use this generic helper which allows to change them as we wish without worrying about alignment and allocation size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it easier to write, or easier to debug when it will fail when a mistake is made in some future PR? I won't insist on the change here, but IMO, over-complicating simple tests usually leads to manually simplifying them in future whenever they catch regressions.

} catch (sycl::exception &e) {
std::cout << "Unexpected exception was thrown: " << e.what() << std::endl;
return 1;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test generates two device images. One contains definitions of indirectly-callable functions and the other one contains kernel functions. In AOT mode they are not linked together before calling opencl-aot. When will this be fixed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, AOT support for virtual functions is incomplete yet, that will be addressed in separate PRs. For now that's a second priority, because there is plenty enough bugs even on JIT path

public:
SYCL_EXT_ONEAPI_FUNCTION_PROPERTY(oneapi::indirectly_callable)
virtual int apply(int *LocalData, sycl::group<1> WG) {
LocalData[WG.get_local_id()] += WG.get_local_id();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is read/write, but I'm not sure "read" part is really important for this test. Can we change it to write-only (e.g. g.get_group_linear() + g.get_local_id()). Then we'd be able to create /* virtual ? */ int calc_ref_value(auto global_size, auto local_size) { return /* formula */ }.

That would simplify lines 109-157 a lot, and would also move the reference value compute close to the device code so that they'd fit in a single screen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, thanks. I've applied that approach in f92ac85

}
}

return sycl::group_broadcast(WG, Res);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This likely contains another group_barrier inside. Would it make sense to change the code to store the leader's value in line 50, then have a barrier and then read leader's value in each of the WIs before returning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, group_broadcast implies group_barrier. Considering that the test is named group-barrier, I've replaced group_broadcast with "manual broadcast" in b948b36

Copy link
Contributor

@aelovikov-intel aelovikov-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think benefits of these in-tree outweigh perfecting the tests in the review, so formally LGTM.

q.submit([&](sycl::handler &CGH) {
sycl::accessor DataAcc(DataStorage, CGH, sycl::read_write);
CGH.parallel_for(R, props, [=](auto it) {
// Select VF that corresponds to this work-item
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Select VF that corresponds to this work-item
// Select virtual function that corresponds to this work-item

although I'm biased here as VF usually means "vector factor" to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually select an object and not a virtual function here, fixed in 21adc25

@AlexeySachkov AlexeySachkov merged commit 6ba05b7 into intel:sycl Oct 11, 2024
12 checks passed
@AlexeySachkov AlexeySachkov deleted the private/asachkov/more-vf-tests branch October 11, 2024 08:52
@AlexeySachkov
Copy link
Contributor Author

Considering that there is a formal approval and all most recent comments were applied, I've merged the PR. If there is any other feedback, I will apply it as a follow-up PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants