-
Notifications
You must be signed in to change notification settings - Fork 734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] Add __imf_rcp64h to intel math libdevice #11610
Conversation
Some deep learning framework uses '__nv_rcp64h' in CUDA backend. We need to provide equivalent functionality in DPC++ compiler. Signed-off-by: jinge90 <ge.jin@intel.com>
Signed-off-by: jinge90 <ge.jin@intel.com>
Signed-off-by: jinge90 <ge.jin@intel.com>
Hi, @zettai-reido |
Signed-off-by: jinge90 <ge.jin@intel.com>
Hi, @tfzhu |
Hi @jinge90 . The issue is resolved. |
Hello @jinge90 #include <cstdio>
#include <cstdint>
#include <initializer_list>
#include <vector>
extern "C"
__device__
double __nv_rcp64h(double a);
__device__
void calculate(const double* a, double* y) {
y[0] = __nv_rcp64h(a[0]);
}
__global__
void test_kernel(size_t n, const double* vec_a, double* vec_y) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i >= n) { return; } // tail
calculate(vec_a + i, vec_y + i);
}
int main() {
std::vector<double> a = { 2, 3, 5, 7, 11, 13, 17, 19 };
std::initializer_list<uint64_t> specials = {
0x7FF0'0000'0000'0000, // +Inf
0xFFF0'0000'0000'0000, // -Inf
0x7FF8'0000'0000'0000, // sNaN
0xFFF8'0000'0000'0000, // qNaN
0x7FEF'FFFF'FFFF'FFFF, // Huge
0xFFEF'FFFF'FFFF'FFFF, // -Huge
0x0010'0000'0000'0000, // smallest Normal
0x001E'EEEE'EEEE'EEEE, // small Normal
0x000F'FFFF'FFFF'FFFF, // largest denormal
0x0000'0000'0000'0001, // smallest denormal
0x8010'0000'0000'0000, // smallest Normal
0x800F'FFFF'FFFF'FFFF, // largest denormal
0x8000'0000'0000'0001, // smallest denormal
0x3FF0'0000'0000'0000, // +1
0xBFF0'0000'0000'0000, // -1
0x3FF8'0000'0000'0000, // +1.5
0xBFF8'0000'0000'0000, // -1.5
};
for (auto e: specials) {
double v;
memcpy(&v, &e, sizeof(v));
a.push_back(v);
}
std::vector<double> y(a.size(), 888);
void* ptr_a;
void* ptr_y;
cudaMalloc(&ptr_a, a.size() * sizeof(double));
cudaMalloc(&ptr_y, y.size() * sizeof(double));
cudaMemcpy(ptr_a, a.data(), a.size() * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(ptr_y, y.data(), y.size() * sizeof(double), cudaMemcpyHostToDevice);
cudaDeviceSynchronize();
int bs = 256;
int nb = (a.size() + bs - 1) / bs;
test_kernel<<<nb, bs>>>(a.size(), (const double*)ptr_a, (double*)ptr_y);
cudaDeviceSynchronize();
cudaMemcpy(y.data(), ptr_y, y.size() * sizeof(double), cudaMemcpyDeviceToHost);
cudaDeviceSynchronize();
for (int i = 0; i < a.size(); ++i) {
fprintf(stderr, "%24.13la => %24.13la\n", (double)a[i], (double)y[i]);
}
cudaFree(&ptr_y);
cudaFree(&ptr_a);
return 0;
} and obtained following results:
may you check your implementation against it? |
Hi, @zettai-reido
I double checked the mismatched cases and print the result of corresponding reciprocal value, it looks like my local result aligns with 'upper 32bits' of corresponding reciprocal value. Do you have any idea why CUDA's result is different? |
@jinge90 I believe it flushes denormals to zero in source and destination and utilizes a slightly different table.
|
Hi,Andrey |
@jinge90 rcp64h provides initial value for division algorithm to work. Sometimes such algorithms are implemented as table-lookup.
It rounded up last result but left first and second as-is. |
Hi, @zettai-reido |
I suggest to even out with FTZ/DAZ behavior. void emulate(const double* a, double* y) {
uint64_t xa = 0;
uint64_t xy = 0;
double sa = a[0];
memcpy(&xa, &sa, sizeof(xa));
int ea = (xa >> 52) & 0x7FF;
if (0 == ea) { sa = 0.0; }
else if (0x7FF == ea && (xa & 0x000F'FFFF'FFFF'FFFF)) {
xy = xa & 0x7FFF'FFFF'FFFF'FFFF;
memcpy(&y[0], &xy, sizeof(y[0]));
return;
}
double sy = 1.0 / sa;
memcpy(&xy, &sy, sizeof(xy));
uint64_t xy_high = (xy >> 32);
if (((xy >> 52) & 0x7FF) == 0) { xy_high = 0; }
xy = (xy_high << 32);
xy |= (xa & 0x8000'0000'0000'0000);
memcpy(&sy, &xy, sizeof(sy));
y[0] = sy;
} |
Hi, @zettai-reido
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @jinge90!
As results and implementation concerned, LGTM!
Signed-off-by: jinge90 <ge.jin@intel.com>
Hi, @intel/dpcpp-tools-reviewers , @intel/llvm-reviewers-runtime and @aelovikov-intel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Hi, @intel/dpcpp-tools-reviewers |
Hi, @intel/dpcpp-tools-reviewers |
Hi, @intel/dpcpp-tools-reviewers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jinge90, we should probably consider adding more folks as code owners for the file to speed up PRs like this. Absolute most of the changes I've seen are this trivial.
Do you have any suggestions whom else to add besides yourself? Feel free to just open a corresponding PR modifying the CODEOWNERS
file
Some deep learning framework uses '__nv_rcp64h' in CUDA backend. We need to provide equivalent functionality in DPC++ compiler.