[SYCL] Add __imf_rcp64h to intel math libdevice #11610

jinge90 · 2023-10-20T02:51:42Z

Some deep learning framework uses '__nv_rcp64h' in CUDA backend. We need to provide equivalent functionality in DPC++ compiler.

Some deep learning framework uses '__nv_rcp64h' in CUDA backend. We need to provide equivalent functionality in DPC++ compiler. Signed-off-by: jinge90 <ge.jin@intel.com>

Signed-off-by: jinge90 <ge.jin@intel.com>

jinge90 · 2024-02-19T02:14:12Z

Hi, @zettai-reido
This PR adds correspondence for '__nv_rcp64h' in imf libdevice, could you help review it?
Thanks very much.

Signed-off-by: jinge90 <ge.jin@intel.com>

jinge90 · 2024-02-19T06:03:56Z

Hi, @tfzhu
Could you help check whether the pre-ci failure in Jenkins/Precommit is a infrastructure issue?
Thanks very much.

tfzhu · 2024-02-19T06:18:07Z

Hi, @tfzhu Could you help check whether the pre-ci failure in Jenkins/Precommit is a infrastructure issue? Thanks very much.

Yes, @DoyleLi is working on the infra issue.

DoyleLi · 2024-02-19T06:39:49Z

Hi @jinge90 . The issue is resolved.

zettai-reido · 2024-02-19T08:18:18Z

Hello @jinge90

#include <cstdio>
#include <cstdint>
#include <initializer_list>
#include <vector>

extern "C"
__device__
double __nv_rcp64h(double a);

__device__
void calculate(const double* a, double* y) {
    y[0] = __nv_rcp64h(a[0]);
}

__global__
void test_kernel(size_t n, const double* vec_a, double* vec_y) {
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i >= n) { return; } // tail
    calculate(vec_a + i, vec_y + i);
}

int main() {
    std::vector<double> a = { 2, 3, 5, 7, 11, 13, 17, 19 };
    std::initializer_list<uint64_t> specials = {
        0x7FF0'0000'0000'0000, // +Inf
        0xFFF0'0000'0000'0000, // -Inf
        0x7FF8'0000'0000'0000, // sNaN
        0xFFF8'0000'0000'0000, // qNaN

        0x7FEF'FFFF'FFFF'FFFF, // Huge
        0xFFEF'FFFF'FFFF'FFFF, // -Huge

        0x0010'0000'0000'0000, // smallest Normal
        0x001E'EEEE'EEEE'EEEE, // small Normal
        0x000F'FFFF'FFFF'FFFF, // largest denormal
        0x0000'0000'0000'0001, // smallest denormal

        0x8010'0000'0000'0000, // smallest Normal
        0x800F'FFFF'FFFF'FFFF, // largest denormal
        0x8000'0000'0000'0001, // smallest denormal

        0x3FF0'0000'0000'0000, // +1
        0xBFF0'0000'0000'0000, // -1
        0x3FF8'0000'0000'0000, // +1.5
        0xBFF8'0000'0000'0000, // -1.5
    };

    for (auto e: specials) {
        double v;
        memcpy(&v, &e, sizeof(v));
        a.push_back(v);
    }

    std::vector<double> y(a.size(), 888);

    void* ptr_a;
    void* ptr_y;

    cudaMalloc(&ptr_a, a.size() * sizeof(double));
    cudaMalloc(&ptr_y, y.size() * sizeof(double));

    cudaMemcpy(ptr_a, a.data(), a.size() * sizeof(double), cudaMemcpyHostToDevice);
    cudaMemcpy(ptr_y, y.data(), y.size() * sizeof(double), cudaMemcpyHostToDevice);
    cudaDeviceSynchronize();

    int bs = 256;
    int nb = (a.size() + bs - 1) / bs;

    test_kernel<<<nb, bs>>>(a.size(), (const double*)ptr_a, (double*)ptr_y);
    cudaDeviceSynchronize();
    cudaMemcpy(y.data(), ptr_y, y.size() * sizeof(double), cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();

    for (int i = 0; i < a.size(); ++i) {
        fprintf(stderr, "%24.13la => %24.13la\n", (double)a[i], (double)y[i]);
    }

    cudaFree(&ptr_y);
    cudaFree(&ptr_a);
    return 0;
}

and obtained following results:

    0x1.0000000000000p+1 =>     0x1.0000000000000p-1
    0x1.8000000000000p+1 =>     0x1.5555500000000p-2
    0x1.4000000000000p+2 =>     0x1.9999900000000p-3
    0x1.c000000000000p+2 =>     0x1.2492400000000p-3
    0x1.6000000000000p+3 =>     0x1.745d100000000p-4
    0x1.a000000000000p+3 =>     0x1.3b13b00000000p-4
    0x1.1000000000000p+4 =>     0x1.e1e1e00000000p-5
    0x1.3000000000000p+4 =>     0x1.af28600000000p-5
                     inf =>     0x0.0000000000000p+0
                    -inf =>    -0x0.0000000000000p+0
                     nan =>                      nan
                    -nan =>                      nan
 0x1.fffffffffffffp+1023 =>     0x0.0000000000000p+0
-0x1.fffffffffffffp+1023 =>    -0x0.0000000000000p+0
 0x1.0000000000000p-1022 =>  0x1.0000000000000p+1022
 0x1.eeeeeeeeeeeeep-1022 =>  0x1.08d3e00000000p+1021
 0x0.fffffffffffffp-1022 =>                      inf
 0x0.0000000000001p-1022 =>                      inf
-0x1.0000000000000p-1022 => -0x1.0000000000000p+1022
-0x0.fffffffffffffp-1022 =>                     -inf
-0x0.0000000000001p-1022 =>                     -inf
    0x1.0000000000000p+0 =>     0x1.0000000000000p+0
   -0x1.0000000000000p+0 =>    -0x1.0000000000000p+0
    0x1.8000000000000p+0 =>     0x1.5555500000000p-1
   -0x1.8000000000000p+0 =>    -0x1.5555500000000p-1

may you check your implementation against it?

jinge90 · 2024-02-19T09:04:40Z

Hello @jinge90

#include <cstdio>
#include <cstdint>
#include <initializer_list>
#include <vector>

extern "C"
__device__
double __nv_rcp64h(double a);

__device__
void calculate(const double* a, double* y) {
    y[0] = __nv_rcp64h(a[0]);
}

__global__
void test_kernel(size_t n, const double* vec_a, double* vec_y) {
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i >= n) { return; } // tail
    calculate(vec_a + i, vec_y + i);
}

int main() {
    std::vector<double> a = { 2, 3, 5, 7, 11, 13, 17, 19 };
    std::initializer_list<uint64_t> specials = {
        0x7FF0'0000'0000'0000, // +Inf
        0xFFF0'0000'0000'0000, // -Inf
        0x7FF8'0000'0000'0000, // sNaN
        0xFFF8'0000'0000'0000, // qNaN

        0x7FEF'FFFF'FFFF'FFFF, // Huge
        0xFFEF'FFFF'FFFF'FFFF, // -Huge

        0x0010'0000'0000'0000, // smallest Normal
        0x001E'EEEE'EEEE'EEEE, // small Normal
        0x000F'FFFF'FFFF'FFFF, // largest denormal
        0x0000'0000'0000'0001, // smallest denormal

        0x8010'0000'0000'0000, // smallest Normal
        0x800F'FFFF'FFFF'FFFF, // largest denormal
        0x8000'0000'0000'0001, // smallest denormal

        0x3FF0'0000'0000'0000, // +1
        0xBFF0'0000'0000'0000, // -1
        0x3FF8'0000'0000'0000, // +1.5
        0xBFF8'0000'0000'0000, // -1.5
    };

    for (auto e: specials) {
        double v;
        memcpy(&v, &e, sizeof(v));
        a.push_back(v);
    }

    std::vector<double> y(a.size(), 888);

    void* ptr_a;
    void* ptr_y;

    cudaMalloc(&ptr_a, a.size() * sizeof(double));
    cudaMalloc(&ptr_y, y.size() * sizeof(double));

    cudaMemcpy(ptr_a, a.data(), a.size() * sizeof(double), cudaMemcpyHostToDevice);
    cudaMemcpy(ptr_y, y.data(), y.size() * sizeof(double), cudaMemcpyHostToDevice);
    cudaDeviceSynchronize();

    int bs = 256;
    int nb = (a.size() + bs - 1) / bs;

    test_kernel<<<nb, bs>>>(a.size(), (const double*)ptr_a, (double*)ptr_y);
    cudaDeviceSynchronize();
    cudaMemcpy(y.data(), ptr_y, y.size() * sizeof(double), cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();

    for (int i = 0; i < a.size(); ++i) {
        fprintf(stderr, "%24.13la => %24.13la\n", (double)a[i], (double)y[i]);
    }

    cudaFree(&ptr_y);
    cudaFree(&ptr_a);
    return 0;
}

and obtained following results:

    0x1.0000000000000p+1 =>     0x1.0000000000000p-1
    0x1.8000000000000p+1 =>     0x1.5555500000000p-2
    0x1.4000000000000p+2 =>     0x1.9999900000000p-3
    0x1.c000000000000p+2 =>     0x1.2492400000000p-3
    0x1.6000000000000p+3 =>     0x1.745d100000000p-4
    0x1.a000000000000p+3 =>     0x1.3b13b00000000p-4
    0x1.1000000000000p+4 =>     0x1.e1e1e00000000p-5
    0x1.3000000000000p+4 =>     0x1.af28600000000p-5
                     inf =>     0x0.0000000000000p+0
                    -inf =>    -0x0.0000000000000p+0
                     nan =>                      nan
                    -nan =>                      nan
 0x1.fffffffffffffp+1023 =>     0x0.0000000000000p+0
-0x1.fffffffffffffp+1023 =>    -0x0.0000000000000p+0
 0x1.0000000000000p-1022 =>  0x1.0000000000000p+1022
 0x1.eeeeeeeeeeeeep-1022 =>  0x1.08d3e00000000p+1021
 0x0.fffffffffffffp-1022 =>                      inf
 0x0.0000000000001p-1022 =>                      inf
-0x1.0000000000000p-1022 => -0x1.0000000000000p+1022
-0x0.fffffffffffffp-1022 =>                     -inf
-0x0.0000000000001p-1022 =>                     -inf
    0x1.0000000000000p+0 =>     0x1.0000000000000p+0
   -0x1.0000000000000p+0 =>    -0x1.0000000000000p+0
    0x1.8000000000000p+0 =>     0x1.5555500000000p-1
   -0x1.8000000000000p+0 =>    -0x1.5555500000000p-1

may you check your implementation against it?

Hi, @zettai-reido
I tried and got following results:

    0x1.0000000000000p+1 =>     0x1.0000000000000p-1
    0x1.8000000000000p+1 =>     0x1.5555500000000p-2
    0x1.4000000000000p+2 =>     0x1.9999900000000p-3
    0x1.c000000000000p+2 =>     0x1.2492400000000p-3
    0x1.6000000000000p+3 =>     0x1.745d100000000p-4
    0x1.a000000000000p+3 =>     0x1.3b13b00000000p-4
    0x1.1000000000000p+4 =>     0x1.e1e1e00000000p-5
    0x1.3000000000000p+4 =>     0x1.af28600000000p-5
                     inf =>     0x0.0000000000000p+0
                    -inf =>    -0x0.0000000000000p+0
                     nan =>                      nan
                    -nan =>                     **-nan**
 0x1.fffffffffffffp+1023 =>  **0x0.4000000000000p-1022**
-0x1.fffffffffffffp+1023 => **-0x0.4000000000000p-1022**
 0x1.0000000000000p-1022 =>  0x1.0000000000000p+1022
 0x1.eeeeeeeeeeeeep-1022 =>  0x1.08d3d00000000p+1021
 0x0.fffffffffffffp-1022 =>  **0x1.0000000000000p+1022**
 0x0.0000000000001p-1022 =>                      inf
-0x1.0000000000000p-1022 => -0x1.0000000000000p+1022
-0x0.fffffffffffffp-1022 => **-0x1.0000000000000p+1022**
-0x0.0000000000001p-1022 =>                     -inf
    0x1.0000000000000p+0 =>     0x1.0000000000000p+0
   -0x1.0000000000000p+0 =>    -0x1.0000000000000p+0
    0x1.8000000000000p+0 =>     0x1.5555500000000p-1
   -0x1.8000000000000p+0 =>    -0x1.5555500000000p-1

I double checked the mismatched cases and print the result of corresponding reciprocal value, it looks like my local result aligns with 'upper 32bits' of corresponding reciprocal value. Do you have any idea why CUDA's result is different?

zettai-reido · 2024-02-19T10:08:10Z

@jinge90 I believe it flushes denormals to zero in source and destination and utilizes a slightly different table.
1-bit differences are unavoidable:

 0x1.eeeeeeeeeeeeep-1022 =>  0x1.08d3e00000000p+1021
vs
 0x1.eeeeeeeeeeeeep-1022 =>  0x1.08d3d00000000p+1021

jinge90 · 2024-02-19T14:01:59Z

@jinge90 I believe it flushes denormals to zero in source and destination and utilizes a slightly different table.

1-bit differences are unavoidable:
 0x1.eeeeeeeeeeeeep-1022 =>  0x1.08d3e00000000p+1021

vs

 0x1.eeeeeeeeeeeeep-1022 =>  0x1.08d3d00000000p+1021

Hi,Andrey
What do you mean by "utilizes a different table"?

zettai-reido · 2024-02-20T01:19:46Z

@jinge90 rcp64h provides initial value for division algorithm to work. Sometimes such algorithms are implemented as table-lookup.

0x1.08d3e00000000p-1


 real math           infinite prec.                   machine result                                                  
1.0 / 5.0 = 1.999999999999......p-3 =>  0x1.9999900000000p-3
1.0 / 7.0 = 1.249249249249....p-3  => 0x1.2492400000000p-3
 2251799813685248/4353479639791479 = 1.08D3D.......CB08D3DD306 => 1.08D3Ep-1

It rounded up last result but left first and second as-is.

jinge90 · 2024-02-20T01:39:32Z

@jinge90 rcp64h provides initial value for division algorithm to work. Sometimes such algorithms are implemented as table-lookup.
0x1.08d3e00000000p-1


 real math           infinite prec.                   machine result                                                  
1.0 / 5.0 = 1.999999999999......p-3 =>  0x1.9999900000000p-3
1.0 / 7.0 = 1.249249249249....p-3  => 0x1.2492400000000p-3
 2251799813685248/4353479639791479 = 1.08D3D.......CB08D3DD306 => 1.08D3Ep-1
It rounded up last result but left first and second as-is.

Hi, @zettai-reido
I also observed that not all values were using the same rounding rules for rcp64h on CUDA platform and it seems that we can't guarantee the exact same behavior as CUDA, 1-bit difference is unavoidable. Do you think we can merge current patch? Or do you have any idea about how we can improve/enhance current implementation?
Thanks very much.

zettai-reido · 2024-02-20T05:23:07Z

I suggest to even out with FTZ/DAZ behavior.
I wrote this quickly to test new behavior.

void emulate(const double* a, double* y) {
    uint64_t xa = 0;
    uint64_t xy = 0;

    double sa = a[0];
    memcpy(&xa, &sa, sizeof(xa));

    int ea = (xa >> 52) & 0x7FF;
    if (0 == ea) { sa = 0.0; }
    else if (0x7FF == ea && (xa & 0x000F'FFFF'FFFF'FFFF)) {
        xy = xa & 0x7FFF'FFFF'FFFF'FFFF;
        memcpy(&y[0], &xy, sizeof(y[0]));
        return;
    }

    double sy = 1.0 / sa;
    memcpy(&xy, &sy, sizeof(xy));
    uint64_t xy_high = (xy >> 32);

    if (((xy >> 52) & 0x7FF) == 0) { xy_high = 0; }
    xy = (xy_high << 32);

    xy |= (xa & 0x8000'0000'0000'0000);
    memcpy(&sy, &xy, sizeof(sy));
    y[0] = sy;
}

jinge90 · 2024-02-20T06:44:08Z

I suggest to even out with FTZ/DAZ behavior. I wrote this quickly to test new behavior.

void emulate(const double* a, double* y) {
    uint64_t xa = 0;
    uint64_t xy = 0;

    double sa = a[0];
    memcpy(&xa, &sa, sizeof(xa));

    int ea = (xa >> 52) & 0x7FF;
    if (0 == ea) { sa = 0.0; }
    else if (0x7FF == ea && (xa & 0x000F'FFFF'FFFF'FFFF)) {
        xy = xa & 0x7FFF'FFFF'FFFF'FFFF;
        memcpy(&y[0], &xy, sizeof(y[0]));
        return;
    }

    double sy = 1.0 / sa;
    memcpy(&xy, &sy, sizeof(xy));
    uint64_t xy_high = (xy >> 32);

    if (((xy >> 52) & 0x7FF) == 0) { xy_high = 0; }
    xy = (xy_high << 32);

    xy |= (xa & 0x8000'0000'0000'0000);
    memcpy(&sy, &xy, sizeof(sy));
    y[0] = sy;
}

Hi, @zettai-reido
I updated the patch to apply 'ftz/daz' to the implementation and the results are:

    0x1.0000000000000p+1 =>     0x1.0000000000000p-1
    0x1.8000000000000p+1 =>     0x1.5555500000000p-2
    0x1.4000000000000p+2 =>     0x1.9999900000000p-3
    0x1.c000000000000p+2 =>     0x1.2492400000000p-3
    0x1.6000000000000p+3 =>     0x1.745d100000000p-4
    0x1.a000000000000p+3 =>     0x1.3b13b00000000p-4
    0x1.1000000000000p+4 =>     0x1.e1e1e00000000p-5
    0x1.3000000000000p+4 =>     0x1.af28600000000p-5
                     inf =>     0x0.0000000000000p+0
                    -inf =>    -0x0.0000000000000p+0
                     nan =>                      nan
                    -nan =>                      nan
 0x1.fffffffffffffp+1023 =>     0x0.0000000000000p+0
-0x1.fffffffffffffp+1023 =>    -0x0.0000000000000p+0
 0x1.0000000000000p-1022 =>  0x1.0000000000000p+1022
 0x1.eeeeeeeeeeeeep-1022 =>  0x1.08d3d00000000p+1021
 0x0.fffffffffffffp-1022 =>                      inf
 0x0.0000000000001p-1022 =>                      inf
-0x1.0000000000000p-1022 => -0x1.0000000000000p+1022
-0x0.fffffffffffffp-1022 =>                     -inf
-0x0.0000000000001p-1022 =>                     -inf
    0x1.0000000000000p+0 =>     0x1.0000000000000p+0
   -0x1.0000000000000p+0 =>    -0x1.0000000000000p+0
    0x1.8000000000000p+0 =>     0x1.5555500000000p-1
   -0x1.8000000000000p+0 =>    -0x1.5555500000000p-1

zettai-reido

Thanks, @jinge90!

As results and implementation concerned, LGTM!

Signed-off-by: jinge90 <ge.jin@intel.com>

jinge90 · 2024-02-20T07:33:42Z

Hi, @intel/dpcpp-tools-reviewers , @intel/llvm-reviewers-runtime and @aelovikov-intel
Could you help review this patch?
Thanks very much.

libdevice/imf/imf_fp64_dl.cpp

steffenlarsen

LGTM!

jinge90 · 2024-02-21T02:55:41Z

Hi, @intel/dpcpp-tools-reviewers
Could you help review this patch?
Thanks very much.

jinge90 · 2024-02-23T06:05:15Z

Hi, @intel/dpcpp-tools-reviewers
Kind ping~.
Thanks very much.

jinge90 · 2024-02-26T13:43:12Z

Hi, @intel/dpcpp-tools-reviewers
Kind ping~.
Thanks very much.

AlexeySachkov

@jinge90, we should probably consider adding more folks as code owners for the file to speed up PRs like this. Absolute most of the changes I've seen are this trivial.

Do you have any suggestions whom else to add besides yourself? Feel free to just open a corresponding PR modifying the CODEOWNERS file

jinge90 requested review from a team as code owners October 20, 2023 02:51

jinge90 requested a review from aelovikov-intel October 20, 2023 02:51

jinge90 temporarily deployed to WindowsCILock October 20, 2023 02:52 — with GitHub Actions Inactive

jinge90 marked this pull request as draft October 20, 2023 03:02

[SYCL] Add __imf_rcp64h to intel math libdevice

89f948c

Some deep learning framework uses '__nv_rcp64h' in CUDA backend. We need to provide equivalent functionality in DPC++ compiler. Signed-off-by: jinge90 <ge.jin@intel.com>

jinge90 temporarily deployed to WindowsCILock October 20, 2023 03:10 — with GitHub Actions Inactive

Change rcp64h to drcp_h

72b5e8c

Signed-off-by: jinge90 <ge.jin@intel.com>

jinge90 temporarily deployed to WindowsCILock October 20, 2023 03:53 — with GitHub Actions Inactive

Merge remote-tracking branch 'origin/sycl' into __imf_rcp64h

6a8c17f

jinge90 temporarily deployed to WindowsCILock February 18, 2024 08:03 — with GitHub Actions Inactive

Use rcp64h

6d8f8c7

Signed-off-by: jinge90 <ge.jin@intel.com>

jinge90 temporarily deployed to WindowsCILock February 18, 2024 08:47 — with GitHub Actions Inactive

jinge90 temporarily deployed to WindowsCILock February 19, 2024 01:53 — with GitHub Actions Inactive

jinge90 marked this pull request as ready for review February 19, 2024 02:02

jinge90 requested a review from zettai-reido February 19, 2024 02:02

add tests for rcp64h

7e49e78

Signed-off-by: jinge90 <ge.jin@intel.com>

jinge90 temporarily deployed to WindowsCILock February 19, 2024 02:30 — with GitHub Actions Inactive

jinge90 had a problem deploying to WindowsCILock February 20, 2024 06:51 — with GitHub Actions Error

zettai-reido approved these changes Feb 20, 2024

View reviewed changes

apply ftz/daz to rcp64h

530a196

Signed-off-by: jinge90 <ge.jin@intel.com>

jinge90 temporarily deployed to WindowsCILock February 20, 2024 07:32 — with GitHub Actions Inactive

Merge remote-tracking branch 'upstream/sycl' into __imf_rcp64h

70e0fbc

jinge90 temporarily deployed to WindowsCILock February 20, 2024 08:09 — with GitHub Actions Inactive

steffenlarsen reviewed Feb 20, 2024

View reviewed changes

libdevice/imf/imf_fp64_dl.cpp Show resolved Hide resolved

steffenlarsen approved these changes Feb 20, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/sycl' into __imf_rcp64h

7a00531

jinge90 temporarily deployed to WindowsCILock February 22, 2024 05:16 — with GitHub Actions Inactive

jinge90 had a problem deploying to WindowsCILock February 22, 2024 06:00 — with GitHub Actions Failure

jinge90 temporarily deployed to WindowsCILock February 23, 2024 04:56 — with GitHub Actions Inactive

Merge remote-tracking branch 'upstream/sycl' into __imf_rcp64h

1a68f4b

jinge90 had a problem deploying to WindowsCILock February 23, 2024 05:40 — with GitHub Actions Failure

jinge90 temporarily deployed to WindowsCILock February 26, 2024 13:42 — with GitHub Actions Inactive

Merge remote-tracking branch 'origin/sycl' into __imf_rcp64h

678f19c

jinge90 had a problem deploying to WindowsCILock February 26, 2024 14:29 — with GitHub Actions Failure

AlexeySachkov approved these changes Feb 27, 2024

View reviewed changes

aelovikov-intel merged commit ce70cb5 into intel:sycl Feb 27, 2024
10 of 11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL] Add __imf_rcp64h to intel math libdevice #11610

[SYCL] Add __imf_rcp64h to intel math libdevice #11610

jinge90 commented Oct 20, 2023

jinge90 commented Feb 19, 2024

jinge90 commented Feb 19, 2024

tfzhu commented Feb 19, 2024

DoyleLi commented Feb 19, 2024

zettai-reido commented Feb 19, 2024

jinge90 commented Feb 19, 2024 •

edited

Loading

zettai-reido commented Feb 19, 2024

jinge90 commented Feb 19, 2024

zettai-reido commented Feb 20, 2024 •

edited

Loading

jinge90 commented Feb 20, 2024

zettai-reido commented Feb 20, 2024

jinge90 commented Feb 20, 2024

zettai-reido left a comment

jinge90 commented Feb 20, 2024

steffenlarsen left a comment

jinge90 commented Feb 21, 2024

jinge90 commented Feb 23, 2024

jinge90 commented Feb 26, 2024

AlexeySachkov left a comment

[SYCL] Add __imf_rcp64h to intel math libdevice #11610

[SYCL] Add __imf_rcp64h to intel math libdevice #11610

Conversation

jinge90 commented Oct 20, 2023

jinge90 commented Feb 19, 2024

jinge90 commented Feb 19, 2024

tfzhu commented Feb 19, 2024

DoyleLi commented Feb 19, 2024

zettai-reido commented Feb 19, 2024

jinge90 commented Feb 19, 2024 • edited Loading

zettai-reido commented Feb 19, 2024

jinge90 commented Feb 19, 2024

zettai-reido commented Feb 20, 2024 • edited Loading

jinge90 commented Feb 20, 2024

zettai-reido commented Feb 20, 2024

jinge90 commented Feb 20, 2024

zettai-reido left a comment

Choose a reason for hiding this comment

jinge90 commented Feb 20, 2024

steffenlarsen left a comment

Choose a reason for hiding this comment

jinge90 commented Feb 21, 2024

jinge90 commented Feb 23, 2024

jinge90 commented Feb 26, 2024

AlexeySachkov left a comment

Choose a reason for hiding this comment

jinge90 commented Feb 19, 2024 •

edited

Loading

zettai-reido commented Feb 20, 2024 •

edited

Loading