Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dnn: relax intel only ocl4 dnn #11557

Merged
merged 1 commit into from May 29, 2018

Conversation

4 participants
@tomoaki0705
Copy link
Contributor

tomoaki0705 commented May 21, 2018

related #11494 #11397 #10639 #10717

I realized that ocl4dnn kernel only works on Intel GPU.
I wrote a workaround in #11494 but the root of the cause was two points.

  1. The kernel was using platform dependent behaviour of native_powr
  2. The fallback process when OpenCL failed on runtime, was doing wrong normalization

Platform specific native_powr

  • I confirmed that Test_TensorFlow_layers.batch_norm and Test_TensorFlow_layers.pooling passes only on Intel GPU
    • PASS : Intel GPU
    • FAILURE: Intel CPU, NVIDIA Geforce, Arm Mali
  • I digged in for a while and realized that after calculating the mean using OpenCL kernel, some of the elements became NaN
  • I googled a lot and realized that this line seems wrong

vec_type dst_vec = native_powr(src_vec - (vec_type)mean_val, 2);

  • This line is computing the square diff between the average value, but native_powr is showing a different behaviour based on platform
    • According to the references[1][2][3], it seems that native_powr requires the first parameter to be greather than or equal to 0
    • In this case, the first parameter can be a negative value, and thus, in some situation the computation value resulted in NaN

native_powr Computes x to the power of y, where x is ? 0. The range of x and y are implementation-defined. The maximum error is implementation-defined.

  • So probably, pown is the function we should use here.

pown Computes x to the power of y, where y is an integer.

fallback process of normalization

  • Still, the test was failing on Arm Mali (Tinkerboard, ODROID-XU4, Firefly RK3399)
    • It seems that memory was not enough for running the kernel and it was falling back to the CPU version call
    • The fall back is inevidable, but I guess that fall back implementation was not verified correctly, and the normalization result was different from the others
    • I modified the code a bit, and now the test passes, but it's involves removing the following 0 fill from mvn_layer.cpp

if (inpBlob.total() == newRows)
{
// MVN is applied to single values at an every row.
outBlob.setTo(0);
return;
}

  • I'd like to confirm @pengli about this point, just in case if I'm going the wrong way.

other topics

  • This PR isn't packed as a single commit, becasue there are other points to discuss
  • other native_powr

dot0 = native_powr(a0 - (Dtype4)sum.x, 2);
dot1 = native_powr(a1 - (Dtype4)sum.y, 2);
dot2 = native_powr(a2 - (Dtype4)sum.z, 2);
dot3 = native_powr(a3 - (Dtype4)sum.w, 2);

  • For this, I could make a separate PR
  • Also, I removed all the isIntel() call from the dnn module
    • If I understand correctly, these call were added because the kernel doesn't work on other platform than Intel GPU
    • Now all the test pass on
      • Intel GPU (HD Graphics)
      • Intel CPU (Core i7)
      • NVIDIA GPU (Geforce GTX 1060)
      • Arm GPU (Mali T628, 760, T860)
    • So I guess isIntel is no more needed in this module

[1] https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/pow.html
[2] https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/pow.html
[3] https://www.khronos.org/registry/OpenCL/sdk/2.0/docs/man/xhtml/pow.html

@alalek
Copy link
Contributor

alalek left a comment

Great investigation!

Please see my comments below.

@@ -89,7 +89,7 @@ __kernel void CALC_MEAN(__global const Dtype* src,

Dtype mean_val = mean[x];
vec_type src_vec = load(src, index);
vec_type dst_vec = native_powr(src_vec - (vec_type)mean_val, 2);
vec_type dst_vec = pown(src_vec - (vec_type)mean_val, 2);

This comment has been minimized.

Copy link
@alalek

alalek May 21, 2018

Contributor

Perhaps we should avoid any pow function here and below. Something like this:

-vec_type dst_vec = pown(src_vec - (vec_type)mean_val, 2);
+vec_type src0_vec = src_vec - (vec_type)mean_val;
+vec_type dst_vec = src0_vec * src0_vec;
inpRow.convertTo(outRow, outRow.type(), 1.0f, -mean[0]);
if (fuse_batch_norm)
{
outRow.convertTo(outRow, outRow.type(), alpha*weight, bias);

This comment has been minimized.

Copy link
@alalek

alalek May 21, 2018

Contributor

It is better to fuse two sequential convertTo() calls into one:

inp.convertTo(tmp, type, a, b); // tmp = inp * a + b
tmp.convertTo(res, type, c, d); // res = tmp * c + d
// ==>
// res = tmp * c + d
//     = (inp * a + b) * c + d
//     = inp * (a * c) + (b * c + d)
inp.convertTo(res, type, a * c, b * c + d); 
@@ -163,7 +163,7 @@ TEST_P(Test_TensorFlow_layers, pooling)
runTensorFlowNet("max_pool_even", targetId);
runTensorFlowNet("max_pool_odd_valid", targetId);
runTensorFlowNet("ave_pool_same", targetId);
runTensorFlowNet("max_pool_odd_same", targetId);
runTensorFlowNet("max_pool_odd_same", targetId, false, 3e-5, 3e-4);

This comment has been minimized.

Copy link
@alalek

alalek May 21, 2018

Contributor

Defaults are: double l1 = 1e-5, double lInf = 1e-4

What configuration requires updated thresholds?

Perhaps we should preserve default values for:

  • non-OpenCL targets (CPU code)
  • for Intel GPU OpenCL
// // MVN is applied to single values at an every row.
// outBlob.setTo(0);
// return;
//}

This comment has been minimized.

Copy link
@alalek

alalek May 21, 2018

Contributor

@dkurt Could you take a look on this?


float weight = 1.f;
float bias = 0.f;
if (fuse_batch_norm)

This comment has been minimized.

Copy link
@dkurt

dkurt May 22, 2018

Member

We need to make it true for CPU implementation at

virtual bool tryFuse(Ptr<Layer>& top) CV_OVERRIDE
{
if (preferableTarget == DNN_TARGET_OPENCL && !fuse_batch_norm)
{
top->getScaleShift(scale, shift);
fuse_batch_norm = !scale.empty() || !shift.empty();
return fuse_batch_norm;
}
return false;
}

@tomoaki0705 tomoaki0705 force-pushed the tomoaki0705:relaxIntelOnlyOCL4DNN branch from 212b04f to 4fa6737 May 22, 2018

@alalek

This comment has been minimized.

Copy link
Contributor

alalek commented May 22, 2018

BTW, Please ignore DNN EAST test failure on "Linux Debug" configuration. Fix is in progress: #11563

@tomoaki0705

This comment has been minimized.

Copy link
Contributor Author

tomoaki0705 commented May 22, 2018

So, the situation is getting complicated.

  • Removing the power function is now done, good.
    vec_type dst_vec = src_vec - (vec_type)mean_val;
    dst_vec = dst_vec * dst_vec;
  • The condition of loosening the threshold is written down, good.
    cv::ocl::Device d = cv::ocl::Device::getDefault();
    bool loosenFlag = targetId == DNN_TARGET_OPENCL && d.isIntel() && d.type() == cv::ocl::Device::TYPE_CPU;
    runTensorFlowNet("max_pool_even", targetId);
    runTensorFlowNet("max_pool_odd_valid", targetId);
    runTensorFlowNet("ave_pool_same", targetId);
    runTensorFlowNet("max_pool_odd_same", targetId, false, loosenFlag ? 3e-5 : 1e-5, loosenFlag ? 3e-4 : 1e-4);
  • The scaling is now done once, good.
    inpRow.convertTo(outRow, outRow.type(), normalizationScale, normalizationShift);
  • The 0 fill was my misunderstanding. In some case there layer becomes single column vector, which has only single value in each row.
    In such case standard deviation cannot be computed correctly, and that's probably why the 0 fill existed.

So, there are still three points to solve.

  1. Comment from @dkurt
  2. Test failure of Test_TensorFlow_fp16.tests on Intel GPU
  3. Test failure of Test_TensorFlow_layers.batch_norm on Arm GPU

Some log of 2.

Current OpenCL device:
    Type = iGPU
    Name = Intel(R) UHD Graphics 620
    Version = OpenCL 2.1
    Driver version = 22.20.16.4840
(snip)
[----------] Global test environment tear-down
[==========] 261 tests from 52 test cases ran. (12757 ms total)
[  PASSED  ] 260 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Test_TensorFlow_fp16.tests/2, where GetParam() = DNN_TARGET_OPENCL_FP16

Some log of 3.

Current OpenCL device: 
    Type = iGPU
    Name = Mali-T760
(snip)
[----------] Global test environment tear-down
[==========] 261 tests from 52 test cases ran. (6898 ms total)
[  PASSED  ] 260 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Test_TensorFlow_layers.batch_norm/1, where GetParam() = DNN_TARGET_OPENCL

I knew the test of Test_TensorFlow_fp16 fails, so I was hoping I can make it as a separate PR, but now it seems they were somehow related, and I need to fix them both.
Let me dig a bit more.

@alalek

This comment has been minimized.

Copy link
Contributor

alalek commented May 22, 2018

Driver version = 22.20.16.4840

Try to upgrade graphics driver (we observed tests fails during FP16 integration on 23.20.16.4849): #11397 (comment)

@dkurt

This comment has been minimized.

Copy link
Member

dkurt commented May 22, 2018

The 0 fill was my misunderstanding. In some case there layer becomes single column vector, which has only single value in each row.
In such case standard deviation cannot be computed correctly, and that's probably why the 0 fill existed.

@tomoaki0705, My local tests are failed without this condition because of too big values of alpha.

double alpha = (normVariance) ? 1/(eps + dev[0]) : 1;

@tomoaki0705

This comment has been minimized.

Copy link
Contributor Author

tomoaki0705 commented May 24, 2018

@dkurt , may I ask the detail, please ?
1.1. On which commit are you talking about ?
1.2. On which test are you talking about ?

In the latest commit 4fa6737 , I stored back the 0 fill, and now the test passes on Intel GPU, Intel CPU and NVIDIA GPU.
The line you pointed is from another commit 3dd9168, and I'm not sure if we are on the same page.

1.3. About the fuse_batch_norm flag, I did an experiment a bit, and I'm not sure if we really need it.
I tried another commit 0ba392a and in here I added a code to write true in fuse_batch_norm but then the test batch_norm starts failing.

else if (preferableTarget == DNN_TARGET_CPU)
{
fuse_batch_norm = true;
return fuse_batch_norm;
}

The current situation ( 4fa6737 ) seems good to me.
2. seems to be driver related problem and
3. will take more time to investigate, and it was failing before the PR, so the situation didn't turn bad. IMHO it's probably better to separate this Arm GPU issue as a separate PR.

@dkurt

This comment has been minimized.

Copy link
Member

dkurt commented May 25, 2018

@tomoaki0705, I just wanted to say that fuse_batch_norm is set to true only for DNN_TARGET_OPENCL at tryFuse method. But these conditions are in forward method which is executed on CPU (fuse_batch_norm is always false there for now).

@alalek

This comment has been minimized.

Copy link
Contributor

alalek commented May 25, 2018

BTW, OpenCL's "forward" is able to call CPU forward path too (as a fallback if something goes wrong or not suported).

@dkurt

This comment has been minimized.

Copy link
Member

dkurt commented May 27, 2018

@tomoaki0705, mvn_batch_norm_1x1 is failed with fusion because it fills zeros but we need to add a bias too (shift). May I ask you to replace

virtual bool tryFuse(Ptr<Layer>& top) CV_OVERRIDE
{
    if (preferableTarget == DNN_TARGET_OPENCL && !fuse_batch_norm)
    {
        top->getScaleShift(scale, shift);
        fuse_batch_norm = !scale.empty() || !shift.empty();
        return fuse_batch_norm;
    }
    return false;
}

on

virtual bool tryFuse(Ptr<Layer>& top) CV_OVERRIDE
{
    if (!fuse_batch_norm)
    {
        top->getScaleShift(scale, shift);
        fuse_batch_norm = !scale.empty() || !shift.empty();
        return fuse_batch_norm;
    }
    return false;
}

and modify it something like this:

Mat inpMat = inpBlob.reshape(1, newRows);
Mat outMat = outBlob.reshape(1, newRows);

if (inpBlob.total() == newRows)
{
    // MVN is applied to single values at an every row.
    if (shift.empty())
        outBlob.setTo(0);
    else
    {
        for ( i = 0; i < newRows; i++)
        {
            outMat.row(i).setTo(((float*)shift.data)[i]);
        }
    }
    return;
}

@tomoaki0705 tomoaki0705 force-pushed the tomoaki0705:relaxIntelOnlyOCL4DNN branch from 4fa6737 to 2e9e71a May 29, 2018

@alalek alalek added this to the 3.4.2 milestone May 29, 2018

@dkurt

dkurt approved these changes May 29, 2018

Copy link
Member

dkurt left a comment

👍
Looks good. Many thanks!

@opencv-pushbot opencv-pushbot merged commit 2e9e71a into opencv:3.4 May 29, 2018

1 check passed

default Required builds passed
Details

@tomoaki0705 tomoaki0705 deleted the tomoaki0705:relaxIntelOnlyOCL4DNN branch May 29, 2018

@tomoaki0705

This comment has been minimized.

Copy link
Contributor Author

tomoaki0705 commented May 29, 2018

Thank you for nice review! I wouldn't be able to finish it by my self only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.