Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is anyone trying to build CNTK with CUDA 11.1? #3835

Open
haryngod opened this issue Dec 17, 2020 · 42 comments
Open

Is anyone trying to build CNTK with CUDA 11.1? #3835

haryngod opened this issue Dec 17, 2020 · 42 comments

Comments

@haryngod
Copy link

I know MS announced that they won't support CNTK anymore.
However, I would like to know who is trying to build CNTK with CUDA 11.1 like me.
If someone trying and have some tips for this, I hope we discuss this.

Now I changed

  • cudnnGetConvolutionForwardAlgorithm -> cudnnGetConvolutionForwardAlgorithm_v7
  • cudnnGetConvolutionBackwardDataAlgorithm -> cudnnGetConvolutionBackwardDataAlgorithm_v7
  • cudnnGetConvolutionBackwardFilterAlgorithm -> cudnnGetConvolutionBackwardFilterAlgorithm-v7
  • cudnnSetRNNDescriptor_v5 -> cudnnSetRNNDescriptor_v8
  • cusparseScsr2csc -> cusparseCsr2cscEx2_bufferSize
  • cusparseDcsr2csc -> cusparseCsr2cscEx2
  • .. etc.

Now I build Common to ReaderLib.
When I build CNTKv2LibraryDll, I've got blow errors.

LNK2005: "unsigned __int64 __cdecl Microsoft::MSR::CNTK::GetCUDNNVersion(void)" (?GetCUDNNVersion@CNTK@MSR@Microsoft@@YA_KXZ) already defined in Cntk.Math-2.7.lib(Cntk.Math-2.7.dll)
LNK2005: "protected: float * __cdecl Microsoft::MSR::CNTK::BaseMatrix<float>::Buffer(void)const " (?Buffer@?$BaseMatrix@M@CNTK@MSR@Microsoft@@IEBAPEAMXZ) already defined in Cntk.Math-2.7.lib(Cntk.Math-2.7.dll) 
LNK1169: multiply defined symbols

Note that I already build Common to CNTK with CUDA 10.1.

@haryngod
Copy link
Author

I and my co-worker are working on this at https://github.com/haryngod/CNTK/tree/2.7-cuda-11.1
It may look like a mess right now, but our goal is to build the code without any errors.
If someone wants to build CNTK, we could share our experiences with each other.

@delzac
Copy link
Contributor

delzac commented Dec 18, 2020

If you manage to successfully build it, i'll definitely be using it! I'm still stuck using GTX 1000 series cards, would love to upgrade. Unfortunately, i have zero experience in compiling cntk so i can't help you in this.

@haixpham
Copy link

Interesting! Hope you succeed in building CNTK with CUDA 11 and maybe newer Python version too.

@kassinvin
Copy link

LNK2005: "unsigned __int64 __cdecl Microsoft::MSR::CNTK::GetCUDNNVersion(void)" (?GetCUDNNVersion@CNTK@MSR@Microsoft@@YA_KXZ) already defined in Cntk.Math-2.7.lib(Cntk.Math-2.7.dll)
LNK2005: "protected: float * __cdecl Microsoft::MSR::CNTK::BaseMatrix::Buffer(void)const " (?Buffer@?$BaseMatrix@M@CNTK@MSR@Microsoft@@IEBAPEAMXZ) already defined in Cntk.Math-2.7.lib(Cntk.Math-2.7.dll)
LNK1169: multiply defined symbols


How is it going? I have met the same error.

@haryngod
Copy link
Author

@kassinvin I add /FORCE:MULTIPLE in CNTKv2LibraryDLL > preperence > linker > command line. It will be ok.

@dmagee
Copy link

dmagee commented Feb 14, 2021

I'm also trying to get this working. The thing I'm stuck on is GPUTensor.cu. It gives a heap error. If you comment out some of the template instantiations (I tried the <float... ones) at the bottom it compiles. I tried to split it into two (with some instantiations in one file, and others in another) , but get multiply defined symbols.

Also, tried the repo linked by @haryngod above, but it seems to be set up to use cuda 10 still. I'm not sure if it's supposed to be working yet?

Thanks!

@dmagee
Copy link

dmagee commented Feb 14, 2021

I'm also trying to get this working. The thing I'm stuck on is GPUTensor.cu. It gives a heap error. If you comment out some of the template instantiations (I tried the <float... ones) at the bottom it compiles. I tried to split it into two (with some instantiations in one file, and others in another) , but get multiply defined symbols.

Also, tried the repo linked by @haryngod above, but it seems to be set up to use cuda 10 still. I'm not sure if it's supposed to be working yet?

Thanks!

In answer to my own question I added the /FORCE:MULTIPLE thing (suggested by @haryngod) to the MathsCuda and Maths projects too. I seem to have a working cntk.exe! (The 01_OneHidden.cntk example in the Images\GettingStarted folder seems to run.anyway). I did achieve this by a) Commenting out various cudnn calls in SparseMatrix and RNN classes that I suspected I wasn't using (I only use CNNS) and copying the cublasLt64_11.dll dll over manually from the cuda install. I also updated cubblas calls to _v7 where it was a simple replacement. This may be of help to some people.

The change to Cuda 11.1 was enacted by modifying various lines in CNTK.Cpp.props

D.

@dmagee
Copy link

dmagee commented Feb 15, 2021

Ok, more advice from my experiments. It turned out I was using an older version of cudnn (cudnn-10.0-v7.3.1) which isn't really designed to work with cuda11.X, and I do suspect that while cntk.exe ran, it wasn't learning properly. I've now replaced this with cudnn-11.1-v8.0.5.39 (I needed to change CUDNN_PATH env variable to point to this). This then throws some new errors as the following functions don't exist:

cudnnGetConvolutionForwardAlgorithm
cudnnGetConvolutionBackwardFilterAlgorithm
cudnnGetConvolutionBackwardDataAlgorithm

These are all used in CuDnnConvolutionEngine.cu

I got past this by adding the following near the top of that file (after the includes):

#ifndef CUDNN_CONVOLUTION_FWD_SPECIFY_WORKSPACE_LIMIT
typedef enum
{
    CUDNN_CONVOLUTION_FWD_NO_WORKSPACE = 0,
    CUDNN_CONVOLUTION_FWD_PREFER_FASTEST = 1,
    CUDNN_CONVOLUTION_FWD_SPECIFY_WORKSPACE_LIMIT = 2,
} cudnnConvolutionFwdPreference_t;

cudnnStatus_t CUDNNWINAPI
cudnnGetConvolutionForwardAlgorithm(cudnnHandle_t handle,
                                    const cudnnTensorDescriptor_t xDesc,
                                    const cudnnFilterDescriptor_t wDesc,
                                    const cudnnConvolutionDescriptor_t convDesc,
                                    const cudnnTensorDescriptor_t yDesc,
                                    cudnnConvolutionFwdPreference_t preference,
                                    size_t memoryLimitInBytes,
                                    cudnnConvolutionFwdAlgo_t* algo)
{
    cudnnConvolutionFwdAlgoPerf_t perfResults;
    int returnedAlgoCount;
    cudnnStatus_t rv;

    rv = cudnnGetConvolutionForwardAlgorithm_v7(
        handle,
        xDesc,
        wDesc,
        convDesc,
        yDesc,
        1,
        &returnedAlgoCount,
        &perfResults);

	if (rv != 0)
    {
        std::cerr << "cudnnGetConvolutionForwardAlgorithm_v7 failed: " << rv << std::endl;

    }

	*algo = perfResults.algo;

	std::cerr << "Using ConvolutionForwardAlgorithm: " << perfResults.algo << std::endl;
    ; 

	return rv;

}
#endif

#ifndef CUDNN_CONVOLUTION_BWD_DATA_SPECIFY_WORKSPACE_LIMIT
typedef enum
{
    CUDNN_CONVOLUTION_BWD_DATA_NO_WORKSPACE = 0,
    CUDNN_CONVOLUTION_BWD_DATA_PREFER_FASTEST = 1,
    CUDNN_CONVOLUTION_BWD_DATA_SPECIFY_WORKSPACE_LIMIT = 2,
} cudnnConvolutionBwdDataPreference_t;

typedef enum
{
    CUDNN_CONVOLUTION_BWD_FILTER_NO_WORKSPACE = 0,
    CUDNN_CONVOLUTION_BWD_FILTER_PREFER_FASTEST = 1,
    CUDNN_CONVOLUTION_BWD_FILTER_SPECIFY_WORKSPACE_LIMIT = 2,
} cudnnConvolutionBwdFilterPreference_t;

cudnnStatus_t CUDNNWINAPI
cudnnGetConvolutionBackwardFilterAlgorithm(cudnnHandle_t handle,
                                           const cudnnTensorDescriptor_t xDesc,
                                           const cudnnTensorDescriptor_t dyDesc,
                                           const cudnnConvolutionDescriptor_t convDesc,
                                           const cudnnFilterDescriptor_t dwDesc,
                                           cudnnConvolutionBwdFilterPreference_t preference,
                                           size_t memoryLimitInBytes,
                                           cudnnConvolutionBwdFilterAlgo_t* algo)
{
    cudnnConvolutionBwdFilterAlgoPerf_t perfResults;
    int returnedAlgoCount;
    cudnnStatus_t rv;

    rv = cudnnGetConvolutionBackwardFilterAlgorithm_v7(
        handle,
        xDesc,
        dyDesc,
        convDesc,
        dwDesc,
        1,
        &returnedAlgoCount,
        &perfResults);

    *algo = perfResults.algo;

    return rv;

}


cudnnStatus_t CUDNNWINAPI
cudnnGetConvolutionBackwardDataAlgorithm(cudnnHandle_t handle,
	const cudnnFilterDescriptor_t wDesc,
	const cudnnTensorDescriptor_t dyDesc,
	const cudnnConvolutionDescriptor_t convDesc,
	const cudnnTensorDescriptor_t dxDesc,
	cudnnConvolutionBwdDataPreference_t preference,
	size_t memoryLimitInBytes,
	cudnnConvolutionBwdDataAlgo_t* algo)
{
    cudnnConvolutionBwdDataAlgoPerf_t perfResults;
    int returnedAlgoCount;
    cudnnStatus_t rv;

    rv = cudnnGetConvolutionBackwardDataAlgorithm_v7(
        handle,
        wDesc,
        dyDesc,
        convDesc,
        dxDesc,
        1,
        &returnedAlgoCount,
        &perfResults);

    *algo = perfResults.algo;

    return rv;

}
#endif

@dmagee
Copy link

dmagee commented Feb 15, 2021

Sorry for spaming everyone, but now with cudnn-11.1-v8.0.5.39 I'm getting an exception thrown on the cudnnConvolutionForward call in CuDnnConvolutionEngine.cu. The output is:


...
Starting minibatch loop.

About to throw exception 'cuDNN failure 3: CUDNN_STATUS_BAD_PARAM ; GPU=0 ; hostname=LAPTOP-RM6KJERA ; expr=cudnnConvolutionForward(*m_cudnn, &C::One, m_inT, ptr(in), *m_kernelT, ptr(kernel), *m_conv, m_fwdAlgo.selectedAlgo, ptr(workspace), workspace.BufferSize(), &C::Zero, m_outT, ptr(out))'
cuDNN failure 3: CUDNN_STATUS_BAD_PARAM ; GPU=0 ; hostname=LAPTOP-RM6KJERA ; expr=cudnnConvolutionForward(*m_cudnn, &C::One, m_inT, ptr(in), *m_kernelT, ptr(kernel), *m_conv, m_fwdAlgo.selectedAlgo, ptr(workspace), workspace.BufferSize(), &C::Zero, m_outT, ptr(out))


[CALL STACK]
    > vcomp_reduction_r4
    - Microsoft::MSR::CNTK::CudaTimer::  Stop
    - Microsoft::MSR::CNTK::CuDnnConvolutionEngine<float>::  ForwardCore
    - Microsoft::MSR::CNTK::ConvolutionNode<float>::  ForwardProp
    - Microsoft::MSR::CNTK::ComputationNetwork::PARTraversalFlowControlNode::  ForwardProp
    - std::_Func_impl_no_alloc<<lambda_258018e62e82ba6c7f6055b001fc29b8>,void,std::shared_ptr<Microsoft::MSR::CNTK::ComputationNodeBase> const &>::  _Do_call
    - Microsoft::MSR::CNTK::ComputationNetwork::TravserseInSortedGlobalEvalOrder<std::vector<std::shared_ptr<Microsoft::MSR::CNTK::ComputationNodeBase>,std::allocator<std::shared_ptr<Microsoft::MSR::CNTK::ComputationNodeBase>>>>
    - Microsoft::MSR::CNTK::ComputationNetwork::ForwardProp<std::vector<std::shared_ptr<Microsoft::MSR::CNTK::ComputationNodeBase>,std::allocator<std::shared_ptr<Microsoft::MSR::CNTK::ComputationNodeBase>>>>
    - Microsoft::MSR::CNTK::SGD<float>::  TrainOneEpoch
    - Microsoft::MSR::CNTK::SGD<float>::  TrainOrAdaptModel
    - Microsoft::MSR::CNTK::SGD<float>::  Train
    - DoTrain<Microsoft::MSR::CNTK::ConfigParameters,float>
    - DispatchThisAction<float>
    - DoCommands<float>
    - wmainOldCNTKConfig
    - wmain1

EXCEPTION occurred: cuDNN failure 3: CUDNN_STATUS_BAD_PARAM ; GPU=0 ; hostname=LAPTOP-RM6KJERA ; expr=cudnnConvolutionForward(*m_cudnn, &C::One, m_inT, ptr(in), *m_kernelT, ptr(kernel), *m_conv, m_fwdAlgo.selectedAlgo, ptr(workspace), workspace.BufferSize(), &C::Zero, m_outT, ptr(out))

(This is calling cntk.exe configFile=02_OneConv.cntk in Examples\Image\GettingStarted)

I checked the algorithm being used (m_fwdAlgo.selectedAlgo) is #1, but the workspace.BufferSize() is zero. Any ideas how to fix this gratefully recieved!

D.

@dmagee
Copy link

dmagee commented Feb 16, 2021

No idea if I'm talking to myself, but the exception I reported above is due to the fact that the workspace size calculation in CNTK seems broken (too small) in 3 places in CuDnnConvolutionEngine.cu. Slightly hacky, but replacing the CNTK workspace object with an inline allocation seems to have my c++ code training!

#if 1
		// TEMPORARY FIX: Try allocating workspace here, rather than using workspace object
		size_t ws_size;
        CUDNN_CALL(cudnnGetConvolutionForwardWorkspaceSize(
            *m_cudnn, m_inT, *m_kernelT, *m_conv, m_outT, m_fwdAlgo.selectedAlgo, &ws_size));

        float* ws_data;
        CUDA_CALL(cudaMalloc(&ws_data, ws_size));

        CUDNN_CALL(cudnnConvolutionForward(*m_cudnn, &C::One, m_inT, ptr(in), *m_kernelT, ptr(kernel), *m_conv, m_fwdAlgo.selectedAlgo, ws_data, ws_size, &C::Zero, m_outT, ptr(out)));

		CUDA_CALL(cudaFree(ws_data));
#else
        CUDNN_CALL(cudnnConvolutionForward(*m_cudnn, &C::One, m_inT, ptr(in), *m_kernelT, ptr(kernel), *m_conv, m_fwdAlgo.selectedAlgo, ptr(workspace), workspace.BufferSize(), &C::Zero, m_outT, ptr(out)));
#endif
#if 1
        // TEMPORARY FIX: Try allocating workspace here,rather than using workspace object
        size_t ws_size;
        CUDNN_CALL(cudnnGetConvolutionBackwardDataWorkspaceSize(
            *m_cudnn, *m_kernelT, m_outT, *m_conv, m_inT, m_backDataAlgo.selectedAlgo, &ws_size));

        float* ws_data;
        CUDA_CALL(cudaMalloc(&ws_data, ws_size));

        CUDNN_CALL(cudnnConvolutionBackwardData(*m_cudnn, &C::One, *m_kernelT, ptr(kernel), m_outT, ptr(srcGrad), *m_conv, m_backDataAlgo.selectedAlgo, ws_data, ws_size, accumulateGradient ? &C::One : &C::Zero, m_inT, ptr(grad)));

		CUDA_CALL(cudaFree(ws_data));

#else
        CUDNN_CALL(cudnnConvolutionBackwardData(*m_cudnn, &C::One, *m_kernelT, ptr(kernel), m_outT, ptr(srcGrad), *m_conv, m_backDataAlgo.selectedAlgo, ptr(workspace), workspace.BufferSize(), accumulateGradient ? &C::One : &C::Zero, m_inT, ptr(grad)));
#endif
#if 1
        // TEMPORARY FIX: Try allocating workspace here, rather than using workspace object 
        size_t ws_size;
        CUDNN_CALL(cudnnGetConvolutionBackwardFilterWorkspaceSize(
            *m_cudnn, m_inT, m_outT, *m_conv, *m_kernelT, m_backFiltAlgo.selectedAlgo, &ws_size));

        float* ws_data;
        CUDA_CALL(cudaMalloc(&ws_data, ws_size));

        cerr << "Calling BackwardFilter:" << ws_size << " vs " << workspace.BufferSize() << endl;
        CUDNN_CALL(cudnnConvolutionBackwardFilter(*m_cudnn, &C::One, m_inT, ptr(in), m_outT, ptr(srcGrad), *m_conv, m_backFiltAlgo.selectedAlgo, ws_data, ws_size, accumulateGradient ? &C::One : &C::Zero, *m_kernelT, ptr(kernelGrad)));

		CUDA_CALL(cudaFree(ws_data));
#else
        CUDNN_CALL(cudnnConvolutionBackwardFilter(*m_cudnn, &C::One, m_inT, ptr(in), m_outT, ptr(srcGrad), *m_conv, m_backFiltAlgo.selectedAlgo, ptr(workspace), workspace.BufferSize(), accumulateGradient ? &C::One : &C::Zero, *m_kernelT, ptr(kernelGrad)));
#endif

Maybe that helps someone!

@delzac
Copy link
Contributor

delzac commented Feb 16, 2021

Thanks for sharing the info here @dmagee! Sounds like trying to setup cntk with the latest Cuda is a non trivial task.

@dmagee
Copy link

dmagee commented Feb 16, 2021

Thanks for sharing the info here @dmagee! Sounds like trying to setup cntk with the latest Cuda is a non trivial task.

No worries. You're absolutely right. the Nvidia cuDnn library it is based on has changed api, so lots of things need updating. I've just fixed the bits that needed for training CNNs with CNTK.exe or the C++ interface. I've only commented out various other bits (to do with RNNs and Sparce matrices), and not touched any python (I don't use the python api). I'm afraid I don'treally have time to package all this up, but hopefully posting what I've done here can help someone who does.

@haryngod
Copy link
Author

Thanks, @dmagee for sharing lots of your experience.

I'm not sure you, we met the same problem.
I've changed function 'cudnnGetConvolutionBackwardFilterAlgorithm' to 'cudnnGetConvolutionBackwardFilterAlgorithm_v7', since Cuda api updated.
You can check cuda documentation. Also, you can compare previous version documentation for comparison.

@dmagee
Copy link

dmagee commented Feb 19, 2021

Another issue I found was in cudnnCommon.cpp on the line:

auto err = cudnnDestroy(*src);

This causes an crash somewhere in the nvidia cudnn library. Essentially a single instance of cudnnHandle_t is allocated when doing prediction and assigned as a shared_ptr within an instance of the CNTK CuDnn class. It is destroyed on the program exit as the destructors are called. I've no idea why this causes a crash (seemingly the same pointer that was allocated is destroyed, and there are no other relevant calls to cudnnDestroy), or why it only happens when doing prediction, and not learning (in c++ anyway), but my solution was to comment out everything in this tidy up code:

    static std::shared_ptr<cudnnHandle_t> m_instance = std::shared_ptr<cudnnHandle_t>(createNew(), [](cudnnHandle_t* src)
    {
#ifndef ORIGINAL_CODE
        UNUSED(src);
#else
		// For some reason the call to cudnnDestroy causes a crash
		// As only allocated/destroyed once, is ok to comment this out without causing a leak

        assert(*src != nullptr);


        auto err = cudnnDestroy(*src);
        assert(err == CUDNN_STATUS_SUCCESS);
#ifdef NDEBUG
        UNUSED(err);
#endif
        delete src;
#endif
    });

Again (like my other solution above) is a horrible hack as it doesn't actually fix the bug, but it does mean my programs don't crash right at the end. If there was lots of instances of CuDnn created obviously it would be a memory leak, but in my code at least it only seems to do it once and tidy up right at the end.

Hopefully helps someone!

@evo11x
Copy link

evo11x commented Mar 15, 2021

I also get this error on RTX 3060 if I have more than 512 neurons on one layer. With RTX 2060 it works without any error with the same files and nvidia drivers.

Loading data...
Using device: GPU[0] GeForce RTX 3060

About to throw exception 'CUBLAS failure 13: CUBLAS_STATUS_EXECUTION_FAILED ; GPU=0 ; hostname=PC1; expr=cublasgemmHelper(cuHandle, transA, transB, m, n, k, &alpha, a.Data(), (int) a.m_numRows, b.Data(), (int) b.m_numRows, &beta, c.Data(), (int) c.m_numRows)'
CUBLAS failure 13: CUBLAS_STATUS_EXECUTION_FAILED ; GPU=0 ; hostname=PC1; expr=cublasgemmHelper(cuHandle, transA, transB, m, n, k, &alpha, a.Data(), (int) a.m_numRows, b.Data(), (int) b.m_numRows, &beta, c.Data(), (int) c.m_numRows)

Unhandled Exception: System.ApplicationException: CUBLAS failure 13: CUBLAS_STATUS_EXECUTION_FAILED ; GPU=0 ; hostname=PC1; expr=cublasgemmHelper(cuHandle, transA, transB, m, n, k, &alpha, a.Data(), (int) a.m_numRows, b.Data(), (int) b.m_numRows, &beta, c.Data(), (int) c.m_numRows)

[CALL STACK]
> Microsoft::MSR::CNTK::TensorView:: Reshaped
- Microsoft::MSR::CNTK::CudaTimer:: Stop
- Microsoft::MSR::CNTK::GPUMatrix:: MultiplyAndWeightedAdd
- Microsoft::MSR::CNTK::Matrix:: MultiplyAndWeightedAdd
- Microsoft::MSR::CNTK::TensorView:: DoMatrixProductOf
- Microsoft::MSR::CNTK::TensorView:: AssignMatrixProductOf
- std::enable_shared_from_thisMicrosoft::MSR::CNTK::MatrixBase:: shared_from_this (x3)
- CNTK::Internal:: UseSparseGradientAggregationInDataParallelSGD
- CNTK:: CreateTrainer
- CNTK::Trainer:: TotalNumberOfUnitsSeen
- CNTK::Trainer:: TrainMinibatch (x2)
- CSharp_CNTK_Trainer__TrainMinibatch__SWIG_2
- 00007FFF157C5E45 (SymFromAddr() error: The specified module could not be found.)

@haryngod
Copy link
Author

@dmagee I've faced the same issue. I think this issue has occurred in PyTorch(issue link) as well. Then it's the PyTorch PR. Even I read this, I have no idea how I fix this yet.

@nietras
Copy link

nietras commented Jun 4, 2021

I'm trying to get CNTK working on latest CUDA 11 too on Windows. I was wondering why I can't find any Azure Pipeline yml files, so I could use a custom pipeline agent for testing instead of local dev. Anyone know link to Azure DevOps pipelines?

Also very interested in whatever changes needed for CUDA 11 to work.

@nietras
Copy link

nietras commented Aug 5, 2021

Hello everyone, based on the work by @haryngod and others I have managed to build CNTK with CUDA 11.4 and cuDNN 8.2.2 and made nuget packages for this. This is detailed in a quick blog post at:

https://nietras.com/2021/08/05/reviving-cntk-with-cuda-11-4/

As mentioned there, I had hoped to release the nuget packages on nuget.org but could not due to size limit. Instead packages can be downloaded and you can add them to your own feeds.

@delzac
Copy link
Contributor

delzac commented Aug 5, 2021

@nietras Amazing work! Thank you for your contributions!!

@JeppeThagaardVP
Copy link

JeppeThagaardVP commented Aug 8, 2021

@nietras amazing work. Thank you very much.
However, I can't get 1x1x256 conv to work on a (1,1,3) input variable. cuDNN throws a cuDNN failure 9: CUDNN_STATUS_NOT_SUPPORTED. Rest of my pipelines work, but I am a bit stuck here. Any ideas?

@nietras
Copy link

nietras commented Aug 8, 2021

@JeppeThagaardVP thanks. I have not hit this issue myself. Do you have some simple reproduction code showing this e. g. in C#? so I don't have to guess around dimension order etc.

@JeppeThagaardVP
Copy link

@nietras, not immediately but will work on getting it. In the meantime, I can try to explain what the pipeline that fails is trying to achieve.
It's basically part of an Image pooling layer from an atrous spatial pyramid pooling block (ASPP from DeepLabV3).

Input: dim = 2x2x3
CNTK::ReduceMean(Input, { CNTK::Axis(0),CNTK::Axis(1) }, true) //GlobalAvgPooling -> dim = 1x1x3
CNTK::Convolution(dim = {1,1,3,256}, strides = {1,1,3}) -> dim = 1x1x256

The convolution operation throws a cuDNN failure 9: CUDNN_STATUS_NOT_SUPPORTED, which it did not before.

Any help would be super appreciated :)

@nietras
Copy link

nietras commented Aug 9, 2021

@JeppeThagaardVP basically the only change to conv is to to use cudnnGetConvolutionForwardAlgorithm_v7 as can be seen in https://github.com/nietras/CNTK/pull/6/files.

However, one thing I don't understand in your example code is the stride 1,1,3? I understand that this worked before but what is intended with this stride? I can understand 1,1 but not 1,1,3? I assume you are trying to do a 1x1 conv2D? :)

Based on our higher level API I made some simple example code and with stride 1,1 this does what I expect.

@JeppeThagaardVP
Copy link

JeppeThagaardVP commented Aug 9, 2021

@nietras that's weird, it does not work on my end. Can I get you to post the exact convolutional map, and parameters for the CNTK::Convolution operation?

The params for stride is similar to this example (https://github.com/microsoft/CNTK/blob/b7d4945a8e604268b344e6286e8993bacdba6e5c/Tests/EndToEndTests/CNTKv2Library/Common/Image.h):

auto convFunction = Convolution(convParams, input, { hStride, vStride, numInputChannels });

@nietras
Copy link

nietras commented Aug 10, 2021

@JeppeThagaardVP I am not using the C++ API hence they are not comparable. However, I have saved a simple model that does what I expect/assume you want in a cntk file. The conv as viewed in Netron does have same stride as you provide so perhaps the C++ API has a different way of expressing this.

conv.zip

Perhaps you can try load this model and evaluate it. It works on my PC 😅 Note I am only doing evaluate, which might be the issue...

image

@sigfrid696
Copy link

@nietras amazing work. Thank you very much.
However, I can't get 1x1x256 conv to work on a (1,1,3) input variable. cuDNN throws a cuDNN failure 9: CUDNN_STATUS_NOT_SUPPORTED. Rest of my pipelines work, but I am a bit stuck here. Any ideas?

hello, trying CNTK with CUDA 11.4 on my project, CNTK and FasterRCNN model, inference c++.
I obtain the same error that you have, CUDNN_STATUS_NOT_SUPPORTED on a forward conv, cudnnConvolutionForward() method call....before on CUDA 10.0 it was working without any problem.
It seems something broken here.
Do you have any news about ?

Thank You very much!!

@sigfrid696
Copy link

I managed to do more tests and I found that the code related to the change from cudnnGetConvolutionForwardAlgorithm to cudnnGetConvolutionForwardAlgorithm_v7 ( and the same for the Backward part) is wrong. The change just selected the first algorithm without considering that the old call had a parameter to specify the current maximum workspace size available. Taking the first algorithm from the function could happen that the allocated workspace is not enough leading to a CUDNN_STATUS_NOT_SUPPORTED error. The correct way is to iterate on all algorithms that are given back to choose one with size <= workspace size.
Here a part of the code:

Old Code
/*if(!noMem) return cudnnGetConvolutionForwardAlgorithm(*m_cudnn, m_inT, *m_kernelT, *m_conv, m_outT, CUDNN_CONVOLUTION_FWD_SPECIFY_WORKSPACE_LIMIT, workspace.BufferSize(), &algo); return cudnnGetConvolutionForwardAlgorithm(*m_cudnn, m_inT, *m_kernelT, *m_conv, m_outT, CUDNN_CONVOLUTION_FWD_NO_WORKSPACE, 0, &algo);*/

New One

 int res_count = 0;
 std::unique_ptr<cudnnConvolutionFwdAlgoPerf_t[]> fwd_perf(new cudnnConvolutionFwdAlgoPerf_t[100]);
 cudnnStatus_t result;
 result = cudnnGetConvolutionForwardAlgorithm_v7(*m_cudnn, m_inT, *m_kernelT, *m_conv, m_outT, 100, &res_count, 
 fwd_perf.get());
 size_t sizeBytes = 0;
 if (!noMem)
                sizeBytes = workspace.BufferSize();

size_t tmpSize = 0;
cudnnStatus_t err = CUDNN_STATUS_EXECUTION_FAILED;
for (int i = 0; i < res_count; i++)
{
                auto err0 = cudnnGetConvolutionForwardWorkspaceSize(*m_cudnn, m_inT, *m_kernelT, *m_conv, m_outT, 
                fwd_perf[i].algo, &tmpSize);
                if (err0 == CUDNN_STATUS_SUCCESS)
                { 
					//printf("found algo sizeBytes %d algo size %d\n", sizeBytes, tmpSize);
                                        if (tmpSize <= sizeBytes)
                                        {
                                                        algo = fwd_perf[i].algo;
                                                        err = err0;
                                                        break;
                                         } 
		 }
}
return err;

For the other parts (backward) I can do a pull request....
I still have some problems, in the sense that now I can correctly execute all the inference code but only if I launch it in release mode with ctrl F5, so release without debug. If I launch directly the application, the app crashes in an unmanaged way...so I don't have any hint where the remaining problem is. Usually these kind of problems are related to uninitialized memory. Do you have any idea of instrument to use to find these problems with CUDA ?
The more rewriting for CUDA 11 is in GPU Sparse Matrix module, the problem could be there...

@sigfrid696
Copy link

sigfrid696 commented Aug 27, 2021

No idea if I'm talking to myself, but the exception I reported above is due to the fact that the workspace size calculation in CNTK seems broken (too small) in 3 places in CuDnnConvolutionEngine.cu. Slightly hacky, but replacing the CNTK workspace object with an inline allocation seems to have my c++ code training!

@dmagee the problem that you saw is not a real one, CNTK increases the workspace in an incremental way starting processing the images...so at the start of the program workspace size is 0 and then it gets incremented...this is also the reason why the first images of the batch are usually slower than the others...
You solved with the temporary fix, because in that way you recalculated the working space based on the chosen algorithm.

@JohanDore
Copy link

First of all: @nietras & @sigfrid696 thanks a lot for the great effort.

I am working with @JeppeThagaardVP on this and we are very close to have it working here, and wonder if @sigfrid696 has a pull request which I can to cover cudnnGetConvolutionBackwardDataAlgorithm_v7, cudnnGetConvolutionBackwardFilterAlgorithm_v7... also

@JohanDore
Copy link

JohanDore commented Aug 28, 2021

I gave @sigfrid696's proposal a shot my self and the below changes made CNTK work it our application using Cuda 11.4.

cudnnGetConvolutionForwardAlgorithm_v7:

        // 2020.12.09 - mj.jo
        // cuda 11.1
        /*int res_count = 0;
        cudnnConvolutionFwdAlgoPerf_t fwd_perf;

        cudnnStatus_t result;
        if (!noMem)
            result = cudnnGetConvolutionForwardAlgorithm_v7(*m_cudnn, m_inT, *m_kernelT, *m_conv, m_outT, 1, &res_count, &fwd_perf);
        else
            result = cudnnGetConvolutionForwardAlgorithm_v7(*m_cudnn, m_inT, *m_kernelT, *m_conv, m_outT, 0, &res_count, &fwd_perf);

        algo = fwd_perf.algo;
        return result;
        */

        // 2021.02.28
        // cuda 11.4
        int res_count = 0;
        std::unique_ptr<cudnnConvolutionFwdAlgoPerf_t[]> fwd_perf(new cudnnConvolutionFwdAlgoPerf_t[100]);
        cudnnStatus_t result;
        result = cudnnGetConvolutionForwardAlgorithm_v7(*m_cudnn, m_inT, *m_kernelT, *m_conv, m_outT, 100, &res_count,
                                                        fwd_perf.get());
        size_t sizeBytes = 0;
        if (!noMem)
            sizeBytes = workspace.BufferSize();

        size_t tmpSize = 0;
        cudnnStatus_t err = CUDNN_STATUS_EXECUTION_FAILED;
        for (int i = 0; i < res_count; i++)
        {
            auto err0 = cudnnGetConvolutionForwardWorkspaceSize(*m_cudnn, m_inT, *m_kernelT, *m_conv, m_outT,
                                                                fwd_perf[i].algo, &tmpSize);
            if (err0 == CUDNN_STATUS_SUCCESS)
            {
                //printf("found algo sizeBytes %d algo size %d\n", sizeBytes, tmpSize);
                if (tmpSize <= sizeBytes)
                {
                    algo = fwd_perf[i].algo;
                    err = err0;
                    break;
                }
            }
        }
        return err;
    };`

cudnnGetConvolutionBackwardDataAlgorithm:

       // 2020.12.09 - mj.jo
        // cuda 10.0
        /*if (!noMem)
            return cudnnGetConvolutionBackwardDataAlgorithm(*m_cudnn, *m_kernelT, m_outT, *m_conv, m_inT, CUDNN_CONVOLUTION_BWD_DATA_SPECIFY_WORKSPACE_LIMIT, workspace.BufferSize(), &algo);
        return cudnnGetConvolutionBackwardDataAlgorithm(*m_cudnn, *m_kernelT, m_outT, *m_conv, m_inT, CUDNN_CONVOLUTION_BWD_DATA_NO_WORKSPACE, 0, &algo);*/

        // 2020.12.09 - mj.jo
        // cuda 11.1
        /*
        int res_count = 0;
        cudnnConvolutionBwdDataAlgoPerf_t bwd_perf;

        cudnnStatus_t result;
        if (!noMem)
            result = cudnnGetConvolutionBackwardDataAlgorithm_v7(*m_cudnn, *m_kernelT, m_outT, *m_conv, m_inT, 1, &res_count, &bwd_perf);
        else
            result = cudnnGetConvolutionBackwardDataAlgorithm_v7(*m_cudnn, *m_kernelT, m_outT, *m_conv, m_inT, 0, &res_count, &bwd_perf);

        algo = bwd_perf.algo;
        return result;
        */

        // 2021.02.28
        // cuda 11.4
        int res_count = 0;
        std::unique_ptr<cudnnConvolutionBwdDataAlgoPerf_t[]> bwd_perf(new cudnnConvolutionBwdDataAlgoPerf_t[100]);
        cudnnStatus_t result;
        result = cudnnGetConvolutionBackwardDataAlgorithm_v7(*m_cudnn, *m_kernelT, m_outT, *m_conv, m_inT, 100, &res_count,
                                                             bwd_perf.get());
        size_t sizeBytes = 0;
        if (!noMem)
          sizeBytes = workspace.BufferSize();

        size_t tmpSize = 0;
        cudnnStatus_t err = CUDNN_STATUS_EXECUTION_FAILED;
        for (int i = 0; i < res_count; i++)
        {
            auto err0 = cudnnGetConvolutionBackwardDataWorkspaceSize(*m_cudnn, *m_kernelT, m_outT, *m_conv, m_inT,
                                                                     bwd_perf[i].algo, &tmpSize);
            if (err0 == CUDNN_STATUS_SUCCESS)
            {
                //printf("found algo sizeBytes %d algo size %d\n", sizeBytes, tmpSize);
                if (tmpSize <= sizeBytes)
                {
                    algo = bwd_perf[i].algo;
                    err = err0;
                    break;
                }
            }
        }
        return err;
    };

cudnnGetConvolutionBackwardFilterAlgorithm_v7:

        // 2020.12.09 - mj.jo
        // cuda 10.0
        //if(!noMem)
        //    return cudnnGetConvolutionBackwardFilterAlgorithm(*m_cudnn, m_inT, m_outT, *m_conv, *m_kernelT, CUDNN_CONVOLUTION_BWD_FILTER_SPECIFY_WORKSPACE_LIMIT, workspace.BufferSize(), &algo);
        //// special case for half/odd filter
        //if(m_kernelT->isOdd() && m_dataType == CUDNN_DATA_HALF)
        //{
        //    size_t tmpSize = 0;
        //    algo = (cudnnConvolutionBwdFilterAlgo_t) 1;
        //    auto err = cudnnGetConvolutionBackwardFilterWorkspaceSize(*m_cudnn, m_inT, m_outT, *m_conv, *m_kernelT, algo, &tmpSize);
        //    workspace.Resize((tmpSize + sizeof(ElemType) - 1) / sizeof(ElemType), 1);
        //    return err;
        //}
        //return cudnnGetConvolutionBackwardFilterAlgorithm(*m_cudnn, m_inT, m_outT, *m_conv, *m_kernelT, CUDNN_CONVOLUTION_BWD_FILTER_NO_WORKSPACE, 0, &algo);

        // 2020.12.09 - mj.jo
        // cuda 11.1
        /*
        int res_count = 0;
        cudnnConvolutionBwdFilterAlgoPerf_t bwd_perf;
        cudnnStatus_t result;

        if (!noMem)
        {
            result = cudnnGetConvolutionBackwardFilterAlgorithm_v7(*m_cudnn, m_inT, m_outT, *m_conv, *m_kernelT, 1, &res_count, &bwd_perf);
            algo = bwd_perf.algo;
            return result;
        }
        // special case for half/odd filter
        if (m_kernelT->isOdd() && m_dataType == CUDNN_DATA_HALF)
        {
            size_t tmpSize = 0;
            algo = (cudnnConvolutionBwdFilterAlgo_t) 1;
            auto err = cudnnGetConvolutionBackwardFilterWorkspaceSize(*m_cudnn, m_inT, m_outT, *m_conv, *m_kernelT, algo, &tmpSize);
            workspace.Resize((tmpSize + sizeof(ElemType) - 1) / sizeof(ElemType), 1);
            return err;
        }
        
        result = cudnnGetConvolutionBackwardFilterAlgorithm_v7(*m_cudnn, m_inT, m_outT, *m_conv, *m_kernelT, 0, &res_count, &bwd_perf);
        algo = bwd_perf.algo;
        return result;*/

        // 2021.02.28
        // cuda 11.4
        int res_count = 0;
        std::unique_ptr<cudnnConvolutionBwdFilterAlgoPerf_t[]> bwf_perf(new cudnnConvolutionBwdFilterAlgoPerf_t[100]);
        cudnnStatus_t result;
        result = cudnnGetConvolutionBackwardFilterAlgorithm_v7(*m_cudnn, m_inT, m_outT, *m_conv, *m_kernelT, 100, &res_count, bwf_perf.get());

        size_t sizeBytes = 0;
        if (!noMem)
            sizeBytes = workspace.BufferSize();

        size_t tmpSize = 0;
        cudnnStatus_t err = CUDNN_STATUS_EXECUTION_FAILED;
        for (int i = 0; i < res_count; i++)
        {
            auto err0 = cudnnGetConvolutionBackwardFilterWorkspaceSize(*m_cudnn, m_inT, m_outT, *m_conv, *m_kernelT,
                                                                     bwf_perf[i].algo, &tmpSize);
            if (err0 == CUDNN_STATUS_SUCCESS)
            {
                if (tmpSize <= sizeBytes)
                {
                    algo = bwf_perf[i].algo;

                    if (m_kernelT->isOdd() && m_dataType == CUDNN_DATA_HALF)
                    {
                      workspace.Resize((tmpSize + sizeof(ElemType) - 1) / sizeof(ElemType), 1);
                    }

                    err = err0;
                    break;
                }
            }
        }

        return err;
    };

@sigfrid696 I wonder if you would mind to review and make a pull request. After all it was you who solved it:-)

I although wonder about the if (!noMem) checks: Can they ever happen and if so I guess the functions will always return CUDNN_STATUS_EXECUTION_FAILED

@sigfrid696
Copy link

sigfrid696 commented Aug 29, 2021

Thank You @JohanDore ! Within tomorrow I'll review your code and make the pull request... It seems very similar to mine. The problem that I have is the fact that launching my app not from Vs Studio (release mode with ctrl F5) results in an unmanaged crash. I suspect the problem is not on the changes we're discussing here but other parts of the porting (uninitialized memory) .
Di you notice a similar problem on your side?

Regarding the nomem condition, it is triggered on the start of the application but it is not a problem because the function V7 also returns conv algorithms that don't need user workspace memory (size 0)

@JohanDore
Copy link

@sigfrid696 I had to apply quite some project edits to get it to compile. The changes probably doesn't deserve a Pull Request but maybe they can help you: CNTK_2_8_1_6a20c25a0a8dd7ec21cbb8926f7085e11afd41b3.zip. BTW the zip also include the changes above

@sigfrid696
Copy link

@sigfrid696 I had to apply quite some project edits to get it to compile. The changes probably doesn't deserve a Pull Request but maybe they can help you: CNTK_2_8_1_6a20c25a0a8dd7ec21cbb8926f7085e11afd41b3.zip. BTW the zip also include the changes above

thank you @JohanDore
I managed to solve the problem, a CUDA dep was missing in working dir but was found if launched from VS Studio.
Created Pull Request on @nietras repo with mods regarding cudnnGetConvolutionForwardAlgorithm API change.
Now my app is fully working...

@nietras
Copy link

nietras commented Sep 5, 2021

@JohanDore @JeppeThagaardVP @sigfrid696 thanks to your work I have released a new version 2.8.2 with your fixes. I hope this solves your issues. I still don't understand why 100 algo count is needed given enumerations for algos have less than 10 😅 If someone cares to explain I am all ears.

https://github.com/nietras/CNTK/releases/tag/v2.8.2

@sigfrid696
Copy link

sigfrid696 commented Sep 5, 2021

@nietras hi nietras ed everyone! New v7 functions are capable of giving back more than 10 algos... The old functions accepted a parameter to filter size but the new don't... So if you don't increase the return number, the algorithms with no extra memory needed (size 0) are not given back: for example in my tests I had the algo back in twelve position.

Hope to have clarified :)

@nietras
Copy link

nietras commented Sep 5, 2021

@sigfrid696 thanks that helps! And thanks again for doing it and apologies for generating extra work with my changes 😅

@sigfrid696
Copy link

sigfrid696 commented Sep 6, 2021

With the aim of keeping alive CNTK project, I made in the last months some more changes to the original repo.
I think this could be the place to further discuss about this...
In particular, referring to c++ inference code and RCNN networks:

  • I added GPU support to the FRCNN proposal layer. It only makes sense to execute NMS on GPU, the rest of the code being executed on CPU with optimal performances
  • Now the code of FRCNN can be fully executed on GPU (before an exception was thrown if executed on GPU)
  • In this context I made a change to the CNTK API, so to have a method to check if a function (in particular a user defined function i.e. proposal layer) can be executed on a particular device: if this is not the case, and for user defined routines, the framework copies the memory on the supported device, executes the user defined layer, then copy the results back on the original device; in this way the original proposal layer can still be executed on CPU only, in the context of a run on GPU device. This behavior is actually implemented only on Forward function.

@nietras let me know if you are interested to merge also these modifications...I think I'll update my repo with these mods in the next days...

@nietras
Copy link

nietras commented Sep 7, 2021

@sigfrid696 we don't use these specific features but yes I would be interested in a PR for that. If you could continue on a branch from my forks master in your fork that would make it easier I think. To avoid merge hell.

It would be great if someone else listening was using this and would be able to test/verify it works for others too. :)

@sigfrid696
Copy link

I admit, I'm not so expert using git-hub. I would continue with the fork I did from your master I believe...the same of the previous PR

@nietras
Copy link

nietras commented Sep 7, 2021

@sigfrid696 it can be a bit daunting perhaps for someone new to the pull-request flow https://guides.github.com/introduction/flow/. If you are continuing from the latest nietras/master and a branch from that say sigfrid696/frcnn-gpu then hopefully it should be good.

@nietras
Copy link

nietras commented Sep 7, 2021

To be clear what I mean is you need to update to the latest nietras/master, with the latest code after your PR got merged. I bumped version etc.

@sigfrid696
Copy link

I updated to the latest nietras/master, and made PR from my branch frcnn-gpu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

11 participants