Converting uint64_t Host Descriptor to compatible type for the Binary Compartor to work #1

buzzsuresh · 2016-11-05T09:13:09Z

Hi,

I'm trying to use the CUDA LATCH processing logic described in the main.cpp file to compare 2 different images for matching.

Objective is to use Brute Force Hamming (Any Binary comparator) to identify the matches between the 2 host descriptors obtained from image 1 and 2 respectively. For KNNMatch to work it expects the data type to be either CV_8UC1/CV_32F.

To use KNNmatch matcher for comparison, I made a Mat object explicitly from the unit64 host descriptor. Conversion of unit64 to either 8U/32F creating data loss so that the KNNMatcher is failing to identify matches across descriptors.

Would like to get your suggestion on how would I convert the unit64 host descriptor to 8U/32F without data loss so that I use any binary comparator to work.

Kindly advice.

std::cout << "Desc 1 " << endl;
uint64_t** tempDesc1 = new uint64_t*[kps1.size()];
int nMark1 = 0;
for (int j=0; j<kps1.size(); j++)
{
    tempDesc1[j] = new uint64_t[32];
    for (int k = 0; k < 8; k++, nMark1++)
        tempDesc1[j][k] = h_GPUdesc1[nMark1];
}
cv::Mat matDesc1(kps1.size(), 8, CV_8UC1, tempDesc1);

std::cout << "Desc 2 " << endl;
uint64_t** tempDesc2 = new uint64_t*[kps2.size()];
int nMark2 = 0;
for (int j=0; j<kps2.size(); j++)
{
    tempDesc2[j] = new uint64_t[8];
    for ( int k = 0; k < 8; k++, nMark2++)
        tempDesc2[j][k] = h_GPUdesc2[nMark2];
}
cv::Mat matDesc2(kps2.size(), 8, CV_8UC1, tempDesc2);

vector<cv::DMatch> good_matches;
cv::BFMatcher matcher(cv::NORM_HAMMING);
vector< vector<cv::DMatch> > matches;
matcher.knnMatch(matDesc1, matDesc2, matches, 2);

Thanks.

The text was updated successfully, but these errors were encountered:

komrad36 · 2016-11-05T09:26:32Z

Hi,

The descriptors output from CLATCH are a contiguous block of 512-bit binary descriptors. You can shuffle the data around to make them work with OpenCV, but the whole point of my utilities is to improve on the very poor performance of OpenCV. My pipeline replaces that of OpenCV, so you don't have to worry about trying to get the descriptors in a format suitable for OpenCV's matchers. Just use my K2NN matcher (for CPU matching) or my CUDAK2NN matcher (for GPU matching.) They work naturally with the descriptors from CLATCH. As with all my computer vision projects there's a test harness and demo.

My matchers are available here:
https://github.com/komrad36/K2NN
https://github.com/komrad36/CUDAK2NN

Please let me know if I can help with anything else!

buzzsuresh · 2016-11-05T10:46:29Z

Much Appreciated. Thank you.

To operate as said using "CUDAK2NN(d_tvecs, size, tex_q, size, d_matches, threshold)"

I could pass in first image's uint64 descriptor pointer for first parameter and second image's "cudaTextureObject_t" object in place of fourth param. Please correct me if I'm wrong.

I had just tried as commented above and getting 0 matches. ( for trial image 1 and 2 are same ).

komrad36 · 2016-11-06T01:48:05Z

Yeah, the main.cpp demo in CUDAK2NN contains an example showing the binding of a texture object and such. Furthermore if you've just used CLATCH the descriptors are already in GPU memory! So all you have to do is wrap the linear memory in a texture object and call CUDAK2NN.

If you do everything correctly, you should get a very high number of matches if you use the same image for both training and query; might even get 100% matches.

buzzsuresh · 2016-11-07T04:57:00Z

Thanks for the suggestion.
To make it simple, I had used independent set of variables for image1 and image 2 respectively.
Coded as below :

int main() {
constexpr int runs = 500;
constexpr int warmups = 100;
constexpr int numkps = 10000;
constexpr char name1[] = "1.png";
constexpr char name2[] = "2.png";
float reSizeFactor = 0.7;
constexpr int size = 10000;
constexpr int threshold = 5;
clock_t t;

cudaDeviceSetCacheConfig(cudaFuncCachePreferEqual);
cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeFourByte);

cudaEvent_t latchFinishedEvent;
cudaEventCreate(&latchFinishedEvent);

    cudaStream_t stream1, stream2;
    cudaStreamCreate(&stream1);
    cudaStreamCreate(&stream2);

    /************************* image loading **************************/
    cv::Mat image1 = cv::imread(name1, CV_LOAD_IMAGE_GRAYSCALE);
    if (!image1.data) {
        std::cerr << "ERROR: failed to open image. Aborting." << std::endl;
        return EXIT_FAILURE;
    }

    cv::Mat image2 = cv::imread(name2, CV_LOAD_IMAGE_GRAYSCALE);
    if (!image2.data) {
        std::cerr << "ERROR: failed to open image. Aborting." << std::endl;
        return EXIT_FAILURE;
    }
    /************************* image loading **************************/


    /************************* For image 1  **************************/
    std::cout << std::endl << "Detecting 1..." << std::endl;
    //cv::Ptr<cv::ORB> orb1 = cv::ORB::create(numkps, 1.20000004768, 8, 31, 0, 3, 0, 31, 31);
    cv::Ptr<cv::ORB> orb1 = cv::ORB::create(numkps, 1.2f, 8, 31, 0, 2, cv::ORB::HARRIS_SCORE, 31, 20);
    std::vector<cv::KeyPoint> keypoints1;
    //double t = (double)getTickCount();
    orb1->detect(image1, keypoints1);
    //t = ((double)getTickCount() - t)/getTickFrequency();
    //cout << "ORB Detecting [ms] : " << (t/1.0)*1000 << endl;
    keypoints1.erase(std::remove_if(keypoints1.begin(), keypoints1.end(), [image1](const cv::KeyPoint& kp) {return kp.pt.x <= 36 || kp.pt.y <= 36 || kp.pt.x >= image1.cols - 36 || kp.pt.y >= image1.rows - 36; }), keypoints1.end());

    std::vector<KeyPoint> kps1;
    for (const auto& kp : keypoints1) kps1.emplace_back(kp.pt.x, kp.pt.y, kp.size, kp.angle * 3.14159265f / 180.0f);

    uint64_t* d_desc1;
    cudaMalloc(&d_desc1, 64 * kps1.size());

    KeyPoint* d_kps1;
    cudaMalloc(&d_kps1, kps1.size() * sizeof(KeyPoint));
    cudaMemcpy(d_kps1, &kps1[0], kps1.size() * sizeof(KeyPoint), cudaMemcpyHostToDevice);

    // allocating and transferring triplets and binding to texture object
    uint32_t* d_triplets1;
    cudaMalloc(&d_triplets1, 2048 * sizeof(uint16_t));
    cudaMemcpy(d_triplets1, triplets, 2048 * sizeof(uint16_t), cudaMemcpyHostToDevice);
    cudaChannelFormatDesc chandesc_trip1 = cudaCreateChannelDesc(16, 16, 16, 16, cudaChannelFormatKindUnsigned);
    cudaArray* d_trip_arr1;
    cudaMallocArray(&d_trip_arr1, &chandesc_trip1, 512);
    cudaMemcpyToArray(d_trip_arr1, 0, 0, d_triplets1, 2048 * sizeof(uint16_t), cudaMemcpyHostToDevice);
    struct cudaResourceDesc resdesc_trip1;
    memset(&resdesc_trip1, 0, sizeof(resdesc_trip1));
    resdesc_trip1.resType = cudaResourceTypeArray;
    resdesc_trip1.res.array.array = d_trip_arr1;
    struct cudaTextureDesc texdesc_trip1;
    memset(&texdesc_trip1, 0, sizeof(texdesc_trip1));
    texdesc_trip1.addressMode[0] = cudaAddressModeClamp;
    texdesc_trip1.filterMode = cudaFilterModePoint;
    texdesc_trip1.readMode = cudaReadModeElementType;
    texdesc_trip1.normalizedCoords = 0;
    cudaTextureObject_t d_trip_tex1 = 0;
    cudaCreateTextureObject(&d_trip_tex1, &resdesc_trip1, &texdesc_trip1, nullptr);

    // allocating and transferring image and binding to texture object
    cudaChannelFormatDesc chandesc_img1 = cudaCreateChannelDesc(8, 0, 0, 0, cudaChannelFormatKindUnsigned);
    cudaArray* d_img_arr1;
    cudaMallocArray(&d_img_arr1, &chandesc_img1, image1.cols, image1.rows);
    cudaMemcpyToArray(d_img_arr1, 0, 0, image1.data, image1.rows * image1.cols, cudaMemcpyHostToDevice);
    struct cudaResourceDesc resdesc_img1;
    memset(&resdesc_img1, 0, sizeof(resdesc_img1));
    resdesc_img1.resType = cudaResourceTypeArray;
    resdesc_img1.res.array.array = d_img_arr1;
    struct cudaTextureDesc texdesc_img1;
    memset(&texdesc_img1, 0, sizeof(texdesc_img1));
    texdesc_img1.addressMode[0] = cudaAddressModeClamp;
    texdesc_img1.addressMode[1] = cudaAddressModeClamp;
    texdesc_img1.filterMode = cudaFilterModePoint;
    texdesc_img1.readMode = cudaReadModeElementType;
    texdesc_img1.normalizedCoords = 0;
    cudaTextureObject_t d_img_tex1 = 0;
    cudaCreateTextureObject(&d_img_tex1, &resdesc_img1, &texdesc_img1, nullptr);

    std::cout << "Warming up 1..." << std::endl;
    for (int i = 0; i < warmups; ++i) CLATCH(d_img_tex1, d_trip_tex1, d_kps1, static_cast<int>(kps1.size()), d_desc1);
    std::cout << "Testing 1..." << std::endl;
    high_resolution_clock::time_point start1 = high_resolution_clock::now();
    for (int i = 0; i < runs; ++i) CLATCH(d_img_tex1, d_trip_tex1, d_kps1, static_cast<int>(kps1.size()), d_desc1);
    high_resolution_clock::time_point end1 = high_resolution_clock::now();
    std::cout << std::endl << "CLATCH 1 took " << static_cast<double>((end1 - start1).count()) * 1e-3 / (static_cast<double>(runs) * static_cast<double>(kps1.size())) << " us per desc over " << kps1.size() << " desc" << (kps1.size() == 1 ? "." : "s.") << std::endl << std::endl;
    uint64_t* h_GPUdesc1 = new uint64_t[8 * kps1.size()];
    cudaMemcpy(h_GPUdesc1, d_desc1, 64 * kps1.size(), cudaMemcpyDeviceToHost);
    //std::cout << "CUDA reports 1 " << cudaGetErrorString(cudaGetLastError()) << std::endl;

    long long total1 = 0;
    for (size_t i = 0; i < 8 * kps1.size(); ++i) total1 += h_GPUdesc1[i];
    //std::cout << "Checksum 1: " << std::hex << total1 << std::endl << std::endl;
    std::cout << std::dec;
    /************************* For image 1  **************************/


    /************************* For image 2  **************************/
    std::cout << std::endl << "Detecting 2..." << std::endl;

    //cv::Ptr<cv::ORB> orb2 = cv::ORB::create(numkps, 1.20000004768, 8, 31, 0, 3, 0, 31, 31);
    cv::Ptr<cv::ORB> orb2 = cv::ORB::create(numkps, 1.2f, 8, 31, 0, 2, cv::ORB::HARRIS_SCORE, 31, 20);
    std::vector<cv::KeyPoint> keypoints2;
    orb2->detect(image2, keypoints2);
    keypoints2.erase(std::remove_if(keypoints2.begin(), keypoints2.end(), [image2](const cv::KeyPoint& kp) {return kp.pt.x <= 36 || kp.pt.y <= 36 || kp.pt.x >= image2.cols - 36 || kp.pt.y >= image2.rows - 36; }), keypoints2.end());

    std::vector<KeyPoint> kps2;
    for (const auto& kp : keypoints2) kps2.emplace_back(kp.pt.x, kp.pt.y, kp.size, kp.angle * 3.14159265f / 180.0f);

    uint64_t* d_desc2;
    cudaMalloc(&d_desc2, 64 * kps2.size());

    KeyPoint* d_kps2;
    cudaMalloc(&d_kps2, kps2.size() * sizeof(KeyPoint));
    cudaMemcpy(d_kps2, &kps2[0], kps2.size() * sizeof(KeyPoint), cudaMemcpyHostToDevice);

    // allocating and transferring triplets and binding to texture object
    uint32_t* d_triplets2;
    cudaMalloc(&d_triplets2, 2048 * sizeof(uint16_t));
    cudaMemcpy(d_triplets2, triplets, 2048 * sizeof(uint16_t), cudaMemcpyHostToDevice);
    cudaChannelFormatDesc chandesc_trip2 = cudaCreateChannelDesc(16, 16, 16, 16, cudaChannelFormatKindUnsigned);
    cudaArray* d_trip_arr2;
    cudaMallocArray(&d_trip_arr2, &chandesc_trip2, 512);
    cudaMemcpyToArray(d_trip_arr2, 0, 0, d_triplets2, 2048 * sizeof(uint16_t), cudaMemcpyHostToDevice);
    struct cudaResourceDesc resdesc_trip2;
    memset(&resdesc_trip2, 0, sizeof(resdesc_trip2));
    resdesc_trip2.resType = cudaResourceTypeArray;
    resdesc_trip2.res.array.array = d_trip_arr2;
    struct cudaTextureDesc texdesc_trip2;
    memset(&texdesc_trip2, 0, sizeof(texdesc_trip2));
    texdesc_trip2.addressMode[0] = cudaAddressModeClamp;
    texdesc_trip2.filterMode = cudaFilterModePoint;
    texdesc_trip2.readMode = cudaReadModeElementType;
    texdesc_trip2.normalizedCoords = 0;
    cudaTextureObject_t d_trip_tex2 = 0;
    cudaCreateTextureObject(&d_trip_tex2, &resdesc_trip2, &texdesc_trip2, nullptr);

    // allocating and transferring image and binding to texture object
    cudaChannelFormatDesc chandesc_img2 = cudaCreateChannelDesc(8, 0, 0, 0, cudaChannelFormatKindUnsigned);
    cudaArray* d_img_arr2;
    cudaMallocArray(&d_img_arr2, &chandesc_img2, image2.cols, image2.rows);
    cudaMemcpyToArray(d_img_arr2, 0, 0, image2.data, image2.rows * image2.cols, cudaMemcpyHostToDevice);
    struct cudaResourceDesc resdesc_img2;
    memset(&resdesc_img2, 0, sizeof(resdesc_img2));
    resdesc_img2.resType = cudaResourceTypeArray;
    resdesc_img2.res.array.array = d_img_arr2;
    struct cudaTextureDesc texdesc_img2;
    memset(&texdesc_img2, 0, sizeof(texdesc_img2));
    texdesc_img2.addressMode[0] = cudaAddressModeClamp;
    texdesc_img2.addressMode[1] = cudaAddressModeClamp;
    texdesc_img2.filterMode = cudaFilterModePoint;
    texdesc_img2.readMode = cudaReadModeElementType;
    texdesc_img2.normalizedCoords = 0;
    cudaTextureObject_t d_img_tex2 = 0;
    cudaCreateTextureObject(&d_img_tex2, &resdesc_img2, &texdesc_img2, nullptr);

    std::cout << "Warming up 2..." << std::endl;
    for (int i = 0; i < warmups; ++i) CLATCH(d_img_tex2, d_trip_tex2, d_kps2, static_cast<int>(kps2.size()), d_desc2);
    std::cout << "Testing 2..." << std::endl;
    high_resolution_clock::time_point start2 = high_resolution_clock::now();
    for (int i = 0; i < runs; ++i) CLATCH(d_img_tex2, d_trip_tex2, d_kps2, static_cast<int>(kps2.size()), d_desc2);
    high_resolution_clock::time_point end2 = high_resolution_clock::now();
    std::cout << std::endl << "CLATCH 2 took " << static_cast<double>((end2 - start2).count()) * 1e-3 / (static_cast<double>(runs) * static_cast<double>(kps2.size())) << " us per desc over " << kps2.size() << " desc" << (kps2.size() == 1 ? "." : "s.") << std::endl << std::endl;
    uint64_t* h_GPUdesc2 = new uint64_t[8 * kps2.size()];
    cudaMemcpy(h_GPUdesc2, d_desc2, 64 * kps2.size(), cudaMemcpyDeviceToHost);
    //std::cout << "CUDA reports 2" << cudaGetErrorString(cudaGetLastError()) << std::endl;

    long long total2 = 0;
    for (size_t i = 0; i < 8 * kps2.size(); ++i) total2 += h_GPUdesc2[i];
    //std::cout << "Checksum 2: " << std::hex << total2 << std::endl << std::endl;
    std::cout << std::dec;
    /************************* For image 2  **************************/

    /************************* Matching **************************/
    int* d_matches;
    cudaMalloc(&d_matches, 4 * size);

    std::cout << std::endl << "Warming up..." << std::endl;
    for (int i = 0; i < warmups; ++i) CUDAK2NN(d_desc1, size, d_img_tex2, size, d_matches, threshold);
    std::cout << "Testing..." << std::endl;
    high_resolution_clock::time_point start = high_resolution_clock::now();
    for (int i = 0; i < runs; ++i) CUDAK2NN(d_desc1, size, d_img_tex2, size, d_matches, threshold);
    high_resolution_clock::time_point end = high_resolution_clock::now();

    int* h_matches = reinterpret_cast<int*>(malloc(4 * size));
    cudaMemcpy(h_matches, d_matches, 4 * size, cudaMemcpyDeviceToHost);
    cudaDeviceReset();

    std::vector<Match> hostMatches;
    for (int i = 0; i < size; ++i) {
        if (h_matches[i] != -1) hostMatches.emplace_back(i, h_matches[i]);
    }

    double sec = static_cast<double>(duration_cast<nanoseconds>(end - start).count()) * 1e-9 / static_cast<double>(runs);
    std::cout << "CUDAK2NN found " << hostMatches.size() << " matches in " << sec * 1e3 << " ms" << std::endl;
    std::cout << "Throughput: " << static_cast<double>(size)*static_cast<double>(size) / sec * 1e-9 << " billion comparisons/second." << std::endl << std::endl;
    /************************* Matching **************************/

}

*Results :**
Detecting 1...
Warming up 1...
Testing 1...
CLATCH 1 took 1.1825 us per desc over 9984 descs.

Detecting 2...
Warming up 2...
Testing 2...
CLATCH 2 took 1.18514 us per desc over 9984 descs.

Warming up...
Testing...
CUDAK2NN found 0 matches in 4.37826 ms
Throughput: 22.8401 billion comparisons/second.

Please help to see if I had missed anything core that is resulting in giving 0 matches.
Thanks a lot.

buzzsuresh · 2016-11-07T06:42:25Z

Added I just ran your "CUDAK2NN" project as it is. The same result here too - 0 matches.

In the make file I just changed from "-std=gnu++17" to "-std=c++0x". Does this impact the result by any chance. When I have as it is as "-std=gnu++17" make file doesn't get compiled for me.

Please advice.

komrad36 · 2016-11-07T06:50:33Z

Hi,

I see a number of problems in the code involving allocations and CUDA
transfers and such. First, though, let's address the fact that you get 0
matches even with the default CUDAK2NN project, as that shouldn't happen. I
am aware of a particular combination of CUDA driver, CUDA version, and GPU
that actually has a nasty synchronization bug that causes things like that
to happen on kernels that push the envelope as much as mine does. There's
nothing I can do about it, it's an NVIDIA bug. We can insert a manual
synchronization that is not necessary - this fixes the problem but causes a
large slowdown. Updating drivers to the latest and CUDA version to 8.0 is a
good start.

Are you running on a 970m, 980m, 960, or 970, by any chance? I've seen a
970m have that problem.

I'd recommend either -std=gnu++17 or -std=gnu++14. Use a recent g++
compiler that can support one of those. (But that's not related to the
problem here, just a general recommendation for my computer vision stuff.)

Thanks,
Kareem

On Mon, Nov 7, 2016 at 12:42 AM, buzzsuresh notifications@github.com
wrote:

Added I just ran your "CUDAK2NN" project as it is. The same result here
too - 0 matches.

In the make file I just changed from "-std=gnu++17" to "-std=c++0x". Does
this impact the result by any chance. When I have as it is as
"-std=gnu++17" make file doesn't get compiled for me.

Please advice.

—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
#1 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AMZ_gi4I8QnPuvzH5FEKbw7zALhZX6Vwks5q7shRgaJpZM4KqNh0
.

buzzsuresh · 2016-11-07T06:57:31Z

Yes, as commented, GPU config are "GeForce GTX 970" 4095Mb, sm_52, Driver/Runtime ver.7.50/7.50.

Let me go through them one by one and try my best to fix. Thanks for the review and comments.

komrad36 · 2016-11-07T06:59:44Z

Ah sorry I missed your GPU config. Thanks. Okay I think the 9xx series have
this strange problem. I do not have very much information on it since it's
either a driver or hardware bug and I also do not have a 9xx for testing.
Please try upgrading to CUDA 8 if possible. I do not know whether that will
fix it but it's the first thing to try and it will be useful to know if
CUDA 8 fixes it. If not we can try manual synchronization.

For now it's probably best to just test this with the canonical CUDAK2NN
source until we address this problem, then do the custom code later.

Thanks,
Kareem

On Mon, Nov 7, 2016 at 12:57 AM, buzzsuresh notifications@github.com
wrote:

Yes, as commented, GPU config are "GeForce GTX 970" 4095Mb, sm_52,
Driver/Runtime ver.7.50/7.50.

Let me go through them one by one and try my best to fix. Thanks for the
review and comments.

—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
#1 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AMZ_glwYHbYhNle8UlgcKXUUKqDnvopWks5q7svbgaJpZM4KqNh0
.

buzzsuresh · 2016-11-08T06:00:07Z

Thanks Kareem.

I'm getting CUDA 8 set up in the same machine (970). Running on Ubuntu 16.04.
Will try to see if that fix the issue. Will get back with the result in short while.

If I understood it correct, you are commenting about the synch that happens from the Device to Host which is doubted to be failing.

Can you please shed light on how we process manual synchronization, so that I can have that block of the code ready in the mean time.

Thanks for your continued support.

komrad36 · 2016-11-08T06:03:58Z

No problem! Please let me know whether CUDA 8 and latest drivers fixes it. That would be nice...

Meanwhile I'm actually talking about a CUDA __syncthreads(), but when I looked back through my notes, that's actually in CLATCH, not in CUDAK2NN, which does not use shared memory. There is something else going on to cause it to find 0 matches. Let's see if we can figure out what it is. Can you try deleting the CUDAK2NN sources and downloading a fresh copy from latest branch? If that doesn't help we'll start adding debug to try to see why it's failing. I've never actually seen any problems with CUDAK2NN on any card before.

Thanks!

buzzsuresh · 2016-11-08T13:36:12Z

Hi Kareem,
Upgrading it to Cuda 8 (On Ubuntu 16.04, GCC 5.4) had better results. Running CUDAK2NN, I was able to get the match count of 1334.

Warming up...
Testing...
CUDA reports no error
CUDAK2NN found 1334 matches in 4.23604 ms
Throughput: 23.607 billion comparisons/second.

But still failed to see the matches happening with the custom code used for CLATCH + CUDAK2NN.
As commented for trial Image 1 and 2 are same.

Detecting 1...
Warming up 1...
Testing 1...
CLATCH 1 took 1.26648 us per desc over 9984 descs.
Checksum 1: a2dc73dbca6075a3

Detecting 2...
Warming up 2...
Testing 2...
CLATCH 2 took 0.85819 us per desc over 9984 descs.
Checksum 2: d6d3d0936cab2618

Warming up...
Testing...
CUDAK2NN found 0 matches in 4.2319 ms
Throughput: 23.6301 billion comparisons/second.

Below is the code used.

int main()
{
constexpr int runs = 500;
constexpr int warmups = 100;
constexpr int numkps = 10000;
constexpr char name1[] = "1.png";
constexpr char name2[] = "2.png";
float reSizeFactor = 0.7;
constexpr int size = 10000;
constexpr int threshold = 5;
clock_t t;

cudaDeviceSetCacheConfig(cudaFuncCachePreferEqual);
cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeFourByte);

cudaEvent_t latchFinishedEvent;
cudaEventCreate(&latchFinishedEvent);

cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

/************************* image loading **************************/
cv::Mat image1 = cv::imread(name1, CV_LOAD_IMAGE_GRAYSCALE);
if (!image1.data) {
    std::cerr << "ERROR: failed to open image. Aborting." << std::endl;
    return EXIT_FAILURE;
}

cv::Mat image2 = cv::imread(name2, CV_LOAD_IMAGE_GRAYSCALE);
if (!image2.data) {
    std::cerr << "ERROR: failed to open image. Aborting." << std::endl;
    return EXIT_FAILURE;
}
/************************* image loading **************************/


/************************* For image 1  **************************/
std::cout << std::endl << "Detecting 1..." << std::endl;
//cv::Ptr<cv::ORB> orb1 = cv::ORB::create(numkps, 1.20000004768, 8, 31, 0, 3, 0, 31, 31);
cv::Ptr<cv::ORB> orb1 = cv::ORB::create(numkps, 1.2f, 8, 31, 0, 2, cv::ORB::HARRIS_SCORE, 31, 20);
std::vector<cv::KeyPoint> keypoints1;
orb1->detect(image1, keypoints1);
keypoints1.erase(std::remove_if(keypoints1.begin(), keypoints1.end(), [image1](const cv::KeyPoint& kp) {return kp.pt.x <= 36 || kp.pt.y <= 36 || kp.pt.x >= image1.cols - 36 || kp.pt.y >= image1.rows - 36; }), keypoints1.end());

std::vector<KeyPoint> kps1;
for (const auto& kp : keypoints1) kps1.emplace_back(kp.pt.x, kp.pt.y, kp.size, kp.angle * 3.14159265f / 180.0f);

uint64_t* d_desc1;
cudaMalloc(&d_desc1, 64 * kps1.size());

KeyPoint* d_kps1;
cudaMalloc(&d_kps1, kps1.size() * sizeof(KeyPoint));
cudaMemcpy(d_kps1, &kps1[0], kps1.size() * sizeof(KeyPoint), cudaMemcpyHostToDevice);

// allocating and transferring triplets and binding to texture object
uint32_t* d_triplets1;
cudaMalloc(&d_triplets1, 2048 * sizeof(uint16_t));
cudaMemcpy(d_triplets1, triplets, 2048 * sizeof(uint16_t), cudaMemcpyHostToDevice);
cudaChannelFormatDesc chandesc_trip1 = cudaCreateChannelDesc(16, 16, 16, 16, cudaChannelFormatKindUnsigned);
cudaArray* d_trip_arr1;
cudaMallocArray(&d_trip_arr1, &chandesc_trip1, 512);
cudaMemcpyToArray(d_trip_arr1, 0, 0, d_triplets1, 2048 * sizeof(uint16_t), cudaMemcpyHostToDevice);
struct cudaResourceDesc resdesc_trip1;
memset(&resdesc_trip1, 0, sizeof(resdesc_trip1));
resdesc_trip1.resType = cudaResourceTypeArray;
resdesc_trip1.res.array.array = d_trip_arr1;
struct cudaTextureDesc texdesc_trip1;
memset(&texdesc_trip1, 0, sizeof(texdesc_trip1));
texdesc_trip1.addressMode[0] = cudaAddressModeClamp;
texdesc_trip1.filterMode = cudaFilterModePoint;
texdesc_trip1.readMode = cudaReadModeElementType;
texdesc_trip1.normalizedCoords = 0;
cudaTextureObject_t d_trip_tex1 = 0;
cudaCreateTextureObject(&d_trip_tex1, &resdesc_trip1, &texdesc_trip1, nullptr);

// allocating and transferring image and binding to texture object
cudaChannelFormatDesc chandesc_img1 = cudaCreateChannelDesc(8, 0, 0, 0, cudaChannelFormatKindUnsigned);
cudaArray* d_img_arr1;
cudaMallocArray(&d_img_arr1, &chandesc_img1, image1.cols, image1.rows);
cudaMemcpyToArray(d_img_arr1, 0, 0, image1.data, image1.rows * image1.cols, cudaMemcpyHostToDevice);
struct cudaResourceDesc resdesc_img1;
memset(&resdesc_img1, 0, sizeof(resdesc_img1));
resdesc_img1.resType = cudaResourceTypeArray;
resdesc_img1.res.array.array = d_img_arr1;
struct cudaTextureDesc texdesc_img1;
memset(&texdesc_img1, 0, sizeof(texdesc_img1));
texdesc_img1.addressMode[0] = cudaAddressModeClamp;
texdesc_img1.addressMode[1] = cudaAddressModeClamp;
texdesc_img1.filterMode = cudaFilterModePoint;
texdesc_img1.readMode = cudaReadModeElementType;
texdesc_img1.normalizedCoords = 0;
cudaTextureObject_t d_img_tex1 = 0;
cudaCreateTextureObject(&d_img_tex1, &resdesc_img1, &texdesc_img1, nullptr);

std::cout << "Warming up 1..." << std::endl;
for (int i = 0; i < warmups; ++i) CLATCH(d_img_tex1, d_trip_tex1, d_kps1, static_cast<int>(kps1.size()), d_desc1);
std::cout << "Testing 1..." << std::endl;
high_resolution_clock::time_point start1 = high_resolution_clock::now();
for (int i = 0; i < runs; ++i) CLATCH(d_img_tex1, d_trip_tex1, d_kps1, static_cast<int>(kps1.size()), d_desc1);
high_resolution_clock::time_point end1 = high_resolution_clock::now();
std::cout << std::endl << "CLATCH 1 took " << static_cast<double>((end1 - start1).count()) * 1e-3 / (static_cast<double>(runs) * static_cast<double>(kps1.size())) << " us per desc over " << kps1.size() << " desc" << (kps1.size() == 1 ? "." : "s.") << std::endl << std::endl;
uint64_t* h_GPUdesc1 = new uint64_t[8 * kps1.size()];
cudaMemcpy(h_GPUdesc1, d_desc1, 64 * kps1.size(), cudaMemcpyDeviceToHost);
//std::cout << "CUDA reports 1 " << cudaGetErrorString(cudaGetLastError()) << std::endl;

long long total1 = 0;
for (size_t i = 0; i < 8 * kps1.size(); ++i) total1 += h_GPUdesc1[i];
std::cout << "Checksum 1: " << std::hex << total1 << std::endl << std::endl;
std::cout << std::dec;
/************************* For image 1  **************************/


/************************* For image 2  **************************/
std::cout << std::endl << "Detecting 2..." << std::endl;

//cv::Ptr<cv::ORB> orb2 = cv::ORB::create(numkps, 1.20000004768, 8, 31, 0, 3, 0, 31, 31);
cv::Ptr<cv::ORB> orb2 = cv::ORB::create(numkps, 1.2f, 8, 31, 0, 2, cv::ORB::HARRIS_SCORE, 31, 20);
std::vector<cv::KeyPoint> keypoints2;
orb2->detect(image2, keypoints2);
keypoints2.erase(std::remove_if(keypoints2.begin(), keypoints2.end(), [image2](const cv::KeyPoint& kp) {return kp.pt.x <= 36 || kp.pt.y <= 36 || kp.pt.x >= image2.cols - 36 || kp.pt.y >= image2.rows - 36; }), keypoints2.end());

std::vector<KeyPoint> kps2;
for (const auto& kp : keypoints2) kps2.emplace_back(kp.pt.x, kp.pt.y, kp.size, kp.angle * 3.14159265f / 180.0f);

uint64_t* d_desc2;
cudaMalloc(&d_desc2, 64 * kps2.size());

KeyPoint* d_kps2;
cudaMalloc(&d_kps2, kps2.size() * sizeof(KeyPoint));
cudaMemcpy(d_kps2, &kps2[0], kps2.size() * sizeof(KeyPoint), cudaMemcpyHostToDevice);

// allocating and transferring triplets and binding to texture object
uint32_t* d_triplets2;
cudaMalloc(&d_triplets2, 2048 * sizeof(uint16_t));
cudaMemcpy(d_triplets2, triplets, 2048 * sizeof(uint16_t), cudaMemcpyHostToDevice);
cudaChannelFormatDesc chandesc_trip2 = cudaCreateChannelDesc(16, 16, 16, 16, cudaChannelFormatKindUnsigned);
cudaArray* d_trip_arr2;
cudaMallocArray(&d_trip_arr2, &chandesc_trip2, 512);
cudaMemcpyToArray(d_trip_arr2, 0, 0, d_triplets2, 2048 * sizeof(uint16_t), cudaMemcpyHostToDevice);
struct cudaResourceDesc resdesc_trip2;
memset(&resdesc_trip2, 0, sizeof(resdesc_trip2));
resdesc_trip2.resType = cudaResourceTypeArray;
resdesc_trip2.res.array.array = d_trip_arr2;
struct cudaTextureDesc texdesc_trip2;
memset(&texdesc_trip2, 0, sizeof(texdesc_trip2));
texdesc_trip2.addressMode[0] = cudaAddressModeClamp;
texdesc_trip2.filterMode = cudaFilterModePoint;
texdesc_trip2.readMode = cudaReadModeElementType;
texdesc_trip2.normalizedCoords = 0;
cudaTextureObject_t d_trip_tex2 = 0;
cudaCreateTextureObject(&d_trip_tex2, &resdesc_trip2, &texdesc_trip2, nullptr);

// allocating and transferring image and binding to texture object
cudaChannelFormatDesc chandesc_img2 = cudaCreateChannelDesc(8, 0, 0, 0, cudaChannelFormatKindUnsigned);
cudaArray* d_img_arr2;
cudaMallocArray(&d_img_arr2, &chandesc_img2, image2.cols, image2.rows);
cudaMemcpyToArray(d_img_arr2, 0, 0, image2.data, image2.rows * image2.cols, cudaMemcpyHostToDevice);
struct cudaResourceDesc resdesc_img2;
memset(&resdesc_img2, 0, sizeof(resdesc_img2));
resdesc_img2.resType = cudaResourceTypeArray;
resdesc_img2.res.array.array = d_img_arr2;
struct cudaTextureDesc texdesc_img2;
memset(&texdesc_img2, 0, sizeof(texdesc_img2));
texdesc_img2.addressMode[0] = cudaAddressModeClamp;
texdesc_img2.addressMode[1] = cudaAddressModeClamp;
texdesc_img2.filterMode = cudaFilterModePoint;
texdesc_img2.readMode = cudaReadModeElementType;
texdesc_img2.normalizedCoords = 0;
cudaTextureObject_t d_img_tex2 = 0;
cudaCreateTextureObject(&d_img_tex2, &resdesc_img2, &texdesc_img2, nullptr);

std::cout << "Warming up 2..." << std::endl;
for (int i = 0; i < warmups; ++i) CLATCH(d_trip_tex1, d_trip_tex2, d_kps2, static_cast<int>(kps2.size()), d_desc2);
std::cout << "Testing 2..." << std::endl;
high_resolution_clock::time_point start2 = high_resolution_clock::now();
for (int i = 0; i < runs; ++i) CLATCH(d_trip_tex1, d_trip_tex2, d_kps2, static_cast<int>(kps2.size()), d_desc2);
high_resolution_clock::time_point end2 = high_resolution_clock::now();
std::cout << std::endl << "CLATCH 2 took " << static_cast<double>((end2 - start2).count()) * 1e-3 / (static_cast<double>(runs) * static_cast<double>(kps2.size())) << " us per desc over " << kps2.size() << " desc" << (kps2.size() == 1 ? "." : "s.") << std::endl << std::endl;
uint64_t* h_GPUdesc2 = new uint64_t[8 * kps2.size()];
cudaMemcpy(h_GPUdesc2, d_desc2, 64 * kps2.size(), cudaMemcpyDeviceToHost);
//std::cout << "CUDA reports 2" << cudaGetErrorString(cudaGetLastError()) << std::endl;

long long total2 = 0;
for (size_t i = 0; i < 8 * kps2.size(); ++i) total2 += h_GPUdesc2[i];
std::cout << "Checksum 2: " << std::hex << total2 << std::endl << std::endl;
std::cout << std::dec;
/************************* For image 2  **************************/

/************************* Matching **************************/
int* d_matches;
cudaMalloc(&d_matches, 4 * size);

std::cout << std::endl << "Warming up..." << std::endl;
for (int i = 0; i < warmups; ++i) CUDAK2NN(d_desc1, 8 * kps1.size(), d_img_tex2, 8 * kps2.size(), d_matches, threshold);
std::cout << "Testing..." << std::endl;
high_resolution_clock::time_point start = high_resolution_clock::now();
for (int i = 0; i < runs; ++i) CUDAK2NN(d_desc1, 8 * kps1.size(), d_img_tex2, 8 * kps2.size(), d_matches, threshold);
high_resolution_clock::time_point end = high_resolution_clock::now();

int* h_matches = reinterpret_cast<int*>(malloc(4 * size));
cudaMemcpy(h_matches, d_matches, 4 * size, cudaMemcpyDeviceToHost);
cudaDeviceReset();

vector<cv::DMatch> matches;
std::vector<Match> hostMatches;
for (int i = 0; i < size; ++i)
{
    if (h_matches[i] != -1)
    {
        hostMatches.emplace_back(i, h_matches[i]);
        matches.push_back(cv::DMatch(i, h_matches[i], 0));
    }
}

double sec = static_cast<double>(duration_cast<nanoseconds>(end - start).count()) * 1e-9 / static_cast<double>(runs);
std::cout << "CUDAK2NN found " << hostMatches.size() << " matches in " << sec * 1e3 << " ms" << std::endl;
std::cout << "Throughput: " << static_cast<double>(size)*static_cast<double>(size) / sec * 1e-9 << " billion comparisons/second." << std::endl << std::endl;
/************************* Matching **************************/

}

Query is, in CLATCH each descriptor is set to be int64 with 8 descriptor for each keypoint. So that it derive to 512 bit per keypoint.

In CUDA2KNN (main.cpp) , the "size" is considered as 10000 by which the total number of descriptors taken into consideration for is 64000 and that could limit only to 8000 keypoints.

I believe I should be passing in the right size for the comparison.
CUDAK2NN(d_desc1, 8 * kps1.size(), d_img_tex2, 8 * kps2.size(), d_matches, threshold);

Please correct me if I'm wrong. Kindly review and help to see if there is any drastic mistake.
Thanks for the help.

komrad36 · 2016-11-09T04:05:17Z

Hi,

That's great news that CUDA 8 has fixed the CUDAK2NN problem!

To start analyzing the custom code, can you please uncomment the "CUDA reports..." output lines? Those are there to show any CUDA errors that might have accumulated and are the best place to start.

Thanks,
Kareem

buzzsuresh · 2016-11-09T04:13:51Z

Hi,

No CUDA errors were reported.

Detecting 1...
Warming up 1...
Testing 1...

CLATCH 1 took 1.23391 us per desc over 9984 descs.

CUDA reports 1 no error
Checksum 1: a2dc73dbca6075a3

Detecting 2...
Warming up 2...
Testing 2...

CLATCH 2 took 0.846509 us per desc over 9984 descs.

CUDA reports 2no error
Checksum 2: d6d3d0936cab2618

Warming up...
Testing...
CUDAK2NN found 0 matches in 210.068 ms
Throughput: 0.476036 billion comparisons/second.

And with the size that I have given , 8 * kps1.size(), on CUDAK2NN comparison it took very long time to come out of the function. Please comment about the size that I should use for the comparison.

Do you see any results running the same code given my in previous post. Please let me know.

Thanks.

komrad36 · 2016-11-09T04:23:16Z

Working on it.

komrad36 · 2016-11-09T04:59:41Z

Okay, I found a lot of problems in the code so I just wrote my own to illustrate the combination of CLATCH and CUDAK2NN. It also plots its match results to illustrate, as seen in the image below. The code will follow shortly. Note that that just a quick demo; proper coding of this will use a template to complete all the steps automatically for frame pairs rather than duplicating everything as I've done here.

buzzsuresh · 2016-11-09T05:02:05Z

Thanks for staying with me and helping throughout.
Will wait for your sample.

komrad36 · 2016-11-09T05:15:27Z

/*******************************************************************
*   main.cpp
*   Pipeline Test
*
*   Author: Kareem Omar
*   kareem.omar@uah.edu
*   https://github.com/komrad36
*
*   Last updated Nov 8, 2016
*******************************************************************/

#include <chrono>
#include <iostream>
#include <iomanip>
#include <opencv2/opencv.hpp>
#include <string>
#include <vector>

#include "CLATCH.h"
#include "CUDAK2NN.h"

#define VC_EXTRALEAN
#define WIN32_LEAN_AND_MEAN

using namespace std::chrono;

struct Match {
    int q, t;
    Match() {}
    Match(const int _q, const int _t) : q(_q), t(_t) {}
};

int main() {
    // ------------- Configuration ------------
    constexpr int numkps = 5000;
    constexpr char name1[] = "1.jpg";
    constexpr char name2[] = "2.jpg";
    constexpr int threshold = 50;
    constexpr bool display_image = true;
    // --------------------------------


    // ------------- Image Read ------------
    cv::Mat image1 = cv::imread(name1, CV_LOAD_IMAGE_GRAYSCALE);
    if (!image1.data) {
        std::cerr << "ERROR: failed to open image 1. Aborting." << std::endl;
        return EXIT_FAILURE;
    }
    cv::Mat image2 = cv::imread(name2, CV_LOAD_IMAGE_GRAYSCALE);
    if (!image2.data) {
        std::cerr << "ERROR: failed to open image 2. Aborting." << std::endl;
        return EXIT_FAILURE;
    }
    // --------------------------------


    // ------------- Detection ------------
    std::cout << std::endl << "Detecting..." << std::endl;
    cv::Ptr<cv::ORB> orb = cv::ORB::create(numkps, 1.2f, 8, 31, 0, 2, cv::ORB::HARRIS_SCORE, 31, 20);
    std::vector<cv::KeyPoint> keypoints1;
    orb->detect(image1, keypoints1);
    keypoints1.erase(std::remove_if(keypoints1.begin(), keypoints1.end(), [image1](const cv::KeyPoint& kp) {return kp.pt.x <= 36 || kp.pt.y <= 36 || kp.pt.x >= image1.cols - 36 || kp.pt.y >= image1.rows - 36; }), keypoints1.end());

    std::vector<cv::KeyPoint> keypoints2;
    orb->detect(image2, keypoints2);
    keypoints2.erase(std::remove_if(keypoints2.begin(), keypoints2.end(), [image2](const cv::KeyPoint& kp) {return kp.pt.x <= 36 || kp.pt.y <= 36 || kp.pt.x >= image2.cols - 36 || kp.pt.y >= image2.rows - 36; }), keypoints2.end());

    // --------------------------------


    // ------------- CLATCH ------------

    // arranging keypoints for PCI transfer
    std::vector<KeyPoint> kps1;
    for (const auto& kp : keypoints1) kps1.emplace_back(kp.pt.x, kp.pt.y, kp.size, kp.angle * 3.14159265f / 180.0f);

    std::vector<KeyPoint> kps2;
    for (const auto& kp : keypoints2) kps2.emplace_back(kp.pt.x, kp.pt.y, kp.size, kp.angle * 3.14159265f / 180.0f);

    // setting cache and shared modes
    cudaDeviceSetCacheConfig(cudaFuncCachePreferEqual);
    cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeFourByte);

    // allocating space for descriptors. allocate extra 8 for training set.
    uint64_t* d_desc1;
    cudaMalloc(&d_desc1, 64 * (kps1.size() + 8));

    uint64_t* d_desc2;
    cudaMalloc(&d_desc2, 64 * kps2.size());

    // allocating and transferring keypoints and binding to texture object
    KeyPoint* d_kps1;
    cudaMalloc(&d_kps1, kps1.size() * sizeof(KeyPoint));
    cudaMemcpy(d_kps1, &kps1[0], kps1.size() * sizeof(KeyPoint), cudaMemcpyHostToDevice);

    KeyPoint* d_kps2;
    cudaMalloc(&d_kps2, kps2.size() * sizeof(KeyPoint));
    cudaMemcpy(d_kps2, &kps2[0], kps2.size() * sizeof(KeyPoint), cudaMemcpyHostToDevice);

    // allocating and transferring triplets and binding to texture object
    uint32_t* d_triplets;
    cudaMalloc(&d_triplets, 2048 * sizeof(uint16_t));
    cudaMemcpy(d_triplets, triplets, 2048 * sizeof(uint16_t), cudaMemcpyHostToDevice);
    cudaChannelFormatDesc chandesc_trip = cudaCreateChannelDesc(16, 16, 16, 16, cudaChannelFormatKindUnsigned);
    cudaArray* d_trip_arr;
    cudaMallocArray(&d_trip_arr, &chandesc_trip, 512);
    cudaMemcpyToArray(d_trip_arr, 0, 0, d_triplets, 2048 * sizeof(uint16_t), cudaMemcpyHostToDevice);
    struct cudaResourceDesc resdesc_trip;
    memset(&resdesc_trip, 0, sizeof(resdesc_trip));
    resdesc_trip.resType = cudaResourceTypeArray;
    resdesc_trip.res.array.array = d_trip_arr;
    struct cudaTextureDesc texdesc_trip;
    memset(&texdesc_trip, 0, sizeof(texdesc_trip));
    texdesc_trip.addressMode[0] = cudaAddressModeClamp;
    texdesc_trip.filterMode = cudaFilterModePoint;
    texdesc_trip.readMode = cudaReadModeElementType;
    texdesc_trip.normalizedCoords = 0;
    cudaTextureObject_t d_trip_tex = 0;
    cudaCreateTextureObject(&d_trip_tex, &resdesc_trip, &texdesc_trip, nullptr);

    // allocating and transferring image and binding to texture object
    cudaChannelFormatDesc chandesc_img1 = cudaCreateChannelDesc(8, 0, 0, 0, cudaChannelFormatKindUnsigned);
    cudaArray* d_img_arr1;
    cudaMallocArray(&d_img_arr1, &chandesc_img1, image1.cols, image1.rows);
    cudaMemcpyToArray(d_img_arr1, 0, 0, image1.data, image1.rows * image1.cols, cudaMemcpyHostToDevice);
    struct cudaResourceDesc resdesc_img1;
    memset(&resdesc_img1, 0, sizeof(resdesc_img1));
    resdesc_img1.resType = cudaResourceTypeArray;
    resdesc_img1.res.array.array = d_img_arr1;
    struct cudaTextureDesc texdesc_img1;
    memset(&texdesc_img1, 0, sizeof(texdesc_img1));
    texdesc_img1.addressMode[0] = cudaAddressModeClamp;
    texdesc_img1.addressMode[1] = cudaAddressModeClamp;
    texdesc_img1.filterMode = cudaFilterModePoint;
    texdesc_img1.readMode = cudaReadModeElementType;
    texdesc_img1.normalizedCoords = 0;
    cudaTextureObject_t d_img_tex1 = 0;
    cudaCreateTextureObject(&d_img_tex1, &resdesc_img1, &texdesc_img1, nullptr);

    cudaChannelFormatDesc chandesc_img2 = cudaCreateChannelDesc(8, 0, 0, 0, cudaChannelFormatKindUnsigned);
    cudaArray* d_img_arr2;
    cudaMallocArray(&d_img_arr2, &chandesc_img2, image2.cols, image2.rows);
    cudaMemcpyToArray(d_img_arr2, 0, 0, image2.data, image2.rows * image2.cols, cudaMemcpyHostToDevice);
    struct cudaResourceDesc resdesc_img2;
    memset(&resdesc_img2, 0, sizeof(resdesc_img2));
    resdesc_img2.resType = cudaResourceTypeArray;
    resdesc_img2.res.array.array = d_img_arr2;
    struct cudaTextureDesc texdesc_img2;
    memset(&texdesc_img2, 0, sizeof(texdesc_img2));
    texdesc_img2.addressMode[0] = cudaAddressModeClamp;
    texdesc_img2.addressMode[1] = cudaAddressModeClamp;
    texdesc_img2.filterMode = cudaFilterModePoint;
    texdesc_img2.readMode = cudaReadModeElementType;
    texdesc_img2.normalizedCoords = 0;
    cudaTextureObject_t d_img_tex2 = 0;
    cudaCreateTextureObject(&d_img_tex2, &resdesc_img2, &texdesc_img2, nullptr);

    // CLATCH 1
    CLATCH(d_img_tex1, d_trip_tex, d_kps1, static_cast<int>(kps1.size()), d_desc1);

    // CLATCH 2
    CLATCH(d_img_tex2, d_trip_tex, d_kps2, static_cast<int>(kps2.size()), d_desc2);

    // wrapping query vecs ("2") in tex obj
    struct cudaResourceDesc resDesc;
    memset(&resDesc, 0, sizeof(resDesc));
    resDesc.resType = cudaResourceTypeLinear;
    resDesc.res.linear.devPtr = d_desc2;
    resDesc.res.linear.desc.f = cudaChannelFormatKindUnsigned;
    resDesc.res.linear.desc.x = 32;
    resDesc.res.linear.desc.y = 32;
    resDesc.res.linear.sizeInBytes = 64 * kps2.size();
    struct cudaTextureDesc texDesc;
    memset(&texDesc, 0, sizeof(texDesc));
    texDesc.addressMode[0] = cudaAddressModeBorder;
    texDesc.addressMode[1] = cudaAddressModeBorder;
    texDesc.filterMode = cudaFilterModePoint;
    texDesc.readMode = cudaReadModeElementType;
    texDesc.normalizedCoords = 0;
    cudaTextureObject_t tex_q = 0;
    cudaCreateTextureObject(&tex_q, &resDesc, &texDesc, nullptr);

    // allocating space for match results
    int* d_matches;
    cudaMalloc(&d_matches, 4 * kps2.size());

    CUDAK2NN(d_desc1, static_cast<int>(kps1.size()), tex_q, static_cast<int>(kps2.size()), d_matches, threshold);

    // transferring matches back to host
    int* h_matches = reinterpret_cast<int*>(malloc(4 * kps2.size()));
    cudaMemcpy(h_matches, d_matches, 4 * kps2.size(), cudaMemcpyDeviceToHost);
    cudaDeviceReset();

    std::cout << "CUDA reports " << cudaGetErrorString(cudaGetLastError()) << std::endl;

    std::vector<Match> matches;
    std::vector<cv::DMatch> dmatches;
    for (int i = 0; i < kps2.size(); ++i) {
        if (h_matches[i] != -1) {
            matches.emplace_back(i, h_matches[i]);
            dmatches.emplace_back(h_matches[i], i, 0.0f);
        }
    }

    std::cout << "CUDAK2NN found " << matches.size() << " matches." << std::endl;

    // ------------- Output ------------
    if (display_image) {
        cv::Mat image_with_matches;
        cv::drawMatches(image1, keypoints1, image2, keypoints2, dmatches, image_with_matches, cv::Scalar::all(-1.0), cv::Scalar::all(-1.0), std::vector<char>(), cv::DrawMatchesFlags::DRAW_RICH_KEYPOINTS);
        cv::namedWindow("Matches", CV_WINDOW_NORMAL | CV_WINDOW_KEEPRATIO);
        cv::imshow("Matches", image_with_matches);
        cv::waitKey(0);
    }
    // --------------------------------

}

buzzsuresh · 2016-11-09T05:18:01Z

Thanks Kareem, Will review and change as recommended wherever required and update you with the results.

komrad36 · 2016-11-09T05:49:10Z

No problem! Let me know if I can explain anything I do in the code sample. Also, check it out, rotation invariance is working perfectly:

Only 4% fewer matches even with 180 degree rotation.

buzzsuresh · 2016-11-09T05:56:35Z

Thanks a lot. It works sweet. This recent sample looks simple and straight to the point removing those warming and testing.

Detecting...
Image 1 - ORB Detect Time : 0.32 ms.
Image 2 - ORB Detect Time : 0.01 ms.
Image 1 - CUDA LATCH : 0.06 ms.
Image 2 - CUDA LATCH : 0 ms.
LATCH Comparison : 0.02 ms.
CUDAK2NN found 20 matches.

I'm yet to verify rotation variance. Will do that too.

Thanks.

komrad36 · 2016-11-09T06:00:06Z

Nice example! Glad it's working for you!! By the way, OpenCV ORB is insanely slow; I'm working on KORB but I'm pretty busy right now so keep an eye out for it in the next few weeks :)

buzzsuresh · 2016-11-09T06:06:36Z

Sure, will keep following. Best wishes for all your work.

komrad36 · 2016-12-28T04:52:13Z

Hi,

Just wanted to mention that after lots of difficulties and delays my detector-descriptor is finally ready, or at least, ready enough to make public! I still have a lot of work and tweaking to do on it, but it's at least available so you can get an idea and hopefully benefit from the increased accuracy and speed that comes from the way it shares scale spaces, for superior rotation- and scale-invariance without any loss of performance. It's called KORAL and is available here: https://github.com/komrad36/KORAL.

Thanks!
-Kareem

buzzsuresh · 2016-12-30T07:23:50Z

Nice work Kareem. Trying out to see if I could KORAL for my work.

I have a suggestion to be asked related to CUDAK2NN what we discussed earlier. For an image to image comparison as discussed the results came out to be good.

For the set of training images, I did the CLATCH gathered all of them into 1 single big array (for e.g [134400 x 1]). This array would have the CLATCH data of 2100 key points (21000*64 = 134400).
Made so to enrich my training dataset so that I could use it while matching.

Using CUDAK2NN I did ran the comparison against the frame image to identify the training pattern. Matching seems to be not accurate and the number of matches returned seems to be high for those images which don't have the training pattern.

My match function is as follows: ( frameCLATCHDesc - frame CLATCH Descriptors; trainingDesc - Training CLATCH Descriptors as discussed above)

vectorcv::DMatch CLATCHMatch(uint64_t* frameCLATCHDesc, std::vector framekps, cv::Mat trainingDesc)
{
int numOfTrainingKPs = trainingDesc.cols/64;
//cout << "\tCLATCH Training Desc Keypoints : " << numOfTrainingKPs << ", Rows : " << trainingDesc.rows << ", Cols : " << trainingDesc.cols << endl;
uint64_t* d_trainingLATCHDesc;
cudaMalloc(&d_trainingLATCHDesc, trainingDesc.cols);
cudaMemcpy(d_trainingLATCHDesc, trainingDesc.data, trainingDesc.cols, cudaMemcpyHostToDevice);

struct cudaResourceDesc resDesc;
memset(&resDesc, 0, sizeof(resDesc));
resDesc.resType = cudaResourceTypeLinear;
resDesc.res.linear.devPtr = frameCLATCHDesc;
resDesc.res.linear.desc.f = cudaChannelFormatKindUnsigned;
resDesc.res.linear.desc.x = 32;
resDesc.res.linear.desc.y = 32;
resDesc.res.linear.sizeInBytes = 64 * framekps.size();
struct cudaTextureDesc texDesc;
memset(&texDesc, 0, sizeof(texDesc));
texDesc.addressMode[0] = cudaAddressModeBorder;
texDesc.addressMode[1] = cudaAddressModeBorder;
texDesc.filterMode = cudaFilterModePoint;
texDesc.readMode = cudaReadModeElementType;
texDesc.normalizedCoords = 0;
cudaTextureObject_t frame_tex_q = 0;
cudaCreateTextureObject(&frame_tex_q, &resDesc, &texDesc, nullptr);

int* d_matches;
cudaMalloc(&d_matches, 4 * numOfTrainingKPs);
CUDAK2NN(d_trainingLATCHDesc, numOfTrainingKPs, frame_tex_q, static_cast<int>(framekps.size()) , d_matches, clatchThres);

int* h_matches = reinterpret_cast<int*>(malloc(4 * numOfTrainingKPs));
cudaMemcpy(h_matches, d_matches, 4 * numOfTrainingKPs, cudaMemcpyDeviceToHost);

std::vector<cv::DMatch> matches;
for (int i = 0; i < numOfTrainingKPs; ++i) 
    if (h_matches[i] != -1)
        matches.emplace_back(h_matches[i], i, 0.0f);

cudaFree(d_matches);
return matches;

}

Kindly suggest if I need to do any correction to get the right intense matches. Looking forward for your suggestion.

Thanks.

komrad36 closed this as completed Nov 5, 2016

komrad36 mentioned this issue Dec 5, 2016

fastApprox match always gives same feedback as bruteForce() match komrad36/K2NN#5

Closed

saihv mentioned this issue Apr 21, 2018

Having a small problem combining KORAL and CUDAK2NN komrad36/CUDAK2NN#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converting uint64_t Host Descriptor to compatible type for the Binary Compartor to work #1

Converting uint64_t Host Descriptor to compatible type for the Binary Compartor to work #1

buzzsuresh commented Nov 5, 2016

komrad36 commented Nov 5, 2016

buzzsuresh commented Nov 5, 2016

komrad36 commented Nov 6, 2016

buzzsuresh commented Nov 7, 2016 •

edited

Loading

buzzsuresh commented Nov 7, 2016

komrad36 commented Nov 7, 2016

buzzsuresh commented Nov 7, 2016

komrad36 commented Nov 7, 2016

buzzsuresh commented Nov 8, 2016

komrad36 commented Nov 8, 2016

buzzsuresh commented Nov 8, 2016 •

edited

Loading

komrad36 commented Nov 9, 2016

buzzsuresh commented Nov 9, 2016 •

edited

Loading

komrad36 commented Nov 9, 2016

komrad36 commented Nov 9, 2016

buzzsuresh commented Nov 9, 2016

komrad36 commented Nov 9, 2016

buzzsuresh commented Nov 9, 2016

komrad36 commented Nov 9, 2016 •

edited

Loading

buzzsuresh commented Nov 9, 2016

komrad36 commented Nov 9, 2016

buzzsuresh commented Nov 9, 2016

komrad36 commented Dec 28, 2016

buzzsuresh commented Dec 30, 2016

Converting uint64_t Host Descriptor to compatible type for the Binary Compartor to work #1

Converting uint64_t Host Descriptor to compatible type for the Binary Compartor to work #1

Comments

buzzsuresh commented Nov 5, 2016

komrad36 commented Nov 5, 2016

buzzsuresh commented Nov 5, 2016

komrad36 commented Nov 6, 2016

buzzsuresh commented Nov 7, 2016 • edited Loading

buzzsuresh commented Nov 7, 2016

komrad36 commented Nov 7, 2016

buzzsuresh commented Nov 7, 2016

komrad36 commented Nov 7, 2016

buzzsuresh commented Nov 8, 2016

komrad36 commented Nov 8, 2016

buzzsuresh commented Nov 8, 2016 • edited Loading

komrad36 commented Nov 9, 2016

buzzsuresh commented Nov 9, 2016 • edited Loading

komrad36 commented Nov 9, 2016

komrad36 commented Nov 9, 2016

buzzsuresh commented Nov 9, 2016

komrad36 commented Nov 9, 2016

buzzsuresh commented Nov 9, 2016

komrad36 commented Nov 9, 2016 • edited Loading

buzzsuresh commented Nov 9, 2016

komrad36 commented Nov 9, 2016

buzzsuresh commented Nov 9, 2016

komrad36 commented Dec 28, 2016

buzzsuresh commented Dec 30, 2016

buzzsuresh commented Nov 7, 2016 •

edited

Loading

buzzsuresh commented Nov 8, 2016 •

edited

Loading

buzzsuresh commented Nov 9, 2016 •

edited

Loading

komrad36 commented Nov 9, 2016 •

edited

Loading