Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting uint64_t Host Descriptor to compatible type for the Binary Compartor to work #1

Closed
buzzsuresh opened this issue Nov 5, 2016 · 24 comments

Comments

@buzzsuresh
Copy link

Hi,

I'm trying to use the CUDA LATCH processing logic described in the main.cpp file to compare 2 different images for matching.

Objective is to use Brute Force Hamming (Any Binary comparator) to identify the matches between the 2 host descriptors obtained from image 1 and 2 respectively. For KNNMatch to work it expects the data type to be either CV_8UC1/CV_32F.

To use KNNmatch matcher for comparison, I made a Mat object explicitly from the unit64 host descriptor. Conversion of unit64 to either 8U/32F creating data loss so that the KNNMatcher is failing to identify matches across descriptors.

Would like to get your suggestion on how would I convert the unit64 host descriptor to 8U/32F without data loss so that I use any binary comparator to work.

Kindly advice.


std::cout << "Desc 1 " << endl;
uint64_t** tempDesc1 = new uint64_t*[kps1.size()];
int nMark1 = 0;
for (int j=0; j<kps1.size(); j++)
{
    tempDesc1[j] = new uint64_t[32];
    for (int k = 0; k < 8; k++, nMark1++)
        tempDesc1[j][k] = h_GPUdesc1[nMark1];
}
cv::Mat matDesc1(kps1.size(), 8, CV_8UC1, tempDesc1);

std::cout << "Desc 2 " << endl;
uint64_t** tempDesc2 = new uint64_t*[kps2.size()];
int nMark2 = 0;
for (int j=0; j<kps2.size(); j++)
{
    tempDesc2[j] = new uint64_t[8];
    for ( int k = 0; k < 8; k++, nMark2++)
        tempDesc2[j][k] = h_GPUdesc2[nMark2];
}
cv::Mat matDesc2(kps2.size(), 8, CV_8UC1, tempDesc2);

vector<cv::DMatch> good_matches;
cv::BFMatcher matcher(cv::NORM_HAMMING);
vector< vector<cv::DMatch> > matches;
matcher.knnMatch(matDesc1, matDesc2, matches, 2);

Thanks.

@komrad36
Copy link
Owner

komrad36 commented Nov 5, 2016

Hi,

The descriptors output from CLATCH are a contiguous block of 512-bit binary descriptors. You can shuffle the data around to make them work with OpenCV, but the whole point of my utilities is to improve on the very poor performance of OpenCV. My pipeline replaces that of OpenCV, so you don't have to worry about trying to get the descriptors in a format suitable for OpenCV's matchers. Just use my K2NN matcher (for CPU matching) or my CUDAK2NN matcher (for GPU matching.) They work naturally with the descriptors from CLATCH. As with all my computer vision projects there's a test harness and demo.

My matchers are available here:
https://github.com/komrad36/K2NN
https://github.com/komrad36/CUDAK2NN

Please let me know if I can help with anything else!

@komrad36 komrad36 closed this as completed Nov 5, 2016
@buzzsuresh
Copy link
Author

Much Appreciated. Thank you.

To operate as said using "CUDAK2NN(d_tvecs, size, tex_q, size, d_matches, threshold)"

I could pass in first image's uint64 descriptor pointer for first parameter and second image's "cudaTextureObject_t" object in place of fourth param. Please correct me if I'm wrong.

I had just tried as commented above and getting 0 matches. ( for trial image 1 and 2 are same ).

@komrad36
Copy link
Owner

komrad36 commented Nov 6, 2016

Yeah, the main.cpp demo in CUDAK2NN contains an example showing the binding of a texture object and such. Furthermore if you've just used CLATCH the descriptors are already in GPU memory! So all you have to do is wrap the linear memory in a texture object and call CUDAK2NN.

If you do everything correctly, you should get a very high number of matches if you use the same image for both training and query; might even get 100% matches.

@buzzsuresh
Copy link
Author

buzzsuresh commented Nov 7, 2016

Thanks for the suggestion.
To make it simple, I had used independent set of variables for image1 and image 2 respectively.
Coded as below :


int main() {
constexpr int runs = 500;
constexpr int warmups = 100;
constexpr int numkps = 10000;
constexpr char name1[] = "1.png";
constexpr char name2[] = "2.png";
float reSizeFactor = 0.7;
constexpr int size = 10000;
constexpr int threshold = 5;
clock_t t;

cudaDeviceSetCacheConfig(cudaFuncCachePreferEqual);
cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeFourByte);

cudaEvent_t latchFinishedEvent;
cudaEventCreate(&latchFinishedEvent);

    cudaStream_t stream1, stream2;
    cudaStreamCreate(&stream1);
    cudaStreamCreate(&stream2);

    /************************* image loading **************************/
    cv::Mat image1 = cv::imread(name1, CV_LOAD_IMAGE_GRAYSCALE);
    if (!image1.data) {
        std::cerr << "ERROR: failed to open image. Aborting." << std::endl;
        return EXIT_FAILURE;
    }

    cv::Mat image2 = cv::imread(name2, CV_LOAD_IMAGE_GRAYSCALE);
    if (!image2.data) {
        std::cerr << "ERROR: failed to open image. Aborting." << std::endl;
        return EXIT_FAILURE;
    }
    /************************* image loading **************************/


    /************************* For image 1  **************************/
    std::cout << std::endl << "Detecting 1..." << std::endl;
    //cv::Ptr<cv::ORB> orb1 = cv::ORB::create(numkps, 1.20000004768, 8, 31, 0, 3, 0, 31, 31);
    cv::Ptr<cv::ORB> orb1 = cv::ORB::create(numkps, 1.2f, 8, 31, 0, 2, cv::ORB::HARRIS_SCORE, 31, 20);
    std::vector<cv::KeyPoint> keypoints1;
    //double t = (double)getTickCount();
    orb1->detect(image1, keypoints1);
    //t = ((double)getTickCount() - t)/getTickFrequency();
    //cout << "ORB Detecting [ms] : " << (t/1.0)*1000 << endl;
    keypoints1.erase(std::remove_if(keypoints1.begin(), keypoints1.end(), [image1](const cv::KeyPoint& kp) {return kp.pt.x <= 36 || kp.pt.y <= 36 || kp.pt.x >= image1.cols - 36 || kp.pt.y >= image1.rows - 36; }), keypoints1.end());

    std::vector<KeyPoint> kps1;
    for (const auto& kp : keypoints1) kps1.emplace_back(kp.pt.x, kp.pt.y, kp.size, kp.angle * 3.14159265f / 180.0f);

    uint64_t* d_desc1;
    cudaMalloc(&d_desc1, 64 * kps1.size());

    KeyPoint* d_kps1;
    cudaMalloc(&d_kps1, kps1.size() * sizeof(KeyPoint));
    cudaMemcpy(d_kps1, &kps1[0], kps1.size() * sizeof(KeyPoint), cudaMemcpyHostToDevice);

    // allocating and transferring triplets and binding to texture object
    uint32_t* d_triplets1;
    cudaMalloc(&d_triplets1, 2048 * sizeof(uint16_t));
    cudaMemcpy(d_triplets1, triplets, 2048 * sizeof(uint16_t), cudaMemcpyHostToDevice);
    cudaChannelFormatDesc chandesc_trip1 = cudaCreateChannelDesc(16, 16, 16, 16, cudaChannelFormatKindUnsigned);
    cudaArray* d_trip_arr1;
    cudaMallocArray(&d_trip_arr1, &chandesc_trip1, 512);
    cudaMemcpyToArray(d_trip_arr1, 0, 0, d_triplets1, 2048 * sizeof(uint16_t), cudaMemcpyHostToDevice);
    struct cudaResourceDesc resdesc_trip1;
    memset(&resdesc_trip1, 0, sizeof(resdesc_trip1));
    resdesc_trip1.resType = cudaResourceTypeArray;
    resdesc_trip1.res.array.array = d_trip_arr1;
    struct cudaTextureDesc texdesc_trip1;
    memset(&texdesc_trip1, 0, sizeof(texdesc_trip1));
    texdesc_trip1.addressMode[0] = cudaAddressModeClamp;
    texdesc_trip1.filterMode = cudaFilterModePoint;
    texdesc_trip1.readMode = cudaReadModeElementType;
    texdesc_trip1.normalizedCoords = 0;
    cudaTextureObject_t d_trip_tex1 = 0;
    cudaCreateTextureObject(&d_trip_tex1, &resdesc_trip1, &texdesc_trip1, nullptr);

    // allocating and transferring image and binding to texture object
    cudaChannelFormatDesc chandesc_img1 = cudaCreateChannelDesc(8, 0, 0, 0, cudaChannelFormatKindUnsigned);
    cudaArray* d_img_arr1;
    cudaMallocArray(&d_img_arr1, &chandesc_img1, image1.cols, image1.rows);
    cudaMemcpyToArray(d_img_arr1, 0, 0, image1.data, image1.rows * image1.cols, cudaMemcpyHostToDevice);
    struct cudaResourceDesc resdesc_img1;
    memset(&resdesc_img1, 0, sizeof(resdesc_img1));
    resdesc_img1.resType = cudaResourceTypeArray;
    resdesc_img1.res.array.array = d_img_arr1;
    struct cudaTextureDesc texdesc_img1;
    memset(&texdesc_img1, 0, sizeof(texdesc_img1));
    texdesc_img1.addressMode[0] = cudaAddressModeClamp;
    texdesc_img1.addressMode[1] = cudaAddressModeClamp;
    texdesc_img1.filterMode = cudaFilterModePoint;
    texdesc_img1.readMode = cudaReadModeElementType;
    texdesc_img1.normalizedCoords = 0;
    cudaTextureObject_t d_img_tex1 = 0;
    cudaCreateTextureObject(&d_img_tex1, &resdesc_img1, &texdesc_img1, nullptr);

    std::cout << "Warming up 1..." << std::endl;
    for (int i = 0; i < warmups; ++i) CLATCH(d_img_tex1, d_trip_tex1, d_kps1, static_cast<int>(kps1.size()), d_desc1);
    std::cout << "Testing 1..." << std::endl;
    high_resolution_clock::time_point start1 = high_resolution_clock::now();
    for (int i = 0; i < runs; ++i) CLATCH(d_img_tex1, d_trip_tex1, d_kps1, static_cast<int>(kps1.size()), d_desc1);
    high_resolution_clock::time_point end1 = high_resolution_clock::now();
    std::cout << std::endl << "CLATCH 1 took " << static_cast<double>((end1 - start1).count()) * 1e-3 / (static_cast<double>(runs) * static_cast<double>(kps1.size())) << " us per desc over " << kps1.size() << " desc" << (kps1.size() == 1 ? "." : "s.") << std::endl << std::endl;
    uint64_t* h_GPUdesc1 = new uint64_t[8 * kps1.size()];
    cudaMemcpy(h_GPUdesc1, d_desc1, 64 * kps1.size(), cudaMemcpyDeviceToHost);
    //std::cout << "CUDA reports 1 " << cudaGetErrorString(cudaGetLastError()) << std::endl;

    long long total1 = 0;
    for (size_t i = 0; i < 8 * kps1.size(); ++i) total1 += h_GPUdesc1[i];
    //std::cout << "Checksum 1: " << std::hex << total1 << std::endl << std::endl;
    std::cout << std::dec;
    /************************* For image 1  **************************/


    /************************* For image 2  **************************/
    std::cout << std::endl << "Detecting 2..." << std::endl;

    //cv::Ptr<cv::ORB> orb2 = cv::ORB::create(numkps, 1.20000004768, 8, 31, 0, 3, 0, 31, 31);
    cv::Ptr<cv::ORB> orb2 = cv::ORB::create(numkps, 1.2f, 8, 31, 0, 2, cv::ORB::HARRIS_SCORE, 31, 20);
    std::vector<cv::KeyPoint> keypoints2;
    orb2->detect(image2, keypoints2);
    keypoints2.erase(std::remove_if(keypoints2.begin(), keypoints2.end(), [image2](const cv::KeyPoint& kp) {return kp.pt.x <= 36 || kp.pt.y <= 36 || kp.pt.x >= image2.cols - 36 || kp.pt.y >= image2.rows - 36; }), keypoints2.end());

    std::vector<KeyPoint> kps2;
    for (const auto& kp : keypoints2) kps2.emplace_back(kp.pt.x, kp.pt.y, kp.size, kp.angle * 3.14159265f / 180.0f);

    uint64_t* d_desc2;
    cudaMalloc(&d_desc2, 64 * kps2.size());

    KeyPoint* d_kps2;
    cudaMalloc(&d_kps2, kps2.size() * sizeof(KeyPoint));
    cudaMemcpy(d_kps2, &kps2[0], kps2.size() * sizeof(KeyPoint), cudaMemcpyHostToDevice);

    // allocating and transferring triplets and binding to texture object
    uint32_t* d_triplets2;
    cudaMalloc(&d_triplets2, 2048 * sizeof(uint16_t));
    cudaMemcpy(d_triplets2, triplets, 2048 * sizeof(uint16_t), cudaMemcpyHostToDevice);
    cudaChannelFormatDesc chandesc_trip2 = cudaCreateChannelDesc(16, 16, 16, 16, cudaChannelFormatKindUnsigned);
    cudaArray* d_trip_arr2;
    cudaMallocArray(&d_trip_arr2, &chandesc_trip2, 512);
    cudaMemcpyToArray(d_trip_arr2, 0, 0, d_triplets2, 2048 * sizeof(uint16_t), cudaMemcpyHostToDevice);
    struct cudaResourceDesc resdesc_trip2;
    memset(&resdesc_trip2, 0, sizeof(resdesc_trip2));
    resdesc_trip2.resType = cudaResourceTypeArray;
    resdesc_trip2.res.array.array = d_trip_arr2;
    struct cudaTextureDesc texdesc_trip2;
    memset(&texdesc_trip2, 0, sizeof(texdesc_trip2));
    texdesc_trip2.addressMode[0] = cudaAddressModeClamp;
    texdesc_trip2.filterMode = cudaFilterModePoint;
    texdesc_trip2.readMode = cudaReadModeElementType;
    texdesc_trip2.normalizedCoords = 0;
    cudaTextureObject_t d_trip_tex2 = 0;
    cudaCreateTextureObject(&d_trip_tex2, &resdesc_trip2, &texdesc_trip2, nullptr);

    // allocating and transferring image and binding to texture object
    cudaChannelFormatDesc chandesc_img2 = cudaCreateChannelDesc(8, 0, 0, 0, cudaChannelFormatKindUnsigned);
    cudaArray* d_img_arr2;
    cudaMallocArray(&d_img_arr2, &chandesc_img2, image2.cols, image2.rows);
    cudaMemcpyToArray(d_img_arr2, 0, 0, image2.data, image2.rows * image2.cols, cudaMemcpyHostToDevice);
    struct cudaResourceDesc resdesc_img2;
    memset(&resdesc_img2, 0, sizeof(resdesc_img2));
    resdesc_img2.resType = cudaResourceTypeArray;
    resdesc_img2.res.array.array = d_img_arr2;
    struct cudaTextureDesc texdesc_img2;
    memset(&texdesc_img2, 0, sizeof(texdesc_img2));
    texdesc_img2.addressMode[0] = cudaAddressModeClamp;
    texdesc_img2.addressMode[1] = cudaAddressModeClamp;
    texdesc_img2.filterMode = cudaFilterModePoint;
    texdesc_img2.readMode = cudaReadModeElementType;
    texdesc_img2.normalizedCoords = 0;
    cudaTextureObject_t d_img_tex2 = 0;
    cudaCreateTextureObject(&d_img_tex2, &resdesc_img2, &texdesc_img2, nullptr);

    std::cout << "Warming up 2..." << std::endl;
    for (int i = 0; i < warmups; ++i) CLATCH(d_img_tex2, d_trip_tex2, d_kps2, static_cast<int>(kps2.size()), d_desc2);
    std::cout << "Testing 2..." << std::endl;
    high_resolution_clock::time_point start2 = high_resolution_clock::now();
    for (int i = 0; i < runs; ++i) CLATCH(d_img_tex2, d_trip_tex2, d_kps2, static_cast<int>(kps2.size()), d_desc2);
    high_resolution_clock::time_point end2 = high_resolution_clock::now();
    std::cout << std::endl << "CLATCH 2 took " << static_cast<double>((end2 - start2).count()) * 1e-3 / (static_cast<double>(runs) * static_cast<double>(kps2.size())) << " us per desc over " << kps2.size() << " desc" << (kps2.size() == 1 ? "." : "s.") << std::endl << std::endl;
    uint64_t* h_GPUdesc2 = new uint64_t[8 * kps2.size()];
    cudaMemcpy(h_GPUdesc2, d_desc2, 64 * kps2.size(), cudaMemcpyDeviceToHost);
    //std::cout << "CUDA reports 2" << cudaGetErrorString(cudaGetLastError()) << std::endl;

    long long total2 = 0;
    for (size_t i = 0; i < 8 * kps2.size(); ++i) total2 += h_GPUdesc2[i];
    //std::cout << "Checksum 2: " << std::hex << total2 << std::endl << std::endl;
    std::cout << std::dec;
    /************************* For image 2  **************************/

    /************************* Matching **************************/
    int* d_matches;
    cudaMalloc(&d_matches, 4 * size);

    std::cout << std::endl << "Warming up..." << std::endl;
    for (int i = 0; i < warmups; ++i) CUDAK2NN(d_desc1, size, d_img_tex2, size, d_matches, threshold);
    std::cout << "Testing..." << std::endl;
    high_resolution_clock::time_point start = high_resolution_clock::now();
    for (int i = 0; i < runs; ++i) CUDAK2NN(d_desc1, size, d_img_tex2, size, d_matches, threshold);
    high_resolution_clock::time_point end = high_resolution_clock::now();

    int* h_matches = reinterpret_cast<int*>(malloc(4 * size));
    cudaMemcpy(h_matches, d_matches, 4 * size, cudaMemcpyDeviceToHost);
    cudaDeviceReset();

    std::vector<Match> hostMatches;
    for (int i = 0; i < size; ++i) {
        if (h_matches[i] != -1) hostMatches.emplace_back(i, h_matches[i]);
    }

    double sec = static_cast<double>(duration_cast<nanoseconds>(end - start).count()) * 1e-9 / static_cast<double>(runs);
    std::cout << "CUDAK2NN found " << hostMatches.size() << " matches in " << sec * 1e3 << " ms" << std::endl;
    std::cout << "Throughput: " << static_cast<double>(size)*static_cast<double>(size) / sec * 1e-9 << " billion comparisons/second." << std::endl << std::endl;
    /************************* Matching **************************/

}


*Results :**
Detecting 1...
Warming up 1...
Testing 1...
CLATCH 1 took 1.1825 us per desc over 9984 descs.

Detecting 2...
Warming up 2...
Testing 2...
CLATCH 2 took 1.18514 us per desc over 9984 descs.

Warming up...
Testing...
CUDAK2NN found 0 matches in 4.37826 ms
Throughput: 22.8401 billion comparisons/second.


Please help to see if I had missed anything core that is resulting in giving 0 matches.
Thanks a lot.

@buzzsuresh
Copy link
Author

Added I just ran your "CUDAK2NN" project as it is. The same result here too - 0 matches.

In the make file I just changed from "-std=gnu++17" to "-std=c++0x". Does this impact the result by any chance. When I have as it is as "-std=gnu++17" make file doesn't get compiled for me.

Please advice.

@komrad36
Copy link
Owner

komrad36 commented Nov 7, 2016

Hi,

I see a number of problems in the code involving allocations and CUDA
transfers and such. First, though, let's address the fact that you get 0
matches even with the default CUDAK2NN project, as that shouldn't happen. I
am aware of a particular combination of CUDA driver, CUDA version, and GPU
that actually has a nasty synchronization bug that causes things like that
to happen on kernels that push the envelope as much as mine does. There's
nothing I can do about it, it's an NVIDIA bug. We can insert a manual
synchronization that is not necessary - this fixes the problem but causes a
large slowdown. Updating drivers to the latest and CUDA version to 8.0 is a
good start.

Are you running on a 970m, 980m, 960, or 970, by any chance? I've seen a
970m have that problem.

I'd recommend either -std=gnu++17 or -std=gnu++14. Use a recent g++
compiler that can support one of those. (But that's not related to the
problem here, just a general recommendation for my computer vision stuff.)

Thanks,
Kareem

On Mon, Nov 7, 2016 at 12:42 AM, buzzsuresh notifications@github.com
wrote:

Added I just ran your "CUDAK2NN" project as it is. The same result here
too - 0 matches.

In the make file I just changed from "-std=gnu++17" to "-std=c++0x". Does
this impact the result by any chance. When I have as it is as
"-std=gnu++17" make file doesn't get compiled for me.

Please advice.


You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
#1 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AMZ_gi4I8QnPuvzH5FEKbw7zALhZX6Vwks5q7shRgaJpZM4KqNh0
.

@buzzsuresh
Copy link
Author

Yes, as commented, GPU config are "GeForce GTX 970" 4095Mb, sm_52, Driver/Runtime ver.7.50/7.50.

Let me go through them one by one and try my best to fix. Thanks for the review and comments.

@komrad36
Copy link
Owner

komrad36 commented Nov 7, 2016

Ah sorry I missed your GPU config. Thanks. Okay I think the 9xx series have
this strange problem. I do not have very much information on it since it's
either a driver or hardware bug and I also do not have a 9xx for testing.
Please try upgrading to CUDA 8 if possible. I do not know whether that will
fix it but it's the first thing to try and it will be useful to know if
CUDA 8 fixes it. If not we can try manual synchronization.

For now it's probably best to just test this with the canonical CUDAK2NN
source until we address this problem, then do the custom code later.

Thanks,
Kareem

On Mon, Nov 7, 2016 at 12:57 AM, buzzsuresh notifications@github.com
wrote:

Yes, as commented, GPU config are "GeForce GTX 970" 4095Mb, sm_52,
Driver/Runtime ver.7.50/7.50.

Let me go through them one by one and try my best to fix. Thanks for the
review and comments.


You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
#1 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AMZ_glwYHbYhNle8UlgcKXUUKqDnvopWks5q7svbgaJpZM4KqNh0
.

@buzzsuresh
Copy link
Author

Thanks Kareem.

I'm getting CUDA 8 set up in the same machine (970). Running on Ubuntu 16.04.
Will try to see if that fix the issue. Will get back with the result in short while.

If I understood it correct, you are commenting about the synch that happens from the Device to Host which is doubted to be failing.

Can you please shed light on how we process manual synchronization, so that I can have that block of the code ready in the mean time.

Thanks for your continued support.

@komrad36
Copy link
Owner

komrad36 commented Nov 8, 2016

No problem! Please let me know whether CUDA 8 and latest drivers fixes it. That would be nice...

Meanwhile I'm actually talking about a CUDA __syncthreads(), but when I looked back through my notes, that's actually in CLATCH, not in CUDAK2NN, which does not use shared memory. There is something else going on to cause it to find 0 matches. Let's see if we can figure out what it is. Can you try deleting the CUDAK2NN sources and downloading a fresh copy from latest branch? If that doesn't help we'll start adding debug to try to see why it's failing. I've never actually seen any problems with CUDAK2NN on any card before.

Thanks!

@buzzsuresh
Copy link
Author

buzzsuresh commented Nov 8, 2016

Hi Kareem,
Upgrading it to Cuda 8 (On Ubuntu 16.04, GCC 5.4) had better results. Running CUDAK2NN, I was able to get the match count of 1334.


Warming up...
Testing...
CUDA reports no error
CUDAK2NN found 1334 matches in 4.23604 ms
Throughput: 23.607 billion comparisons/second.


But still failed to see the matches happening with the custom code used for CLATCH + CUDAK2NN.
As commented for trial Image 1 and 2 are same.


Detecting 1...
Warming up 1...
Testing 1...
CLATCH 1 took 1.26648 us per desc over 9984 descs.
Checksum 1: a2dc73dbca6075a3

Detecting 2...
Warming up 2...
Testing 2...
CLATCH 2 took 0.85819 us per desc over 9984 descs.
Checksum 2: d6d3d0936cab2618

Warming up...
Testing...
CUDAK2NN found 0 matches in 4.2319 ms
Throughput: 23.6301 billion comparisons/second.


Below is the code used.


int main()
{
constexpr int runs = 500;
constexpr int warmups = 100;
constexpr int numkps = 10000;
constexpr char name1[] = "1.png";
constexpr char name2[] = "2.png";
float reSizeFactor = 0.7;
constexpr int size = 10000;
constexpr int threshold = 5;
clock_t t;

cudaDeviceSetCacheConfig(cudaFuncCachePreferEqual);
cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeFourByte);

cudaEvent_t latchFinishedEvent;
cudaEventCreate(&latchFinishedEvent);

cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

/************************* image loading **************************/
cv::Mat image1 = cv::imread(name1, CV_LOAD_IMAGE_GRAYSCALE);
if (!image1.data) {
    std::cerr << "ERROR: failed to open image. Aborting." << std::endl;
    return EXIT_FAILURE;
}

cv::Mat image2 = cv::imread(name2, CV_LOAD_IMAGE_GRAYSCALE);
if (!image2.data) {
    std::cerr << "ERROR: failed to open image. Aborting." << std::endl;
    return EXIT_FAILURE;
}
/************************* image loading **************************/


/************************* For image 1  **************************/
std::cout << std::endl << "Detecting 1..." << std::endl;
//cv::Ptr<cv::ORB> orb1 = cv::ORB::create(numkps, 1.20000004768, 8, 31, 0, 3, 0, 31, 31);
cv::Ptr<cv::ORB> orb1 = cv::ORB::create(numkps, 1.2f, 8, 31, 0, 2, cv::ORB::HARRIS_SCORE, 31, 20);
std::vector<cv::KeyPoint> keypoints1;
orb1->detect(image1, keypoints1);
keypoints1.erase(std::remove_if(keypoints1.begin(), keypoints1.end(), [image1](const cv::KeyPoint& kp) {return kp.pt.x <= 36 || kp.pt.y <= 36 || kp.pt.x >= image1.cols - 36 || kp.pt.y >= image1.rows - 36; }), keypoints1.end());

std::vector<KeyPoint> kps1;
for (const auto& kp : keypoints1) kps1.emplace_back(kp.pt.x, kp.pt.y, kp.size, kp.angle * 3.14159265f / 180.0f);

uint64_t* d_desc1;
cudaMalloc(&d_desc1, 64 * kps1.size());

KeyPoint* d_kps1;
cudaMalloc(&d_kps1, kps1.size() * sizeof(KeyPoint));
cudaMemcpy(d_kps1, &kps1[0], kps1.size() * sizeof(KeyPoint), cudaMemcpyHostToDevice);

// allocating and transferring triplets and binding to texture object
uint32_t* d_triplets1;
cudaMalloc(&d_triplets1, 2048 * sizeof(uint16_t));
cudaMemcpy(d_triplets1, triplets, 2048 * sizeof(uint16_t), cudaMemcpyHostToDevice);
cudaChannelFormatDesc chandesc_trip1 = cudaCreateChannelDesc(16, 16, 16, 16, cudaChannelFormatKindUnsigned);
cudaArray* d_trip_arr1;
cudaMallocArray(&d_trip_arr1, &chandesc_trip1, 512);
cudaMemcpyToArray(d_trip_arr1, 0, 0, d_triplets1, 2048 * sizeof(uint16_t), cudaMemcpyHostToDevice);
struct cudaResourceDesc resdesc_trip1;
memset(&resdesc_trip1, 0, sizeof(resdesc_trip1));
resdesc_trip1.resType = cudaResourceTypeArray;
resdesc_trip1.res.array.array = d_trip_arr1;
struct cudaTextureDesc texdesc_trip1;
memset(&texdesc_trip1, 0, sizeof(texdesc_trip1));
texdesc_trip1.addressMode[0] = cudaAddressModeClamp;
texdesc_trip1.filterMode = cudaFilterModePoint;
texdesc_trip1.readMode = cudaReadModeElementType;
texdesc_trip1.normalizedCoords = 0;
cudaTextureObject_t d_trip_tex1 = 0;
cudaCreateTextureObject(&d_trip_tex1, &resdesc_trip1, &texdesc_trip1, nullptr);

// allocating and transferring image and binding to texture object
cudaChannelFormatDesc chandesc_img1 = cudaCreateChannelDesc(8, 0, 0, 0, cudaChannelFormatKindUnsigned);
cudaArray* d_img_arr1;
cudaMallocArray(&d_img_arr1, &chandesc_img1, image1.cols, image1.rows);
cudaMemcpyToArray(d_img_arr1, 0, 0, image1.data, image1.rows * image1.cols, cudaMemcpyHostToDevice);
struct cudaResourceDesc resdesc_img1;
memset(&resdesc_img1, 0, sizeof(resdesc_img1));
resdesc_img1.resType = cudaResourceTypeArray;
resdesc_img1.res.array.array = d_img_arr1;
struct cudaTextureDesc texdesc_img1;
memset(&texdesc_img1, 0, sizeof(texdesc_img1));
texdesc_img1.addressMode[0] = cudaAddressModeClamp;
texdesc_img1.addressMode[1] = cudaAddressModeClamp;
texdesc_img1.filterMode = cudaFilterModePoint;
texdesc_img1.readMode = cudaReadModeElementType;
texdesc_img1.normalizedCoords = 0;
cudaTextureObject_t d_img_tex1 = 0;
cudaCreateTextureObject(&d_img_tex1, &resdesc_img1, &texdesc_img1, nullptr);

std::cout << "Warming up 1..." << std::endl;
for (int i = 0; i < warmups; ++i) CLATCH(d_img_tex1, d_trip_tex1, d_kps1, static_cast<int>(kps1.size()), d_desc1);
std::cout << "Testing 1..." << std::endl;
high_resolution_clock::time_point start1 = high_resolution_clock::now();
for (int i = 0; i < runs; ++i) CLATCH(d_img_tex1, d_trip_tex1, d_kps1, static_cast<int>(kps1.size()), d_desc1);
high_resolution_clock::time_point end1 = high_resolution_clock::now();
std::cout << std::endl << "CLATCH 1 took " << static_cast<double>((end1 - start1).count()) * 1e-3 / (static_cast<double>(runs) * static_cast<double>(kps1.size())) << " us per desc over " << kps1.size() << " desc" << (kps1.size() == 1 ? "." : "s.") << std::endl << std::endl;
uint64_t* h_GPUdesc1 = new uint64_t[8 * kps1.size()];
cudaMemcpy(h_GPUdesc1, d_desc1, 64 * kps1.size(), cudaMemcpyDeviceToHost);
//std::cout << "CUDA reports 1 " << cudaGetErrorString(cudaGetLastError()) << std::endl;

long long total1 = 0;
for (size_t i = 0; i < 8 * kps1.size(); ++i) total1 += h_GPUdesc1[i];
std::cout << "Checksum 1: " << std::hex << total1 << std::endl << std::endl;
std::cout << std::dec;
/************************* For image 1  **************************/


/************************* For image 2  **************************/
std::cout << std::endl << "Detecting 2..." << std::endl;

//cv::Ptr<cv::ORB> orb2 = cv::ORB::create(numkps, 1.20000004768, 8, 31, 0, 3, 0, 31, 31);
cv::Ptr<cv::ORB> orb2 = cv::ORB::create(numkps, 1.2f, 8, 31, 0, 2, cv::ORB::HARRIS_SCORE, 31, 20);
std::vector<cv::KeyPoint> keypoints2;
orb2->detect(image2, keypoints2);
keypoints2.erase(std::remove_if(keypoints2.begin(), keypoints2.end(), [image2](const cv::KeyPoint& kp) {return kp.pt.x <= 36 || kp.pt.y <= 36 || kp.pt.x >= image2.cols - 36 || kp.pt.y >= image2.rows - 36; }), keypoints2.end());

std::vector<KeyPoint> kps2;
for (const auto& kp : keypoints2) kps2.emplace_back(kp.pt.x, kp.pt.y, kp.size, kp.angle * 3.14159265f / 180.0f);

uint64_t* d_desc2;
cudaMalloc(&d_desc2, 64 * kps2.size());

KeyPoint* d_kps2;
cudaMalloc(&d_kps2, kps2.size() * sizeof(KeyPoint));
cudaMemcpy(d_kps2, &kps2[0], kps2.size() * sizeof(KeyPoint), cudaMemcpyHostToDevice);

// allocating and transferring triplets and binding to texture object
uint32_t* d_triplets2;
cudaMalloc(&d_triplets2, 2048 * sizeof(uint16_t));
cudaMemcpy(d_triplets2, triplets, 2048 * sizeof(uint16_t), cudaMemcpyHostToDevice);
cudaChannelFormatDesc chandesc_trip2 = cudaCreateChannelDesc(16, 16, 16, 16, cudaChannelFormatKindUnsigned);
cudaArray* d_trip_arr2;
cudaMallocArray(&d_trip_arr2, &chandesc_trip2, 512);
cudaMemcpyToArray(d_trip_arr2, 0, 0, d_triplets2, 2048 * sizeof(uint16_t), cudaMemcpyHostToDevice);
struct cudaResourceDesc resdesc_trip2;
memset(&resdesc_trip2, 0, sizeof(resdesc_trip2));
resdesc_trip2.resType = cudaResourceTypeArray;
resdesc_trip2.res.array.array = d_trip_arr2;
struct cudaTextureDesc texdesc_trip2;
memset(&texdesc_trip2, 0, sizeof(texdesc_trip2));
texdesc_trip2.addressMode[0] = cudaAddressModeClamp;
texdesc_trip2.filterMode = cudaFilterModePoint;
texdesc_trip2.readMode = cudaReadModeElementType;
texdesc_trip2.normalizedCoords = 0;
cudaTextureObject_t d_trip_tex2 = 0;
cudaCreateTextureObject(&d_trip_tex2, &resdesc_trip2, &texdesc_trip2, nullptr);

// allocating and transferring image and binding to texture object
cudaChannelFormatDesc chandesc_img2 = cudaCreateChannelDesc(8, 0, 0, 0, cudaChannelFormatKindUnsigned);
cudaArray* d_img_arr2;
cudaMallocArray(&d_img_arr2, &chandesc_img2, image2.cols, image2.rows);
cudaMemcpyToArray(d_img_arr2, 0, 0, image2.data, image2.rows * image2.cols, cudaMemcpyHostToDevice);
struct cudaResourceDesc resdesc_img2;
memset(&resdesc_img2, 0, sizeof(resdesc_img2));
resdesc_img2.resType = cudaResourceTypeArray;
resdesc_img2.res.array.array = d_img_arr2;
struct cudaTextureDesc texdesc_img2;
memset(&texdesc_img2, 0, sizeof(texdesc_img2));
texdesc_img2.addressMode[0] = cudaAddressModeClamp;
texdesc_img2.addressMode[1] = cudaAddressModeClamp;
texdesc_img2.filterMode = cudaFilterModePoint;
texdesc_img2.readMode = cudaReadModeElementType;
texdesc_img2.normalizedCoords = 0;
cudaTextureObject_t d_img_tex2 = 0;
cudaCreateTextureObject(&d_img_tex2, &resdesc_img2, &texdesc_img2, nullptr);

std::cout << "Warming up 2..." << std::endl;
for (int i = 0; i < warmups; ++i) CLATCH(d_trip_tex1, d_trip_tex2, d_kps2, static_cast<int>(kps2.size()), d_desc2);
std::cout << "Testing 2..." << std::endl;
high_resolution_clock::time_point start2 = high_resolution_clock::now();
for (int i = 0; i < runs; ++i) CLATCH(d_trip_tex1, d_trip_tex2, d_kps2, static_cast<int>(kps2.size()), d_desc2);
high_resolution_clock::time_point end2 = high_resolution_clock::now();
std::cout << std::endl << "CLATCH 2 took " << static_cast<double>((end2 - start2).count()) * 1e-3 / (static_cast<double>(runs) * static_cast<double>(kps2.size())) << " us per desc over " << kps2.size() << " desc" << (kps2.size() == 1 ? "." : "s.") << std::endl << std::endl;
uint64_t* h_GPUdesc2 = new uint64_t[8 * kps2.size()];
cudaMemcpy(h_GPUdesc2, d_desc2, 64 * kps2.size(), cudaMemcpyDeviceToHost);
//std::cout << "CUDA reports 2" << cudaGetErrorString(cudaGetLastError()) << std::endl;

long long total2 = 0;
for (size_t i = 0; i < 8 * kps2.size(); ++i) total2 += h_GPUdesc2[i];
std::cout << "Checksum 2: " << std::hex << total2 << std::endl << std::endl;
std::cout << std::dec;
/************************* For image 2  **************************/

/************************* Matching **************************/
int* d_matches;
cudaMalloc(&d_matches, 4 * size);

std::cout << std::endl << "Warming up..." << std::endl;
for (int i = 0; i < warmups; ++i) CUDAK2NN(d_desc1, 8 * kps1.size(), d_img_tex2, 8 * kps2.size(), d_matches, threshold);
std::cout << "Testing..." << std::endl;
high_resolution_clock::time_point start = high_resolution_clock::now();
for (int i = 0; i < runs; ++i) CUDAK2NN(d_desc1, 8 * kps1.size(), d_img_tex2, 8 * kps2.size(), d_matches, threshold);
high_resolution_clock::time_point end = high_resolution_clock::now();

int* h_matches = reinterpret_cast<int*>(malloc(4 * size));
cudaMemcpy(h_matches, d_matches, 4 * size, cudaMemcpyDeviceToHost);
cudaDeviceReset();

vector<cv::DMatch> matches;
std::vector<Match> hostMatches;
for (int i = 0; i < size; ++i)
{
    if (h_matches[i] != -1)
    {
        hostMatches.emplace_back(i, h_matches[i]);
        matches.push_back(cv::DMatch(i, h_matches[i], 0));
    }
}

double sec = static_cast<double>(duration_cast<nanoseconds>(end - start).count()) * 1e-9 / static_cast<double>(runs);
std::cout << "CUDAK2NN found " << hostMatches.size() << " matches in " << sec * 1e3 << " ms" << std::endl;
std::cout << "Throughput: " << static_cast<double>(size)*static_cast<double>(size) / sec * 1e-9 << " billion comparisons/second." << std::endl << std::endl;
/************************* Matching **************************/

}


Query is, in CLATCH each descriptor is set to be int64 with 8 descriptor for each keypoint. So that it derive to 512 bit per keypoint.

In CUDA2KNN (main.cpp) , the "size" is considered as 10000 by which the total number of descriptors taken into consideration for is 64000 and that could limit only to 8000 keypoints.

I believe I should be passing in the right size for the comparison.
CUDAK2NN(d_desc1, 8 * kps1.size(), d_img_tex2, 8 * kps2.size(), d_matches, threshold);

Please correct me if I'm wrong. Kindly review and help to see if there is any drastic mistake.
Thanks for the help.

@komrad36
Copy link
Owner

komrad36 commented Nov 9, 2016

Hi,

That's great news that CUDA 8 has fixed the CUDAK2NN problem!

To start analyzing the custom code, can you please uncomment the "CUDA reports..." output lines? Those are there to show any CUDA errors that might have accumulated and are the best place to start.

Thanks,
Kareem

@buzzsuresh
Copy link
Author

buzzsuresh commented Nov 9, 2016

Hi,

No CUDA errors were reported.


Detecting 1...
Warming up 1...
Testing 1...

CLATCH 1 took 1.23391 us per desc over 9984 descs.

CUDA reports 1 no error
Checksum 1: a2dc73dbca6075a3

Detecting 2...
Warming up 2...
Testing 2...

CLATCH 2 took 0.846509 us per desc over 9984 descs.

CUDA reports 2no error
Checksum 2: d6d3d0936cab2618

Warming up...
Testing...
CUDAK2NN found 0 matches in 210.068 ms
Throughput: 0.476036 billion comparisons/second.


And with the size that I have given , 8 * kps1.size(), on CUDAK2NN comparison it took very long time to come out of the function. Please comment about the size that I should use for the comparison.

Do you see any results running the same code given my in previous post. Please let me know.

Thanks.

@komrad36
Copy link
Owner

komrad36 commented Nov 9, 2016

Working on it.

@komrad36
Copy link
Owner

komrad36 commented Nov 9, 2016

Okay, I found a lot of problems in the code so I just wrote my own to illustrate the combination of CLATCH and CUDAK2NN. It also plots its match results to illustrate, as seen in the image below. The code will follow shortly. Note that that just a quick demo; proper coding of this will use a template to complete all the steps automatically for frame pairs rather than duplicating everything as I've done here.

latch

@buzzsuresh
Copy link
Author

Thanks for staying with me and helping throughout.
Will wait for your sample.

@komrad36
Copy link
Owner

komrad36 commented Nov 9, 2016

/*******************************************************************
*   main.cpp
*   Pipeline Test
*
*   Author: Kareem Omar
*   kareem.omar@uah.edu
*   https://github.com/komrad36
*
*   Last updated Nov 8, 2016
*******************************************************************/

#include <chrono>
#include <iostream>
#include <iomanip>
#include <opencv2/opencv.hpp>
#include <string>
#include <vector>

#include "CLATCH.h"
#include "CUDAK2NN.h"

#define VC_EXTRALEAN
#define WIN32_LEAN_AND_MEAN

using namespace std::chrono;

struct Match {
    int q, t;
    Match() {}
    Match(const int _q, const int _t) : q(_q), t(_t) {}
};

int main() {
    // ------------- Configuration ------------
    constexpr int numkps = 5000;
    constexpr char name1[] = "1.jpg";
    constexpr char name2[] = "2.jpg";
    constexpr int threshold = 50;
    constexpr bool display_image = true;
    // --------------------------------


    // ------------- Image Read ------------
    cv::Mat image1 = cv::imread(name1, CV_LOAD_IMAGE_GRAYSCALE);
    if (!image1.data) {
        std::cerr << "ERROR: failed to open image 1. Aborting." << std::endl;
        return EXIT_FAILURE;
    }
    cv::Mat image2 = cv::imread(name2, CV_LOAD_IMAGE_GRAYSCALE);
    if (!image2.data) {
        std::cerr << "ERROR: failed to open image 2. Aborting." << std::endl;
        return EXIT_FAILURE;
    }
    // --------------------------------


    // ------------- Detection ------------
    std::cout << std::endl << "Detecting..." << std::endl;
    cv::Ptr<cv::ORB> orb = cv::ORB::create(numkps, 1.2f, 8, 31, 0, 2, cv::ORB::HARRIS_SCORE, 31, 20);
    std::vector<cv::KeyPoint> keypoints1;
    orb->detect(image1, keypoints1);
    keypoints1.erase(std::remove_if(keypoints1.begin(), keypoints1.end(), [image1](const cv::KeyPoint& kp) {return kp.pt.x <= 36 || kp.pt.y <= 36 || kp.pt.x >= image1.cols - 36 || kp.pt.y >= image1.rows - 36; }), keypoints1.end());

    std::vector<cv::KeyPoint> keypoints2;
    orb->detect(image2, keypoints2);
    keypoints2.erase(std::remove_if(keypoints2.begin(), keypoints2.end(), [image2](const cv::KeyPoint& kp) {return kp.pt.x <= 36 || kp.pt.y <= 36 || kp.pt.x >= image2.cols - 36 || kp.pt.y >= image2.rows - 36; }), keypoints2.end());

    // --------------------------------


    // ------------- CLATCH ------------

    // arranging keypoints for PCI transfer
    std::vector<KeyPoint> kps1;
    for (const auto& kp : keypoints1) kps1.emplace_back(kp.pt.x, kp.pt.y, kp.size, kp.angle * 3.14159265f / 180.0f);

    std::vector<KeyPoint> kps2;
    for (const auto& kp : keypoints2) kps2.emplace_back(kp.pt.x, kp.pt.y, kp.size, kp.angle * 3.14159265f / 180.0f);

    // setting cache and shared modes
    cudaDeviceSetCacheConfig(cudaFuncCachePreferEqual);
    cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeFourByte);

    // allocating space for descriptors. allocate extra 8 for training set.
    uint64_t* d_desc1;
    cudaMalloc(&d_desc1, 64 * (kps1.size() + 8));

    uint64_t* d_desc2;
    cudaMalloc(&d_desc2, 64 * kps2.size());

    // allocating and transferring keypoints and binding to texture object
    KeyPoint* d_kps1;
    cudaMalloc(&d_kps1, kps1.size() * sizeof(KeyPoint));
    cudaMemcpy(d_kps1, &kps1[0], kps1.size() * sizeof(KeyPoint), cudaMemcpyHostToDevice);

    KeyPoint* d_kps2;
    cudaMalloc(&d_kps2, kps2.size() * sizeof(KeyPoint));
    cudaMemcpy(d_kps2, &kps2[0], kps2.size() * sizeof(KeyPoint), cudaMemcpyHostToDevice);

    // allocating and transferring triplets and binding to texture object
    uint32_t* d_triplets;
    cudaMalloc(&d_triplets, 2048 * sizeof(uint16_t));
    cudaMemcpy(d_triplets, triplets, 2048 * sizeof(uint16_t), cudaMemcpyHostToDevice);
    cudaChannelFormatDesc chandesc_trip = cudaCreateChannelDesc(16, 16, 16, 16, cudaChannelFormatKindUnsigned);
    cudaArray* d_trip_arr;
    cudaMallocArray(&d_trip_arr, &chandesc_trip, 512);
    cudaMemcpyToArray(d_trip_arr, 0, 0, d_triplets, 2048 * sizeof(uint16_t), cudaMemcpyHostToDevice);
    struct cudaResourceDesc resdesc_trip;
    memset(&resdesc_trip, 0, sizeof(resdesc_trip));
    resdesc_trip.resType = cudaResourceTypeArray;
    resdesc_trip.res.array.array = d_trip_arr;
    struct cudaTextureDesc texdesc_trip;
    memset(&texdesc_trip, 0, sizeof(texdesc_trip));
    texdesc_trip.addressMode[0] = cudaAddressModeClamp;
    texdesc_trip.filterMode = cudaFilterModePoint;
    texdesc_trip.readMode = cudaReadModeElementType;
    texdesc_trip.normalizedCoords = 0;
    cudaTextureObject_t d_trip_tex = 0;
    cudaCreateTextureObject(&d_trip_tex, &resdesc_trip, &texdesc_trip, nullptr);

    // allocating and transferring image and binding to texture object
    cudaChannelFormatDesc chandesc_img1 = cudaCreateChannelDesc(8, 0, 0, 0, cudaChannelFormatKindUnsigned);
    cudaArray* d_img_arr1;
    cudaMallocArray(&d_img_arr1, &chandesc_img1, image1.cols, image1.rows);
    cudaMemcpyToArray(d_img_arr1, 0, 0, image1.data, image1.rows * image1.cols, cudaMemcpyHostToDevice);
    struct cudaResourceDesc resdesc_img1;
    memset(&resdesc_img1, 0, sizeof(resdesc_img1));
    resdesc_img1.resType = cudaResourceTypeArray;
    resdesc_img1.res.array.array = d_img_arr1;
    struct cudaTextureDesc texdesc_img1;
    memset(&texdesc_img1, 0, sizeof(texdesc_img1));
    texdesc_img1.addressMode[0] = cudaAddressModeClamp;
    texdesc_img1.addressMode[1] = cudaAddressModeClamp;
    texdesc_img1.filterMode = cudaFilterModePoint;
    texdesc_img1.readMode = cudaReadModeElementType;
    texdesc_img1.normalizedCoords = 0;
    cudaTextureObject_t d_img_tex1 = 0;
    cudaCreateTextureObject(&d_img_tex1, &resdesc_img1, &texdesc_img1, nullptr);

    cudaChannelFormatDesc chandesc_img2 = cudaCreateChannelDesc(8, 0, 0, 0, cudaChannelFormatKindUnsigned);
    cudaArray* d_img_arr2;
    cudaMallocArray(&d_img_arr2, &chandesc_img2, image2.cols, image2.rows);
    cudaMemcpyToArray(d_img_arr2, 0, 0, image2.data, image2.rows * image2.cols, cudaMemcpyHostToDevice);
    struct cudaResourceDesc resdesc_img2;
    memset(&resdesc_img2, 0, sizeof(resdesc_img2));
    resdesc_img2.resType = cudaResourceTypeArray;
    resdesc_img2.res.array.array = d_img_arr2;
    struct cudaTextureDesc texdesc_img2;
    memset(&texdesc_img2, 0, sizeof(texdesc_img2));
    texdesc_img2.addressMode[0] = cudaAddressModeClamp;
    texdesc_img2.addressMode[1] = cudaAddressModeClamp;
    texdesc_img2.filterMode = cudaFilterModePoint;
    texdesc_img2.readMode = cudaReadModeElementType;
    texdesc_img2.normalizedCoords = 0;
    cudaTextureObject_t d_img_tex2 = 0;
    cudaCreateTextureObject(&d_img_tex2, &resdesc_img2, &texdesc_img2, nullptr);

    // CLATCH 1
    CLATCH(d_img_tex1, d_trip_tex, d_kps1, static_cast<int>(kps1.size()), d_desc1);

    // CLATCH 2
    CLATCH(d_img_tex2, d_trip_tex, d_kps2, static_cast<int>(kps2.size()), d_desc2);

    // wrapping query vecs ("2") in tex obj
    struct cudaResourceDesc resDesc;
    memset(&resDesc, 0, sizeof(resDesc));
    resDesc.resType = cudaResourceTypeLinear;
    resDesc.res.linear.devPtr = d_desc2;
    resDesc.res.linear.desc.f = cudaChannelFormatKindUnsigned;
    resDesc.res.linear.desc.x = 32;
    resDesc.res.linear.desc.y = 32;
    resDesc.res.linear.sizeInBytes = 64 * kps2.size();
    struct cudaTextureDesc texDesc;
    memset(&texDesc, 0, sizeof(texDesc));
    texDesc.addressMode[0] = cudaAddressModeBorder;
    texDesc.addressMode[1] = cudaAddressModeBorder;
    texDesc.filterMode = cudaFilterModePoint;
    texDesc.readMode = cudaReadModeElementType;
    texDesc.normalizedCoords = 0;
    cudaTextureObject_t tex_q = 0;
    cudaCreateTextureObject(&tex_q, &resDesc, &texDesc, nullptr);

    // allocating space for match results
    int* d_matches;
    cudaMalloc(&d_matches, 4 * kps2.size());

    CUDAK2NN(d_desc1, static_cast<int>(kps1.size()), tex_q, static_cast<int>(kps2.size()), d_matches, threshold);

    // transferring matches back to host
    int* h_matches = reinterpret_cast<int*>(malloc(4 * kps2.size()));
    cudaMemcpy(h_matches, d_matches, 4 * kps2.size(), cudaMemcpyDeviceToHost);
    cudaDeviceReset();

    std::cout << "CUDA reports " << cudaGetErrorString(cudaGetLastError()) << std::endl;

    std::vector<Match> matches;
    std::vector<cv::DMatch> dmatches;
    for (int i = 0; i < kps2.size(); ++i) {
        if (h_matches[i] != -1) {
            matches.emplace_back(i, h_matches[i]);
            dmatches.emplace_back(h_matches[i], i, 0.0f);
        }
    }

    std::cout << "CUDAK2NN found " << matches.size() << " matches." << std::endl;

    // ------------- Output ------------
    if (display_image) {
        cv::Mat image_with_matches;
        cv::drawMatches(image1, keypoints1, image2, keypoints2, dmatches, image_with_matches, cv::Scalar::all(-1.0), cv::Scalar::all(-1.0), std::vector<char>(), cv::DrawMatchesFlags::DRAW_RICH_KEYPOINTS);
        cv::namedWindow("Matches", CV_WINDOW_NORMAL | CV_WINDOW_KEEPRATIO);
        cv::imshow("Matches", image_with_matches);
        cv::waitKey(0);
    }
    // --------------------------------

}

@buzzsuresh
Copy link
Author

Thanks Kareem, Will review and change as recommended wherever required and update you with the results.

@komrad36
Copy link
Owner

komrad36 commented Nov 9, 2016

No problem! Let me know if I can explain anything I do in the code sample. Also, check it out, rotation invariance is working perfectly:

rotation invariance

Only 4% fewer matches even with 180 degree rotation.

@buzzsuresh
Copy link
Author

Thanks a lot. It works sweet. This recent sample looks simple and straight to the point removing those warming and testing.


Detecting...
Image 1 - ORB Detect Time : 0.32 ms.
Image 2 - ORB Detect Time : 0.01 ms.
Image 1 - CUDA LATCH : 0.06 ms.
Image 2 - CUDA LATCH : 0 ms.
LATCH Comparison : 0.02 ms.
CUDAK2NN found 20 matches.


clatch_cmp

I'm yet to verify rotation variance. Will do that too.

Thanks.

@komrad36
Copy link
Owner

komrad36 commented Nov 9, 2016

Nice example! Glad it's working for you!! By the way, OpenCV ORB is insanely slow; I'm working on KORB but I'm pretty busy right now so keep an eye out for it in the next few weeks :)

@buzzsuresh
Copy link
Author

Sure, will keep following. Best wishes for all your work.

@komrad36
Copy link
Owner

Hi,

Just wanted to mention that after lots of difficulties and delays my detector-descriptor is finally ready, or at least, ready enough to make public! I still have a lot of work and tweaking to do on it, but it's at least available so you can get an idea and hopefully benefit from the increased accuracy and speed that comes from the way it shares scale spaces, for superior rotation- and scale-invariance without any loss of performance. It's called KORAL and is available here: https://github.com/komrad36/KORAL.

Thanks!
-Kareem

@buzzsuresh
Copy link
Author

Nice work Kareem. Trying out to see if I could KORAL for my work.

I have a suggestion to be asked related to CUDAK2NN what we discussed earlier. For an image to image comparison as discussed the results came out to be good.

For the set of training images, I did the CLATCH gathered all of them into 1 single big array (for e.g [134400 x 1]). This array would have the CLATCH data of 2100 key points (21000*64 = 134400).
Made so to enrich my training dataset so that I could use it while matching.

Using CUDAK2NN I did ran the comparison against the frame image to identify the training pattern. Matching seems to be not accurate and the number of matches returned seems to be high for those images which don't have the training pattern.

My match function is as follows: ( frameCLATCHDesc - frame CLATCH Descriptors; trainingDesc - Training CLATCH Descriptors as discussed above)


vectorcv::DMatch CLATCHMatch(uint64_t* frameCLATCHDesc, std::vector framekps, cv::Mat trainingDesc)
{
int numOfTrainingKPs = trainingDesc.cols/64;
//cout << "\tCLATCH Training Desc Keypoints : " << numOfTrainingKPs << ", Rows : " << trainingDesc.rows << ", Cols : " << trainingDesc.cols << endl;
uint64_t* d_trainingLATCHDesc;
cudaMalloc(&d_trainingLATCHDesc, trainingDesc.cols);
cudaMemcpy(d_trainingLATCHDesc, trainingDesc.data, trainingDesc.cols, cudaMemcpyHostToDevice);

struct cudaResourceDesc resDesc;
memset(&resDesc, 0, sizeof(resDesc));
resDesc.resType = cudaResourceTypeLinear;
resDesc.res.linear.devPtr = frameCLATCHDesc;
resDesc.res.linear.desc.f = cudaChannelFormatKindUnsigned;
resDesc.res.linear.desc.x = 32;
resDesc.res.linear.desc.y = 32;
resDesc.res.linear.sizeInBytes = 64 * framekps.size();
struct cudaTextureDesc texDesc;
memset(&texDesc, 0, sizeof(texDesc));
texDesc.addressMode[0] = cudaAddressModeBorder;
texDesc.addressMode[1] = cudaAddressModeBorder;
texDesc.filterMode = cudaFilterModePoint;
texDesc.readMode = cudaReadModeElementType;
texDesc.normalizedCoords = 0;
cudaTextureObject_t frame_tex_q = 0;
cudaCreateTextureObject(&frame_tex_q, &resDesc, &texDesc, nullptr);

int* d_matches;
cudaMalloc(&d_matches, 4 * numOfTrainingKPs);
CUDAK2NN(d_trainingLATCHDesc, numOfTrainingKPs, frame_tex_q, static_cast<int>(framekps.size()) , d_matches, clatchThres);

int* h_matches = reinterpret_cast<int*>(malloc(4 * numOfTrainingKPs));
cudaMemcpy(h_matches, d_matches, 4 * numOfTrainingKPs, cudaMemcpyDeviceToHost);

std::vector<cv::DMatch> matches;
for (int i = 0; i < numOfTrainingKPs; ++i) 
    if (h_matches[i] != -1)
        matches.emplace_back(h_matches[i], i, 0.0f);

cudaFree(d_matches);
return matches;

}


Kindly suggest if I need to do any correction to get the right intense matches. Looking forward for your suggestion.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants