# Vector multiplication example
This is an example developed live in class.

We will use it in two different ways so you can see a nice way to measure the execution times of each part of the code.

First example is written to be directly executed and to play with it.

The second one is the same, but instead of using %%cu, we use %cuda and save the code into a file. We also removed the shell output so that you can try with bigger examples. Then, by compiling it manually, we will be able to use nvprof, to measure each part execution time.

##First implementation to see results and understand

First we install the pluggin and check which GPU we have.

In [None]:
!pip install git+git://github.com/andreinechaev/nvcc4jupyter.git
%load_ext nvcc_plugin

Collecting git+git://github.com/andreinechaev/nvcc4jupyter.git
  Cloning git://github.com/andreinechaev/nvcc4jupyter.git to /tmp/pip-req-build-a4i97kjz
  Running command git clone -q git://github.com/andreinechaev/nvcc4jupyter.git /tmp/pip-req-build-a4i97kjz
Building wheels for collected packages: NVCCPlugin
  Building wheel for NVCCPlugin (setup.py) ... [?25l[?25hdone
  Created wheel for NVCCPlugin: filename=NVCCPlugin-0.0.2-cp36-none-any.whl size=4307 sha256=454422906add62d9318c38584a90889b0ad095ac0b2156fc1d0063122b86731b
  Stored in directory: /tmp/pip-ephem-wheel-cache-wa3hvj4f/wheels/10/c2/05/ca241da37bff77d60d31a9174f988109c61ba989e4d4650516
Successfully built NVCCPlugin
Installing collected packages: NVCCPlugin
Successfully installed NVCCPlugin-0.0.2
created output directory at /content/src
Out bin /content/result.out


In [None]:
%%cu
#include <stdio.h>
#include <stdlib.h>

int main() {
    int numDevs=0;
    cudaGetDeviceCount(&numDevs);
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);
    printf("Device Number: %d\n", 0);
    printf("  Device name: %s\n", prop.name);
    printf("  Memory Clock Rate (KHz): %d\n",
          prop.memoryClockRate);
    printf("  Memory Bus Width (bits): %d\n",
          prop.memoryBusWidth);
    printf("  Peak Memory Bandwidth (GB/s): %f\n\n",
          2.0*prop.memoryClockRate*(prop.memoryBusWidth/8)/1.0e6);
    printf("Num devices %d\n", numDevs);
    return 0;
}

Now the example. It is a vector point to point multiplication, of any size.

We do all the steps in order to have the data in the GPU to perform the computation, and back to the CPU to read the results.

Since we print de results, be carefull not to use a too big vector size.

In [None]:
%%cu

#include <iostream>
#include "math.h"

#define VEC_SIZE 8
#define BLOCK_SIZE 256

__global__ void vectMult(const int* d_vectA, const int* d_vectB, int* d_vectC) {
    int x = threadIdx.x + (blockIdx.x * blockDim.x);
    if (x < VEC_SIZE) {
      d_vectC[x] = d_vectA[x] * d_vectB[x];
    }
}

int main() {
    int *h_vectA, *h_vectB, *d_vectA, *d_vectB, *h_vectC, *d_vectC;

    h_vectA = (int*)malloc(sizeof(int)*VEC_SIZE);
    h_vectB = (int*)malloc(sizeof(int)*VEC_SIZE);
    h_vectC = (int*)malloc(sizeof(int)*VEC_SIZE);

    cudaMalloc(&d_vectA, sizeof(int)*VEC_SIZE);
    cudaMalloc(&d_vectB, sizeof(int)*VEC_SIZE);
    cudaMalloc(&d_vectC, sizeof(int)*VEC_SIZE);

    for (int i=0; i<VEC_SIZE; ++i) {
        h_vectA[i] = i;
        h_vectB[i] = i;
    }

    cudaStream_t stream;
    cudaStreamCreate(&stream);

    cudaMemcpyAsync(d_vectA, h_vectA, sizeof(int)*VEC_SIZE, cudaMemcpyHostToDevice, stream);
    cudaMemcpyAsync(d_vectB, h_vectB, sizeof(int)*VEC_SIZE, cudaMemcpyHostToDevice, stream);

    dim3 block;
    dim3 grid;

    if (VEC_SIZE <= BLOCK_SIZE) {
        block.x = VEC_SIZE;
        grid.x = 1;
    } else {
        block.x = BLOCK_SIZE;
        grid.x = ceil((double)VEC_SIZE / (double)BLOCK_SIZE);
    }

    vectMult<<<grid, block, 0, stream>>>(d_vectA, d_vectB, d_vectC);

    cudaMemcpyAsync(h_vectC, d_vectC, sizeof(int)*VEC_SIZE, cudaMemcpyDeviceToHost, stream);

    cudaStreamSynchronize(stream);

    std::cout << "Results: ";
    for (int i=0; i<VEC_SIZE; ++i) {
        std::cout << h_vectC[i] << " ";
    }
    std::cout << std::endl;
}

##Second implementation with manual compilation and profiling
Now we do the same, but we save the code into test.cu.

This way we can compile it manually with nvcc, and then use nvprof to see detailed execution times.

In [None]:
%%cuda --name test.cu

#include <iostream>
#include "math.h"

#define VEC_SIZE 4000
#define BLOCK_SIZE 256

__global__ void vectMult(const int* d_vectA, const int* d_vectB, int* d_vectC) {
    int x = threadIdx.x + (blockIdx.x * blockDim.x);
    if (x < VEC_SIZE) {
      d_vectC[x] = d_vectA[x] * d_vectB[x];
    }
}

int main() {
    int *h_vectA, *h_vectB, *d_vectA, *d_vectB, *h_vectC, *d_vectC;

    h_vectA = (int*)malloc(sizeof(int)*VEC_SIZE);
    h_vectB = (int*)malloc(sizeof(int)*VEC_SIZE);
    h_vectC = (int*)malloc(sizeof(int)*VEC_SIZE);

    cudaMalloc(&d_vectA, sizeof(int)*VEC_SIZE);
    cudaMalloc(&d_vectB, sizeof(int)*VEC_SIZE);
    cudaMalloc(&d_vectC, sizeof(int)*VEC_SIZE);

    for (int i=0; i<VEC_SIZE; ++i) {
        h_vectA[i] = i;
        h_vectB[i] = i;
    }

    cudaStream_t stream;
    cudaStreamCreate(&stream);

    cudaMemcpyAsync(d_vectA, h_vectA, sizeof(int)*VEC_SIZE, cudaMemcpyHostToDevice, stream);
    cudaMemcpyAsync(d_vectB, h_vectB, sizeof(int)*VEC_SIZE, cudaMemcpyHostToDevice, stream);

    dim3 block;
    dim3 grid;

    if (VEC_SIZE <= BLOCK_SIZE) {
        block.x = VEC_SIZE;
        grid.x = 1;
    } else {
        block.x = BLOCK_SIZE;
        grid.x = ceil((double)VEC_SIZE / (double)BLOCK_SIZE);
    }

    vectMult<<<grid, block, 0, stream>>>(d_vectA, d_vectB, d_vectC);

    cudaMemcpyAsync(h_vectC, d_vectC, sizeof(int)*VEC_SIZE, cudaMemcpyDeviceToHost, stream);

    cudaStreamSynchronize(stream);
}

In [None]:
!nvcc src/test.cu -o test

In [None]:
!nvprof ./test