<a href="https://colab.research.google.com/github/ncamcl/EpiNow/blob/master/Practical_8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **CUDA Programming on NVIDIA GPUs, July 22-26, 2024**

# **Practical 8**

Again make sure the correct Runtime is being used, by clicking on the Runtime option at the top, then "Change runtime type", and selecting an appropriate GPU such as the T4.

Then verify the details of the GPU which is available to you, and upload the usual two header files.

In [None]:
!nvidia-smi


Sun Jun  9 06:31:46 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   69C    P8              12W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
!wget https://people.maths.ox.ac.uk/gilesm/cuda/headers/helper_cuda.h
!wget https://people.maths.ox.ac.uk/gilesm/cuda/headers/helper_string.h


--2024-06-09 06:31:47--  https://people.maths.ox.ac.uk/gilesm/cuda/headers/helper_cuda.h
Resolving people.maths.ox.ac.uk (people.maths.ox.ac.uk)... 129.67.184.129, 2001:630:441:202::8143:b881
Connecting to people.maths.ox.ac.uk (people.maths.ox.ac.uk)|129.67.184.129|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 34238 (33K) [text/x-chdr]
Saving to: ‘helper_cuda.h’


2024-06-09 06:31:48 (258 KB/s) - ‘helper_cuda.h’ saved [34238/34238]

--2024-06-09 06:31:48--  https://people.maths.ox.ac.uk/gilesm/cuda/headers/helper_string.h
Resolving people.maths.ox.ac.uk (people.maths.ox.ac.uk)... 129.67.184.129, 2001:630:441:202::8143:b881
Connecting to people.maths.ox.ac.uk (people.maths.ox.ac.uk)|129.67.184.129|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23960 (23K) [text/x-chdr]
Saving to: ‘helper_string.h’


2024-06-09 06:31:49 (181 KB/s) - ‘helper_string.h’ saved [23960/23960]





---

The next step is to create the file scan.cu which includes within it a reference C++ routine against which the CUDA results are compared.

In [None]:
%%writefile scan.cu

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <float.h>

#include <helper_cuda.h>


///////////////////////////////////////////////////////////////////////
// CPU routine
///////////////////////////////////////////////////////////////////////

void scan_gold(float* odata, float* idata, const unsigned int len)
{
  odata[0] = 0;
  for(int i=1; i<len; i++) odata[i] = idata[i-1] + odata[i-1];
}

///////////////////////////////////////////////////////////////////////
// GPU routine
///////////////////////////////////////////////////////////////////////

__global__ void scan(float *g_odata, float *g_idata)
{
  // Dynamically allocated shared memory for scan kernels

  extern __shared__  float tmp[];

  float temp;
  int   tid = threadIdx.x;

  // read input into shared memory

  temp     = g_idata[tid];
  tmp[tid] = temp;

  // perform scan

  for (int d=1; d<blockDim.x; d=2*d) {
    __syncthreads();
    if (tid-d >= 0) temp += tmp[tid-d];
    __syncthreads();
    tmp[tid] = temp;
  }

  // write results to global memory

  __syncthreads();

  temp = 0.0f;
  if (tid>0) temp = tmp[tid-1];

  g_odata[tid] = temp;
}


////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////

int main( int argc, const char** argv)
{
  int num_threads, num_elements, mem_size, shared_mem_size;

  float *h_data, *reference;
  float *d_idata, *d_odata;

  // initialise card

  findCudaDevice(argc, argv);

  num_threads  = 512;
  num_elements = num_threads;
  mem_size     = sizeof(float) * num_elements;

  // allocate host memory to store the input data
  // and initialize to integer values between 0 and 10

  h_data = (float*) malloc(mem_size);

  for(int i=0; i<num_elements; i++)
    h_data[i] = floorf(10.0f*(rand()/(float)RAND_MAX));

  // compute reference solution

  reference = (float*) malloc(mem_size);
  scan_gold(reference, h_data, num_elements);

  // allocate device memory input and output arrays

  checkCudaErrors( cudaMalloc((void**)&d_idata, mem_size) );
  checkCudaErrors( cudaMalloc((void**)&d_odata, mem_size) );

  // copy host memory to device input array

  checkCudaErrors( cudaMemcpy(d_idata, h_data, mem_size,
                              cudaMemcpyHostToDevice));

  // execute the kernel

  shared_mem_size = sizeof(float) * num_threads;
  scan<<<1,num_threads,shared_mem_size>>>(d_odata,d_idata);
  getLastCudaError("scan kernel execution failed");

  // copy result from device to host

  checkCudaErrors( cudaMemcpy(h_data, d_odata, mem_size,
                              cudaMemcpyDeviceToHost) );

  // check results

  float err=0.0;
  for (int i=0; i<num_elements; i++)
    err += (h_data[i] - reference[i])*(h_data[i] - reference[i]);
  printf("rms scan error  = %f\n",sqrt(err/num_elements));

  // cleanup memory

  free(h_data);
  free(reference);
  checkCudaErrors( cudaFree(d_idata) );
  checkCudaErrors( cudaFree(d_odata) );

  // CUDA exit -- needed to flush printf write buffer

  cudaDeviceReset();
}

Writing scan.cu



---

We can now compile and run the executable.  Note that the compilation links in the CUDA random number generation library cuRAND.


In [None]:
!nvcc scan.cu -o scan -I. -lineinfo -arch=sm_70 --ptxas-options=-v --use_fast_math -lcudart

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z4scanPfS_' for 'sm_70'
ptxas info    : Function properties for _Z4scanPfS_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 12 registers, 368 bytes cmem[0]


In [None]:
!./scan

rms scan error  = 0.000000




---

You are now ready to carry out the exercises in Practical 8.

