<h1><div align="center"> Final Exercise: Iteratively Optimize an Accelerated SAXPY Application</div></h1>

A basic accelerated SAXPY (Single Precision a\*x+b) application has been provided below. It currently contains a couple of bugs you will need to find and fix before successfully compile, running, and then profile it with `nsys profile`. After fixing the bugs and profiling the application, record the runtime of the `saxpy` kernel and then work *iteratively* to optimize the application, using `nsys profile` after each iteration to notice the effects of the code changes on kernel performance and UM behavior. Your end goal is to profile an accurate `saxpy` kernel without modifying `N`, and compare prefetching performance options in the code. 

In [1]:
%%writefile saxpy.cu
#include <stdio.h>

#define N 2048 * 2048 // Number of elements in each vector

__global__ 
void saxpy(int * a, int * b, int * c)
{
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  
  for (int i = tid; i < N; i += stride)
    c[i] = 2 * a[i] + b[i];
}

int main(int argc, char **argv)
{
  int *a, *b, *c;

  int size = N * sizeof (int);

  int deviceId;
  int numberOfSMs;

  cudaGetDevice(&deviceId);
  cudaDeviceGetAttribute(&numberOfSMs, cudaDevAttrMultiProcessorCount, deviceId);

  // Allocate memory
  cudaMallocManaged(&a, size);
  cudaMallocManaged(&b, size);
  cudaMallocManaged(&c, size);

  // Initialize memory
  for( int i = 0; i < N; ++i )
  {
    a[i] = 2000;
    b[i] = 1000;
    c[i] = 0;
  }

  cudaMemPrefetchAsync(a, size, deviceId);
  cudaMemPrefetchAsync(b, size, deviceId);
  cudaMemPrefetchAsync(c, size, deviceId);

  int threads_per_block = 256;
  int number_of_blocks = numberOfSMs * 32;

       saxpy <<<number_of_blocks, threads_per_block>>>( a, b, c );

  cudaDeviceSynchronize(); // Wait for the GPU to finish

  // Print out the first and last 5 values of c for a quality check
  for( int i = 0; i < 5; ++i )
    printf("c[%d] = %d, ", i, c[i]);
  printf ("\n");
  for( int i = N-5; i < N; ++i )
    printf("c[%d] = %d, ", i, c[i]);
  printf ("\n");

  // Free all our allocated memory
  cudaFree( a ); 
  cudaFree( b ); 
  cudaFree( c );
}

Writing saxpy.cu


In [2]:
!nvcc saxpy.cu -o saxpy

In [3]:
!nsys profile --stats=true -o saxpy-report ./saxpy

c[0] = 5000, c[1] = 5000, c[2] = 5000, c[3] = 5000, c[4] = 5000, 
c[4194299] = 5000, c[4194300] = 5000, c[4194301] = 5000, c[4194302] = 5000, c[4194303] = 5000, 
Generating '/tmp/nsys-report-17ad.qdstrm'
[3/8] Executing 'nvtxsum' stats report
SKIPPED: /home/murilo/profiling/saxpy-report.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrtsum' stats report

Operating System Runtime API Statistics:

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)     Med (ns)    Min (ns)   Max (ns)    StdDev (ns)        Name     
 --------  ---------------  ---------  ------------  -----------  --------  -----------  ------------  --------------
     64,0      294.410.206         25  11.776.408,0  2.982.399,0     1.858  100.115.692  20.967.666,0  poll          
     19,0       86.435.481      1.017      84.990,0     20.619,0     1.020   28.961.871     930.472,0  ioctl         
     10,0       46.233.491         18   2.568.527,0    107.166,0    26.612   21.537.912   5.802.718,0 

### ☆ Questions:

- What's the difference between the Timelines of Prefetching and Non-Prefetching? 

- Can we further improve the performance of the best code?

## Clear the Temporary Files

Before moving on, please execute the following cell to clear up the directory. This is required to move on to the next notebook.

In [4]:
!rm -rf *saxpy* 