<h1><div align="center">Visual Profilling on GPU Environment</div></h1>

The CUDA toolkit ships with the **Nsight Systems**, a powerful GUI application to support the development of accelerated CUDA applications. Nsight Systems generates a graphical timeline of an accelerated application, with detailed information about CUDA API calls, kernel execution, memory activity, ... In this notebook, you will be using the **Nsight Systems** timeline to guide you in optimizing accelerated applications. 

---
## Running Nsight Systems

For this interactive lab environment, we have set up a remote desktop you can access from your browser, where you will be able to launch and use **Nsight Systems**. You will begin by creating a report file for an already-existing vector addition program, after which you will be walked through a series of steps to open this report file in **Nsight Systems**, and to make the visual experience nice.

### Generate Report File

In [None]:
%%writefile vector-add.cu
#include <stdio.h>

void initWith(float num, float *a, int N)
{
  for(int i = 0; i < N; ++i)
    a[i] = num;
}

__global__
void addVectorsInto(float *result, float *a, float *b, int N)
{
  int index = threadIdx.x + blockIdx.x * blockDim.x;
  int stride = blockDim.x * gridDim.x;

  for(int i = index; i < N; i += stride)
    result[i] = a[i] + b[i];
  
}

void checkElementsAre(float target, float *vector, int N)
{
  for(int i = 0; i < N; i++)
  {
    if(vector[i] != target)
    {
      printf("FAIL: vector[%d] - %0.0f does not equal %0.0f\n", i, vector[i], target);
      exit(1);
    }
  }
  printf("Success! All values calculated correctly.\n");
}

int main(int argc, char **argv)
{
  int deviceId;
  int numberOfSMs;

  cudaGetDevice(&deviceId);
  cudaDeviceGetAttribute(&numberOfSMs, cudaDevAttrMultiProcessorCount, deviceId);

  const int N = 2<<24;
  size_t size = N * sizeof(float);

  float *a;
  float *b;
  float *c;

  cudaMallocManaged(&a, size);
  cudaMallocManaged(&b, size);
  cudaMallocManaged(&c, size);

  initWith(3, a, N);
  initWith(4, b, N);
  initWith(0, c, N);

  size_t threadsPerBlock;
  size_t numberOfBlocks;

  threadsPerBlock = 256;
  numberOfBlocks = 32 * numberOfSMs;

  cudaError_t addVectorsErr;
  cudaError_t asyncErr;

  addVectorsInto<<<numberOfBlocks, threadsPerBlock>>>(c, a, b, N);

  addVectorsErr = cudaGetLastError();
  if(addVectorsErr != cudaSuccess) printf("Error: %s\n", cudaGetErrorString(addVectorsErr));

  asyncErr = cudaDeviceSynchronize();
  if(asyncErr != cudaSuccess) printf("Error: %s\n", cudaGetErrorString(asyncErr));

  checkElementsAre(7, c, N);

  cudaFree(a);
  cudaFree(b);
  cudaFree(c);
}

Use the code execution cell directly above to compile and run it. You should see a message printed that indicates it was successful.

In [None]:
!nvcc vector-add.cu -o vector-add-no-prefetch

Next, use `nsys profile --stats=true` to create a report file that you will be able to open in the Nsight Systems visual profiler. Here we use the `-o` flag to give the report file a memorable name:

In [None]:
!nsys profile --stats=true -o vector-add-no-prefetch-report ./vector-add-no-prefetch

### Open the Remote Desktop

Then, read the instructions that follow in the notebook, and connect with NICE DCV on OGBON:

#### 1) Connect in login8 on OGBON

> ~$ ssh -p 5001 user@ogbon-login8.fieb.org.br 

#### 2) Create the Alias

> ~$ alias dcvCreate="dcv create-session profiling"

> ~$ alias dcvList="dcv list-sessions"

> ~$ alias dcvClose="dcv close-session profiling"

#### 3) Create Session in NICE DCV

> ~$ dcvCreate

#### 4) Open the browser and connect in the adress associating the alias session

    https://ogbon-cgpu4.fieb.org.br:8443#profiling

#### 5) Insert the user and password
After clicking the _Connect_ button you will be asked for a password, which is registered in the NOC/CS2I.

### Open Nsight Systems

To open Nsight Systems, initialize the`nsys-ui` on the command line.

### Open the Report File

Open this report file by visiting _File_ -> _Open_ from the Nsight Systems menu and select `vector-add-no-prefetch-report.qdrep`.

### Ignore Warnings/Errors

You can close and ignore any warnings or errors you see, which are just a result of our particular remote desktop environment.

### Make More Room for the Timelines

To make your experience nicer, full-screen the profiler, close the _Project Explorer_ and hide the *Events View*.

### Expand the CUDA Unified Memory Timelines

Next, expand the _CUDA_ -> _Unified memory_ and _Context_ timelines, and close the _Threads_ timelines.

### Observe Many Memory Transfers

From a glance you can see that your application is taking about 1 second to run, and that also, during the time when the `addVectorsInto` kernel is running, that there is a lot of UM memory activity:

Zoom into the memory timelines to see more clearly all the small memory transfers being caused by the on-demand memory page faults. A couple tips:

1. You can zoom in and out at any point of the timeline by holding `CTRL` while scrolling your mouse/trackpad
2. You can zoom into any section by click + dragging a rectangle around it, and then selecting _Zoom in_

---
## Comparing Code Refactors Iteratively with Nsight Systems

Now that you have **Nsight Systems** up and running and are comfortable moving around the timelines, you will be profiling a series of programs that were iteratively improved using techniques already familiar to you. Each time you profile, information in the timeline will give information supporting how you should next modify your code. Doing this will further increase your understanding of how various CUDA programming techniques affect application performance.

## Discussion: What is the Asynchronous Memory Prefetching?

The first of all is to understand the meaning of the **asynchronous memory prefetching** technique. A powerful technique to reduce the overhead of page faulting and on-demand memory migrations, both in host-to-device and device-to-host memory transfers, is called **asynchronous memory prefetching**. Using this technique allows programmers to asynchronously migrate unified memory (UM) to any CPU or GPU device in the system, in the background, prior to its use by application code. By doing this, GPU kernels and CPU function performance can be increased on account of reduced page fault and on-demand data migration overhead.

Prefetching also tends to migrate data in larger chunks, and therefore fewer trips, than on-demand migration. This makes it an excellent fit when data access needs are known before runtime, and when data access patterns are not sparse.

CUDA makes asynchronously prefetching managed memory to either a GPU device or the CPU easy with its `cudaMemPrefetchAsync` function. Here is an example of using it to both prefetch data to the currently active GPU device, and then, to the CPU:

```cpp
int deviceId;
cudaGetDevice(&deviceId);                                         // The ID of the currently active GPU device.

cudaMemPrefetchAsync(pointerToSomeUMData, size, deviceId);        // Prefetch to GPU device.
cudaMemPrefetchAsync(pointerToSomeUMData, size, cudaCpuDeviceId); // Prefetch to host. `cudaCpuDeviceId` is a
                                                                  // built-in CUDA variable.
```

### Exercise: Compare the Timelines of Prefetching vs. Non-Prefetching

Refactors the vector addition application from above so that the 3 vectors needed by its `addVectorsInto` kernel are asynchronously prefetched to the active GPU device prior to launching the kernel (using [`cudaMemPrefetchAsync`](http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1ge8dc9199943d421bc8bc7f473df12e42)). Open the source code and identify where in the application these changes were made. After reviewing the changes, compile and run the refactored application using the code execution cell directly below. You should see its success message printed.

In [None]:
%%writefile vector-add-prefetch-solution.cu
#include <stdio.h>

void initWith(float num, float *a, int N)
{
  for(int i = 0; i < N; ++i)
    a[i] = num;

}

__global__
void addVectorsInto(float *result, float *a, float *b, int N)
{
  int index = threadIdx.x + blockIdx.x * blockDim.x;
  int stride = blockDim.x * gridDim.x;

  for(int i = index; i < N; i += stride)
    result[i] = a[i] + b[i];

}

void checkElementsAre(float target, float *vector, int N)
{
  for(int i = 0; i < N; i++)
  {
    if(vector[i] != target)
    {
      printf("FAIL: vector[%d] - %0.0f does not equal %0.0f\n", i, vector[i], target);
      exit(1);
    }
  }
  printf("Success! All values calculated correctly.\n");
}

int main(int argc, char **argv)
{
  int deviceId;
  int numberOfSMs;

  cudaGetDevice(&deviceId);
  cudaDeviceGetAttribute(&numberOfSMs, cudaDevAttrMultiProcessorCount, deviceId);

  const int N = 2<<24;
  size_t size = N * sizeof(float);

  float *a;
  float *b;
  float *c;

  cudaMallocManaged(&a, size);
  cudaMallocManaged(&b, size);
  cudaMallocManaged(&c, size);

  initWith(3, a, N);
  initWith(4, b, N);
  initWith(0, c, N);

  cudaMemPrefetchAsync(a, size, deviceId);
  cudaMemPrefetchAsync(b, size, deviceId);
  cudaMemPrefetchAsync(c, size, deviceId);

  size_t threadsPerBlock;
  size_t numberOfBlocks;

  threadsPerBlock = 256;
  numberOfBlocks = 32 * numberOfSMs;

  cudaError_t addVectorsErr;
  cudaError_t asyncErr;

  addVectorsInto<<<numberOfBlocks, threadsPerBlock>>>(c, a, b, N);

  addVectorsErr = cudaGetLastError();
  if(addVectorsErr != cudaSuccess) printf("Error: %s\n", cudaGetErrorString(addVectorsErr));

  asyncErr = cudaDeviceSynchronize();
  if(asyncErr != cudaSuccess) printf("Error: %s\n", cudaGetErrorString(asyncErr));

  checkElementsAre(7, c, N);

  cudaFree(a);
  cudaFree(b);
  cudaFree(c);
}

In [None]:
!nvcc vector-add-prefetch-solution.cu -o vector-add-prefetch

Now create a report file for this version of the application:

In [None]:
!nsys profile --stats=true -o vector-add-prefetch-report ./vector-add-prefetch

Open the report in Nsight Systems, leaving the previous report open for comparison.

### ☆ Questions:

- How does the execution time compare to that of the `addVectorsInto` kernel prior to adding asynchronous prefetching?
- Locate `cudaMemPrefetchAsync` in the *CUDA API* section of the timeline.
- How have the memory transfers changed?

## Summary

At this point in the lab you are able to:

- Use the **Nsight Systems** to visually profile the timeline of GPU-accelerated CUDA applications.
- Use **Nsight Systems** to identify, and exploit, optimization opportunities in GPU-accelerated CUDA applications.

At this point in time you have a wealth of fundamental tools and techniques for accelerating CPU-only applications, and for then optimizing those accelerated applications. In the final, exercise in the next notebook, you will have a chance to apply everything that you've learned to accelerate.

## Clear the Temporary Files

Before moving on, please execute the following cell to clear up the directory. This is required to move on to the next notebook.

In [None]:
!rm -rf *vector-add* 

## Next

Please continue to the next notebook: [_Final-Exercise_](03-Final-Exercise.ipynb).