# An Even Easier Introduction to CUDA

This notebook is a Colab-friendly companion to NVIDIA’s free DLI course:

- **An Even Easier Introduction to CUDA**: https://learn.nvidia.com/courses/course-detail?course_id=course-v1:DLI+T-AC-01+V1

It follows the same learning path as Mark Harris’s blog post (the course companion reading) and guides you through progressively more parallel CUDA kernels for adding two vectors of **1,048,576** floats.

### What you’ll do here
By the end of the notebook, you will be able to:

- Compile and run **CUDA C++** code with `nvcc`
- Launch CUDA kernels with different execution configurations (threads and blocks)
- Use **Unified Memory** and understand why data migration matters
- Measure kernel runtime using **CUDA events** (and optionally collect a timeline with Nsight Systems, if available)

### Before you run
1. In Colab: **Runtime → Change runtime type → GPU**
2. Check the GPU model with `nvidia-smi`.
3. This notebook compiles with `-arch=sm_75` (T4). If Colab gives you a different GPU, change the `-arch` flag in the compile cells:
   - P100: `sm_60` • V100: `sm_70` • T4: `sm_75` • A100: `sm_80` • L4: `sm_89`


In [12]:
!nvidia-smi
!which nsys || echo "nsys not on PATH"
!find / -type f -name nsys 2>/dev/null | head -n 20


Sat Jan  3 23:13:25 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   31C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [14]:
# Download (CLI-only .deb)
!wget -O nsys_cli.deb "https://developer.nvidia.com/downloads/assets/tools/secure/nsight-systems/2025_6/NsightSystems-linux-cli-public-2025.6.1.190-3689520.deb"

# Install
!sudo dpkg -i nsys_cli.deb || sudo apt-get -f install -y

# Verify
!nsys --version


--2026-01-03 23:14:39--  https://developer.nvidia.com/downloads/assets/tools/secure/nsight-systems/2025_6/NsightSystems-linux-cli-public-2025.6.1.190-3689520.deb
Resolving developer.nvidia.com (developer.nvidia.com)... 23.59.88.80, 23.59.88.69, 23.59.88.67, ...
Connecting to developer.nvidia.com (developer.nvidia.com)|23.59.88.80|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://developer.download.nvidia.com/assets/tools/secure/nsight-systems/2025_6/NsightSystems-linux-cli-public-2025.6.1.190-3689520.deb?__token__=exp=1767483879~hmac=cf2ebad1ead6d659eca139b9e61dfba644bc784b61646f05f9ea3fd54774ad29 [following]
--2026-01-03 23:14:39--  https://developer.download.nvidia.com/assets/tools/secure/nsight-systems/2025_6/NsightSystems-linux-cli-public-2025.6.1.190-3689520.deb?__token__=exp=1767483879~hmac=cf2ebad1ead6d659eca139b9e61dfba644bc784b61646f05f9ea3fd54774ad29
Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 23.

<img src="https://developer.download.nvidia.com/training/courses/T-AC-01-V1/CUDA_Cube_1K.jpeg" width="400">

This post is a super simple introduction to CUDA, the popular parallel computing platform and programming model from NVIDIA. I wrote a previous [“Easy Introduction”](https://developer.nvidia.com/blog/easy-introduction-cuda-c-and-c/) to CUDA in 2013 that has been very popular over the years. But CUDA programming has gotten easier, and GPUs have gotten much faster, so it’s time for an updated (and even easier) introduction.

CUDA C++ is just one of the ways you can create massively parallel applications with CUDA. It lets you use the powerful C++ programming language to develop high performance algorithms accelerated by thousands of parallel threads running on GPUs. Many developers have accelerated their computation- and bandwidth-hungry applications this way, including the libraries and frameworks that underpin the ongoing revolution in artificial intelligence known as [Deep Learning](https://developer.nvidia.com/deep-learning).

So, you’ve heard about CUDA and you are interested in learning how to use it in your own applications. If you are a C or C++ programmer, this blog post should give you a good start. To follow along, you’ll need a computer with an CUDA-capable GPU (Windows, Mac, or Linux, and any NVIDIA GPU should do), or a cloud instance with GPUs (AWS, Azure, IBM SoftLayer, and other cloud service providers have them). You’ll also need the free [CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit) installed.

Let's get started!

<img src="https://developer-blogs.nvidia.com/wp-content/uploads/2017/01/cuda_ai_cube-625x625.jpg" width="400">

## Starting Simple

### The three CUDA versions you’ll run
You’ll write and run three CUDA programmes (each saved to its own `.cu` file):

1. **`add.cu` — 1 thread**  
   A first CUDA kernel launch (`<<<1,1>>>`) to show the basic mechanics. Correct, but not parallel.

2. **`add_block.cu` — 256 threads (1 block)**  
   Uses `threadIdx.x` and a *block-stride* loop so the work is shared across threads in a single block.

3. **`add_grid.cu` — many blocks × 256 threads (grid-stride loop)**  
   Scales across the whole GPU using `blockIdx.x`, `gridDim.x`, and a *grid-stride* loop. This version also demonstrates **Unified Memory prefetching** and prints an **average kernel time** from multiple iterations.

You’ll keep the algorithm the same (vector add) so you can focus on how launch configuration and memory behaviour affect performance.


We’ll start with a simple C++ program that adds the elements of two arrays with a million elements each.

In [1]:
%%writefile add.cpp

#include <iostream>
#include <math.h>

// function to add the elements of two arrays
void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
      y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<20; // 1M elements

  float *x = new float[N];
  float *y = new float[N];

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the CPU
  add(N, x, y);

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  delete [] x;
  delete [] y;

  return 0;
}

Writing add.cpp


Executing the above cell will save its contents to the file add.cpp.

The following cell will compile and run this C++ program.

In [2]:
%%shell
g++ add.cpp -o add



Then run it:

In [3]:
%%shell
./add

Max error: 0




As expected, it prints that there was no error in the summation and then exits. Now I want to get this computation running (in parallel) on the many cores of a GPU. It’s actually pretty easy to take the first steps.

First, I just have to turn our `add` function into a function that the GPU can run, called a *kernel* in CUDA. To do this, all I have to do is add the specifier `__global__` to the function, which tells the CUDA C++ compiler that this is a function that runs on the GPU and can be called from CPU code.

```cpp
// CUDA Kernel function to add the elements of two arrays on the GPU
__global__
void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
      y[i] = x[i] + y[i];
}
```

These `__global__` functions are known as *kernels*, and code that runs on the GPU is often called *device code*, while code that runs on the CPU is *host code*.

## Memory Allocation in CUDA

To compute on the GPU, I need to allocate memory accessible by the GPU. [Unified Memory](https://developer.nvidia.com/blog/unified-memory-in-cuda-6/) in CUDA makes this easy by providing a single memory space accessible by all GPUs and CPUs in your system. To allocate data in unified memory, call `cudaMallocManaged()`, which returns a pointer that you can access from host (CPU) code or device (GPU) code. To free the data, just pass the pointer to `cudaFree()`.

I just need to replace the calls to `new` in the code above with calls to `cudaMallocManaged()`, and replace calls to `delete []` with calls to `cudaFree`.

```cpp
  // Allocate Unified Memory -- accessible from CPU or GPU
  float *x, *y;
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));

  ...

  // Free memory
  cudaFree(x);
  cudaFree(y);
```

Finally, I need to *launch* the `add()` kernel, which invokes it on the GPU. CUDA kernel launches are specified using the triple angle bracket syntax `<<< >>>`. I just have to add it to the call to `add` before the parameter list.

```cpp
add<<<1, 1>>>(N, x, y);
```

Easy! I’ll get into the details of what goes inside the angle brackets soon; for now all you need to know is that this line launches one GPU thread to run `add()`.

Just one more thing: I need the CPU to wait until the kernel is done before it accesses the results (because CUDA kernel launches don’t block the calling CPU thread). To do this I just call `cudaDeviceSynchronize()` before doing the final error checking on the CPU.

Here’s the complete code:

In [22]:
%%writefile add.cu
#include <iostream>
#include <cmath>
#include <cuda_runtime.h>

__global__
void add(int n, const float *x, float *y)
{
  // single-thread version (as in your example)
  for (int i = 0; i < n; i++)
    y[i] = x[i] + y[i];
}

int main()
{
  const int N = 1 << 20; // 1M

  float *x = nullptr, *y = nullptr;
  cudaMallocManaged(&x, N * sizeof(float));
  cudaMallocManaged(&y, N * sizeof(float));

  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // --- simple kernel timing using CUDA events ---
  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);

  cudaEventRecord(start);
  add<<<1, 1>>>(N, x, y);
  cudaEventRecord(stop);

  cudaEventSynchronize(stop);

  float ms = 0.0f;
  cudaEventElapsedTime(&ms, start, stop);
  std::cout << "Kernel time (ms): " << ms << std::endl;

  cudaDeviceSynchronize(); // ensure results ready

  // check result
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i] - 3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  cudaEventDestroy(start);
  cudaEventDestroy(stop);

  cudaFree(x);
  cudaFree(y);
  return 0;
}


Overwriting add.cu


In [23]:
%%shell
nvcc -O3 add.cu -o add -arch=sm_75
./add


Kernel time (ms): 107.89
Max error: 0




This is only a first step, because as written, this kernel is only correct for a single thread, since every thread that runs it will perform the add on the whole array. Moreover, there is a [race condition](https://en.wikipedia.org/wiki/Race_condition) since multiple parallel threads would both read and write the same locations.

## Profile it!

For quick feedback, this notebook measures runtime **inside the programme** using **CUDA events**.  

- `add.cu` and `add_block.cu` print a single **Kernel time (ms)**  
- `add_grid.cu` runs the kernel many times and prints an **Avg kernel time (ms)**

For a deeper look (timelines, memory operations, concurrency), you can use profiling tools such as **Nsight Systems** (`nsys`) or **Nsight Compute** (`ncu`) when available in your environment.


Kernel timings will vary depending on which GPU Colab assigns to you.  
Run the next cell to see your current GPU model (for example, *Tesla T4*).

In [7]:
%%shell

nvidia-smi

Sat Jan  3 22:15:49 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   37C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                



Let's make it faster with parallelism.

## Picking up the Threads

Now that you’ve run a kernel with one thread that does some computation, how do you make it parallel? The key is in CUDA’s `<<<1, 1>>>` syntax. This is called the execution configuration, and it tells the CUDA runtime how many parallel threads to use for the launch on the GPU. There are two parameters here, but let’s start by changing the second one: the number of threads in a thread block. CUDA GPUs run kernels using blocks of threads that are a multiple of 32 in size, so 256 threads is a reasonable size to choose.

```cpp
add<<<1, 256>>>(N, x, y);
```

If I run the code with only this change, it will do the computation once per thread, rather than spreading the computation across the parallel threads. To do it properly, I need to modify the kernel. CUDA C++ provides keywords that let kernels get the indices of the running threads. Specifically, `threadIdx.x` contains the index of the current thread within its block, and `blockDim.x` contains the number of threads in the block. I’ll just modify the loop to stride through the array with parallel threads.

```cpp
__global__
void add(int n, float *x, float *y)
{
  int index = threadIdx.x;
  int stride = blockDim.x;
  for (int i = index; i < n; i += stride)
      y[i] = x[i] + y[i];
}
```

The `add` function hasn’t changed that much. In fact, setting `index = 0` and `stride = 1` makes it semantically identical to the single-thread version.

Now we’ll save the file as **`add_block.cu`**, compile it, and run it.  
When it runs, it will print **Kernel time (ms)** and **Max error** so you can compare performance and correctness with the previous version.


In [25]:
%%writefile add_block.cu
#include <iostream>
#include <cmath>
#include <cuda_runtime.h>

__global__
void add(int n, const float *x, float *y)
{
  int index  = threadIdx.x;
  int stride = blockDim.x;
  for (int i = index; i < n; i += stride)
    y[i] = x[i] + y[i];
}

int main()
{
  const int N = 1 << 20;

  float *x = nullptr, *y = nullptr;
  cudaMallocManaged(&x, N * sizeof(float));
  cudaMallocManaged(&y, N * sizeof(float));

  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // --- simple kernel timing using CUDA events ---
  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);

  cudaEventRecord(start);
  add<<<1, 256>>>(N, x, y);
  cudaEventRecord(stop);

  cudaEventSynchronize(stop);

  float ms = 0.0f;
  cudaEventElapsedTime(&ms, start, stop);
  std::cout << "Kernel time (ms): " << ms << std::endl;

  cudaDeviceSynchronize(); // ensure results ready

  // check result
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i] - 3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  cudaEventDestroy(start);
  cudaEventDestroy(stop);

  cudaFree(x);
  cudaFree(y);
  return 0;
}


Overwriting add_block.cu


In [26]:
%%shell
nvcc -O3 add_block.cu -o add_block -arch=sm_75
./add_block


Kernel time (ms): 3.60054
Max error: 0




That’s a big speed-up — compare the printed **Kernel time (ms)** from this run with the previous single-thread launch.  
That improvement is expected because we increased parallelism from **1 thread** to **256 threads**. Next, we’ll scale out to *many* blocks so the whole GPU can participate.

## Out of the Blocks

CUDA GPUs have many parallel processors grouped into Streaming Multiprocessors, or SMs. Each SM can run multiple concurrent thread blocks. As an example, a Tesla P100 GPU based on the [Pascal GPU Architecture](https://developer.nvidia.com/blog/inside-pascal/) has 56 SMs, each capable of supporting up to 2048 active threads. To take full advantage of all these threads, I should launch the kernel with multiple thread blocks.

By now you may have guessed that the first parameter of the execution configuration specifies the number of thread blocks. Together, the blocks of parallel threads make up what is known as the *grid*. Since I have `N` elements to process, and 256 threads per block, I just need to calculate the number of blocks to get at least `N` threads. I simply divide `N` by the block size (being careful to round up in case `N` is not a multiple of `blockSize`).

```cpp
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add<<<numBlocks, blockSize>>>(N, x, y);
```

<img src="https://developer-blogs.nvidia.com/wp-content/uploads/2017/01/cuda_indexing.png" width="800">

I also need to update the kernel code to take into account the entire grid of thread blocks. CUDA provides `gridDim.x`, which contains the number of blocks in the grid, and `blockIdx.x`, which contains the index of the current thread block in the grid. Figure 1 illustrates the the approach to indexing into an array (one-dimensional) in CUDA using `blockDim.x`, `gridDim.x`, and `threadIdx.x`. The idea is that each thread gets its index by computing the offset to the beginning of its block (the block index times the block size: `blockIdx.x * blockDim.x`) and adding the thread’s index within the block (`threadIdx.x`). The code `blockIdx.x * blockDim.x + threadIdx.x` is idiomatic CUDA.

```cpp
__global__
void add(int n, float *x, float *y)
{
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  for (int i = index; i < n; i += stride)
    y[i] = x[i] + y[i];
}
```

The updated kernel sets:

- `index = blockIdx.x * blockDim.x + threadIdx.x` (a unique starting index per thread across the whole grid)
- `stride = blockDim.x * gridDim.x` (the total number of threads in the grid)

This pattern is called a **grid-stride loop** and is a standard way to write kernels that scale to any `N`.

Now save the file as **`add_grid.cu`**, compile it, and run it.  
This version also demonstrates **Unified Memory prefetching** and prints an **Avg kernel time (ms)** over multiple iterations.


In [18]:
%%writefile add_grid.cu
#include <iostream>
#include <cmath>
#include <cuda_runtime.h>

#define CUDA_CHECK(call) do {                                           \
  cudaError_t err = (call);                                             \
  if (err != cudaSuccess) {                                             \
    std::cerr << "CUDA error at " << __FILE__ << ":" << __LINE__        \
              << " : " << cudaGetErrorString(err) << std::endl;         \
    return 1;                                                           \
  }                                                                     \
} while (0)

__global__
void add(int n, const float *x, float *y)
{
  int index  = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  for (int i = index; i < n; i += stride)
    y[i] = x[i] + y[i];
}

int main()
{
  const int N = 1 << 20; // 1M
  float *x = nullptr, *y = nullptr;

  CUDA_CHECK(cudaMallocManaged(&x, N * sizeof(float)));
  CUDA_CHECK(cudaMallocManaged(&y, N * sizeof(float)));

  // init on CPU
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // launch config
  const int blockSize = 256;
  const int numBlocks = (N + blockSize - 1) / blockSize;

  // Prefetch Unified Memory to the active GPU
  int dev = 0;
  CUDA_CHECK(cudaGetDevice(&dev));
  CUDA_CHECK(cudaMemPrefetchAsync(x, N * sizeof(float), dev));
  CUDA_CHECK(cudaMemPrefetchAsync(y, N * sizeof(float), dev));
  CUDA_CHECK(cudaDeviceSynchronize());

  // Warm-up (also helps JIT / first-touch effects)
  add<<<numBlocks, blockSize>>>(N, x, y);
  CUDA_CHECK(cudaGetLastError());
  CUDA_CHECK(cudaDeviceSynchronize());

  // Timed loop using CUDA events
  cudaEvent_t start, stop;
  CUDA_CHECK(cudaEventCreate(&start));
  CUDA_CHECK(cudaEventCreate(&stop));

  const int iters = 100;

  CUDA_CHECK(cudaEventRecord(start));
  for (int r = 0; r < iters; r++) {
    add<<<numBlocks, blockSize>>>(N, x, y);
  }
  CUDA_CHECK(cudaGetLastError());
  CUDA_CHECK(cudaEventRecord(stop));
  CUDA_CHECK(cudaEventSynchronize(stop));

  float ms = 0.0f;
  CUDA_CHECK(cudaEventElapsedTime(&ms, start, stop));

  std::cout << "Avg kernel time (ms): " << (ms / iters) << std::endl;

  // Validate (expect y ~= 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++) {
    maxError = fmax(maxError, fabs(y[i] - 3.0f));
  }
  std::cout << "Max error: " << maxError << std::endl;

  CUDA_CHECK(cudaEventDestroy(start));
  CUDA_CHECK(cudaEventDestroy(stop));
  CUDA_CHECK(cudaFree(x));
  CUDA_CHECK(cudaFree(y));
  return 0;
}


Writing add_grid.cu


In [21]:
%%shell
nvcc -O3 add_grid.cu -o add_grid -arch=sm_75
./add_grid


Avg kernel time (ms): 0.0535261
Max error: 100




### Optional: capture a GPU timeline with Nsight Systems
If `nsys` is available in your runtime, you can record a short CUDA trace for the **grid** version.  
This produces an `.nsys-rep` report file which you can download and open in the Nsight Systems GUI on your machine.

If the command is not found, you can skip this step — the printed CUDA-event timings above are enough for this notebook’s comparisons.


In [15]:
!nsys profile -t cuda -o add_profile ./add_grid
!ls -lh add_profile.nsys-rep


Collecting data...
Max error: 1
Generating '/tmp/nsys-report-b04d.qdstrm'
Generated:
	/content/add_profile.nsys-rep
-rw-rw-r-- 1 root root 75K Jan  3 23:15 add_profile.nsys-rep


That’s another big speed-up from running **multiple blocks** across the GPU.  
Compare the **Avg kernel time (ms)** printed by `add_grid.cu` with the earlier versions.

Your exact speed-ups will vary depending on the GPU Colab assigns. If your final speed-up is smaller than expected, the exercises below include ideas to investigate (especially memory behaviour and launch configuration).

## Exercises

To keep you going, here are a few things to try on your own.

1. Browse the [CUDA Toolkit documentation](https://docs.nvidia.com/cuda/index.html). If you haven’t installed CUDA yet, check out the [Quick Start Guide](https://docs.nvidia.com/cuda/cuda-quick-start-guide/index.html) and the installation guides. Then browse the [Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html) and the [Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html). There are also tuning guides for various architectures.
2. Experiment with `printf()` inside the kernel. Try printing out the values of `threadIdx.x` and `blockIdx.x` for some or all of the threads. Do they print in sequential order? Why or why not?
3. Print the value of `threadIdx.y` or `threadIdx.z` (or `blockIdx.y`) in the kernel. (Likewise for `blockDim` and `gridDim`). Why do these exist? How do you get them to take on values other than 0 (1 for the dims)?
4. If you have access to a [Pascal-based GPU](https://developer.nvidia.com/blog/inside-pascal/), try running `add_grid.cu` on it. Is performance better or worse than the K80 results? Why? (Hint: read about [Pascal’s Page Migration Engine and the CUDA 8 Unified Memory API](https://developer.nvidia.com/blog/beyond-gpu-memory-limits-unified-memory-pascal/).) For a detailed answer to this question, see the post [Unified Memory for CUDA Beginners](https://developer.nvidia.com/blog/unified-memory-cuda-beginners/).

## Where to From Here

If you enjoyed this notebook and want to learn more, the [NVIDIA DLI](https://nvidia.com/dli) offers several in depth CUDA Programming courses.

For those of you just starting out, please consider [_Fundamentals of Accelerated Computing with CUDA C/C++_](https://courses.nvidia.com/courses/course-v1:DLI+C-AC-01+V1/about) which provides dedicated GPU resources, a more sophisticated programming environment, use of the [NVIDIA Nsight Systems™](https://developer.nvidia.com/nsight-systems) visual profiler, dozens of interactive exercises, detailed presentations, over 8 hours of material, and the ability to earn a DLI Certificate of Competency.

Similarly, for Python programmers, please consider [_Fundamentals of Accelerated Computing with CUDA Python_](https://courses.nvidia.com/courses/course-v1:DLI+C-AC-02+V1/about).

For more intermediate and advance CUDA programming materials, please check out the _Accelerated Computing_ section of the NVIDIA DLI [self-paced catalog](https://www.nvidia.com/en-us/training/online/).