# üìò Task Explanation: Nsight Compute Profiling ‚Äî Warp Stalls and Memory Stalls

## üéØ Objective
The objective of this task is to use **Nsight Compute** to perform **kernel-level performance profiling** and identify the primary causes of performance loss, with a specific focus on **warp stalls** and **memory stalls**.

By completing this task, you will learn how to move beyond runtime measurements and understand **why a CUDA kernel is slow at the micro-architectural level**.

---

## üß† Background: What Is Nsight Compute?
**Nsight Compute** is NVIDIA‚Äôs **kernel-focused performance profiler**.  
Unlike Nsight Systems, which provides a timeline view of the entire application, Nsight Compute analyzes:
- Individual CUDA kernels
- Instruction throughput
- Memory behavior
- Warp execution efficiency

Nsight Compute answers the question:
> *‚ÄúWhat is limiting the performance of this specific kernel?‚Äù*

---

## üß© Part A ‚Äî Nsight Compute Profiling

### Task
Profile a CUDA kernel using Nsight Compute and collect detailed performance metrics.

You should:
- Run Nsight Compute on one or more CUDA kernels
- Collect metrics related to:
  - Warp execution
  - Memory access
  - Instruction scheduling

### Goal
Obtain a detailed breakdown of how warps are scheduled and where execution time is being lost.

---

## üß© Part B ‚Äî Identify Warp Stalls

### What Is a Warp Stall?
A **warp stall** occurs when a warp is ready to execute but cannot proceed due to a hardware or dependency limitation.

Common causes include:
- Instruction dependencies (e.g., waiting for a previous instruction to complete)
- Insufficient instruction-level parallelism (ILP)
- Execution pipeline contention

### Task
Using Nsight Compute, identify:
- The dominant warp stall reasons
- The percentage of cycles spent stalled
- Whether stalls are due to compute or scheduling issues

---

## üß© Part C ‚Äî Identify Memory Stalls

### What Is a Memory Stall?
A **memory stall** happens when a warp is waiting for data to be loaded from memory.

Typical sources:
- Global memory latency
- Cache misses (L1 / L2)
- Uncoalesced memory accesses
- Register spills to local memory

### Task
Analyze memory-related stall metrics and determine:
- Whether the kernel is memory-bound
- Which memory level (global, L2, shared, local) is the bottleneck
- Whether access patterns are inefficient

---

## üìä What to Look For in Nsight Compute

Key metrics and sections to examine:
- Warp stall breakdown (e.g., stalled on memory, stalled on dependencies)
- Memory throughput vs. theoretical peak
- Cache hit / miss rates
- Instruction issue efficiency

---

## üîç Key Questions to Answer
- Are warps mostly stalled or actively executing?
- Is the kernel limited by memory latency or compute throughput?
- Are stalls caused by memory access patterns or instruction dependencies?
- Which optimizations (e.g., memory coalescing, unrolling, prefetching) could reduce stalls?

---

## üß™ Deliverables
You should produce:
1. Nsight Compute profiling reports for selected kernels
2. A summary of dominant warp and memory stall reasons
3. A short analysis explaining:
   - Why these stalls occur
   - How they impact performance
   - What optimizations could mitigate them

---

## üéì What You Learn from This Task
By completing this task, you will understand:
- How to interpret Nsight Compute metrics
- The difference between warp stalls and memory stalls
- How low-level hardware behavior affects kernel performance
- How to connect profiling results to concrete optimization strategies

---

## üöÄ Relevance to ML Systems
Identifying warp and memory stalls is critical for optimizing:
- Matrix multiplication kernels
- Reduction and normalization kernels
- Attention and FlashAttention implementations
- Compiler-generated kernels (e.g., Triton)

This task trains you to reason about GPU performance at the **same level used by professional ML systems and GPU kernel engineers**.


In [None]:
!nvcc --version
!nvidia-smi

/bin/bash: line 1: nvcc: command not found
/bin/bash: line 1: nvidia-smi: command not found


In [None]:
!apt-get update
!apt-get install -y cuda-toolkit-12-4

In [None]:
%%writefile ncu_stall_profile_skeleton.cu
#include <cstdio>
#include <cstdlib>
#include <cuda_runtime.h>

#define CUDA_CHECK(call) do {                                   \
  cudaError_t err = (call);                                     \
  if (err != cudaSuccess) {                                     \
    fprintf(stderr, "CUDA error %s:%d: %s\n",                   \
            __FILE__, __LINE__, cudaGetErrorString(err));       \
    std::exit(EXIT_FAILURE);                                    \
  }                                                             \
} while(0)

// ------------------------------------------------------------
// Task: Nsight Compute profiling + identify warp stalls & memory stalls
// NO SOLUTION: fill TODOs to create a kernel that you can profile.
//
// Options (pick one by implementing the kernel accordingly):
//  A) Memory-stall oriented kernel: strided global loads, cache-miss friendly
//  B) Warp-stall oriented kernel: dependency chain / low ILP
//  C) Compare two kernels with a flag to see stall breakdown differences
// ------------------------------------------------------------

__global__ void kernelToProfile(const float* __restrict__ in,
                                float* __restrict__ out,
                                int N,
                                int stride,
                                int iters) {
    // TODO:
    // - compute global thread index tid
    // - create your access pattern and/or dependency chain
    // - do repeated work controlled by iters
    // - write out[tid] to prevent compiler eliminating the loop
}

// ------------------------------------------------------------
// Optional: second variant to compare stalls (leave as TODO or unused)
// ------------------------------------------------------------
__global__ void kernelToProfileAlt(const float* __restrict__ in,
                                   float* __restrict__ out,
                                   int N,
                                   int stride,
                                   int iters) {
    // TODO
}

static void initHost(float* a, int N) {
    for (int i = 0; i < N; ++i) a[i] = 0.001f * (i % 1000);
}

int main(int argc, char** argv) {
    // Simple args: ./app <stride> <iters> <mode>
    // mode: 0 -> kernelToProfile, 1 -> kernelToProfileAlt
    int stride = (argc > 1) ? std::atoi(argv[1]) : 1;
    int iters  = (argc > 2) ? std::atoi(argv[2]) : 256;
    int mode   = (argc > 3) ? std::atoi(argv[3]) : 0;

    const int N = 1 << 24;
    const size_t bytes = size_t(N) * sizeof(float);

    float* hIn  = (float*)std::malloc(bytes);
    float* hOut = (float*)std::malloc(bytes);
    if (!hIn || !hOut) {
        fprintf(stderr, "Host malloc failed.\n");
        return 1;
    }
    initHost(hIn, N);

    float *dIn=nullptr, *dOut=nullptr;
    CUDA_CHECK(cudaMalloc(&dIn, bytes));
    CUDA_CHECK(cudaMalloc(&dOut, bytes));
    CUDA_CHECK(cudaMemcpy(dIn, hIn, bytes, cudaMemcpyHostToDevice));
    CUDA_CHECK(cudaMemset(dOut, 0, bytes));

    // TODO: choose launch config
    int blockSize = 0; // TODO (e.g., 256)
    int gridSize  = 0; // TODO (e.g., (N + blockSize - 1) / blockSize)

    // Warmup (optional)
    for (int i = 0; i < 3; ++i) {
        if (mode == 0) kernelToProfile<<<gridSize, blockSize>>>(dIn, dOut, N, stride, iters);
        else           kernelToProfileAlt<<<gridSize, blockSize>>>(dIn, dOut, N, stride, iters);
    }
    CUDA_CHECK(cudaGetLastError());
    CUDA_CHECK(cudaDeviceSynchronize());

    // Profile target launch (Nsight Compute will capture this kernel)
    if (mode == 0) kernelToProfile<<<gridSize, blockSize>>>(dIn, dOut, N, stride, iters);
    else           kernelToProfileAlt<<<gridSize, blockSize>>>(dIn, dOut, N, stride, iters);

    CUDA_CHECK(cudaGetLastError());
    CUDA_CHECK(cudaDeviceSynchronize());

    CUDA_CHECK(cudaMemcpy(hOut, dOut, bytes, cudaMemcpyDeviceToHost));
    printf("Done. stride=%d iters=%d mode=%d out0=%f\n", stride, iters, mode, hOut[0]);

    CUDA_CHECK(cudaFree(dIn));
    CUDA_CHECK(cudaFree(dOut));
    std::free(hIn);
    std::free(hOut);
    return 0;
}


Overwriting matmul_skeleton.cu


In [None]:
!nvcc -arch=sm_75 ncu_stall_profile_skeleton.cu -o ncu_stall_profile_skeleton
!./ncu_stall_profile_skeleton

[Naive MatMul Kernel (Baseline)]                          Kernel=1 | correctness = PASS | time=9.1753 ms | GFLOPS=234.05
[Block Tiling Kernel (structure only, still global loads)]Kernel=2 | correctness = PASS | time=4.0592 ms | GFLOPS=529.04
[Shared-Memory Tiled Kernel(Tile16)]                    Kernel=3 | correctness = PASS | time=2.4577 ms | GFLOPS=873.76
[Shared-Memory Tiled Kernel(Tile32)]                   Kernel=4 | correctness = PASS | time=10.4073 ms | GFLOPS=206.34
[Shared-Memory Tiled Kernel(Tile32padding)]            Kernel=5 | correctness = PASS | time=2.0924 ms | GFLOPS=1026.32


In [None]:
# Nsight Compute (collect stall-related sections)
!ncu --set full --kernel-name kernelToProfile -o ncu_report_stride1 ./ncu_stalls 1 256 0
!ncu --set full --kernel-name kernelToProfile -o ncu_report_stride4 ./ncu_stalls 4 256 0