# 00 Nvtx Matmul**FreeCodeCamp CUDA Course - Module 5**Original Course: [https://www.youtube.com/watch?v=86FAWCzIe_4](https://www.youtube.com/watch?v=86FAWCzIe_4)Source File: `00 nvtx_matmul.cu`---## OverviewImplement matrix multiplication using CUDA.---

## Learning ObjectivesBy the end of this notebook, you will:1. Understand CUDA kernel syntax and execution2. Learn GPU memory allocation and data transfer3. Profile CUDA code using NVTX---

## SetupMake sure you've completed the setup from the first notebook (GPU enabled, nvcc4jupyter installed).---

## Key Concepts- **Kernel Function**: Uses `__global__` qualifier for GPU execution- **Device Memory**: Allocated using `cudaMalloc`- **Data Transfer**: Uses `cudaMemcpy` between host and device- **Kernel Launch**: Syntax `kernel<<<blocks, threads>>>(...)`- **Synchronization**: `cudaDeviceSynchronize()` waits for GPU completion---## CUDA Implementation

In [None]:
%%cu#include <cuda_runtime.h>#include <nvtx3/nvToolsExt.h>#include <iostream>#define BLOCK_SIZE 16__global__ void matrixMulKernel(float* A, float* B, float* C, int N) {    int row = blockIdx.y * blockDim.y + threadIdx.y;    int col = blockIdx.x * blockDim.x + threadIdx.x;    float sum = 0.0f;        if (row < N && col < N) {        for (int i = 0; i < N; i++) {            sum += A[row * N + i] * B[i * N + col];        }        C[row * N + col] = sum;    }}void matrixMul(float* A, float* B, float* C, int N) {    nvtxRangePush("Matrix Multiplication");        float *d_A, *d_B, *d_C;    int size = N * N * sizeof(float);    nvtxRangePush("Memory Allocation");    cudaMalloc(&d_A, size);    cudaMalloc(&d_B, size);    cudaMalloc(&d_C, size);    nvtxRangePop();    nvtxRangePush("Memory Copy H2D");    cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);    cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);    nvtxRangePop();    dim3 threadsPerBlock(BLOCK_SIZE, BLOCK_SIZE);    dim3 numBlocks((N + BLOCK_SIZE - 1) / BLOCK_SIZE, (N + BLOCK_SIZE - 1) / BLOCK_SIZE);    nvtxRangePush("Kernel Execution");    matrixMulKernel<<<numBlocks, threadsPerBlock>>>(d_A, d_B, d_C, N);    cudaDeviceSynchronize();    nvtxRangePop();    nvtxRangePush("Memory Copy D2H");    cudaMemcpy(C, d_C, size, cudaMemcpyDeviceToHost);    nvtxRangePop();    nvtxRangePush("Memory Deallocation");    cudaFree(d_A);    cudaFree(d_B);    cudaFree(d_C);    nvtxRangePop();    nvtxRangePop();  // End of Matrix Multiplication}int main() {    const int N = 1024;    float *A = new float[N*N];    float *B = new float[N*N];    float *C = new float[N*N];    // Initialize matrices A and B here...    matrixMul(A, B, C, N);    // Use result in C...    delete[] A;    delete[] B;    delete[] C;    return 0;}

## ExercisesExperiment with these modifications:1. **Matrix Sizes**: Try different M, N, K dimensions   - What happens with non-square matrices?   - Test with very large matrices2. **Verify Correctness**: Add code to verify the result against CPU computation3. **Block Size Impact**: Experiment with different BLOCK_SIZE values   - Measure performance for each4. **Measure Throughput**: Calculate FLOPS (floating-point operations per second)

---## Key Takeaways- CUDA enables massive parallelism for compute-intensive tasks- Proper memory management is crucial for performance- Understanding the thread hierarchy helps write efficient kernels- Always synchronize when needed to ensure correctness---## Next StepsContinue to the next notebook in Module 5 to learn more CUDA concepts!---## Notes*Use this space for your learning notes:*