# 01 Naive Matmul**FreeCodeCamp CUDA Course - Module 5**Original Course: [https://www.youtube.com/watch?v=86FAWCzIe_4](https://www.youtube.com/watch?v=86FAWCzIe_4)Source File: `01_naive_matmul.cu`---## OverviewImplement matrix multiplication using CUDA.---

## Learning ObjectivesBy the end of this notebook, you will:1. Understand CUDA kernel syntax and execution2. Learn GPU memory allocation and data transfer---

## SetupMake sure you've completed the setup from the first notebook (GPU enabled, nvcc4jupyter installed).---

## Key Concepts- **Kernel Function**: Uses `__global__` qualifier for GPU execution- **Device Memory**: Allocated using `cudaMalloc`- **Kernel Launch**: Syntax `kernel<<<blocks, threads>>>(...)`- **Synchronization**: `cudaDeviceSynchronize()` waits for GPU completion---## CUDA Implementation

In [None]:
%%cu#include <cuda_runtime.h>#include <iostream>__global__ void matrixMultiply(float* A, float* B, float* C, int M, int N, int K) {    int row = blockIdx.y * blockDim.y + threadIdx.y;    int col = blockIdx.x * blockDim.x + threadIdx.x;        if (row < M && col < N) {        float sum = 0.0f;        for (int i = 0; i < K; ++i) {            sum += A[row * K + i] * B[i * N + col];        }        C[row * N + col] = sum;    }}int main() {    // Define matrix dimensions    const int M = 1024; // Number of rows in A and C    const int N = 1024; // Number of columns in B and C    const int K = 1024; // Number of columns in A and rows in B    // Calculate matrix sizes in bytes    size_t size_A = M * K * sizeof(float);    size_t size_B = K * N * sizeof(float);    size_t size_C = M * N * sizeof(float);    // Declare device pointers    float *d_A, *d_B, *d_C;    // Allocate device memory    cudaMalloc(&d_A, size_A);    cudaMalloc(&d_B, size_B);    cudaMalloc(&d_C, size_C);    // Kernel launch code    dim3 blockDim(16, 16);    dim3 gridDim((N + blockDim.x - 1) / blockDim.x, (M + blockDim.y - 1) / blockDim.y);    matrixMultiply<<<gridDim, blockDim>>>(d_A, d_B, d_C, M, N, K);    // Synchronize device    cudaDeviceSynchronize();    // Free device memory    cudaFree(d_A);    cudaFree(d_B);    cudaFree(d_C);    // Check for any CUDA errors    cudaError_t error = cudaGetLastError();    if (error != cudaSuccess) {        std::cerr << "CUDA error: " << cudaGetErrorString(error) << std::endl;        return -1;    }    return 0;}

## ExercisesExperiment with these modifications:1. **Matrix Sizes**: Try different M, N, K dimensions   - What happens with non-square matrices?   - Test with very large matrices2. **Verify Correctness**: Add code to verify the result against CPU computation3. **Block Size Impact**: Experiment with different BLOCK_SIZE values   - Measure performance for each4. **Measure Throughput**: Calculate FLOPS (floating-point operations per second)

---## Key Takeaways- CUDA enables massive parallelism for compute-intensive tasks- Proper memory management is crucial for performance- Understanding the thread hierarchy helps write efficient kernels- Always synchronize when needed to ensure correctness---## Next StepsContinue to the next notebook in Module 5 to learn more CUDA concepts!---## Notes*Use this space for your learning notes:*