# 02 Softmax**FreeCodeCamp CUDA Course - Module 8**Original Course: [https://www.youtube.com/watch?v=86FAWCzIe_4](https://www.youtube.com/watch?v=86FAWCzIe_4)Source File: `02_softmax.cu`---## OverviewCUDA programming concepts and implementation.---

## Learning ObjectivesBy the end of this notebook, you will:1. Understand CUDA kernel syntax and execution2. Learn GPU memory allocation and data transfer---

## Setup: Google Colab GPUFirst, ensure you have enabled GPU in Colab:1. Go to **Runtime** → **Change runtime type**2. Select **T4 GPU** as Hardware accelerator3. Click **Save**Let's verify CUDA is available:

In [None]:
# Check GPU availability!nvidia-smi

Now install the nvcc4jupyter plugin to compile CUDA code inline:

In [None]:
# Install nvcc4jupyter for inline CUDA compilation!pip install nvcc4jupyter -q%load_ext nvcc4jupyter

---

## Key Concepts- **Kernel Function**: Uses `__global__` qualifier for GPU execution- **Device Memory**: Allocated using `cudaMalloc`- **Data Transfer**: Uses `cudaMemcpy` between host and device- **Kernel Launch**: Syntax `kernel<<<blocks, threads>>>(...)`---## CUDA Implementation

In [None]:
%%cu#include <stdio.h>#include <math.h>#include <cuda_runtime.h>#include <stdlib.h>__global__ void softmax_cuda(float* input, float* output, int B, int N) {    int tid = blockIdx.x * blockDim.x + threadIdx.x;    int bid = blockIdx.y;        if (tid < N && bid < B) {        int offset = bid * N;        float max_val = input[offset + tid];        for (int i = 1; i < N; i++) {            max_val = max(max_val, input[offset + i]);        }                float sum = 0.0f;        for (int i = 0; i < N; i++) {            sum += expf(input[offset + i] - max_val);        }                for (int i = 0; i < N; i++) {            output[offset + i] = expf(input[offset + i] - max_val) / sum;        }    }}void softmax(float *x, int N) {    float max = x[0];    for (int i = 1; i < N; i++) {        if (x[i] > max) {            max = x[i];        }    }    float sum = 0.0;    for (int i = 0; i < N; i++) {        x[i] = exp(x[i] - max);        sum += x[i];    }    for (int i = 0; i < N; i++) {        x[i] /= sum;    }}int main() {    const int B = 32;  // Batch size    const int N = 1024;  // Row length    float *x_cpu = (float*)malloc(B * N * sizeof(float));    float *x_gpu = (float*)malloc(B * N * sizeof(float));    float *d_input, *d_output;    // Initialize input vector    for (int i = 0; i < B * N; i++) {        x_cpu[i] = (float)rand() / RAND_MAX;  // Random values between 0 and 1        x_gpu[i] = x_cpu[i];  // Copy to GPU input    }    // Allocate device memory    cudaMalloc((void**)&d_input, B * N * sizeof(float));    cudaMalloc((void**)&d_output, B * N * sizeof(float));    // Copy input data to device    cudaMemcpy(d_input, x_gpu, B * N * sizeof(float), cudaMemcpyHostToDevice);    // Launch kernel    int threadsPerBlock = 256;    int blocksPerGrid_x = (N + threadsPerBlock - 1) / threadsPerBlock;    dim3 gridDim(blocksPerGrid_x, B);    softmax_cuda<<<gridDim, threadsPerBlock>>>(d_input, d_output, B, N);    // Copy result back to host    cudaMemcpy(x_gpu, d_output, B * N * sizeof(float), cudaMemcpyDeviceToHost);    // Compute softmax on CPU (for one batch as an example)    softmax(x_cpu, N);    // Compare results (for the first batch as an example)    float max_diff = 0.0f;    for (int i = 0; i < N; i++) {        float diff = fabsf(x_cpu[i] - x_gpu[i]);        if (diff > max_diff) {            max_diff = diff;        }    }    printf("Maximum difference between CPU and GPU results (first batch): %e\n", max_diff);    // Clean up    free(x_cpu);    free(x_gpu);    cudaFree(d_input);    cudaFree(d_output);    return 0;}

## ExercisesTry these modifications:1. **Modify Parameters**: Change kernel launch parameters and observe effects2. **Add Error Checking**: Implement CUDA error checking for all API calls3. **Performance Measurement**: Add timing code to measure execution time4. **Extend Functionality**: Add new features building on this example

---## Key Takeaways- CUDA enables massive parallelism for compute-intensive tasks- Proper memory management is crucial for performance- Understanding the thread hierarchy helps write efficient kernels- Always synchronize when needed to ensure correctness---## Next StepsContinue to the next notebook in Module 8 to learn more CUDA concepts!---## Notes*Use this space for your learning notes:*