## 1. Setup and Build Configuration

In [None]:
# Baseline
!nvcc -arch=sm_75 -o build/train_gpu_baseline src/train_gpu_baseline.cu src/kernel.cu src/cifar10_dataset.cpp -I include/
%cd build
!./train_gpu_baseline
%cd ..

# Memory Pool
!nvcc -arch=sm_75 -o build/train_gpu_optimize_memory_pool src/train_gpu_optimize_memory_pool.cu src/kernel.cu src/cifar10_dataset.cpp -I include/
%cd build
!./train_gpu_optimize_memory_pool
%cd ..

# Double Buffering
!nvcc -arch=sm_75 -o build/train_gpu_optimize_double_buffer src/train_gpu_optimize_double_buffer.cu src/kernel.cu src/cifar10_dataset.cpp -I include/
%cd build
!./train_gpu_optimize_double_buffer
%cd ..

# Im2col
!nvcc -arch=sm_75 -o build/train_gpu_optimize_im2col src/train_gpu_optimize_im2col.cu src/optimize_kernel.cu src/cifar10_dataset.cpp -I include/
%cd build
!./train_gpu_optimize_im2col
%cd ..

# GEMM
!nvcc -arch=sm_75 -o build/train_gpu_optimize_gemm src/train_gpu_optimize_gemm.cu src/optimize_kernel.cu src/cifar10_dataset.cpp -I include/
%cd build
!./train_gpu_optimize_gemm
%cd ..

# Kernel Fusion
!nvcc -arch=sm_75 -o build/train_gpu_optimize_fused_kernels src/train_gpu_optimize_fused_kernels.cu src/optimize_kernel.cu src/cifar10_dataset.cpp -I include/
%cd build
!./train_gpu_optimize_fused_kernels
%cd ..

# CUDA Streams
!nvcc -arch=sm_75 -o build/train_gpu_optimize_cuda_streams src/train_gpu_optimize_cuda_streams.cu src/optimize_kernel.cu src/cifar10_dataset.cpp -I include/
%cd build
!./train_gpu_optimize_cuda_streams
%cd ..

# Pinned Memory
!nvcc -arch=sm_75 -o build/train_gpu_optimize_pinned_memory src/train_gpu_optimize_pinned_memory.cu src/optimize_kernel.cu src/cifar10_dataset.cpp -I include/
%cd build
!./train_gpu_optimize_pinned_memory
%cd ..

# All Optimizations
!nvcc -arch=sm_75 -o build/train_gpu_optimize_all src/train_gpu_optimize_all.cu src/optimize_kernel.cu src/cifar10_dataset.cpp -I include/
%cd build
!./train_gpu_optimize_all
%cd ..

## Summary

This benchmark compares 8 CUDA training variants:

| Technique | Key Benefit |
|-----------|-------------|
| **Baseline** | Original implementation (reference) |
| **Memory Pool** | Reduces fragmentation, faster allocation |
| **Double Buffering** | Overlaps I/O with computation |
| **Im2col Transform** | Enables efficient GEMM for convolutions |
| **GEMM Operations** | Highly optimized matrix multiplication |
| **Kernel Fusion** | Reduces kernel launch overhead |
| **CUDA Streams** | Parallel computation and data transfer |
| **Pinned Memory** | Fast PCIe transfers |
| **All Combined** | Synergistic effect of all techniques |

**Interpretation:**
- Speedup > 1: Faster than baseline âœ“
- Speedup = 1: No improvement
- Improvement (%) shows percentage reduction in execution time