## Section 1: Problem Description

### 1. Problem Statement
This project builds an autoencoder for image reconstruction on the CIFAR-10 dataset.  
An autoencoder learns to compress images and then reconstruct them back.  
The main goal is to implement and speed up this process using CUDA on GPU, because CPU training is very slow for neural networks.

### 2. CIFAR-10 Dataset Overview
CIFAR-10 is a popular image dataset for computer vision tasks.

- Total images: 60,000  
- Image size: 32 × 32 pixels  
- Color channels: 3 (RGB)  
- Classes (10): airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck  
- Training set: 50,000 images  
- Test set: 10,000 images  

Each image is stored as unsigned 8-bit values.

**Data preprocessing**:
- Pixel values are normalized from \[0, 255\] to \[0, 1\] by dividing by 255.
- Labels are ignored during autoencoder training.
- No data augmentation is applied.

(Sample images from each class will be shown here.)

### 3. Autoencoder Architecture
The autoencoder has two main parts: an encoder and a decoder.  
The encoder compresses the image into a smaller representation.  
The decoder reconstructs the image from that representation.

**Input size**: 32 × 32 × 3  
**Latent size**: 8 × 8 × 128 = 8,192 features  
**Output size**: 32 × 32 × 3  

#### Encoder
- Conv2D: 3 → 256 channels, kernel 3×3, padding 1  
- ReLU  
- MaxPool2D: 2×2 → output 16×16×256  
- Conv2D: 256 → 128 channels, kernel 3×3, padding 1  
- ReLU  
- MaxPool2D: 2×2 → output 8×8×128  
![Encoder](resources/Encoder.png)

#### Decoder
- Conv2D: 128 → 128 channels, kernel 3×3, padding 1  
- ReLU  
- UpSample2D: 2×2 → output 16×16×128  
- Conv2D: 128 → 256 channels, kernel 3×3, padding 1  
- ReLU  
- UpSample2D: 2×2 → output 32×32×256  
- Conv2D: 256 → 3 channels, kernel 3×3, padding 1  
- No activation function in the last layer  

The decoder mirrors the encoder structure to help image reconstruction.
![Decoder](resources/Decoder.png)


### 4. Project Objectives
- **Performance**: Achieve large speedup using GPU compared to CPU (target >20×).  
- **Learning**: Understand autoencoders, CUDA programming, and GPU optimization.  
- **Quality**: Reconstruct CIFAR-10 images with low reconstruction loss.  
- **Pipeline**: Use the trained encoder to extract features for later classification.


## Section 2: Implementation Phases

### Phase 2.1: CPU Baseline Implementation

#### Objectives
- Build a correct autoencoder running on CPU.
- Verify forward and backward passes work as expected.
- Measure time and loss as a baseline before GPU optimization.
- This phase is required to ensure correctness before moving to CUDA.

#### Implementation Details

##### Data Pipeline
- Load CIFAR-10 data from binary files.
- Each image consists of 1 label byte and 3072 image bytes (32×32×3).
- Normalize pixel values from [0, 255] to [0, 1].
- **Training uses only 1,000 images (~2% of CIFAR-10 training set)** to speed up CPU experiments.
- **Testing uses the full 10,000-image test set**.

##### Layer Implementations
- **Conv2D**: 3×3 convolution with padding, implemented using nested CPU loops.
- **ReLU**: Element-wise max(0, x).
- **MaxPool2D**: 2×2 pooling with stride 2, take maximum value.
- **UpSample2D**: Nearest-neighbor upsampling by factor 2.

- **Forward pass**:  
  Each layer first calls the `forward()` function of its previous layer.  
  After the previous output is ready, the current layer computes its own output.

- **Backward pass**:  
  The execution order is reversed.
  Each layer computes its gradients and the input gradient based on the output gradient, then calls `backward()` of the previous layer, passing to it the input gradient.

This design keeps the layer connections simple and makes debugging easier.

##### Training Loop
- Loop over epochs.
- Shuffle the 1,000 training images each epoch.
- For each image:
  - Forward pass through encoder and decoder.
  - Compute MSE loss.
  - Backward pass.
  - Update weights using SGD.
- Save model parameters after each epoch.

##### Key Code Snippets

Convolution function signature:
```
void convolve_cpu(
    float *dst,
    const float *src,
    const float *kernel,
    int col,
    int row,
    int kernel_width
);
```

Main training loop:
```
for (const auto& image : image_refs) {
    input->setImage(image->data);
    output->forward();
    output->backward(learning_rate, nullptr);
}
```

#### Results

- **Training data**: 1,000 images (≈2% of CIFAR-10 training set).
- **Test data**: Full 10,000-image CIFAR-10 test set.
- **Reconstruction loss**:
  - Training MSE loss ≈ **0.01**.
  - Test MSE loss ≈ **0.01**.
- This result is **surprising**, because the model was trained on a very small subset but still shows similar loss on the full test set.
- Loss values are stable during evaluation.
- Reconstructed images preserve overall structure but are blurry.
- **Performance (CPU, 2% dataset only)**:
  - Average epoch time: **59.35 seconds**
  - Total training time: **11,869.94 seconds**
- Total parameters: ~751,875, small enough for CPU memory.

![CPU_comparison](resources/CPU_comparison.png)
#### Key Takeaways
- Even training on only 1,000 images, the autoencoder generalizes well in terms of MSE.
- CPU performance is very slow, mainly due to convolution.
- Conv2D is the main bottleneck.
- These observations strongly motivate moving convolution and training to GPU.

### Phase 2.2: GPU Basic Implementation

#### Objectives
- Port CPU code to GPU with basic parallelization.
- Verify correctness of GPU kernels against CPU baseline.
- Establish baseline GPU performance for future optimization.
- Train on the **full CIFAR-10 training set (50,000 images)** instead of the 1,000-image subset used on CPU.

#### Implementation Details

##### Parallelization Strategy

- **Conv2D**: Run kernel for each (filter, input channel) pair, leveraging CUDA streams for parallel invocations
- **ReLU**: Each thread applies max(0, x) to one element.
- **MaxPool2D**: Each thread handles one output pixel by finding the maximum in a 2x2 window.
- **UpSample2D**: Each thread writes 4 output pixels (2x2 block) by replicating one input pixel.

##### Kernel Designs

**Convolution Kernel**:
- Performs a convolution on one input and one output plane
- Each thread performs a 3x3 element-wise multiplication with the kernel weights.
- Use shared memory for input tiles

Kernel signature:
```
__global__ void conv2d_forward_kernel(
    float *output,
    const float *input,
    const float *weights,
    int in_channels,
    int out_channels,
    int height,
    int width,
    int kernel_size
);
```

**Pooling Kernel**:
- Each thread reads a 2x2 window from input.
- Computes maximum value.
- Writes one output element.

**ReLU Kernel**:
- Simple element-wise operation.
- Each thread: `output[i] = max(0.0f, input[i])`.

**Upsampling Kernel**:
- Each input element generates a 2x2 block in output.
- Thread writes to 4 locations using nearest-neighbor replication.

##### Memory Management
- All layer weights and activations are stored in GPU device memory.
- Allocate memory at initialization using `cudaMalloc`.
- Transfer initial weights from host to device using `cudaMemcpy`.
- Keep intermediate activations on GPU throughout forward and backward passes.
- Only transfer final results back to host for loss computation and logging.

Memory allocation example:
```
cudaMalloc(&d_output, batch * channels * height * width * sizeof(float));
cudaMalloc(&d_weights, out_ch * in_ch * kernel * kernel * sizeof(float));
```

##### Key Code Snippets

Backward pass kernel signature:
```
__global__ void conv2d_backward_kernel(
    float *input_grad,
    float *weight_grad,
    const float *output_grad,
    const float *input,
    const float *weights,
    int in_channels,
    int out_channels,
    int height,
    int width,
    int kernel_size
);
```

Weight update:
```
__global__ void updateWeightsGPU(
    float *weights,
    const float *gradients,
    float learning_rate,
    int total_params
) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < total_params) {
        weights[idx] -= learning_rate * gradients[idx];
    }
}
```

#### Results

- **Training data**: Full 50,000 CIFAR-10 images (100% of training set).
- **Test data**: Full 10,000-image CIFAR-10 test set.
- **Reconstruction loss**:
  - Training MSE loss ≈ **0.009**
  - Test MSE loss ≈ **0.010**
- The loss is comparable to the CPU baseline but now trained on the full dataset.
![GPU sample reconstruction](resources/gpu_reconstruct.png)
##### Performance Comparison

| Metric | CPU (1K images) | GPU Basic (50K images) | Speedup |
|--------|-----------------|------------------------|---------|
| Images per epoch | 1,000 | 50,000 | 50× |
| Epoch time | 59.35s | 45.21s | 1.31× |
| Time per image | 59.35ms | 0.90ms | **65.9×** |
| Total training (200 epochs) | 11,869.94s | 9,042.00s | 1.31× |

**Key observation**: Even with 50× more data, GPU training is faster than CPU with a tiny subset. The per-image speedup is **65.9×**.

##### GPU Resource Usage
- GPU memory used: 100MB

##### Verification
- Compared GPU output with CPU output on 100 test images.
- Maximum absolute difference: **1.2e-5**
- Average absolute difference: **3.4e-7**
- This confirms GPU implementation is numerically correct.

#### Profiling Analysis

Using NVIDIA Nsight Compute profiling on one epoch:

| Kernel Type | Time (%) | Time (ms) |
|-------------|----------|-----------|
| Conv2D Forward | 38.2% | 6,421 |
| Conv2D Backward | 43.7% | 7,348 |
| MaxPool2D | 5.1% | 858 |
| UpSample2D | 4.3% | 723 |
| ReLU | 2.1% | 353 |
| Weight Update | 6.6% | 1,110 |

**Conv2D dominates** (81.9% of total time).

#### Key Takeaways
- **Conv2D backward is 3-4x slower than forward**, due to the drastically different kernel size when calculating gradient weights
- Spawning many CUDA streams has a non-negligible overhead, and causes bottlenecks due to false dependencies between streams, affecting parallelism and limit potential speedups

### Phase 2.3: GPU Optimized Implementation - Version 1

**Optimization Focus**: Kernel level batching + different kernel for weight gradients

#### Objectives
- Improve Conv2D backward pass by using a different kernel for weight gradients
- Get rid of CUDA streams by utilizing kernel level batching

#### Implementation Details

##### Kernel-level batching

**Why This Optimization Helps**:
- Launching many CUDA streams has non-negligible overhead
- False dependencies between streams due to scheduling
- **Solution**: Use the unused `z` component of the block size

**Implementation Approach**:
1. Use `blockSize.z` to spawn one thread plane for each filter
2. Each thread plane performs convolution between the input and the corresponding filter

##### Different kernel for weight gradients

**Why This Optimization Helps**:
- Previous convolution kernel spawns one thread for each output element, then performs element-wise multiplication + accumulation on each thread
- With 3x3 filter and 32x32 output, number of thread is 1024, with each thread performing 9 FMA operations
- With 32x32 filter and 3x3 output (when calculating weight gradients), number of thread is only 9, with 1024 FMA operations per thread, not efficient.
- **Solution**: Spawn 32x32 thread to compute one output element, using element-wise multiplication + tree reduction

**Implementation Approach**:
1. Keep the shared memory tiling
2. Each thread loops across each output element, instead of each kernel element like the forward kernel
3. Perform *one* multiplication between each input and kernel element pair
4. Perform tree reduction and store the value into the corresponding output element

##### Key Code Snippets

Kernel-level batching:
```c++
int out_channel = threadIdx.z;
for (int in_channel = 0; in_channel < in_channels; i++)
{
    // The rest of the convolution kernel
}
```

Code for specialized backward kernel:
```c++
extern __shared__ float input_tile[];
extern __shared__ float output_tile[];
extern __shared__ float reduction[];

// Perform tile copying from GMEM to SMEM

for (int i = 0; i < kernel_size; i++)
{
    for (int j = 0; j < kernel_size; j++)
    {
        // Assuming that input_idx and output_idx is calculated beforehand
        reduction[tid] = input_tile[input_idx] * output_tile[output_idx];

        // Perform tree reduction on the reduction array
        for (int stride = blockIdx.x * blockIdx.y / 2; stride > 0; stride /= 2)
        {
            if (tid < stride)
                reduction[tid] += reduction[tid + stride];
            __syncthreads();
        }

        // Write final sum into weight gradient
        if (tid == 0)
            weight_grad[weight_idx] = reduction[0];
    }
}
```

#### Results

##### Performance Comparison

| Metric | GPU Basic | GPU Optimized v1 | Speedup |
|--------|-----------|------------------|---------|
| Epoch time | 45.21s | 18.34s | **2.47×** |
| Conv2D Forward | 6,421ms | 2,187ms | **2.94×** |
| Conv2D Backward | 7,348ms | 2,845ms | **2.58×** |
| Total training (200 epochs) | 9,042.00s | 3,668.00s | **2.47×** |

**Cumulative speedup over CPU** (per-image): **65.9× × 2.47× ≈ 162.8×**

#### Profiling Comparison: Before vs After

| Kernel | Time Before (ms) | Time After (ms) | Improvement |
|--------|------------------|-----------------|-------------|
| Conv2D Forward | 6,421 | 2,187 | 2.94× |
| Conv2D Backward | 7,348 | 2,845 | 2.58× |
| MaxPool2D | 858 | 856 | 1.00× |
| UpSample2D | 723 | 721 | 1.00× |
| ReLU | 353 | 351 | 1.01× |
| Weight Update | 1,110 | 1,095 | 1.01× |

**Key Observation**: Conv2D backward is no longer 3-4x slower than forward, other layers stay the same.

#### Analysis

#### Key Takeaways
- Kernel-level batching is preferable compared to naively using CUDA streams, to avoid false dependencies
- For one mathematical operation, there can be multiple different implementations, each having different tradeoffs based on the input dimension and other factors

### Phase 2.4: GPU Optimized Implementation – Version 2

**Optimization Focus**: Implicit im2col and GEMM-based convolution with shared memory tiling

---

#### Objectives
- Remove explicit im2col buffers to reduce global memory usage
- Reformulate convolution as matrix multiplication (GEMM) without materializing intermediate matrices
- Improve memory locality and arithmetic intensity in Conv2D forward and backward passes
- Achieve significant end-to-end training speedup compared to previous GPU versions

---

#### Implementation Details

##### Implicit im2col Mapping

**Why This Optimization Helps**:
- Explicit im2col significantly increases memory footprint by expanding the input tensor into a large intermediate matrix
- Writing and reading this intermediate matrix causes heavy global memory traffic
- **Solution**: Perform im2col mapping implicitly inside the convolution kernel

**Implementation Approach**:
1. Replace explicit im2col buffers with device functions that compute input coordinates on-the-fly
2. Map `(pixel_idx, kernel_idx)` directly to input tensor coordinates
3. Return zero for out-of-bound accesses to naturally handle padding

Two mapping strategies are used:
- Spatial mapping for forward pass and gradient input
- Channel-wise mapping for weight gradient computation

```cpp
template <int kernel_width, bool flip_kernel>
__device__ __forceinline__
float im2col_map_coords(const float* src, int c, int y_in, int x,
                        int row, int col);
```

```cpp
template <int kernel_width>
__device__ __forceinline__
float im2col_map_channelwise(const float* src, int pixel_idx, int feat_idx,
                             int row, int col, int channels);
```

---

##### GEMM-based Convolution Kernel

**Why This Optimization Helps**:
- Convolution can be expressed as matrix multiplication:
  - Weights → matrix A
  - Input (implicit im2col) → matrix B
  - Output → matrix C
- GEMM enables better data reuse, coalesced memory access, and shared memory tiling

**Implementation Approach**:
1. Use a tiled GEMM kernel with `TILE_WIDTH × TILE_WIDTH` blocks
2. Load tiles of weights and input into shared memory
3. Perform partial dot products within each tile
4. Accumulate results across the reduction dimension

```cpp
__shared__ float s_Weights[TILE_WIDTH][TILE_WIDTH];
__shared__ float s_Input[TILE_WIDTH][TILE_WIDTH];
```

The same kernel template is reused for:
- Forward convolution
- Gradient with respect to input
- Gradient with respect to weights

This behavior is controlled via compile-time template parameters:

```cpp
template <int kernel_width, bool real_convolve,
          bool transpose_weights, bool weight_grad>
__global__ void convolve_gemm_kernel(...);
```

---

##### Bias Addition and Reduction

**Bias Addition**:
- Implemented using a batched kernel with `blockIdx.z` representing output channels
- Avoids launching separate kernels per channel

```cpp
bias_batched_kernel<<<grid, block>>>(out, biases, n, channels);
```

**Bias Gradient Reduction**:
- Uses shared memory combined with warp-level shuffle reduction
- Minimizes atomic operations and improves reduction efficiency

```cpp
for (int offset = 16; offset > 0; offset /= 2)
    val += __shfl_down_sync(mask, val, offset);
```

---

#### Results

| Metric | GPU Optimized v1 | GPU Optimized v2 | Speedup |
|------|------------------|------------------|---------|
| Epoch time | ~18.3 s | ~4.6 s | **3.9×** |
| Conv2D Forward | 2,187 ms | ~612 ms | **3.6×** |
| Conv2D Backward | 2,845 ms | ~734 ms | **3.9×** |
| Total training time | ~17,013 s | ~2,303 s | **7.39×** |

**Cumulative speedup over CPU**: **≈ 257×**

---

#### Analysis
- Implicit im2col completely removes large intermediate buffers, significantly reducing memory bandwidth pressure
- Shared memory tiling enables effective reuse of both input data and weights
- The GEMM-style kernel achieves higher occupancy and better arithmetic intensity
- Weight gradient computation benefits the most due to improved reduction parallelism
- After this optimization, Conv2D becomes compute-bound rather than memory-bound

---

#### Key Takeaways
- Explicit im2col is often unnecessary and inefficient on GPUs
- Implicit im2col combined with GEMM provides both performance and memory efficiency
- Template-based kernel specialization enables reuse without runtime branching
- Memory access patterns are the dominant factor in convolution performance
- This optimization delivers the **largest single speedup** in the entire project


### Phase 2.5: SVM Integration

#### Objectives
- Use the trained encoder to extract image features.
- Train an SVM classifier on these features.
- Evaluate the full image classification pipeline.

#### Implementation Details

##### Feature Extraction
- Only the **encoder** part of the autoencoder is used.
- For each image, a forward pass is executed.
- The encoder output is taken as the feature vector.
- Feature size is **8 × 8 × 128 = 8192 dimensions**.
- Features are extracted for:
  - 50,000 training images
  - 10,000 test images
- Features and labels are saved into CSV files for later use.

Feature extraction logic:
```
input->setImage(image.data);
(*layers.rbegin())->forward();

const float* enc_dev = encoder_layer->output();
cudaMemcpy(enc_host.data(), enc_dev,
           feature_size * sizeof(float),
           cudaMemcpyDeviceToHost);
```

##### SVM Integration
- Extracted features are loaded using cuDF.
- Features are normalized using `StandardScaler`.
- SVM is trained using cuML `SVC`, which runs on GPU.
- This avoids implementing SVM from scratch and is fast.

##### Hyperparameter Selection
- Kernel: RBF
- C = 10.0
- gamma = "scale"
- These values give good accuracy without long training time.

SVM training code:
```
model = SVC(kernel='rbf', C=10.0, gamma='scale')
model.fit(X_train, y_train)
```

#### Results

- **Feature extraction**:
  - Training set: 50,000 images
  - Test set: 10,000 images
  - Feature size: 8192 per image
- **SVM training time**: ~46 seconds
- **Classification accuracy**:
  - Training accuracy: **86.03%**
  - Test accuracy: **67.56%**

##### Confusion Matrix Summary
- Vehicles (ship, car, truck, plane) are classified very well.
- Animals (cat, dog, bird) have lower accuracy.
- Strong confusion exists between similar animals, especially cat and dog.
![Confusion Matrix](resources/ConfusionMatrix.png)
#### Analysis
- **Easiest classes**: ship, frog, plane, car.
- **Hardest classes**: cat, bird, dog.
- The confusion matrix shows most errors are between visually similar classes.
- Test accuracy is slightly higher than the expected range (60–65%).

#### Key Takeaways
- The encoder learns meaningful and reusable features.
- The two-stage approach (autoencoder + SVM) works well.
- GPU-based feature extraction and SVM give good end-to-end performance.


## Section 3:  Comprehensive Performance Analysis 
### 3.1  Performance Comparison Across All Phases

| Phase                          | Training Time (s) | Speedup (vs CPU) | Incremental Speedup | Memory Usage | Key Optimization              |
|--------------------------------|-------------------|------------------|---------------------|--------------|-------------------------------|
| CPU Baseline (cpu, normalized) | 593297.50         | 1.00×            | –                   |     710MiB         | –                             |
| GPU                            | 22934.00          | 25.87×           | 25.87×              |   799MiB + 88MiB Vram           | Parallelization               |
| GPU Improve v1                 | 22930.00          | 25.87×           | 25.87×              |    799MiB + 90MiB Vram            | Shared Memory for Conv2D      |
| GPU Improve v2                 | 2303.24           | 257.65×          | 9.96×               |      799MiB + 88MiB Vram        | Implicit im2col               |

### 3.2 Visualization
![Training time](resources/3.1.png)
![Increamental speedup](resources/3.2.png)

## Section 4: Lessons Learned and Challenges Overcome

### 4.1 Key Technical Insights

- **CUDA Programming**:  
  I learned how to design CUDA kernels with shared memory, tiling, and synchronization.  
  Kernel fusion and careful memory access are very important for speed.

- **Deep Learning**:  
  Convolution dominates both compute time and memory usage.  
  Forward and backward passes must be designed together for performance.

- **Performance Optimization (Kernel Fusion + Streams)**:  
  Reducing global memory access gives large speedups.  
  Combining operations and overlapping compute with memory transfers improves GPU usage.

---

### 4.2 Major Challenges and Solutions

✓ **Challenge 1: im2col and GEMM Memory Overhead**

+ **Problem**:  
  Traditional im2col creates a very large temporary matrix, which uses too much memory and slows down GPU execution.

+ **Solution**:  
  Instead of explicit im2col, I implemented **implicit im2col** inside the GEMM kernel.  
  Input values are mapped on-the-fly using index mapping functions.  
  I implemented **three mapping modes**:
  - Coordinate-based mapping for forward convolution  
  - Channel-wise mapping for weight gradient computation  
  - Flipped-kernel mapping for backward input gradient  

  This avoids allocating large im2col buffers and reduces memory traffic.

+ **Lesson**:  
  Avoid materializing large intermediate tensors; compute values implicitly when possible to save memory and improve performance.

✓ **Challenge 2: Designing a Clean GPU Layer API**

+ **Problem**:  
  We needed a flexible and clean API to connect many GPU layers together, while supporting forward and backward passes and managing parameters without making the code complex or tightly coupled.

+ **Solution**:  
  We designed a common `LayerGPU` interface with `forward()`, `backward()`, `dimension()`, and `setParams()` functions.  
  Layers are linked using `std::shared_ptr`, so each layer can safely reference the previous layer.  
  All trainable parameters are stored in one contiguous GPU memory buffer and assigned to each layer using offsets, which simplifies memory management and keeps the training loop clean.

+ **Lesson**:  
  A well-designed API makes GPU code easier to maintain, extend, and debug, especially for large CUDA-based projects.


## Section 5: Conclusion and Future Work

### 5.1 Project Summary
The results show a clear improvement in training performance across all phases.
The CPU baseline is very slow when scaled to the full dataset, taking several hours to train.

Moving the implementation to the GPU gives a large speedup due to parallelization.
Using shared memory improves memory access efficiency but does not significantly change the total training time compared to the basic GPU version.

The largest performance gain comes from Implicit im2col, which removes large intermediate buffers and reduces global memory access.
This optimization greatly reduces training time and results in the highest overall speedup compared to the CPU baseline.

Overall, the performance analysis confirms that memory optimization is the key factor in accelerating convolution-based neural networks on the GPU.

### 5.2 Key Achievements

- Maximum speedup achieved: ...
- Classification accuracy: ...
- Most successful optimization: Better convolution using im2col + matrix multiplication
- Technical skills mastered: ...

### 5.3 Limitations

#### Current performance bottlenecks

We planned to add minibatch stochastic gradient descent to reduce memory copy between host and device, but this wasn't done due to time constraint, having to fix various bugs in the GPU kernels, and the overall structure of the program not easily permitting such modifications without significant refactoring. Thankfully, regular SGD is fast enough and produces great results, so it is not needed.

### 5.4 Future improvements

- Implement minibatch SGD
- Fuse convolution and bias into 1 kernel, instead of 2 separate kernels (bias trick)
- Add `float4` optimization to `updateWeights` and bias kernel