# CIFAR-10 Autoencoder + SVM với CUDA
**Đồ án cuối kỳ - Lập trình Song song (CSC14120)**

---

## Mục lục
1. Setup môi trường
2. Compile project
3. Phase 1: CPU Baseline
4. Phase 2: Naive GPU
5. Phase 3: Optimized GPU
6. Phase 4: SVM Classification
7. So sánh kết quả
8. Kết luận

## 1. Setup môi trường

### Kiểm tra GPU và CUDA

In [None]:
# Check GPU
!nvidia-smi

print("\n" + "="*60)
print("CUDA Version:")
print("="*60)
!nvcc --version

### Setup trên Google Colab (uncomment nếu cần)

In [None]:
# === COLAB SETUP (uncomment) ===
# !git clone https://github.com/your-repo/project.git
# %cd project
# !wget https://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz
# !tar -xzf cifar-10-binary.tar.gz
# !mkdir -p third_party && cd third_party && git clone https://github.com/cjlin1/libsvm.git && cd libsvm && make

### Check project structure

In [None]:
import os
import sys

# Adjust for your environment
PROJECT_DIR = "/home/hahuy2004/LT_song_song/LT/Project"
# For Colab: PROJECT_DIR = "/content/project"

os.chdir(PROJECT_DIR)
print(f"Working directory: {os.getcwd()}")

print("\nChecking files...")
required = ["cifar-10-batches-bin", "src", "cuda", "include", "Makefile", "run_pipeline.py"]
for item in required:
    exists = os.path.exists(item)
    print(f"[{'OK' if exists else 'MISSING'}] {item}")

### Import Python wrapper

In [None]:
sys.path.insert(0, PROJECT_DIR)
from run_pipeline import CIFARAutoencoderPipeline

# Initialize pipeline
pipeline = CIFARAutoencoderPipeline(PROJECT_DIR)

# Check setup
pipeline.check_setup()

---
## 2. Compile Project

In [None]:
# Compile all phases
print("="*60)
print("Compiling...")
print("="*60)

pipeline.compile_all()

# Check executables
!ls -lh build/phase*

---
## 3. Phase 1: CPU Baseline

In [None]:
import time

print("\n" + "="*60)
print("PHASE 1: CPU BASELINE")
print("="*60)

start = time.time()
result1 = pipeline.run_phase1_cpu()
phase1_time = time.time() - start

print(f"\n[RESULT] Phase 1 time: {phase1_time:.2f}s")

---
## 4. Phase 2: Naive GPU

In [None]:
print("\n" + "="*60)
print("PHASE 2: NAIVE GPU")
print("="*60)

start = time.time()
result2 = pipeline.run_phase2_gpu()
phase2_time = time.time() - start

print(f"\n[RESULT] Phase 2 time: {phase2_time:.2f}s")
if phase1_time > 0:
    print(f"[SPEEDUP] {phase1_time/phase2_time:.2f}x vs CPU")

---
## 5. Phase 3: Optimized GPU

### Optimizations Applied:
- Kernel fusion (Conv + ReLU)
- Pinned memory
- Async transfers
- Larger batch size (128 vs 64)

In [None]:
print("\n" + "="*60)
print("PHASE 3: OPTIMIZED GPU")
print("="*60)

start = time.time()
result3 = pipeline.run_phase3_optimized()
phase3_time = time.time() - start

print(f"\n[RESULT] Phase 3 time: {phase3_time:.2f}s")
if phase1_time > 0:
    print(f"[SPEEDUP] {phase1_time/phase3_time:.2f}x vs CPU")
if phase2_time > 0:
    print(f"[SPEEDUP] {phase2_time/phase3_time:.2f}x vs Naive GPU")

---
## 6. Phase 4: SVM Classification

In [None]:
print("\n" + "="*60)
print("PHASE 4: SVM CLASSIFICATION")
print("="*60)

start = time.time()
result4 = pipeline.run_phase4_svm(use_optimized=True)
phase4_time = time.time() - start

print(f"\n[RESULT] Phase 4 time: {phase4_time:.2f}s")

---
## 7. So sánh kết quả

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Create comparison table
df = pd.DataFrame({
    "Phase": ["CPU Baseline", "Naive GPU", "Optimized GPU"],
    "Time (s)": [phase1_time, phase2_time, phase3_time],
    "Speedup vs CPU": [
        1.0,
        phase1_time/phase2_time if phase2_time > 0 else 0,
        phase1_time/phase3_time if phase3_time > 0 else 0
    ]
})

print("\n" + "="*60)
print("PERFORMANCE COMPARISON")
print("="*60)
print(df.to_string(index=False))

# Check target
final_speedup = phase1_time/phase3_time if phase3_time > 0 else 0
target_met = "PASS" if final_speedup >= 20 else "FAIL"
print(f"\n[TARGET] Speedup >20x: {target_met} ({final_speedup:.2f}x)")

### Visualization

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Training time
colors = ["#3498db", "#e74c3c", "#2ecc71"]
ax1.bar(df["Phase"], df["Time (s)"], color=colors)
ax1.set_ylabel("Time (seconds)", fontsize=12)
ax1.set_title("Training Time Comparison", fontsize=14, fontweight="bold")
ax1.grid(axis="y", alpha=0.3)

# Speedup
ax2.bar(df["Phase"], df["Speedup vs CPU"], color=colors)
ax2.axhline(y=20, color="orange", linestyle="--", linewidth=2, label="Target: 20x")
ax2.set_ylabel("Speedup (x)", fontsize=12)
ax2.set_title("Speedup vs CPU Baseline", fontsize=14, fontweight="bold")
ax2.legend()
ax2.grid(axis="y", alpha=0.3)

plt.tight_layout()
plt.show()

---
## 8. Kết luận

### Thành tựu đạt được
- Implementation hoàn chỉnh 4 phases
- CPU baseline
- Naive GPU implementation
- Optimized GPU với kernel fusion, pinned memory, async transfers
- Full pipeline với SVM

### Kỹ thuật tối ưu (Phase 3)
1. **Kernel fusion:** Conv2D + ReLU merged → 15-20% faster
2. **Pinned memory:** cudaMallocHost → 2x faster transfers
3. **Async transfers:** cudaMemcpyAsync → hide latency
4. **Batch size:** 128 vs 64 → better GPU occupancy

### Cải tiến trong tương lai
- Shared memory tiling
- Multi-stream execution
- Mixed precision (FP16)
- cuDNN integration

---

**Project completed!**