Course: PHYS 421 - Parallel Computing (Fall 2025)
Assignment: #5 - Linear Regression
Due: October 29, 2025 (with 24-hour extension)
This project implements full-batch gradient descent (FBGD) for linear regression, comparing CPU serial and GPU implementations. The assignment demonstrates:
- Vector microbenchmarks (SAXPY and dot product)
- FBGD linear regression from scratch with custom CUDA kernels
- Scaling studies on synthetic data
- Real-data application with performance analysis
- GPU: NVIDIA GPU with CUDA Compute Capability ≥6.0 (tested on V100/A100)
- CPU: Modern x86_64 processor (tested on Intel Xeon)
- RAM: Minimum 8 GB (16 GB recommended for large N)
- CUDA Toolkit: ≥11.0 (tested with CUDA 11.8)
- GCC/G++: ≥9.0 with C++17 support
- Python: ≥3.8 (for plotting and preprocessing)
- Libraries:
- NumPy ≥1.20
- Matplotlib ≥3.3
- pandas ≥1.2
- PyYAML ≥5.4
pip install numpy matplotlib pandas pyyaml scikit-learnassignment-5-linear-regression/
├── README.md # This file
├── Makefile # Build system
├── .gitignore # Git ignore rules
├── include/ # Header files
├── src/ # Source code
│ ├── common/ # Shared utilities
│ ├── part0/ # Vector microbenchmarks
│ ├── part1/ # FBGD implementation
│ └── part2/ # Real-data application
├── scripts/ # Run and plotting scripts
│ ├── hpc/ # Slurm batch scripts
│ └── plot_*.py # Python plotting scripts
├── configs/ # YAML configuration files
├── data/ # Datasets and metadata
│ ├── real/ # Real dataset
│ └── meta/ # Seeds and subsample indices
├── results/ # CSV output logs (gitignored)
├── plots/ # Generated figures (gitignored)
├── logs/ # Job logs (gitignored)
└── report/ # Final PDF report
make allThis compiles:
bin/part0_microbench- Vector microbenchmarksbin/part1_fbgd- Linear regression scaling studybin/part2_real- Real-data application
On a local machine with GPU:
bash scripts/run_all.shOn HPC cluster (Slurm):
# Load modules
source scripts/hpc/env.module
# Submit full pipeline
sbatch scripts/hpc/run_all.sbatch# Part 0: Vector microbenchmarks
python scripts/plot_part0.py
# Part 1: FBGD scaling study
python scripts/plot_part1.py
# Part 2: Real-data results
python scripts/plot_part2.pyAll plots will be saved to plots/part{0,1,2}/.
make part0 # Vector microbenchmarks only
make part1 # FBGD implementation only
make part2 # Real-data application only
make common # Shared utilities onlymake DEBUG=1 # Debug build with -g -G
make VERBOSE=1 # Verbose compilation output
make ARCH=sm_70 # Target specific GPU architecturemake clean # Remove object files and executables
make cleanall # Remove build artifacts and resultsSingle run:
./bin/part0_microbench --operation saxpy --size 1000000 --trials 10 --warmup 3
./bin/part0_microbench --operation dot --size 1000000 --trials 10 --warmup 3Full sweep:
bash scripts/run_part0.shSingle configuration:
./bin/part1_fbgd --N 100000 --D 16 --eta 0.001 --tol 1e-4 --maxiter 10000 --seed 42Full sweep over N:
bash scripts/run_part1.shOn HPC cluster:
sbatch scripts/hpc/part1_array.sbatchPrepare dataset:
cd data/real
bash fetch_real_data.sh
python preprocessing.py
cd ../..Run experiment:
./bin/part2_real --config configs/part2.yamlConfiguration files in configs/ use YAML format:
Example: configs/sweep_part1.yaml
problem:
D: 16
N_values: [30000, 100000, 300000, 1000000]
seed: 42
noise_sigma: 0.5
solver:
eta: 0.001
tolerance: 1e-4
max_iterations: 10000
timing:
warmup_trials: 3
timed_trials: 10
timeout_seconds: 60All experiments follow strict timing requirements:
- Warm-up runs: 3 iterations (discarded)
- Timed runs: 10 iterations
- Reporting: Mean ± standard deviation
- Kernel-Only: Compute only (data pre-loaded on device)
- End-to-End: Host-to-device + compute + device-to-host
- Serial: Single-threaded baseline with
std::chrono - Timeout: Cap at 60 seconds or 10× GPU time (whichever is smaller)
Random seeds are stored in data/meta/seeds.txt:
master_seed: 42
part0_seed: 12345
part1_seed: 67890
part2_seed: 24680
Pre-generated subsample indices for different N values are stored in data/meta/indices/ as .npy files. These ensure identical subsets across runs.
cd data/meta
python generate_indices.py --seed 42 --N_max 3000000 --D 16- Crossover point: GPU overtakes CPU serial around N ≈ 10^5 - 10^6
- Speedup: 10-100× for large N (kernel-only)
- Overhead: End-to-end includes ~10-100 μs launch + transfer costs
- Convergence: ~100-1000 iterations to reach tolerance 10^-4
- GPU advantage: Grows with N; 5-50× speedup for N ≥ 10^6
- Bottleneck: Memory bandwidth for large matrices
- Runtime comparison on fixed-size real dataset
- Validation: residual norm and MSE on held-out data
Results are saved as CSV files in results/:
N,D,implementation,timing_type,mean_time,std_time,iterations,flops
100000,16,cpu_serial,compute,0.453,0.012,543,NA
100000,16,gpu,kernel_only,0.034,0.002,543,1.2e9
100000,16,gpu,end_to_end,0.067,0.003,543,1.2e9
Generated plots include:
- Runtime vs N (log-log scale)
- Speedup vs N (GPU/CPU ratio)
- Throughput vs N (GFLOP/s)
- Convergence curves (MSE or gradient norm vs iterations)
Reduce problem size or use smaller data types:
# Edit configs/sweep_part1.yaml
N_values: [30000, 100000, 300000] # Remove 1M, 3MCheck CUDA toolkit version:
nvcc --version
which nvccEnsure correct GPU architecture:
make ARCH=sm_75 # For RTX 2080
make ARCH=sm_80 # For A100Enable optimizations:
make CXXFLAGS="-O3 -march=native -funroll-loops"Load required modules:
module load CUDA/11.8
module load GCC/11.2.0
module load Python/3.9.6- Coalesced memory access: Structure-of-arrays (SoA) layout for matrix X
- Shared memory: Use for reductions (dot product, norms)
- Occupancy: Block size 128-256 threads
- Streams: (Optional) Overlap H2D/compute/D2H
- Compiler flags:
-O3 -march=native -funroll-loops - SIMD: Vectorization with
-ftree-vectorize - Cache blocking: (Optional for OpenMP version)
If you use this code, please cite:
PHYS 421 Assignment #5: Linear Regression via GPU-Accelerated Gradient Descent
Fall 2025, Parallel Computing Course
This code is provided for educational purposes as part of PHYS 421 coursework.
For questions or issues, refer to the course forum or contact the instructor.
Last Updated: October 27, 2025