PHYS 421 Assignment #5: Linear Regression via GPU-Accelerated Gradient Descent

Course: PHYS 421 - Parallel Computing (Fall 2025)
Assignment: #5 - Linear Regression
Due: October 29, 2025 (with 24-hour extension)

Overview

This project implements full-batch gradient descent (FBGD) for linear regression, comparing CPU serial and GPU implementations. The assignment demonstrates:

Vector microbenchmarks (SAXPY and dot product)
FBGD linear regression from scratch with custom CUDA kernels
Scaling studies on synthetic data
Real-data application with performance analysis

Environment Requirements

Hardware

GPU: NVIDIA GPU with CUDA Compute Capability ≥6.0 (tested on V100/A100)
CPU: Modern x86_64 processor (tested on Intel Xeon)
RAM: Minimum 8 GB (16 GB recommended for large N)

Software

CUDA Toolkit: ≥11.0 (tested with CUDA 11.8)
GCC/G++: ≥9.0 with C++17 support
Python: ≥3.8 (for plotting and preprocessing)
Libraries:
- NumPy ≥1.20
- Matplotlib ≥3.3
- pandas ≥1.2
- PyYAML ≥5.4

Installing Python Dependencies

pip install numpy matplotlib pandas pyyaml scikit-learn

Repository Structure

assignment-5-linear-regression/
├── README.md              # This file
├── Makefile               # Build system
├── .gitignore            # Git ignore rules
├── include/              # Header files
├── src/                  # Source code
│   ├── common/          # Shared utilities
│   ├── part0/           # Vector microbenchmarks
│   ├── part1/           # FBGD implementation
│   └── part2/           # Real-data application
├── scripts/             # Run and plotting scripts
│   ├── hpc/            # Slurm batch scripts
│   └── plot_*.py       # Python plotting scripts
├── configs/             # YAML configuration files
├── data/                # Datasets and metadata
│   ├── real/           # Real dataset
│   └── meta/           # Seeds and subsample indices
├── results/             # CSV output logs (gitignored)
├── plots/               # Generated figures (gitignored)
├── logs/                # Job logs (gitignored)
└── report/              # Final PDF report

Quick Start

1. Build All Executables

make all

This compiles:

bin/part0_microbench - Vector microbenchmarks
bin/part1_fbgd - Linear regression scaling study
bin/part2_real - Real-data application

2. Run All Experiments

On a local machine with GPU:

bash scripts/run_all.sh

On HPC cluster (Slurm):

# Load modules
source scripts/hpc/env.module

# Submit full pipeline
sbatch scripts/hpc/run_all.sbatch

3. Generate Plots

# Part 0: Vector microbenchmarks
python scripts/plot_part0.py

# Part 1: FBGD scaling study
python scripts/plot_part1.py

# Part 2: Real-data results
python scripts/plot_part2.py

All plots will be saved to plots/part{0,1,2}/.

Detailed Build Instructions

Build Individual Parts

make part0          # Vector microbenchmarks only
make part1          # FBGD implementation only
make part2          # Real-data application only
make common         # Shared utilities only

Build Options

make DEBUG=1        # Debug build with -g -G
make VERBOSE=1      # Verbose compilation output
make ARCH=sm_70     # Target specific GPU architecture

Clean Build Artifacts

make clean          # Remove object files and executables
make cleanall       # Remove build artifacts and results

Running Experiments

Part 0: Vector Microbenchmarks

Single run:

./bin/part0_microbench --operation saxpy --size 1000000 --trials 10 --warmup 3
./bin/part0_microbench --operation dot --size 1000000 --trials 10 --warmup 3

Full sweep:

bash scripts/run_part0.sh

Part 1: FBGD Scaling Study

Single configuration:

./bin/part1_fbgd --N 100000 --D 16 --eta 0.001 --tol 1e-4 --maxiter 10000 --seed 42

Full sweep over N:

bash scripts/run_part1.sh

On HPC cluster:

sbatch scripts/hpc/part1_array.sbatch

Part 2: Real-Data Application

Prepare dataset:

cd data/real
bash fetch_real_data.sh
python preprocessing.py
cd ../..

Run experiment:

./bin/part2_real --config configs/part2.yaml

Configuration Files

Configuration files in configs/ use YAML format:

Example: configs/sweep_part1.yaml

problem:
  D: 16
  N_values: [30000, 100000, 300000, 1000000]
  seed: 42
  noise_sigma: 0.5

solver:
  eta: 0.001
  tolerance: 1e-4
  max_iterations: 10000

timing:
  warmup_trials: 3
  timed_trials: 10
  timeout_seconds: 60

Timing Protocol

All experiments follow strict timing requirements:

Warm-up runs: 3 iterations (discarded)
Timed runs: 10 iterations
Reporting: Mean ± standard deviation

GPU Timing (CUDA Events)

Kernel-Only: Compute only (data pre-loaded on device)
End-to-End: Host-to-device + compute + device-to-host

CPU Timing

Serial: Single-threaded baseline with std::chrono
Timeout: Cap at 60 seconds or 10× GPU time (whichever is smaller)

Reproducibility

Fixed Seeds

Random seeds are stored in data/meta/seeds.txt:

master_seed: 42
part0_seed: 12345
part1_seed: 67890
part2_seed: 24680

Subsample Indices

Pre-generated subsample indices for different N values are stored in data/meta/indices/ as .npy files. These ensure identical subsets across runs.

Regenerating Data

cd data/meta
python generate_indices.py --seed 42 --N_max 3000000 --D 16

Expected Results

Part 0: Vector Microbenchmarks

Crossover point: GPU overtakes CPU serial around N ≈ 10^5 - 10^6
Speedup: 10-100× for large N (kernel-only)
Overhead: End-to-end includes ~10-100 μs launch + transfer costs

Part 1: FBGD Scaling Study

Convergence: ~100-1000 iterations to reach tolerance 10^-4
GPU advantage: Grows with N; 5-50× speedup for N ≥ 10^6
Bottleneck: Memory bandwidth for large matrices

Part 2: Real-Data Application

Runtime comparison on fixed-size real dataset
Validation: residual norm and MSE on held-out data

Output Format

CSV Logs

Results are saved as CSV files in results/:

N,D,implementation,timing_type,mean_time,std_time,iterations,flops
100000,16,cpu_serial,compute,0.453,0.012,543,NA
100000,16,gpu,kernel_only,0.034,0.002,543,1.2e9
100000,16,gpu,end_to_end,0.067,0.003,543,1.2e9

Plots

Generated plots include:

Runtime vs N (log-log scale)
Speedup vs N (GPU/CPU ratio)
Throughput vs N (GFLOP/s)
Convergence curves (MSE or gradient norm vs iterations)

Troubleshooting

CUDA Out of Memory

Reduce problem size or use smaller data types:

# Edit configs/sweep_part1.yaml
N_values: [30000, 100000, 300000]  # Remove 1M, 3M

Compilation Errors

Check CUDA toolkit version:

nvcc --version
which nvcc

Ensure correct GPU architecture:

make ARCH=sm_75  # For RTX 2080
make ARCH=sm_80  # For A100

Slow CPU Performance

Enable optimizations:

make CXXFLAGS="-O3 -march=native -funroll-loops"

Module Not Found (HPC)

Load required modules:

module load CUDA/11.8
module load GCC/11.2.0
module load Python/3.9.6

Performance Tuning

GPU Kernel Optimization

Coalesced memory access: Structure-of-arrays (SoA) layout for matrix X
Shared memory: Use for reductions (dot product, norms)
Occupancy: Block size 128-256 threads
Streams: (Optional) Overlap H2D/compute/D2H

CPU Optimization

Compiler flags: -O3 -march=native -funroll-loops
SIMD: Vectorization with -ftree-vectorize
Cache blocking: (Optional for OpenMP version)

Citation

If you use this code, please cite:

PHYS 421 Assignment #5: Linear Regression via GPU-Accelerated Gradient Descent
Fall 2025, Parallel Computing Course

License

This code is provided for educational purposes as part of PHYS 421 coursework.

Contact

For questions or issues, refer to the course forum or contact the instructor.

Last Updated: October 27, 2025

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
bin		bin
build		build
data		data
include		include
plots		plots
report		report
results		results
scripts		scripts
src		src
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt
test_shabyt.sbatch		test_shabyt.sbatch

License

proxy-pylon/linear-regression-cuda

Folders and files

Latest commit

History

Repository files navigation

PHYS 421 Assignment #5: Linear Regression via GPU-Accelerated Gradient Descent

Overview

Environment Requirements

Hardware

Software

Installing Python Dependencies

Repository Structure

Quick Start

1. Build All Executables

2. Run All Experiments

3. Generate Plots

Detailed Build Instructions

Build Individual Parts

Build Options

Clean Build Artifacts

Running Experiments

Part 0: Vector Microbenchmarks

Part 1: FBGD Scaling Study

Part 2: Real-Data Application

Configuration Files

Timing Protocol

GPU Timing (CUDA Events)

CPU Timing

Reproducibility

Fixed Seeds

Subsample Indices

Regenerating Data

Expected Results

Part 0: Vector Microbenchmarks

Part 1: FBGD Scaling Study

Part 2: Real-Data Application

Output Format

CSV Logs

Plots

Troubleshooting

CUDA Out of Memory

Compilation Errors

Slow CPU Performance

Module Not Found (HPC)

Performance Tuning

GPU Kernel Optimization

CPU Optimization

Citation

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages