Skip to content

Implementation of linear regression on CPU and GPU. Uses C++ and CUDA for parallelized compuatations. A detailed latex report is provided

License

Notifications You must be signed in to change notification settings

proxy-pylon/linear-regression-cuda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PHYS 421 Assignment #5: Linear Regression via GPU-Accelerated Gradient Descent

Course: PHYS 421 - Parallel Computing (Fall 2025)
Assignment: #5 - Linear Regression
Due: October 29, 2025 (with 24-hour extension)

Overview

This project implements full-batch gradient descent (FBGD) for linear regression, comparing CPU serial and GPU implementations. The assignment demonstrates:

  1. Vector microbenchmarks (SAXPY and dot product)
  2. FBGD linear regression from scratch with custom CUDA kernels
  3. Scaling studies on synthetic data
  4. Real-data application with performance analysis

Environment Requirements

Hardware

  • GPU: NVIDIA GPU with CUDA Compute Capability ≥6.0 (tested on V100/A100)
  • CPU: Modern x86_64 processor (tested on Intel Xeon)
  • RAM: Minimum 8 GB (16 GB recommended for large N)

Software

  • CUDA Toolkit: ≥11.0 (tested with CUDA 11.8)
  • GCC/G++: ≥9.0 with C++17 support
  • Python: ≥3.8 (for plotting and preprocessing)
  • Libraries:
    • NumPy ≥1.20
    • Matplotlib ≥3.3
    • pandas ≥1.2
    • PyYAML ≥5.4

Installing Python Dependencies

pip install numpy matplotlib pandas pyyaml scikit-learn

Repository Structure

assignment-5-linear-regression/
├── README.md              # This file
├── Makefile               # Build system
├── .gitignore            # Git ignore rules
├── include/              # Header files
├── src/                  # Source code
│   ├── common/          # Shared utilities
│   ├── part0/           # Vector microbenchmarks
│   ├── part1/           # FBGD implementation
│   └── part2/           # Real-data application
├── scripts/             # Run and plotting scripts
│   ├── hpc/            # Slurm batch scripts
│   └── plot_*.py       # Python plotting scripts
├── configs/             # YAML configuration files
├── data/                # Datasets and metadata
│   ├── real/           # Real dataset
│   └── meta/           # Seeds and subsample indices
├── results/             # CSV output logs (gitignored)
├── plots/               # Generated figures (gitignored)
├── logs/                # Job logs (gitignored)
└── report/              # Final PDF report

Quick Start

1. Build All Executables

make all

This compiles:

  • bin/part0_microbench - Vector microbenchmarks
  • bin/part1_fbgd - Linear regression scaling study
  • bin/part2_real - Real-data application

2. Run All Experiments

On a local machine with GPU:

bash scripts/run_all.sh

On HPC cluster (Slurm):

# Load modules
source scripts/hpc/env.module

# Submit full pipeline
sbatch scripts/hpc/run_all.sbatch

3. Generate Plots

# Part 0: Vector microbenchmarks
python scripts/plot_part0.py

# Part 1: FBGD scaling study
python scripts/plot_part1.py

# Part 2: Real-data results
python scripts/plot_part2.py

All plots will be saved to plots/part{0,1,2}/.

Detailed Build Instructions

Build Individual Parts

make part0          # Vector microbenchmarks only
make part1          # FBGD implementation only
make part2          # Real-data application only
make common         # Shared utilities only

Build Options

make DEBUG=1        # Debug build with -g -G
make VERBOSE=1      # Verbose compilation output
make ARCH=sm_70     # Target specific GPU architecture

Clean Build Artifacts

make clean          # Remove object files and executables
make cleanall       # Remove build artifacts and results

Running Experiments

Part 0: Vector Microbenchmarks

Single run:

./bin/part0_microbench --operation saxpy --size 1000000 --trials 10 --warmup 3
./bin/part0_microbench --operation dot --size 1000000 --trials 10 --warmup 3

Full sweep:

bash scripts/run_part0.sh

Part 1: FBGD Scaling Study

Single configuration:

./bin/part1_fbgd --N 100000 --D 16 --eta 0.001 --tol 1e-4 --maxiter 10000 --seed 42

Full sweep over N:

bash scripts/run_part1.sh

On HPC cluster:

sbatch scripts/hpc/part1_array.sbatch

Part 2: Real-Data Application

Prepare dataset:

cd data/real
bash fetch_real_data.sh
python preprocessing.py
cd ../..

Run experiment:

./bin/part2_real --config configs/part2.yaml

Configuration Files

Configuration files in configs/ use YAML format:

Example: configs/sweep_part1.yaml

problem:
  D: 16
  N_values: [30000, 100000, 300000, 1000000]
  seed: 42
  noise_sigma: 0.5

solver:
  eta: 0.001
  tolerance: 1e-4
  max_iterations: 10000

timing:
  warmup_trials: 3
  timed_trials: 10
  timeout_seconds: 60

Timing Protocol

All experiments follow strict timing requirements:

  1. Warm-up runs: 3 iterations (discarded)
  2. Timed runs: 10 iterations
  3. Reporting: Mean ± standard deviation

GPU Timing (CUDA Events)

  • Kernel-Only: Compute only (data pre-loaded on device)
  • End-to-End: Host-to-device + compute + device-to-host

CPU Timing

  • Serial: Single-threaded baseline with std::chrono
  • Timeout: Cap at 60 seconds or 10× GPU time (whichever is smaller)

Reproducibility

Fixed Seeds

Random seeds are stored in data/meta/seeds.txt:

master_seed: 42
part0_seed: 12345
part1_seed: 67890
part2_seed: 24680

Subsample Indices

Pre-generated subsample indices for different N values are stored in data/meta/indices/ as .npy files. These ensure identical subsets across runs.

Regenerating Data

cd data/meta
python generate_indices.py --seed 42 --N_max 3000000 --D 16

Expected Results

Part 0: Vector Microbenchmarks

  • Crossover point: GPU overtakes CPU serial around N ≈ 10^5 - 10^6
  • Speedup: 10-100× for large N (kernel-only)
  • Overhead: End-to-end includes ~10-100 μs launch + transfer costs

Part 1: FBGD Scaling Study

  • Convergence: ~100-1000 iterations to reach tolerance 10^-4
  • GPU advantage: Grows with N; 5-50× speedup for N ≥ 10^6
  • Bottleneck: Memory bandwidth for large matrices

Part 2: Real-Data Application

  • Runtime comparison on fixed-size real dataset
  • Validation: residual norm and MSE on held-out data

Output Format

CSV Logs

Results are saved as CSV files in results/:

N,D,implementation,timing_type,mean_time,std_time,iterations,flops
100000,16,cpu_serial,compute,0.453,0.012,543,NA
100000,16,gpu,kernel_only,0.034,0.002,543,1.2e9
100000,16,gpu,end_to_end,0.067,0.003,543,1.2e9

Plots

Generated plots include:

  • Runtime vs N (log-log scale)
  • Speedup vs N (GPU/CPU ratio)
  • Throughput vs N (GFLOP/s)
  • Convergence curves (MSE or gradient norm vs iterations)

Troubleshooting

CUDA Out of Memory

Reduce problem size or use smaller data types:

# Edit configs/sweep_part1.yaml
N_values: [30000, 100000, 300000]  # Remove 1M, 3M

Compilation Errors

Check CUDA toolkit version:

nvcc --version
which nvcc

Ensure correct GPU architecture:

make ARCH=sm_75  # For RTX 2080
make ARCH=sm_80  # For A100

Slow CPU Performance

Enable optimizations:

make CXXFLAGS="-O3 -march=native -funroll-loops"

Module Not Found (HPC)

Load required modules:

module load CUDA/11.8
module load GCC/11.2.0
module load Python/3.9.6

Performance Tuning

GPU Kernel Optimization

  1. Coalesced memory access: Structure-of-arrays (SoA) layout for matrix X
  2. Shared memory: Use for reductions (dot product, norms)
  3. Occupancy: Block size 128-256 threads
  4. Streams: (Optional) Overlap H2D/compute/D2H

CPU Optimization

  1. Compiler flags: -O3 -march=native -funroll-loops
  2. SIMD: Vectorization with -ftree-vectorize
  3. Cache blocking: (Optional for OpenMP version)

Citation

If you use this code, please cite:

PHYS 421 Assignment #5: Linear Regression via GPU-Accelerated Gradient Descent
Fall 2025, Parallel Computing Course

License

This code is provided for educational purposes as part of PHYS 421 coursework.

Contact

For questions or issues, refer to the course forum or contact the instructor.


Last Updated: October 27, 2025

About

Implementation of linear regression on CPU and GPU. Uses C++ and CUDA for parallelized compuatations. A detailed latex report is provided

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published