Skip to content

Winning submission for StartHack 2024: HPC optimized multi-GPU/CPU inference

License

Notifications You must be signed in to change notification settings

kachi-group/ichida-algo

Repository files navigation

ichida-logo

ichida-algo

made-using-c authors-kachi-group License: MIT Build status

This repository contains the winning submission for StartHack 2024 (June) 🎉🎊🥳. Our team chose Track 1 - High Performance Computing, which was hosted by QDX and involved writing a massively parallel implementation of neural network inference on a small demo model.

Development involved:

  • Optimising for the compiler (cache locality, throughput)
  • SIMD programming, vector intrinsics and alignment in C
  • Multithreading and task distribution
  • x86_64 assembly, in-depth profiling and tuning
  • Programming in CUDA
  • MPI (Message-Passing Interface) for multi-GPU utilisation

We decided to go down this path because it sounded like some high risk, high reward excitement. Before starting out, we didn't know almost anything about low level optimisation & GPU programming, so it turned out to be a lot of active learning on the job! We would like to once again thank QDX for the opportunity to participate in this unique challenge and for offering us a chace to impress them.

Installation

In order to correctly run the code, please ensure that you have an x86_64 CPU if you want to test the CPU implementation (as well as OpenMP for multithreading), and a CUDA compatible GPU to test the GPU implementation (as well as the appropriate version of the CUDA toolkit). Please ensure that if you are using multiple GPUs you have an MPI implementation installed (we have verified OpenMPI as working).

To compile and run on CPU (multithreaded):

  • make
  • Run with the provided script: ./speed_demo_cpu.sh ./weights_and_biases.txt ./tensors <iterations_per_input>

To compile and run with a non-MPI setup:

  • make build_gpu
  • Run with ./speed_gpu ./weights_and_biases.txt ./tensors <iterations_per_input>

To compile and run with a MPI (multi-gpu) setup:

  • make
  • Run with the provided script: ./speed_demo_gpu.sh ./weights_and_biases.txt ./tensors <iterations_per_input>

Important

We've found that running MPI on a large number of devices incurs a significant ~6s overhead. In order to minimise the effect of this overhead when measuring, we recommend running a large number of inferences per input (500M to 1B per input on 8 GPUs). CUDA incurs a similar but less severe ~2s penalty in some cases.

Implementation details

  • The CPU matmul kernel is written using SIMD intrinsics, entirely in C! It makes heavy use of memory alignment, cache locality with a transposition step, and is quite fast. In fact, as far as we're aware, it beats the inline asm version provided by cblas by a noticeable margin for this usecase!
  • To skirt around the problem of each matrix calculation being quite small, we went with a monolithic kernel design, where inferences are run essentially per thread. It took some wrestling with the way CUDA works, but we also managed to get it running at a satisfing speed. This was an interesting first CUDA experience, and there really wasn't much material about this approach, but we are happy with how it turned out in the end (especially since one of us only owns a MacBook).
  • We divide work evenly among the available GPUs for a speedup using MPI. After some struggles with setup, we managed to get it working. We want to thank the QDX team for the convenient test machine that they set up for the competition, as it helped us get the multi-GPU aspect as correct as we could.

Results

Thes are the best runs that we have achieved (all categories were tested on 52 inputs):

Hardware Used Parallelism Best Run / Nr. of iterations Throughput (time for 1B)
Ryzen 5600x 1 thread 6.658s / 100k per input 21 minutes 20.34 seconds
Ryzen 5600x 12 threads 11.631s / 1M per input 3 minutes 43.62 seconds
EPYC 7J13* 240 threads 112.124s / 100M per input 21.56 seconds
A100 80GB 1 GPU 103.833s / 100M per input 19.96 seconds
A100 80GB 8 GPUs 70.388s / 500M per input 2.71 seconds

* Dual socket system with 2x CPUs each at 64 cores / 128 threads

Team members

All team members are from RMIT.

Artemis Rosman - rozukke

  • Project management
  • CPU optimisation (AVX2/SIMD, kernel, memory, multithreading, testing/profiling & tuning)
  • GPU optimisation (monolithic kernel design & work division, small tweaks)
  • MPI optimisation (work division)
  • Code rewrites & cleanup, code review/maintenance
  • Communication & design
  • A lot of textbook reading

Dan Dang - nhatdongdang

  • Core implementation in C
  • Benchmark implementation
  • CPU optimisation (AVX2/SIMD, testing & tuning)
  • GPU optimisation (memory, kernel implementation, testing/profiling & tuning)
  • MPI optimisation
  • Code review & CI pipeline
  • A lot of textbook reading

Johnathan Chan - Jinxto

  • Core implementation in CUDA
  • Core MPI implementation & optimisation (GPU detection)
  • Builds & CMake setup
  • Teamwork :D
  • A lot of textbook reading