ichida-algo

This repository contains the winning submission for StartHack 2024 (June) 🎉🎊🥳. Our team chose Track 1 - High Performance Computing, which was hosted by QDX and involved writing a massively parallel implementation of neural network inference on a small demo model.

Development involved:

Optimising for the compiler (cache locality, throughput)
SIMD programming, vector intrinsics and alignment in C
Multithreading and task distribution
x86_64 assembly, in-depth profiling and tuning
Programming in CUDA
MPI (Message-Passing Interface) for multi-GPU utilisation

We decided to go down this path because it sounded like some high risk, high reward excitement. Before starting out, we didn't know almost anything about low level optimisation & GPU programming, so it turned out to be a lot of active learning on the job! We would like to once again thank QDX for the opportunity to participate in this unique challenge and for offering us a chace to impress them.

Installation

In order to correctly run the code, please ensure that you have an x86_64 CPU if you want to test the CPU implementation (as well as OpenMP for multithreading), and a CUDA compatible GPU to test the GPU implementation (as well as the appropriate version of the CUDA toolkit). Please ensure that if you are using multiple GPUs you have an MPI implementation installed (we have verified OpenMPI as working).

To compile and run on CPU (multithreaded):

make
Run with the provided script: ./speed_demo_cpu.sh ./weights_and_biases.txt ./tensors <iterations_per_input>

To compile and run with a non-MPI setup:

make build_gpu
Run with ./speed_gpu ./weights_and_biases.txt ./tensors <iterations_per_input>

To compile and run with a MPI (multi-gpu) setup:

make
Run with the provided script: ./speed_demo_gpu.sh ./weights_and_biases.txt ./tensors <iterations_per_input>

Important

We've found that running MPI on a large number of devices incurs a significant ~6s overhead. In order to minimise the effect of this overhead when measuring, we recommend running a large number of inferences per input (500M to 1B per input on 8 GPUs). CUDA incurs a similar but less severe ~2s penalty in some cases.

Implementation details

The CPU matmul kernel is written using SIMD intrinsics, entirely in C! It makes heavy use of memory alignment, cache locality with a transposition step, and is quite fast. In fact, as far as we're aware, it beats the inline asm version provided by cblas by a noticeable margin for this usecase!
To skirt around the problem of each matrix calculation being quite small, we went with a monolithic kernel design, where inferences are run essentially per thread. It took some wrestling with the way CUDA works, but we also managed to get it running at a satisfing speed. This was an interesting first CUDA experience, and there really wasn't much material about this approach, but we are happy with how it turned out in the end (especially since one of us only owns a MacBook).
We divide work evenly among the available GPUs for a speedup using MPI. After some struggles with setup, we managed to get it working. We want to thank the QDX team for the convenient test machine that they set up for the competition, as it helped us get the multi-GPU aspect as correct as we could.

Results

Thes are the best runs that we have achieved (all categories were tested on 52 inputs):

Hardware Used	Parallelism	Best Run / Nr. of iterations	Throughput (time for 1B)
Ryzen 5600x	1 thread	6.658s / 100k per input	21 minutes 20.34 seconds
Ryzen 5600x	12 threads	11.631s / 1M per input	3 minutes 43.62 seconds
EPYC 7J13*	240 threads	112.124s / 100M per input	21.56 seconds
A100 80GB	1 GPU	103.833s / 100M per input	19.96 seconds
A100 80GB	8 GPUs	70.388s / 500M per input	2.71 seconds

* Dual socket system with 2x CPUs each at 64 cores / 128 threads

Team members

All team members are from RMIT.

Artemis Rosman - rozukke

Project management
CPU optimisation (AVX2/SIMD, kernel, memory, multithreading, testing/profiling & tuning)
GPU optimisation (monolithic kernel design & work division, small tweaks)
MPI optimisation (work division)
Code rewrites & cleanup, code review/maintenance
Communication & design
A lot of textbook reading

Dan Dang - nhatdongdang

Core implementation in C
Benchmark implementation
CPU optimisation (AVX2/SIMD, testing & tuning)
GPU optimisation (memory, kernel implementation, testing/profiling & tuning)
MPI optimisation
Code review & CI pipeline
A lot of textbook reading

Johnathan Chan - Jinxto

Core implementation in CUDA
Core MPI implementation & optimisation (GPU detection)
Builds & CMake setup
Teamwork :D
A lot of textbook reading

Name		Name	Last commit message	Last commit date
Latest commit History 146 Commits
.github/workflows		.github/workflows
benchmark		benchmark
cudasrc		cudasrc
resources		resources
src		src
tensors		tensors
test		test
.clang-format		.clang-format
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
speed_demo_cpu.sh		speed_demo_cpu.sh
speed_demo_gpu.sh		speed_demo_gpu.sh
weights_and_biases.txt		weights_and_biases.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ichida-algo

Installation

Implementation details

Results

Team members

Artemis Rosman - rozukke

Dan Dang - nhatdongdang

Johnathan Chan - Jinxto

About

Contributors 3

Languages

License

kachi-group/ichida-algo

Folders and files

Latest commit

History

Repository files navigation

ichida-algo

Installation

Implementation details

Results

Team members

Artemis Rosman - rozukke

Dan Dang - nhatdongdang

Johnathan Chan - Jinxto

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 3

Languages