Skip to content

martin-garaj/gpu_cluster

Repository files navigation

gpu_cluster

Boiler-plate framework for job scheduling on HPC GPU cluster. Not every project is finished, but that does not mean the effort is wasted. This project began when Tensorflow and Pytorch were not as popular as they are now. Therefore, there was a plan to use a custom boiler-plate framework to move data within the cluster and do calculations on GPUs. This little project demonstrates, that such a task is doable and be coded in C++ and CudaC.

Main challenges

  • Moving data between CPU and GPU is solved by using Unified memory, which is a physical memory residing on GPU but appears as virtual memory to CPU. This enables to create objects (which inherit from Managed.cuh) that are directly constructed and exist only within this memory. This means, that a serialized object sent from node A can be deserialized and directly stored in GPU memory of node B. This saves a lot of allocation steps and movind data piece-by-piece.
  • Synchronizing multiple processes residing on unknown nodes. There is a very close co-existence of PBS Pro scheduler and OpenMPI communication interface. While the CPU is not concious of where in the cluster it is, thus neither the MPI process, the PBS Pro scheduler can allocate the resources in predictable manner. Then, it is a matter of structuring the MPI processes according to some hierarchical structure using commRank MPI variable.
  • Compilation of code for different architectures. C++ and CudaC are different languages that are executed on different machines, therefore they need different compilers. Although this is straight-forward, making the process seamless, thus easy to work with (e.g. compiling on remote machine), was a nice way to practice writting Makefile.

About the project

The project is self-contained, all source-files are provided (including parts of Cereal library) to compile the project. The operating system on cluster is Linux, but the source-code requires very little support other than Makefile.

This was a fun to program, as I have gained some experience with:

  • template programming in C++
  • CUDA programming on NVIDIA GPUs
  • object serialization with Cereal library to prepare data for transmission
  • OpenMPI transmission and synchronization for sending/receiving data
  • PBS Pro scripting for job distribution on cluster
  • general Object Oriented Programming concepts to not get lost on the way :D

Code execution

The below image illustrates how a cluster is structured into nodes, which are divided into main and worker processes. The program continues by spreading the data from main to workers. Workers process the data on GPU and send the result to main. This is repreated TIME_TESTING_TERATIONS to generate statistics.

program_execution_flow

File description

PBS scripts

pbs_script.scr: PBS Pro script to allocate resources within cluster and run the executable (line 109)

CPU (C++)

main.cpp: Main file representing main and worker nodes

  • main node is differentiated by [const int] commRank == ROOT_PROCESS (defined in const.h)
  • worker nodes are all other nodes with commRank != ROOT_PROCESS

Process.cpp, Process.hpp: Object representing both main and worker nodes (NOTICE: the same kind of Process object represents different kinds of nodes)

  • NOTICE: The object has a regular OOP structure:
    • Process.cpp contains implementations of classes and functions
    • Process.hpp contains declarations + implementations of template functions (template functions cannot be directly compiled from implementation in .hpp file)

Unified memory (CUDA<->C++)

Managed.cuh: Inheriting from this class allows the object to be unified memory (memory on GPU that is visible from CPU)

GPU (CUDA)

Cuda_GPU.cu: Class representing GPU from the CPU point of view. Therefore, this object, while using CUDA functions and compiled by CUDA compiler, does not run directly on GPU.

Cuda_kernel.cu: This file defines kernel_execute() function, which is a wrapper for a function executed on GPU. The actuall kernel function that is exeuted on GPU is implemented in kernel_by_ref.cu. See Cuda_kernel.cuh for detailed explanation on why C and C++ does not mix. In short, C++ mangles the names during compilation, C does not mangle the names.

Other files

Makefile: Makefile to locally (on cluster) compile the source files and generate executable test_framework.

compile.sh: Shell script to compile the _source/code.

jobrun.sh: Shell script to schedule resource allocation using pbs_script.scr.

recompile_jobrun.sh: Recompile and schedule resource allocation (combines compile.sh and jobrun.sh).

return_values.h: Define unified return values.

const.h: Constants for CPU.

config_GPU.h: Constants for GPU.

/cereal: Parts of Cereal library required to compile the project.

300000_messages_1009.frontnode.OU: Output file produced by main node after the test is finished. Notice line 110, which shows the total amount of transfered data during the test was 15 [TB].

About

Boiler-plate framework for job scheduling on HPC GPU cluster.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published