course at NTHU CS 2019 fall
- MPI
- optimized algorithm to minimize message size
- asynchronous communication (non-blocking send/recv)
hw2/hw2_hybrid_dynamic_p_v.c
- MPI + pthreads + OpenMP
- leader/follower architecture
- load balance with dynamic scheduling
- overlapped computing and file writing
- vetorization with Intel SSE3 (SIMD)
3. all-pairs shortest path (cpu)
- OpenMP
- implemented blocked-Floyd-Warshall algorithm to utilize cache locality
- utilized NVIDIA Pascal GPU memory hierarchy : shared memory, registers
- fine-tuned block size and kernel size
- resolved bank conflicts
- minimized peer-to-peer communication