## Proposed Research Plan to Tackle Memory and Communication Walls as a Margaret Butler Fellow

As we are going into the exascale era of computing, large-scale supercomputing is addressing challenging problems that once were thought unsolvable. Now we are racing to demonstrate technological capabilities for solving critical problems for the well-being of citizens of all countries today, such as understanding climate change or simulating COVID-19 mechanisms at molecular level. In fact, exascale computing may be considered as a psychological milestone because the LINPACK benchmark is based on regular and dense computations. However, only a handful of applications will enjoy exascale deployment on early systems. Because, a plethora of today's relevant scientific, AI, and graph-analytics applications involve irregular and sparse computations, and they therefore suffer from memory-wall and communication-wall bottlenecks. As a result, these applications utilize only a tiny portion of the theoretical performance of large scale computing systems in practice.

To obtain higher utilization on exascale computer, my research plan tackles inefficiencies due to irregular and sparse memory accesses and communications. Moreover, the next generation of exascale systems-Frontier, Aurora, and El Capitan-will involve multi-GPU nodes connected by a hierarchical communication network with no exception. Therefore, the proposed algorithms target multi-GPU node architecture and the accompanying interconnect topology. More specifically, I will seek applications that can benefit from the Tiled SpMM and hierarchical communication techniques that I have developed in my doctoral dissertation research. Both techniques embrace an inspection/execution model that preprocess the memory access and communication patterns and optimize them to perform distributed matrix multiplication with high performance.

The proposed techniques have already been applied to several award-winning applications to solve problems at unprecedented scales. A good example is the X-ray imaging problem to reconstruct 3D images with sub-micron resolution from TB-scale scan data collected at the Advanced Photon Source of Argonne, where I worked with an interdisciplinary team from MCS/DSL and XSD divisions. Our SC19 and SC20 papers demonstrated the reconstruction of a 3D mouse brain on 4,096 KNLs (256k cores) of ALCF Theta and on 24,576 GPUs of OLCF Summit. The Tiled SpMM throughput reaches 65 mixed-precision PFLOPS and the hierarchical communications reduces the dominating communication time by 60%. To enhance the performance, we reorder rows and columns of the sparse matrix for modifying irregular data-access patterns with space-filling-curve-based data layout algorithms. This work won the best paper award at SC20. I have also applied the proposed techniques at the IBM-Illinois center to accelerate sparse deep neural network inference up to 180 TeraEdges/Second on Summit. Our HPEC20 paper obtained the championship title at MIT/Amazon/IEEE Sparse Challenge. I am currently collaborating with NVIDIA to contribute to the cuSPARSE library with the SpMM tiling techniques.

My research plan involves collaborating with domain scientists not only at Argonne, but also at other national laboratories, industry partners, and academic institutions to apply the proposed techniques. I also plan to develop novel extensions/generalizations by handling application requirements and release them as a performant library for the benefit of other exascale application developers. Throughout my graduate programs, I have applied similar techniques on various application domains involving nonlinear inverse problems, fast algorithms, n-body problems, multigrid method, computational imaging, sparse deep neural networks, stencil computations, and graph neural networks, as well as optimized applications such as HPCG, SETSM, and ChaNGa on petascale systems. I believe my previous experience will help me to communicate with domain scientists in the language of applied mathematics and computational science.

In summary, as a Margaret Butler Fellow at Argonne, my research plan is developing novel algorithms for critical applications at scale, especially for those involving irregular and sparse computations and communications. I will optimize these algorithms on multi-GPU node architecture and communication topologies that the exascale systems will embrace. My research will alleviate the memory wall individual GPUs, and communication-wall bottleneck on system level by exploiting the underlying exascale supercomputer architecture. This will accelerate a plethora of large-scale sparse applications. As a result, my research will yield high-throughput science production at exascale systems and contribute solving challenging problems of the coming decade.