### Sudheer Kumar

## RESEARCH INTERESTS

High Performance Computing, Performance Optimization on Multi-cores and Accelerators

EDUCATION

Ph.D., Computer Science, Sri Sathya Sai Institute of Higher Learning, April 2013

- Advisor: Prof. Ashok Srinivasan, Florida State University
- Thesis: Topology and Routing Aware Mapping on Parallel Processors

M.Tech., Computer Science, Sri Sathya Sai Institute of Higher Learning, March 2006

• Advisor: Mr. Shakti Kapoor, STSM, IBM Austin

B.Tech., Information Technology, RVR&JC College of Engineering, April 2004

Work Experience **Assistant Professor**, Department of Mathematics & Computer Science, Sri Sathya Sai Institute of Higher Learning, July 2011 - Present

JOURNAL PUBLICATIONS

Dynamic Load Balancing for Petascale Quantum Monte Carlo Applications: The Alias Method, C.D. Sudheer, S. Krishnan, A. Srinivasan, and P. R. C. Kent. Computer Physics Communciations, Feb 2013, Impact Factor: 3.268, (5-year Impact Factor: 2.812).

Conference Publications Optimization of the Hop-Byte Metric for Effective Topology Aware Mapping, C.D. Sudheer, Ashok Srinivasan, Proceedings of the 19th IEEE International Conference on High Performance Computing (HiPC), 2012, (Acceptance rate: 25%).

Optimizing Assignment of Threads to SPEs of the Cell BE Processor, C.D. Sudheer, T. Nagaraju, P.K. Baruah, Ashok Srinivasan, 10th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC), Proceedings of the 23rd International Parallel and Distributed Processing Symposium, IEEE, 2009, (Citations: SG - 5).

High Throughput Compression of Floating point numbers in GPUs, Ajith Padyana, C.D Sudheer, P.K. Baruah, Ashok Srinvasan, 2nd IEEE International Conference on Parallel, Distributed and Grid Computing - 2012 Himachal Pradesh, December 2012.

Investigating Algorithmic Techniques for Enhancing Application Performance on Multicore Processors, C.D. Sudheer (Advisor: Ashok Srinivasan), PhD Forum at IEEE International Parallel and Distributed Processing Symposium, 2009.

A Communication Model for Determining Optimal Affinity on the Cell BE processor, C.D. Sudheer, Sriram, S.: In: Proc. 16th IEEE International Conference on High Performance Computing (HiPC), Student Research Symposium, Dec 2009.

INVITED PRESENTATIONS

An Overview of the Global Arrays Toolkit, Five-days Technology Workshop on Heterogeneous Computing - Many Core/ Multi GPU - Performance of Algorithms, Application Kernels (HeMPa), 2011, at CMSD, UoHYD by C-DAC Pune & CMSD.

Programming for Performance on Cell BE processor, Performance Enhancement on Emerging Parallel Processing Platforms Workshop (PEEP), 2008, jointly organized by C-DAC and IUCAA.

Professional Service Technical Program Committee member, International Conference on High Performance Computing and Communications (HPCC), 2011, 2012.

TPC member, The 11th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA), 2013.

TPC member, The 12th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP), 2012.

TPC member, Student Research Symposium, IEEE International Conference on High Performance Computing (HiPC), 2012.

Reviewer, Computing, Springer Journal.

## TIME GRANTS

Computer Access XSEDE Research Allocation: Scaling Communication Performance for Massively Parallel Applications, 800,000 SUs, PI: Prof. Ashok Srinivasan, Florida State University.

Teragrid Startup Allocation, PI: Prof. P. Sadayappan, Ohio State University.

XSEDE Education Allocation: Programming for Performance on multicore and many-core processor, PI: Prof. Ravi Mukkamala, Old Dominion University. Have used this allocation effectively for teaching graduate level courses. Had access to the following supercomputers: TACC systems Ranger, Lonestar and Longhorn, SDSC systems Gordon and Trestles, PSC Blacklight.

## TECHNICAL EXPERTISE

Experience in programming Cell BE, CUDA, OpenCL, Cilk, MPI, OpenMP and Global Arrays. Programming Languages: C, PowerPC and x86 assembly and C++. Conversant with architectures Cell BE, PowerPC 405 Embedded, ARM, MIPS and x86.

## KEY RESEARCH Projects

#### Topology and routing aware mapping tool for massively parallel processors

- Developed general mapping techniques by posing the hop-byte metric as a quadratic assignment problem (QAP). Rather than using the metric just for evaluation of the mapping quality, the idea is intuitive in optimizing the metric itself.
- A metric based on the idea of minimizing the number of bottleneck links, called the maximum contention metric, requires the routing information along with the topology details. We showed that our heuristics for optimizing this metric are more effective in reduction communication costs.
- Tools/programming methods used: C, C++, Cray MPI, Cray Scheduler, Python, OSU MPI benchmarks and large scale code profiler - HPC toolkit.

## Topology aware implementation of Global Arrays data management for QWalk

- An automated data management approach is generally used to enable existing QWalk applications using the GA library to significantly enhance the range of problem sizes that can be handled.
- In this work, the topology and routing aware grouping of GA groups was implemented such that the contention created by the communication in the GA calls is minimized.
- Tools/programming methods used: C, C++, Global Arrays and MPI.

## Optimal dynamic load balancing algorithm for large scale codes involving near identical computational tasks

- A new load balancing algorithm for Quantum Monte Carlo and showed that it scales well, especially when compared with existing implementations.
- We also theoretically analyze its performance characteristics under a variety of metrics, in order to provide more insight into its strengths and limitations.
- An important feature of the new algorithm is that the load can be perfectly balanced with each process receiving at most one message. It is also optimal in the maximum size of messages received by any process.
- Tools/programming methods used: C, MPI, OSU MPI benchmarks and profilers such TAU, IPM.

#### Optimizing assignment of threads to SPEs on the Cell BE Processor

• The actual bandwidth obtained for inter-SPE communication is strongly inuenced by the assign-

ment of threads to SPEs (thread-SPE afnity) in many realistic communication patterns. We identify the bottlenecks to optimal performance and use this information to determine good afnities for common communication patterns.

- Our solutions improve performance by up to a factor of two over the default assignment. We also discussed the optimization of afnity on a Cell blade consisting of two Cell processors, and provided a software tool to help with this.
- Tools/programming methods used: Cell BE programming involving signals, DMAs and mailboxes, C and MPI.

# Reducing the disk IO bandwidth bottleneck through fast floating point compression using sccelerators

- We proposed a compression technique based on time-series analysis, and investigate its effectiveness on floating point data from a variety of applications.
- We show that significant reduction in IO time can be achieved, even accounting for the compression overhead, on a Cell BE processor. In our experiments, the typical improvement was around 30%, varying from a slight loss in performance to a factor of seven improvement, depending on the type of application.
- The contribution of this work lies in demonstrating the potential of floating point compression in reducing the IO bandwidth bottleneck for important classes of scientific applications.
- Tools/programming methods used: Cell BE programming, GPU CUDA programming and C.

# Design and implementation of an optimized Proc file system for a message passing operating system

- The aim is to design a process(proc) file system in the Minix OS by first understanding the implementation of it in Linux and then implement it for Minix OS running on PowerPC 405GP board.
- Implementation involved the debugging of Minix on PowerPC using IBM RISCWatch debugger.
- Tools/programming methods used: MINIX OS, PowerPC ISA and C.

#### Mini Software Projects

- Implementation of TFTP (Trivial File Transfer Protocol) in Java. It was also provided with a GUI.
- Implementation of a primitive web server / proxy / browser in UNIX.
- Tools/programming methods used: Java, C and UNIX programming internals.

## TEACHING

#### High Performance Computing with Accelerators, Summer 2012.

This course was highly successful resulting in 9 student papers, more than one third of the total papers, accepted for Student Research Symposium, at HiPC 2012. And also, 3 out of the 4 awards constituted for Best Presentation and Best Poster were secured by the students of this course. Course Webpage: http://dmacssite.github.com

Programming for Performance, Winter 2011, 2012 and 2013. http://progforperf.github.com

 $Parallel\ Computing, Winter\ 2013.\ http://parallelcomp.github.com$ 

Computer Organization and Design, 2010, 2011 and 2012.

Processor Architecture and its Applications, 2008.

Operating Systems Design and Implementation, 2007.

Systems Programming using MINIX Operating System, 2006.

### Course management tools

- Maintained course websites locally over e-learning portal and also in public domain over Git.
- Incorporated the use of Piazza, a web based tool for an efficient way to manage class Q&A.

ACADEMIC ACHIEVEMENTS TCPP PhD Forum Travel Grant for attending the IPDPS 2009 conference in Italy.

Secured 96.84 percentile in Graduate Aptitude Test in Engineering (GATE-04) in Computer Science stream.

Achieved University first rank in C programming theory and laboratory exam in the 1st year of B.Tech.

REFERENCES Prof. Ashok Srinivasan, Florida State University (asriniva@cs.fsu.edu)

Prof. P. Sadayappan, Ohio State University (saday@cse.ohio-state.edu)

Mr. Shakti Kapoor, STSM, IBM Austin (skapoor@us.ibm.com)

Prof. Pallav Kumar Baruah, Sri Sathya Sai Institute of Higher Learning (pkbaruah@sssihl.edu.in)