Skip to content

mabdullahsoyturk/HPC-Paper-Notes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

List of Papers

  1. Energy Efficient Architecture for Graph Analytics Accelerators
  2. A Template Based Design Methodology for Graph Parallel Hardware Accelerators
  3. System Simulation with gem5 and SystemC
  4. GAIL: The Graph Algorithm Iron Law
  5. Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Serve
  6. Graphicionado A High Performance and Energy-Efficient Accelerator for Graph Analytics
  7. Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads
  8. Alleviating Irregularity in Graph Analytics Acceleration: a Hardware/Software Design Approach
  9. GNN Performance Optimization
  10. Dissecting the Graphcore IPU Architecture
  11. Using the Graphcore IPU for Traditional HPC Applications
  12. Roofline: An Insightful Visual Performance Model
  13. CUDA New Features and Beyond
  14. A Study of Persistent Threads Style GPU Programming for GPGPU Workloads
  15. BrainTorrent: A Peer to Peer Environment for Decentralized Federated Learning
  16. Whippletree: Task-based Scheduling of Dynamic Workloads on the GPU
  17. Groute: Asynchronous Multi-GPU Programming Model with Applications to Large-scale Graph Processing
  18. A Computational-Graph Partitioning Method for Training Memory-Constrained DNNs
  19. The Broker Queue: A Fast, Linearizable FIFO Queue for Fine-Granular Work Distribution on the GPU
  20. Softshell: Dynamic Scheduling on GPUs
  21. Gravel: Fine-Grain GPU-Initiated Network Messages
  22. SPIN:Seamless Operating System Integration of Peer to Peer DMA Between SSDs and GPUs
  23. Automatic Graph Partitioning for Very Large-scale Deep Learning
  24. Stateful Dataflow Multigraphs: A data-centric model for performance portability on heterogeneous architectures
  25. Productivity, Portability, Performance: Data-Centric Python
  26. Interferences between Communications and Computations in Distributed HPC Systems
  27. MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters
  28. GGAS: Global GPU Address Spaces for Efficient Communication in Heterogeneous Clusters
  29. GPUnet: Networking Abstractions for GPU Programs
  30. GPUrdma: GPU-side library for high performance networking from GPU kernels
  31. Trends in Data Locality Abstractions for HPC Systems
  32. Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines
  33. Benchmarking GPUs to Tune Dense Linear Algebra
  34. Brook for GPUs: stream computing on graphics hardware
  35. IPUG: Accelerating Breadth-First Graph Traversals using Manycore Graphcore IPUs
  36. Supporting RISC-V Performance Counters through Performance analysis tools for Linux
  37. Merrimac: Supercomputing with Streams
  38. Peta-scale Phase-Field Simulation for Dendritic Solidification on the TSUBAME 2.0 Supercomputer
  39. A-RISC-V-Simulator-and-Benchmark-Suite-for-Designing-and-Evaluating-Vector-Architectures
  40. PyTorch Distributed Experiences on Accelerating Data Parallel Training
  41. An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks
  42. ZeRO Memory Optimizations Toward Training Trillion Parameter Models
  43. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  44. XeFlow: Streamlining Inter-Processor Pipeline Execution for the Discrete CPU-GPU Platform
  45. Architecture and Performance of Devito, a System for Automated Stencil Computation
  46. Distributed Training of Deep Learning Models A Taxonomic Perspective
  47. Performance Trade-offs in GPU Communication A Study of Host and Device-initiated Approaches
  48. Assessment of NVSHMEM for High Performance Computing
  49. Sparse GPU Kernels for Deep Learning
  50. The State of Sparsity in Deep Neural Networks
  51. Pruning neural networks without any data by iteratively conserving synaptic flow
  52. SNIP: Single-shot Network Pruning based on Connection Sensitivity
  53. Comparing Rewinding and Fine-tuning in Neural Network Pruning
  54. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
  55. Torch.fx: Practical Program Capture and Transformation for Deep Learning in Python
  56. An asynchronous message driven parallel framework for extreme scale deep learning
  57. Bolt: Bridging The Gap Between Auto Tuners And Hardware Native Performance
  58. Efficient Tensor Core-Based GPU Kernels for Structured Sparsity under Reduced Precision
  59. Attention is All You Need
  60. Scaling Laws for Neural Language Models
  61. Language Models are Few-Shot Learners
  62. BERT Pre-training of Deep Bidirectional Transformers for Language Understanding
  63. RoBERTa: A Robustly Optimized BERT Pretraining Approach
  64. Longformer: The Long-Document Transformer
  65. Linformer: Self-Attention with Linear Complexity
  66. The Efficiency Misnomer
  67. A Survey of Transformers
  68. PipeTransformer-Automated Elastic Pipelining for Distributed Training of Large-scale Models
  69. Training Compute-Optimal Large Language Models
  70. WholeGraph: A Fast Graph Neural Network Training Framework with Multi-GPU Distributed Shared Memory Architecture
  71. Sparse-GPT-Massive-Language-Models-Can-Be-Accurately-Pruned-in-One-Shot

Two Papers A Week Goal (Starting from 28.06.2021)

28.06.2021 - 04.07.2021

05.07.2021 - 11.07.2021

12.07.2021 - 18.07.2021

09.08.2021 - 15.08.2021

16.08.2021 - 22.08.2021

23.08.2021 - 29.08.2021

30.08.2021 - 05.09.2021

06.09.2021 - 12.09.2021

13.09.2021 - 19.09.2021

20.09.2021 - 26.09.2021

27.09.2021 - 03.10.2021

04.10.2021 - 10.10.2021

11.10.2021 - 17.10.2021

18.10.2021 - 24.10.2021

25.10.2021 - 31.10.2021

01.11.2021 - 07.11.2021

08.11.2021 - 14.11.2021

15.11.2021 - 21.11.2021

22.11.2021 - 28.11.2021

29.11.2021 - 05.12.2021

06.11.2021 - 12.12.2021

13.12.2021 - 19.12.2021

20.12.2021 - 26.12.2021

Essential Reading List in Parallel Computing (Including suggestions of my advisor (Didem Unat))

Trends

Architectures

Performance Models and Tools

Applications

Programming Models

Compilers

Runtime Systems

My Scalable Deep Learning List (Just a list of papers that I read recently. Not recommendations)