Release SuperBench v0.3.0

abuccts released this 24 Sep 05:57

· 2 commits to release/0.3 since this release

SuperBench v0.3.0 Release Notes

SuperBench Framework

Runner

Implement MPI mode.

Benchmarks

Support Docker benchmark.

Single-node Validation

Micro Benchmarks

Memory (Tool: NVIDIA/AMD Bandwidth Test Tool)

Metrics Unit Description

H2D_Mem_BW_GPU GB/s host-to-GPU bandwidth for each GPU

D2H_Mem_BW_GPU GB/s GPU-to-host bandwidth for each GPU

IBLoopback (Tool: PerfTest – Standard RDMA Test Tool)

Metrics	Unit	Description
IB_Write	MB/s	The IB write loopback throughput with different message sizes
IB_Read	MB/s	The IB read loopback throughput with different message sizes
IB_Send	MB/s	The IB send loopback throughput with different message sizes

NCCL/RCCL (Tool: NCCL/RCCL Tests)

Metrics	Unit	Description
NCCL_AllReduce	GB/s	The NCCL AllReduce performance with different message sizes
NCCL_AllGather	GB/s	The NCCL AllGather performance with different message sizes
NCCL_broadcast	GB/s	The NCCL Broadcast performance with different message sizes
NCCL_reduce	GB/s	The NCCL Reduce performance with different message sizes
NCCL_reduce_scatter	GB/s	The NCCL ReduceScatter performance with different message sizes

Disk (Tool: FIO – Standard Disk Performance Tool)

Metrics	Unit	Description
Seq_Read	MB/s	Sequential read performance
Seq_Write	MB/s	Sequential write performance
Rand_Read	MB/s	Random read performance
Rand_Write	MB/s	Random write performance
Seq_R/W_Read	MB/s	Read performance in sequential read/write, fixed measurement (read:write = 4:1)
Seq_R/W_Write	MB/s	Write performance in sequential read/write (read:write = 4:1)
Rand_R/W_Read	MB/s	Read performance in random read/write (read:write = 4:1)
Rand_R/W_Write	MB/s	Write performance in random read/write (read:write = 4:1)

H2D/D2H SM Transmission Bandwidth (Tool: MSR-A build)

Metrics Unit Description

H2D_SM_BW_GPU GB/s host-to-GPU bandwidth using GPU kernel for each GPU

D2H_SM_BW_GPU GB/s GPU-to-host bandwidth using GPU kernel for each GPU

AMD GPU Support

Docker Image Support

ROCm 4.2 PyTorch 1.7.0
ROCm 4.0 PyTorch 1.7.0

Micro Benchmarks

Kernel Launch (Tool: MSR-A build)

Metrics	Unit	Description
Kernel_Launch_Event_Time	Time (ms)	Dispatch latency measured in GPU time using hipEventRecord()
Kernel_Launch_Wall_Time	Time (ms)	Dispatch latency measured in CPU time

GEMM FLOPS (Tool: AMD rocblas-bench Tool)

Metrics	Unit	Description
FP64	GFLOPS	FP64 FLOPS without MatrixCore
FP32(MC)	GFLOPS	TF32 FLOPS with MatrixCore
FP16(MC)	GFLOPS	FP16 FLOPS with MatrixCore
BF16(MC)	GFLOPS	BF16 FLOPS with MatrixCore
INT8(MC)	GOPS	INT8 FLOPS with MatrixCore

E2E Benchmarks

CNN models -- Use PyTorch torchvision models
- ResNet: ResNet-50, ResNet-101, ResNet-152
- DenseNet: DenseNet-169, DenseNet-201
- VGG: VGG-11, VGG-13, VGG-16, VGG-19
BERT -- Use huggingface Transformers
- BERT
- BERT Large
LSTM -- Use PyTorch
GPT-2 -- Use huggingface Transformers

Bug Fix

VGG models failed on A100 GPU with batch_size=128

Other Improvement

Contribution related
- Contribute rule
- System information collection
Document
- Add release process doc
- Add design documents
- Add developer guide doc for coding style
- Add contribution rules
- Add docker image list
- Add initial validation results

Assets 2