pytorch_nersc_wheel

This repo contains PyTorch wheel installation file for NERSC Cori super computer, source code from Intel-Pytorch, icc_0.4 branch.

Performance has been optimized for the following features:

Conv2d optimization with MKL-DNN
Conv3d optimization with vol2col and col2vol parallelization
LSTM optimization with sigmoid and tanh parallelization
AVX512 support for Intel Xeon Skylake CPU and Xeon Phi (KNL, KNM)
Provide icc support to accommodate NERSC compilation environment

Installation

pytorch_original_gcc - compiled with origin PyTorch master with gcc
pytorch_intel_icc - compiled with Intel-PyTorch icc, support both Haswell and KNL
~~pytorch_intel_icc512 - compiled with Intel-PyTorch icc branch with icc and AVX512 support, specialized for KNL~~

Benchmark

Batch script for Cori

BKM: to our knowledge, KMP_AFFINITY setting has great impact on PyTorch CPU performance, we'll keep on updating best KMP_AFFINITY setting we know. On Haswell compute nodes

#!/bin/bash

#SBATCH -N 1
#SBATCH -t 00:30:00
#SBATCH -q regular
#SBATCH -L SCRATCH
#SBATCH -C haswell

export KMP_AFFINITY="granularity=fine,compact,1,0"
export OMP_NUM_THREADS=32
export KMP_BLOCKTIM=1
python benchmark.py

On KNL compute nodes

#!/bin/bash

#SBATCH -N 1
#SBATCH -t 00:30:00
#SBATCH -q regular
#SBATCH -L SCRATCH
#SBATCH -C knl,quad,cache

export KMP_AFFINITY="granularity=fine,compact,1,0"
export OMP_NUM_THREADS=68
export KMP_BLOCKTIME=1
python benchmark.py

Distributed Training

This section provides basic guidelines of implmenting Synchronous SGD with pytorch distributed package. pytorch.distributed provides an MPI-like interface for exchanging tensor data across multi-node networks. Note that a friendly wrapper of distrubited for Synchronous SGD is provides by torch.nn.parallel.DistributedDataParallel, but it supports only nccl and gloo backend. An example of distributed training MNIST with synchronous SGD is provided mnist_dist.py. Generally, it takes only 4 steps to apply distributed training. Please notice that this example serves as an fundamental prototype of distributed learning, torch.distributed provides all kinds of communication primitives by which you can implement any kind of synchronization algorithm such as DeepSpeech's ring-allreduce.

Init with mpi backend

import torch
import torch.distributed as dist
dist.init_process_group(backend='mpi')

Partition you local dataset using dist.get_rank() and mpi.get_world_size(). The dataset partition pattern is user defined, you may randomly select batch for each rank or use a uniform shuffled index as shown in the example.
After your network is initialized, synchronize weights across all ranks. Theoratically you only need to synchronize weights once at the beginning of training. Pratically I tend to synchronize weights at the beginning of every epoch to kill accumulated numerical error.

def sync_params(model):
    """ broadcast rank 0 parameter to all ranks """
    for param in model.parameters():
        dist.broadcast(param.data, 0)

After each backward step, synchronize gradients

def sync_grads(model):
    """ all_reduce grads from all ranks """
    for param in model.parameters():
        dist.all_reduce(param.grad.data)

I also did a little trick rewritten print(), with debug_print=True, print message will be addad a prefix of [rank/world_size], with debug_print=False, only message from master rank will be printed while rest of ranks will be muted.

To lauch this example on Cori:

cd examples/
salloc --qos=interactive -N 2 -t 00:30:00 -C haswell
# to kill MPI implementation doesn't support multithreading WARNING:
export MPICH_MAX_THREAD_SAFETY=multiple
srun -N 2 python -u mnist_dist.py

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
examples		examples
nersc_benchmarks		nersc_benchmarks
nersc_pytorch_whl		nersc_pytorch_whl
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pytorch_nersc_wheel

Performance has been optimized for the following features:

Installation

Benchmark

Batch script for Cori

Distributed Training

About

Releases

Packages

mingfeima/pytorch_nersc_wheel

Folders and files

Latest commit

History

Repository files navigation

pytorch_nersc_wheel

Performance has been optimized for the following features:

Installation

Benchmark

Batch script for Cori

Distributed Training

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages