Skip to content

PyTorch wheel installation file with performance optimization on CPU

Notifications You must be signed in to change notification settings

mingfeima/pytorch_nersc_wheel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 

Repository files navigation

pytorch_nersc_wheel

This repo contains PyTorch wheel installation file for NERSC Cori super computer, source code from Intel-Pytorch, icc_0.4 branch.

Performance has been optimized for the following features:

  • Conv2d optimization with MKL-DNN
  • Conv3d optimization with vol2col and col2vol parallelization
  • LSTM optimization with sigmoid and tanh parallelization
  • AVX512 support for Intel Xeon Skylake CPU and Xeon Phi (KNL, KNM)
  • Provide icc support to accommodate NERSC compilation environment

Installation

  • pytorch_original_gcc - compiled with origin PyTorch master with gcc
  • pytorch_intel_icc - compiled with Intel-PyTorch icc, support both Haswell and KNL
  • pytorch_intel_icc512 - compiled with Intel-PyTorch icc branch with icc and AVX512 support, specialized for KNL

Benchmark

Batch script for Cori

BKM: to our knowledge, KMP_AFFINITY setting has great impact on PyTorch CPU performance, we'll keep on updating best KMP_AFFINITY setting we know. On Haswell compute nodes

#!/bin/bash

#SBATCH -N 1
#SBATCH -t 00:30:00
#SBATCH -q regular
#SBATCH -L SCRATCH
#SBATCH -C haswell

export KMP_AFFINITY="granularity=fine,compact,1,0"
export OMP_NUM_THREADS=32
export KMP_BLOCKTIM=1
python benchmark.py

On KNL compute nodes

#!/bin/bash

#SBATCH -N 1
#SBATCH -t 00:30:00
#SBATCH -q regular
#SBATCH -L SCRATCH
#SBATCH -C knl,quad,cache

export KMP_AFFINITY="granularity=fine,compact,1,0"
export OMP_NUM_THREADS=68
export KMP_BLOCKTIME=1
python benchmark.py

Distributed Training

This section provides basic guidelines of implmenting Synchronous SGD with pytorch distributed package. pytorch.distributed provides an MPI-like interface for exchanging tensor data across multi-node networks. Note that a friendly wrapper of distrubited for Synchronous SGD is provides by torch.nn.parallel.DistributedDataParallel, but it supports only nccl and gloo backend. An example of distributed training MNIST with synchronous SGD is provided mnist_dist.py. Generally, it takes only 4 steps to apply distributed training. Please notice that this example serves as an fundamental prototype of distributed learning, torch.distributed provides all kinds of communication primitives by which you can implement any kind of synchronization algorithm such as DeepSpeech's ring-allreduce.

  1. Init with mpi backend
import torch
import torch.distributed as dist
dist.init_process_group(backend='mpi')
  1. Partition you local dataset using dist.get_rank() and mpi.get_world_size(). The dataset partition pattern is user defined, you may randomly select batch for each rank or use a uniform shuffled index as shown in the example.
  2. After your network is initialized, synchronize weights across all ranks. Theoratically you only need to synchronize weights once at the beginning of training. Pratically I tend to synchronize weights at the beginning of every epoch to kill accumulated numerical error.
def sync_params(model):
    """ broadcast rank 0 parameter to all ranks """
    for param in model.parameters():
        dist.broadcast(param.data, 0)
  1. After each backward step, synchronize gradients
def sync_grads(model):
    """ all_reduce grads from all ranks """
    for param in model.parameters():
        dist.all_reduce(param.grad.data)

I also did a little trick rewritten print(), with debug_print=True, print message will be addad a prefix of [rank/world_size], with debug_print=False, only message from master rank will be printed while rest of ranks will be muted.

To lauch this example on Cori:

cd examples/
salloc --qos=interactive -N 2 -t 00:30:00 -C haswell
# to kill MPI implementation doesn't support multithreading WARNING:
export MPICH_MAX_THREAD_SAFETY=multiple
srun -N 2 python -u mnist_dist.py

About

PyTorch wheel installation file with performance optimization on CPU

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published