Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI tag can overflow #6

Closed
cdaley opened this issue Jun 26, 2017 · 5 comments
Closed

MPI tag can overflow #6

cdaley opened this issue Jun 26, 2017 · 5 comments

Comments

@cdaley
Copy link

cdaley commented Jun 26, 2017

The MPI tag value can overflow when using Cray MPI:

Rank 65531 [Thu Jun 22 21:21:27 2017] [c4-5c0s13n0] Fatal error in PMPI_Isend: Invalid tag, error stack:
PMPI_Isend(161): MPI_Isend(buf=0x2aad1623efc0, count=3840, MPI_DOUBLE_PRECISION, dest=256, tag=2097153, comm=0x84000006, request=0x2aad35ffe280) failed
PMPI_Isend(108): Invalid tag, value is 2097153
Rank 65273 [Thu Jun 22 21:21:28 2017] [c4-5c0s3n0] Fatal error in MPI_Recv: Invalid tag, error stack:
MPI_Recv(212): MPI_Recv(buf=0x2aad16c7a000, count=3840, MPI_DOUBLE_PRECISION, src=253, tag=2097153, comm=0xc4000000, status=0x2aad2dffc000) failed
MPI_Recv(118): Invalid tag, value is 2097153
Rank 65530 [Thu Jun 22 21:21:27 2017] [c4-5c0s13n0] Fatal error in MPI_Recv: Invalid tag, error stack:
MPI_Recv(212): MPI_Recv(buf=0x2aad15e03f80, count=3840, MPI_DOUBLE_PRECISION, src=257, tag=2097153, comm=0x84000006, status=0x2aad31ffc000) failed
MPI_Recv(118): Invalid tag, value is 2097153
forrtl: error (76): Abort trap signal

The maximum valid tag in cray-mpich/7.4.4 is 2097151 (which is 2^21 - 1). The MPI standard specifies that the tag upper bound must be at least 32767. Ideally the tag value in SNAP should be kept below the value specified by the MPI standard.

This error happened when running the APEX "Grand Challenge" SNAP problem on 8192 nodes of Cori-KNL at NERSC with 65532 MPI ranks (npey=258, npez=254) and 8 OpenMP threads per MPI rank.

@zerr
Copy link
Collaborator

zerr commented Jun 27, 2017 via email

@cdaley
Copy link
Author

cdaley commented Jun 27, 2017

ng=144, nx=1000, ichunk=20

Here is the full input file

! Input from namelist
&invar
  nthreads=8
  nnested=1
  npey=258
  npez=254
  ndimen=3
  nx=1000
  lx=100.0
  ny=1032
  ly=103.2
  nz=1016
  lz=101.6
  ichunk=20
  nmom=4
  nang=48
  ng=144
  mat_opt=1
  src_opt=1
  timedep=1
  it_det=0
  tf=1.0
  nsteps=10
  iitm=5
  oitm=100
  epsi=1.E-4
  fluxp=0
  scatp=0
  fixup=1
  soloutp=0
  popout=0
  swp_typ=0
  angcpy=1
/

@zerr
Copy link
Collaborator

zerr commented Jun 27, 2017 via email

@cdaley
Copy link
Author

cdaley commented Jun 27, 2017

Thanks. I am happy to make the source code change that you describe.

We also encountered the same issue when running the "medium" APEX problem with nang=48 and ng=144. Can you give us a formula so we can calculate how the maximum tag value changes depending on input parameters?

@zerr
Copy link
Collaborator

zerr commented Jun 27, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants