-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI tag can overflow #6
Comments
Please confirm for me:
Ng
Nx
ichunk
From: cdaley <notifications@github.com<mailto:notifications@github.com>>
Reply-To: lanl/SNAP <reply@reply.github.com<mailto:reply@reply.github.com>>
Date: Monday, June 26, 2017 at 12:23 PM
To: lanl/SNAP <SNAP@noreply.github.com<mailto:SNAP@noreply.github.com>>
Cc: Subscribed <subscribed@noreply.github.com<mailto:subscribed@noreply.github.com>>
Subject: [lanl/SNAP] MPI tag can overflow (#6)
The MPI tag value can overflow when using Cray MPI:
Rank 65531 [Thu Jun 22 21:21:27 2017] [c4-5c0s13n0] Fatal error in PMPI_Isend: Invalid tag, error stack:
PMPI_Isend(161): MPI_Isend(buf=0x2aad1623efc0, count=3840, MPI_DOUBLE_PRECISION, dest=256, tag=2097153, comm=0x84000006, request=0x2aad35ffe280) failed
PMPI_Isend(108): Invalid tag, value is 2097153
Rank 65273 [Thu Jun 22 21:21:28 2017] [c4-5c0s3n0] Fatal error in MPI_Recv: Invalid tag, error stack:
MPI_Recv(212): MPI_Recv(buf=0x2aad16c7a000, count=3840, MPI_DOUBLE_PRECISION, src=253, tag=2097153, comm=0xc4000000, status=0x2aad2dffc000) failed
MPI_Recv(118): Invalid tag, value is 2097153
Rank 65530 [Thu Jun 22 21:21:27 2017] [c4-5c0s13n0] Fatal error in MPI_Recv: Invalid tag, error stack:
MPI_Recv(212): MPI_Recv(buf=0x2aad15e03f80, count=3840, MPI_DOUBLE_PRECISION, src=257, tag=2097153, comm=0x84000006, status=0x2aad31ffc000) failed
MPI_Recv(118): Invalid tag, value is 2097153
forrtl: error (76): Abort trap signal
The maximum valid tag in cray-mpich/7.4.4 is 2097151 (which is 2^21 - 1). The MPI standard specifies that the tag upper bound must be at least 32767. Ideally the tag value in SNAP should be kept below the value specified by the MPI standard.
This error happened when running the APEX "Grand Challenge" SNAP problem on 8192 nodes of Cori-KNL at NERSC with 65532 MPI ranks (npey=258, npez=254) and 8 OpenMP threads per MPI rank.
-
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#6>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD0I8ZPZvfohY9VrPDdGDJXzHdPAPnDvks5sH_c5gaJpZM4OFrcn>.
|
ng=144, nx=1000, ichunk=20 Here is the full input file
|
My apologies. I designed the GC problem as a 5-year goal and had not attempted to run it myself. I can see it will run afoul of even the more forgiving tag limits. Given the number of groups and spatial work chunks, I cannot reduce the maximum tag size below 32768 without coming up with a new formula and doing more extensive testing. I can more easily reset the formula to stay below the Cray (and other) MPI implementations' limits as this case does represent somewhat of an upper bound for expect SNAP runs.
Preference?
In fact, you can test the change yourself: change g_off in thrd_comm.f90 to 2**10. That provides ample offset for this number of spatial work chunks, while bringing the maximum tag value well below 2**21 - 1. (For this problem you could reduce it further to 2**9 or even more specifically 400, given the number of spatial work chunks)
The challenge is when a user attempts to test with a lot of spatial work chunks (to test fine grained communication), a larger g_off is necessary (given the current formula) to provide unique IDs to each chunk-octant-group message. I had hoped 2**14 was a sufficient balance knowing that several MPI implementations permit larger message tags, but again, my mistake, the GC problem still breaks that. Sorry.
Let me know.
Thanks.
From: cdaley <notifications@github.com<mailto:notifications@github.com>>
Reply-To: lanl/SNAP <reply@reply.github.com<mailto:reply@reply.github.com>>
Date: Tuesday, June 27, 2017 at 9:31 AM
To: lanl/SNAP <SNAP@noreply.github.com<mailto:SNAP@noreply.github.com>>
Cc: Robert Zerr <rzerr@lanl.gov<mailto:rzerr@lanl.gov>>, Comment <comment@noreply.github.com<mailto:comment@noreply.github.com>>
Subject: Re: [lanl/SNAP] MPI tag can overflow (#6)
ng=144, nx=1000, ichunk=20
Here is the full input file
! Input from namelist
&invar
nthreads=8
nnested=1
npey=258
npez=254
ndimen=3
nx=1000
lx=100.0
ny=1032
ly=103.2
nz=1016
lz=101.6
ichunk=20
nmom=4
nang=48
ng=144
mat_opt=1
src_opt=1
timedep=1
it_det=0
tf=1.0
nsteps=10
iitm=5
oitm=100
epsi=1.E-4
fluxp=0
scatp=0
fixup=1
soloutp=0
popout=0
swp_typ=0
angcpy=1
/
-
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#6 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD0I8QWAcF2cWcXAOhBA3-7XcLOx1818ks5sISBJgaJpZM4OFrcn>.
|
Thanks. I am happy to make the source code change that you describe. We also encountered the same issue when running the "medium" APEX problem with nang=48 and ng=144. Can you give us a formula so we can calculate how the maximum tag value changes depending on input parameters? |
By increasing the medium problem's NG value to that of the GC problem, you'll see the same problem with the tag.
The tag formula appears in thrd_comm.f90 in two places (lines 213 and 298 in my version, which I hope is about the same). The g_off parameter is to provide enough offset for unique tags for the spatial work chunks with sweep solutions in eight octants, creating a new tag for each group as well. Using input keywords, that formula simplifies for the largest tag value to:
max_tag = g_off*NG + 8*(NX/ICHUNK)
Unless you're running NX/ICHUNK as some large number (thousands), the tag value is mostly determined by g_off*NG. So in both cases 16384*144 exceeds the allowable limit for even your Cray implementation. Using g_off as low as 400 will be fine, because NX/ICHUNK in your case=50.
Sorry for the confusion and departure from standard. Hope this helps.
From: cdaley <notifications@github.com<mailto:notifications@github.com>>
Reply-To: lanl/SNAP <reply@reply.github.com<mailto:reply@reply.github.com>>
Date: Tuesday, June 27, 2017 at 11:27 AM
To: lanl/SNAP <SNAP@noreply.github.com<mailto:SNAP@noreply.github.com>>
Cc: Robert Zerr <rzerr@lanl.gov<mailto:rzerr@lanl.gov>>, Comment <comment@noreply.github.com<mailto:comment@noreply.github.com>>
Subject: Re: [lanl/SNAP] MPI tag can overflow (#6)
Thanks. I am happy to make the source code change that you describe.
We also encountered the same issue when running the "medium" APEX problem with nang=48 and ng=144. Can you give us a formula so we can calculate how the maximum tag value changes depending on input parameters?
-
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#6 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD0I8UyAXf0p8zKJu13b9TkKJXsi41xMks5sITuAgaJpZM4OFrcn>.
|
This was referenced Feb 16, 2021
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The MPI tag value can overflow when using Cray MPI:
The maximum valid tag in cray-mpich/7.4.4 is 2097151 (which is 2^21 - 1). The MPI standard specifies that the tag upper bound must be at least 32767. Ideally the tag value in SNAP should be kept below the value specified by the MPI standard.
This error happened when running the APEX "Grand Challenge" SNAP problem on 8192 nodes of Cori-KNL at NERSC with 65532 MPI ranks (npey=258, npez=254) and 8 OpenMP threads per MPI rank.
The text was updated successfully, but these errors were encountered: