New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crash when sending very large messages #23

Closed
mattijsjanssens opened this Issue Jan 25, 2018 · 5 comments

Comments

Projects
None yet
3 participants
@mattijsjanssens

mattijsjanssens commented Jan 25, 2018

We're occasionally seeing assert message of the form

ips_proto.c:1646: (scb->payload_size & 0x3) == 0

which seem to originate from somewhere in the network stack (e.g. https://github.com/intel/opa-psm2/blob/master/ptl_ips/ips_proto.c#L1957) when the size is not a multiple of 4.

  • this only happens occasionally
  • and only for extremely large messages (not sure but could be 100Mb or even Gb)
  • and only on omnipath, either with IntelMPI+ofi or ompi+psm2

Is this a known problem? We don't pad our mpi messages to be multiple of 4 bytes. Should we? If so why does it not show up on ordinary usage (i.e. smaller messages).

@aravindksg

This comment has been minimized.

Show comment
Hide comment
@aravindksg

aravindksg Jan 25, 2018

Collaborator

We have seen this problem appear with message sizes that are not DW multiple before, but the issue was fixed. (as of PSM2 version :PSM2_10.2-235)
Also- the line numbers you posted above do not match:
ips_proto.c:1646 where your execution is failing and current location of assert (in latest PSM2 master) is ips_proto.c: 1957. Could you clarify if you are actually using the latest PSM2 version from GitHub or a different PSM2 version (either from distro or from IFS)? If it is indeed an older version, could you please update to latest GitHub master and retry?

Collaborator

aravindksg commented Jan 25, 2018

We have seen this problem appear with message sizes that are not DW multiple before, but the issue was fixed. (as of PSM2 version :PSM2_10.2-235)
Also- the line numbers you posted above do not match:
ips_proto.c:1646 where your execution is failing and current location of assert (in latest PSM2 master) is ips_proto.c: 1957. Could you clarify if you are actually using the latest PSM2 version from GitHub or a different PSM2 version (either from distro or from IFS)? If it is indeed an older version, could you please update to latest GitHub master and retry?

@mattijsjanssens

This comment has been minimized.

Show comment
Hide comment
@mattijsjanssens

mattijsjanssens Jan 26, 2018

Thanks for the answer. I will check.

mattijsjanssens commented Jan 26, 2018

Thanks for the answer. I will check.

@rwmcguir

This comment has been minimized.

Show comment
Hide comment
@rwmcguir

rwmcguir Apr 26, 2018

Contributor

Can this issue be closed or is this still a problem?

Contributor

rwmcguir commented Apr 26, 2018

Can this issue be closed or is this still a problem?

@mattijsjanssens

This comment has been minimized.

Show comment
Hide comment
@mattijsjanssens

mattijsjanssens May 3, 2018

mattijsjanssens commented May 3, 2018

@rwmcguir

This comment has been minimized.

Show comment
Hide comment
@rwmcguir

rwmcguir May 3, 2018

Contributor

Thank you for confirming.

Contributor

rwmcguir commented May 3, 2018

Thank you for confirming.

@rwmcguir rwmcguir closed this May 3, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment