Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes build error in MXNet on certain platforms #1141

Merged
merged 1 commit into from Jun 12, 2019

Conversation

4 participants
@apeforest
Copy link
Contributor

commented Jun 11, 2019

Remove the unnecessary MSHADOW_USE_F16C compiler flag when building Horovod with MXNet

Fix #1138

remove MSHADOW_USE_F16C
Signed-off-by: Lin Yuan <apeforest@gmail.com>

@apeforest apeforest force-pushed the apeforest:bugfix/mxnet_check_f16c branch from 7bfc975 to 4b8f3bb Jun 11, 2019

@apeforest

This comment has been minimized.

Copy link
Contributor Author

commented Jun 11, 2019

@yuxihu Please help to review.

@yuxihu
Copy link
Contributor

left a comment

So this affects installing Horovod on Amazon Linux based DLAMI? Ubuntu based is fine with/without this flag?

@@ -744,6 +744,7 @@ def build_mx_extension(build_ext, options):
else:
mxnet_mpi_lib.define_macros += [('MXNET_USE_MKLDNN', '0')]
mxnet_mpi_lib.define_macros += [('MSHADOW_USE_MKL', '0')]
mxnet_mpi_lib.define_macros += [('MSHADOW_USE_F16C', '0')]

This comment has been minimized.

Copy link
@yuxihu

yuxihu Jun 11, 2019

Contributor

We already set it to '0', which means disable. How did it cause installation failure?

This comment has been minimized.

Copy link
@apeforest

apeforest Jun 11, 2019

Author Contributor

It was not set to '0'. This flag is not used in horovod anyways. Even in MXNet, it should not be set to '1'. See https://github.com/dmlc/mshadow/blob/1d79ecfdb4c9234537e1bf5148f44a1af54501ec/make/mshadow.mk#L61

This comment has been minimized.

Copy link
@apeforest

This comment has been minimized.

Copy link
@eric-haibin-lin

eric-haibin-lin Jun 12, 2019

Contributor

This is set automatically by detecting if the platform supports f16c instructions, right? https://github.com/dmlc/mshadow/blob/1d79ecfdb4c9234537e1bf5148f44a1af54501ec/make/mshadow.mk#L36-L56

I did find f16c on DL AMI 18.0 (AL): cat /proc/cpuinfo | grep flags | grep f16c

fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt

This comment has been minimized.

Copy link
@eric-haibin-lin

eric-haibin-lin Jun 12, 2019

Contributor

Also if hvd-mxnet does automatic gradient compression in the future, f16c will help accelerate fp32<->fp16 conversion, I assume?

This comment has been minimized.

Copy link
@eric-haibin-lin

eric-haibin-lin Jun 12, 2019

Contributor

BTW - I find horovod 0.16.2 installation is fine on DLAMI. But 0.16.3 fails

This comment has been minimized.

Copy link
@apeforest

apeforest Jun 12, 2019

Author Contributor

This is set automatically by detecting if the platform supports f16c instructions, right? https://github.com/dmlc/mshadow/blob/1d79ecfdb4c9234537e1bf5148f44a1af54501ec/make/mshadow.mk#L36-L56

I did find f16c on DL AMI 18.0 (AL): cat /proc/cpuinfo | grep flags | grep f16c

fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt

The line you pasted is set when building MXNet with USE_F16C flag. In this case, the flag is to actually link a header file when building Horovod with MXNet as the dependency. It is set here: https://github.com/dmlc/mshadow/blob/6e94643bdf1d51a505b147f28c358fb71070b8fd/mshadow/base.h#L139

In either case, that flag is set automatically when building libmxnet.so. It seems we do not need to turn on the compiler flag when building Horovod.

@apeforest

This comment has been minimized.

Copy link
Contributor Author

commented Jun 11, 2019

Yes, ubuntu installation is fine without this flag.

@yuxihu

This comment has been minimized.

Copy link
Contributor

commented Jun 11, 2019

Let me summarize my understanding. Currently AL based DLAMI is failing with this flag but will be fine after removing it. Ubuntu based DLAMI is fine either way, right?

Last month we tried to install Horovod on AL based DLAMI and it was fine. Not sure if something got changed by DLAMI team. I am fine with this change since it is redundant anyway.

Looks like this will block the DLAMI integration until next Horovod release is out.

@apeforest

This comment has been minimized.

Copy link
Contributor Author

commented Jun 11, 2019

Let me summarize my understanding. Currently AL based DLAMI is failing with this flag but will be fine after removing it. Ubuntu based DLAMI is fine either way, right?

Your understanding is correct.

@yuxihu

yuxihu approved these changes Jun 11, 2019

@apeforest

This comment has been minimized.

Copy link
Contributor Author

commented Jun 11, 2019

@alsrgv Please help to review/merge this PR. Thanks

@alsrgv

alsrgv approved these changes Jun 12, 2019

Copy link
Collaborator

left a comment

@apeforest, LGTM. Do you want me to release 0.16.5 with this fix to unblock DLAMI?

@alsrgv alsrgv merged commit 181cc1c into horovod:master Jun 12, 2019

2 checks passed

DCO DCO
Details
buildkite/horovod/pr Build #489 passed (42 minutes, 48 seconds)
Details
@yuxihu

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2019

LGTM. Do you want me to release 0.16.5 with this fix to unblock DLAMI?

@alsrgv That would be great. We are running more integration tests with DLAMI. Will let you know once everything looks fine and then we can release 0.16.5. Will let you know. Thanks for your help.

@apeforest apeforest deleted the apeforest:bugfix/mxnet_check_f16c branch Jun 12, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.