Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not an Issue, but close: M1 compilation and speed #1498

Closed
GillesDuvert opened this issue Jan 26, 2023 · 24 comments
Closed

Not an Issue, but close: M1 compilation and speed #1498

GillesDuvert opened this issue Jan 26, 2023 · 24 comments

Comments

@GillesDuvert
Copy link
Contributor

I have a login on a remote mac mini with M1.
The system prefers to use MacPorts, which is not a problem per se. The main defect is that, as for Homebrew, plplot is compiled without dynamic drivers so unless we patch the plplot MacPort, (something build_gdl.sh already does for homebrew's plplot), there will be no 3D support and no wxWidgets PLOT windows. But otherwise all the components for GDL are here.

The main issue was to have openmp. With the current state of CMakeLists.txt, impossible to get it with Apple clang. I succeeded only with

  1. use a modified version of CMakeLists.txt:
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -463,8 +463,8 @@
 # -DOPENMP=ON|OFF
 if(OPENMP)
   find_package(OpenMP QUIET)
-  set(USE_OPENMP ${OPENMP_FOUND})
-  if(OPENMP_FOUND)
+  set(USE_OPENMP ${OpenMP_FOUND})
+  if(OpenMP_FOUND)
     if(MSVC)
       set(LIBRARIES ${LIBRARIES} vcomp)
     elseif(WIN32)
@@ -473,9 +473,10 @@
       set(LIBRARIES ${LIBRARIES} ${OpenMP_CXX_FLAGS})
     endif()
     if(APPLE)
+      set(LIBRARIES ${LIBRARIES} ${OpenMP_CXX_LIBRARIES})
       link_directories(/usr/local/lib)
     endif()
-  else(OPENMP_FOUND)
+  else(OpenMP_FOUND)
  1. installing clang-14
    sudo port install clang-14
  2. call cmake: with this clang compiler:
    -DCMAKE_CXX_COMPILER="/opt/local/bin/clang++-mp-14" -DCMAKE_C_COMPILER="/opt/local/bin/clang-mp-14"

Results:
OMP is here, because

GDL> !CPU
{
    "HW_VECTOR": 0,
    "VECTOR_ENABLE": 0,
    "HW_NCPU": 8,
    "TPOOL_NTHREADS": 6,
    "TPOOL_MIN_ELTS": 100000,
    "TPOOL_MAX_ELTS": 0
}

test_all_basic_functions,size=1000000 gives
% Time elapsed ALL TESTS: 3.9622629 seconds. that's 2 times faster than on my (old) Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz

BUT: time_test4 shows incredible loss of time, in particular a bottleneck in Foreach (44.5 sec where my computer does it in 0.13s)

|       OS_FAMILY=unix, OS=darwin, ARCH=arm64
| Thu Jan 26 22:26:44 2023
       1    0.0723610 Empty for loop, 6000000 times
       2      44.5536 Foreach, 6000000 elements
       3   0.00927997 Call empty procedure (1 param) 300000 times
       4      4.45700 Add 600000 integer scalars and store
       5      5.57202 150000 scalar loops each of 5 ops, 2 =, 1 if)
       6   0.00158405 Mult 512 by 512 byte by constant and store, 90 times
       7    0.0392852 Shift 512 by 512 byte and store, 900 times
       8   0.00400996 Add constant to 512x512 byte array, 300 times
       9   0.00382113 Add two 512 by 512 byte arrays and store, 240 times
      10   0.00355911 Mult 512 by 512 floating by constant, 90 times
      11    0.0114350 Shift 512 x 512 array, 180 times
      12   0.00487995 Add two 512 by 512 floating images, 120 times
      13   0.00964713 Generate 3000000 random numbers
      14   0.00843716 Invert a 332^2 random matrix
      15   0.00488091 LU Decomposition of a 332^2 random matrix
      16      3.31779 Transpose 665^2 byte, FOR loops
      17     0.121225 Transpose 665^2 byte, row and column ops x 10
      18    0.0152781 Transpose 665^2 byte, TRANSPOSE function x 100
      19      2.23489 Log of 300000 numbers, FOR loop
      20   0.00312591 Log of 300000 numbers, vector ops 10 times
      21    0.0573280 2097152 point forward plus inverse FFT
      22    0.0414770 Smooth 512x512 byte array, 5x5 boxcar, 30 times
      23    0.0487540 Smooth 512x512 floating array, 5x5 boxcar, 30 times
      24      1.07070 Write and read 512x512 byte array, 120 times
      25      1.90312 Create 120000 empty lists
      63.5695=Total Time,      0.061679211=Geometric mean,      25 tests.

Interestingly, compiling without Eigen3 sort of divide by 2 the time passed in the previous tests (but as expected produce terrible results for test_all_basic_functions: 30 seconds)
As Eigen is related to the alignment of our variables in memory there may be a subject here.

@GillesDuvert
Copy link
Contributor Author

GillesDuvert commented Jan 27, 2023

Apparently we have no problem with alignment. I mean, no error is reported.

@GillesDuvert
Copy link
Contributor Author

GillesDuvert commented Jan 27, 2023

we are currently aligning on 16 bytes if Eigen is included in the build. This is what is recommended, and similarly on M1.
This is what gcc does apparently anyway.
Nothing strange here.

@GillesDuvert
Copy link
Contributor Author

Foreach on M1 takes exactly the same time whatever the size of the array 600, 6000, 6000000... !

@GillesDuvert
Copy link
Contributor Author

In fact it is the deletion of an object (here a loop variable) that seems to be the culprit.
line 1580 of prognode.cpp
not deleting gives:
2 0.423445 Foreach, 6000000 elements
for time_test4
Does this slow deletion points to a deficiency of GDL code and a way to improve performance?

@GillesDuvert
Copy link
Contributor Author

eh eh eh GDL's FOREACH code is too good. See #1500
Doing like IDL would of course remove the Delete and thus speedup things.
Nevertheless, this deletion problem on M1 will be found everywhere else in GDL. To be continued.

@GillesDuvert
Copy link
Contributor Author

OK folks, the only reasonable way to get GDL compiled with OpenMP on a M1 (here a mac mini with Darwin Kernel Version 22.3.0 RELEASE_ARM64_T8103 arm64 is to read this page and bypass the retraints put by Apple. Applying to the letter the recipe in the above page, we just use Apple clang, with openmp:

cmake -DINTERACTIVE_GRAPHICS=OFF -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS=" -I/usr/local/include" -DCMAKE_CPPFLAGS=" -Xclang -fopenmp" -DCMAKE_LDFLAGS=" -lomp " -DX11=ON -DLIBPROJDIR=/opt/local/lib/proj5/include -DPYTHON=OFF -DEIGEN3=ON ../gdl
Use fuzzy detection for PLplot lib. (e.g. in /usr/lib)
-- INFO: We prefer to use GraphicsMagick than ImageMagick
-- warning, you don't have NetCDF-4 versionsome new NetCDF capabilities in NetCDF-4 (related to Groups) will not be usable
-- Checking for module 'mpi-c'
--   No package 'mpi-c' found
-- Checking for module 'mpi-cxx'
--   No package 'mpi-cxx' found
-- Summary

GDL - GNU DATA LANGUAGE [Standalone]
System                 Darwin-22.3.0
Files generated        Unix Makefiles
Installation prefix    /usr/local
C++ compiler           /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -O3 -DNDEBUG

-- Options

Interactive plots: OFF
Widgets support: TRUE

OpenMP support         ON (flag: -Xclang -fopenmp /usr/local/lib/libomp.dylib)
WxWidgets              ON (libs:-L/opt/local/Library/Frameworks/wxWidgets.framework/Versions/wxWidgets/3.1/lib;;;-framework IOKit;-framework Carbon;-framework Cocoa;-framework QuartzCore;-framework AudioToolbox;-framework System;-framework OpenGL;-lwx_baseu-3.1;-lwx_osx_cocoau_core-3.1; headers:/opt/local/Library/Frameworks/wxWidgets.framework/Versions/wxWidgets/3.1/lib/wx/include/osx_cocoa-unicode-3.1;/opt/local/Library/Frameworks/wxWidgets.framework/Versions/wxWidgets/3.1/include/wx-3.1)
GRAPHICSMAGICK         ON (libs:/opt/local/lib/libGraphicsMagick.dylib;/opt/local/lib/libGraphicsMagick++.dylib; headers:/opt/local/include/GraphicsMagick)
TIFF                   ON (libs:/opt/local/lib/libtiff.dylib; headers:/opt/local/include)
GeoTIFF                ON (libs:/opt/local/lib/libgeotiff.dylib; headers:/opt/local/include)
NetCDF                 ON (libs:netcdf; headers:/opt/local/include)
HDF4                   ON (libs:/opt/local/lib/libmfhdf.dylib;/opt/local/lib/libdf.dylib;z;/opt/local/lib/libjpeg.dylib; headers:/opt/local/include)
HDF5                   ON (libs:/opt/local/lib/libhdf5.dylib;/opt/local/lib/libz.dylib;/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.1.sdk/usr/lib/libdl.tbd;/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.1.sdk/usr/lib/libm.tbd; headers:/opt/local/include)
FFTW                   ON (libs:/opt/local/lib/libfftw3.dylib;/opt/local/lib/libfftw3f.dylib; headers:/opt/local/include)
MPI                    OFF
PROJ                   ON (libs:/opt/local/lib/proj5/lib/libproj.dylib; headers:/opt/local/lib/proj5/include)
Python                 OFF
UDUNITS-2              ON (libs:/opt/local/lib/libudunits2.dylib; headers:/opt/local/include/udunits2)
EIGEN3                 ON (libs:; headers:/opt/local/include/eigen3)
GRIB                   ON (libs:/opt/local/lib/libeccodes.dylib; headers:/opt/local/include)
QHULL                  ON (libs:/opt/local/lib/libqhullcpp.a;/opt/local/lib/libqhullstatic_r.a; headers:/opt/local/include)
GLPK                   ON (libs:/opt/local/lib/libglpk.dylib; headers:/opt/local/include)
SHAPELIB               ON (libs:/opt/local/lib/libshp.dylib; headers:/opt/local/include)
EXPAT                  ON (libs:/opt/local/lib/libexpat.dylib; headers:/opt/local/include)
libpng                 ON (libs:/opt/local/lib/libpng.dylib;/opt/local/lib/libz.dylib; headers:/opt/local/include;/opt/local/include)

-- Mandatory modules
Plplot                 ON (libs:/opt/local/lib/libplplot.dylib;/opt/local/lib/libplplotcxx.dylib; headers:/opt/local/include)
GNU Readline           ON (libs:/opt/local/lib/libreadline.dylib;/opt/local/lib/libhistory.dylib; headers:/opt/local/include)
GSL                    ON (libs:/opt/local/lib/libgsl.dylib;/opt/local/lib/libgslcblas.dylib; headers:/opt/local/include)
Zlib                   ON (libs:/opt/local/lib/libz.dylib; headers:/opt/local/include)
(N)curses              ON (libs:/opt/local/lib/libncurses.dylib;/opt/local/lib/libform.dylib; headers:/opt/local/include)
RPC                    ON (libs:; headers:/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.1.sdk/usr/include)

note: INTERACTIVE_GRAPHICS=OFF because MacPort's plplot is not compiled with DYNAMIC_DRIVERS, this will be solved easily.

With openmp on, I have test_all_basic_functions down to 5 seconds, now twice as fast as my old Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz@ 2.50GHz

BUT: time_test4 chokes on

     2      23.0193 Foreach, 6000000 elements
    16      1.73310 Transpose 665^2 byte, FOR loops
    19      1.16885 Log of 300000 numbers, FOR loop
    25      1.33631 Create 120000 empty lists

This being related to the strange slowness of deleting objects.

@alaingdl
Copy link
Contributor

alaingdl commented Feb 1, 2023

I stop to look for a way to use OpenMP on Mac since three years : every release they change some details inside ... A total mess and loss of time.

One point : did you try to use another compiler than Clang. Quite easy to switch to a true GCC ...

@GillesDuvert
Copy link
Contributor Author

indeed, I tried with

  1. MacPorts clang (that must be called clang++-mp-14 since clang is Apple's clang). Works but strangely the openmp was not there.
  2. gcc would be the best candidate, as (I did tests) it does not choke on Foreach etc, but a lot of link problems with all the gdl libraries (unresolved symbols), probably because they are not built with gcc.

Of course, it may well be that the simplest solution is to use Brew and not MacPorts, I'll do tests.

@GillesDuvert
Copy link
Contributor Author

I was able to recompile on M1 using the version of build_gdl.sh as modified in #1510, i.e.: only by having the script recompile the plplot library.
This is the simplest approach and must be followed.
Today this means using Homebrew, but adding support for MacPorts in build_gdl.sh is straightforward.
The executable is open-mp'ed so the vector functions are very fast.
However, contrary to my old intel_based Catalina (10.15.7) the test machine here, a Darwin Kernel Version 22.3.0 RELEASE_ARM64_T8103 arm64 is painfully slow on object creation/destruction.
It remains to be seen if this is due to the compiler or the architecture or a defect in GDL's OO design.

@GillesDuvert
Copy link
Contributor Author

Ah, and there's a problem when using the PROJ library (crash in test_map.pro)

@GillesDuvert
Copy link
Contributor Author

The PROJ library problem may just arise from incompatibility with installed versions. Tricky or just needs a debugger?.
The object creation/destruction slowness is interesting in that it shows real bottlenecks in our code, one of wich can already be olved (for FOREACH) by not creating objects at all. But in general this is our main source of speed problems, in loops, as we create and destroy many objects (and that's what C++ is for). Possibly this slowness points to the absence of a C++ qualifier that would force the compiler to optimze the object creation. (Any C++ guru around?)

@GillesDuvert
Copy link
Contributor Author

Proj library problem above was just a version problem

@GillesDuvert
Copy link
Contributor Author

now on #1675

@alaingdl
Copy link
Contributor

alaingdl commented Dec 8, 2023

I gained intermittent access to a M2 OSX

The good news : yes the script is working fine and OpenMP is activated

The very bad news : we do horrible performances in some places, already reported here by @GillesDuvert
(8 cores, M2)

time_test4
       2      38.2682 Foreach, 6000000 elements
       4      3.84859 Add 600000 integer scalars and store
       5      4.80080 150000 scalar loops each of 5 ops, 2 =, 1 if)
      16      2.87045 Transpose 665^2 byte, FOR loops
      19      1.92629 Log of 300000 numbers, FOR loop
...

On my old laptop (4 cores, i5)

       2    0.148648 Foreach, 6000000 elements
       4    0.0324070 Add 600000 integer scalars and store
       5    0.0323660 150000 scalar loops each of 5 ops, 2 =, 1 if)
      16    0.0399718 Transpose 665^2 byte, FOR loops
      19    0.0284669 Log of 300000 numbers, FOR loop

@alaingdl alaingdl reopened this Dec 8, 2023
@GillesDuvert
Copy link
Contributor Author

@alaingdl Having a fast GDL on M1 and M2 seems an important goal. The slowness of creation/destruction of C++ objects is quite possibly the culprit. This in turn may be due to the compiler + the platform, as the same code is fast everywhere else (including Apple x86). It would be interesting to test GDL on a M1, but with linux and not OSX to confirm that. Perhaps simple, or rather, better C++ code, such as adding 'const' or 'final' everywhere possible in our code, will make faster code on OSX+M1. Using the Apple compiler may also be an asset. Various things I'm not in position to do having no Apple machine or license.

@alaingdl
Copy link
Contributor

alaingdl commented Dec 8, 2023

I have a rendez-vous on Monday with a colleague who removed the OSX from an Apple laptop and then installed a arm-based linux ! I hope it will clarify this serious question ! Clang problem ?

Later I will try to play with the Clang options ...

@GillesDuvert
Copy link
Contributor Author

I wonder if @opoplawski would confirm GDL compiles on linux M1... If they have such a machine in their huge rebuild system.

The other way around is to check with Apple clang, as I did here, including the openmp bypass described in the same comment.

@opoplawski
Copy link
Contributor

So, we (Fedora) don't necessarily build on M1/M2 - but we do build for Linux on aarch64 in general.

I started poking at time_test4 - which seems to be part of IDL. I tried running the version from IDL 8.7 with gdl 1.0.2 and got:

GDL> time_test4
% Compiled module: TIME_TEST4.
% Compiled module: TIME_TEST.
% TIME_TEST_INIT: Function not found: LMGR
% Error occurred at: TIME_TEST_INIT      42 /home/orion/idl/time_test.pro
%                    TIME_TEST4_INTERNAL   632 /home/orion/idl/time_test.pro
%                    TIME_TEST4          24 /home/orion/idl/time_test4.pro
%                    $MAIN$
% Execution halted at: TIME_TEST4          24 /home/orion/idl/time_test4.pro

I'm not finding any other source for time_test4. What am I missing?

@opoplawski
Copy link
Contributor

I think I figured out how to run time_test4 from IDL 8.7 - here is the output from an aarch64 Fedora builder:

       1    0.0865450 Empty for loop, 6000000 times
       2     0.133599 Foreach, 6000000 elements
       3    0.0200760 Call empty procedure (1 param) 300000 times
       4    0.0408940 Add 600000 integer scalars and store
       5    0.0390480 150000 scalar loops each of 5 ops, 2 =, 1 if)
       6   0.00125408 Mult 512 by 512 byte by constant and store, 90 times
       7    0.0110099 Shift 512 by 512 byte and store, 900 times
       8   0.00262213 Add constant to 512x512 byte array, 300 times
       9   0.00358605 Add two 512 by 512 byte arrays and store, 240 times
      10   0.00737286 Mult 512 by 512 floating by constant, 90 times
      11    0.0105178 Shift 512 x 512 array, 180 times
      12    0.0118940 Add two 512 by 512 floating images, 120 times
      13   0.00635409 Generate 3000000 random numbers
      14    0.0149791 Invert a 332^2 random matrix
      15    0.0103650 LU Decomposition of a 332^2 random matrix
      16    0.0516040 Transpose 665^2 byte, FOR loops
      17    0.0348761 Transpose 665^2 byte, row and column ops x 10
      18    0.0306211 Transpose 665^2 byte, TRANSPOSE function x 100
      19    0.0332041 Log of 300000 numbers, FOR loop
      20   0.00239587 Log of 300000 numbers, vector ops 10 times
      21     0.162641 2097152 point forward plus inverse FFT
      22    0.0372660 Smooth 512x512 byte array, 5x5 boxcar, 30 times
      23    0.0511451 Smooth 512x512 floating array, 5x5 boxcar, 30 times

This appears to be an 8 core VM. cpuinfo reports:

processor	: 7
BogoMIPS	: 50.00
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

@alaingdl
Copy link
Contributor

Thanks you so much

Clear good results :)

Since those results are in line with the ones on x86, I would say that Clang or OSX does have a problem with the default usage we do (do we have to look at the options & flags ?)

@GillesDuvert
Copy link
Contributor Author

Many thanks, @opoplawski !!
I suspect using the Apple compiler (for testing the speed of tests n°2,4,5,16,19 no need to have openmp!) will show better results.

@alaingdl
Copy link
Contributor

Thanks to a M2 Apple running a Ubuntu Linux arm VM 8 cores, I was able to replicate the results from @opoplawski
Thanks again Orion ! (the gcc part)

On other site, on my old laptop (x86) I install clang-15 and the numbers are as good as with gcc
(CC=clang-15 CXX=clang++-15 cmake .. -DOPENMP=OFF )

Could someone remember me how to switch off most options when compiling with Clang ? Thanks !
(I suspect what @dirteat mentioned here #1635 (comment) ) (or just a bad flag ?)
Could

@GillesDuvert
Copy link
Contributor Author

@alaingdl nice results, too. In GDL's cmake or in build_gdl.sh I see no specific clang options/flags (???).
In fact in build_gdl.sh there is nothing about the version of clang on OSX to be used.

@GillesDuvert
Copy link
Contributor Author

solved see #1755

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants