Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding ETI for DeepCopy / ViewFill etc. #1578

Closed
crtrott opened this issue Apr 19, 2018 · 16 comments
Closed

Adding ETI for DeepCopy / ViewFill etc. #1578

crtrott opened this issue Apr 19, 2018 · 16 comments
Assignees
Labels
Enhancement Improve existing capability; will potentially require voting
Milestone

Comments

@crtrott
Copy link
Member

crtrott commented Apr 19, 2018

With the recent changes which improve runtime for those functions significantly for higher rank views, we unfortunately increased compile times significantly for some applications which make haevy use of those functions. On the plus side if you made haevy use of say rank 4 or rank5 views these functions sped up by up orders of magnitude. (See: #1270)

To mitigate compile times we need to introduce ETI, since in most cases we can not choose at compile time to simply drop to a simple binary operation due to padding and the way subviews work.

@crtrott crtrott self-assigned this Apr 20, 2018
@crtrott crtrott added the Enhancement Improve existing capability; will potentially require voting label Apr 20, 2018
@crtrott crtrott added this to the 2018 April milestone Apr 20, 2018
@ndellingwood
Copy link
Contributor

@mhoemmen
Copy link
Contributor

Thanks @crtrott ! If I can help please let me know; compile time is a big deal for all of us.

@crtrott
Copy link
Member Author

crtrott commented Apr 20, 2018

OK the following is my test code:

#include<Kokkos_Core.hpp>

int main(int argc, char* argv[]) {
  Kokkos::initialize(argc,argv);
  {
     int N = (argc>1) ? atoi(argv[1]) : 1000;
     int M = (argc>2) ? atoi(argv[2]) : 1000;
     int R = (argc>3) ? atoi(argv[3]) : 10;

     Kokkos::View<double****,Kokkos::LayoutLeft> al("Al",N,M,R,10);
     Kokkos::View<double****,Kokkos::LayoutRight> ar("Ar",N,M,R,10);
     Kokkos::View<double***[10],Kokkos::LayoutLeft> al1("Al1",N,M,R);
     Kokkos::View<double***[10],Kokkos::LayoutRight> ar1("Ar1",N,M,R);
     Kokkos::View<double**[10][10],Kokkos::LayoutLeft> al2("Al2",N,M);
     Kokkos::View<double**[10][10],Kokkos::LayoutRight> ar2("Ar2",N,M);


     Kokkos::deep_copy(al,1.0);
     Kokkos::deep_copy(ar,2.0);
     Kokkos::deep_copy(al1,3.0);
     Kokkos::deep_copy(al2,4.0);
     Kokkos::deep_copy(ar1,5.0);
     Kokkos::deep_copy(ar2,6.0);
     Kokkos::deep_copy(al,al1);
     Kokkos::deep_copy(al,al2);
     Kokkos::deep_copy(ar,ar1);
     Kokkos::deep_copy(ar,ar2);
     Kokkos::deep_copy(al,ar1);
     Kokkos::deep_copy(al,ar2);
     Kokkos::deep_copy(ar,al1);
     Kokkos::deep_copy(ar,al2);
  }
  Kokkos::finalize();
}

This is my compile line:

time icpc  -I./ -I/home/crtrott/Kokkos/kokkos/core/src -I/home/crtrott/Kokkos/kokkos/containers/src -I/home/crtrott/Kokkos/kokkos/algorithms/src --std=c++11 -xCORE-AVX2 -fopenmp -O3 -g  -c main.cpp

This is where I am right now in terms of Compile Times:

Branch 2.5 2.6 DevelopETI
Time 7.0 24.2 10.7
Symbols 367 1696 340

I will post performance later. I am also looking into whether I can decrease the rest of that time back to get closer to the 7.0.

@crtrott
Copy link
Member Author

crtrott commented Apr 23, 2018

OK made some more progress.

The following is file sizes in MB for this experiment if I instantiate all the way to Rank 8, for
Scalar: int, int64_t, float, double
Index: int , int64_t
Layout: Left, Right, Stride and combinations thereof

File ETI Deprecated Code main.o libkokkos.a main.exe
Size OFF ON 7.0 2.2 4.2
Size ON ON 2.3 433 8.5
Size OFF OFF 6.0 2.2 3.6
Size ON OFF 2.0 430 6.6

Now the question is how much of this do we want to ETI. I do some more experiments with up to which rank we should go. In particular how much difference it makes to only instantiate up to rank 5. The other question is should ETI be on by default or off?
Obviously compiling the 400+MB takes quite a bit. And most codes will only use a fraction of that. On the other hand it will not effect Trilinos compile times significantly. This is still way faster than compiling even Tpetra.

@crtrott
Copy link
Member Author

crtrott commented Apr 23, 2018

Pushed branch "issue-1578" which has most of the stuff in it for OpenMP. Note: I didn't yet commit the cpp files, since we first need to decide how much to make available via ETI.
But I did commit the scripts in scripts/eti. You have to run the primary script from within the core/eti directory after doing a mkdir for OpenMP, ROCm, Cuda, Serial and Threads.

@mhoemmen
Copy link
Contributor

@crtrott yikes 400 MB :( How bad is it when you do up to rank 4 or 5?

@crtrott
Copy link
Member Author

crtrott commented Apr 23, 2018

One thing is that the linker will throw out most of the stuff (you see that the executable in my case goes back to 6.6MB, vs 3.6 with no ETI). But there are some options to reduce that:

  • always use int64_t as index
  • only instantiate up to Rank 5 (not many apps use higher dimensional stuff)
  • Don't instantiate float
  • Don't instantiate for LayoutStride

@nmhamster
Copy link
Contributor

Does this have debug turned on (when you get to 400MB)?

@crtrott
Copy link
Member Author

crtrott commented Apr 23, 2018

Yeah this is with -g

@nmhamster
Copy link
Contributor

Ok that’s good. Try it without, that will give you a better idea of the code size.

@crtrott
Copy link
Member Author

crtrott commented Apr 23, 2018

Disabling float as Scalar, int as IndexType and LayoutStride as primary Layout gets me down to 109MB for the lib and 4.2 MB for the executable (with deprecated code off).

@crtrott
Copy link
Member Author

crtrott commented Apr 23, 2018

Here is the comparison for the reduced set with and without "-g":

File ETI Deprecated Code CXXFLAGS main.o libkokkos.a main.exe
Size ON OFF -O3 -g 2.0 109 4.2
Size ON OFF -O3 0.2 14 0.8

@crtrott
Copy link
Member Author

crtrott commented Apr 23, 2018

One thing I am considering is that an app could provide its own ETI list. I.e. we give you the scripts to generate the headers and cpp files you need which take as input the Scalar types and Layouts, and then you can bent around the ETI directory to be used.

@ibaned
Copy link
Contributor

ibaned commented Apr 23, 2018

@crtrott an ETI list sounds like a good idea. int is an important index type, for meshing codes where a lot of the data is indexing, using int64_t is a 2X penalty on memory consumption.

Edit: nevermind, index type is different from scalar type. Always using int64_t for indexing sounds okay.

@nmhamster
Copy link
Contributor

So one option here could be to use GCC diagnostics to drop the default debug generation down. We probably only need call info here.

@mhoemmen
Copy link
Contributor

@crtrott We SPARC folks like the "app provides its own ETI list" option, at least if the simplest set of ETI enables doesn't help build times. If it's easy for Kokkos to distill a minimal set of ETI enables, then it would be nicer for Kokkos to handle it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Improve existing capability; will potentially require voting
Projects
None yet
Development

No branches or pull requests

5 participants