Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] KOKKOS fix rigid #1624

Open
ag949 opened this issue Aug 6, 2019 · 37 comments
Open

[Feature Request] KOKKOS fix rigid #1624

ag949 opened this issue Aug 6, 2019 · 37 comments

Comments

@ag949
Copy link

ag949 commented Aug 6, 2019

Summary

Porting fix rigid to KOKKOS

Detailed Description

In the past I've benefited significantly from using LAMMPS accelerated with KOKKOS. I am trying to run large scale simulations in GPUs with rigid bodies. Is porting fix rigid planned for the future?

@jaj52
Copy link

jaj52 commented Aug 6, 2019

This will be super useful!

@Pakketeretet2
Copy link
Collaborator

I would like to take a look because I am interested in it myself but I would have to bother @stanmoore1 a lot with Kokkos-related questions, since porting a complicated fix like rigid seems to be a lot harder than porting a pair style.
Are you interested in fix rigid or fix rigid/small?

@ag949
Copy link
Author

ag949 commented Aug 29, 2019

This sounds great, thank you. As you say, it doesn't look trivial. We would be interested in fix rigid.

@stanmoore1
Copy link
Contributor

I'm happy to answer questions about Kokkos.

@Pakketeretet2
Copy link
Collaborator

OK, I will take a look at it. Unfortunately, I am more interested in rigid/small but I'll see what I can do for either. I think rigid/small would be more suitable for GPUs and the like because it seems more similar to what Hoomd-Blue does.
@stanmoore1, should a simple port be reasonably performant or are there special things I should worry about when porting fixes like this? Or should I just port and get it working first, and then worry about performance?

@stanmoore1
Copy link
Contributor

@Pakketeretet2 yes a simple port is fine, i.e. just thread over loops of atoms. After we have a port then we'll worry about performance, I can do some tuning. I used c++11 lambdas extensively in #1650 and it made porting easier since I didn't have to mess with functors or tags.

@Pakketeretet2
Copy link
Collaborator

OK, since fix propel/self is pretty much done from a programming point of view I have started with this now.

@stanmoore1
Copy link
Contributor

Great, feel free to ping me either here on Github or via email if you have questions.

@Pakketeretet2
Copy link
Collaborator

Will do. For a second I couldn't find the lambdas anywhere but I found them. They look nicer than C++11 lambdas so they were hard to spot. :)

@Pakketeretet2
Copy link
Collaborator

The latest changes with minimize seem to cause linking problems with clang on my laptop, but it might be something dumb on my part. I'll look into it more tomorrow.

@akohlmey
Copy link
Member

The latest changes with minimize seem to cause linking problems with clang on my laptop, but it might be something dumb on my part. I'll look into it more tomorrow.

You are building LAMMPS with cmake, right?
There is a small adjustment needed:

$ git diff
  diff --git a/cmake/Modules/Packages/KOKKOS.cmake b/cmake/Modules/Packages/KOKKOS.cmake
  index 0e5bf70a4..cc1e05162 100644
  --- a/cmake/Modules/Packages/KOKKOS.cmake
  +++ b/cmake/Modules/Packages/KOKKOS.cmake
  @@ -17,6 +17,8 @@ if(PKG_KOKKOS)
                            ${KOKKOS_PKG_SOURCES_DIR}/atom_vec_kokkos.cpp
                            ${KOKKOS_PKG_SOURCES_DIR}/comm_kokkos.cpp
                            ${KOKKOS_PKG_SOURCES_DIR}/comm_tiled_kokkos.cpp
  +                         ${KOKKOS_PKG_SOURCES_DIR}/min_kokkos.cpp
  +                         ${KOKKOS_PKG_SOURCES_DIR}/min_linesearch_kokkos.cpp
                            ${KOKKOS_PKG_SOURCES_DIR}/neighbor_kokkos.cpp
                            ${KOKKOS_PKG_SOURCES_DIR}/neigh_list_kokkos.cpp
                            ${KOKKOS_PKG_SOURCES_DIR}/neigh_bond_kokkos.cpp

@Pakketeretet2
Copy link
Collaborator

Pakketeretet2 commented Sep 12, 2019 via email

@Pakketeretet2
Copy link
Collaborator

@stanmoore1 In some places the rigid code calls functions in math_extra that expects arrays of (implicit) size three. I am sure Kokkos provides a way to slice an array (subview or something). Two related questions:

  1. These functions are inline. Will it work to just call those from within a Kokkos::parallel_for or do the functions in themselves need porting?
  2. If it works to just call them, how do I properly grab a sub-part of a Kokkos array as an array? Is that possible? So I am looking for the equivalent of angmom[ibody], where angmom is a double**, or (angmom+3*ibody) if angmom is stored as a flat double *.

@stanmoore1
Copy link
Contributor

Any function that is called in a Kokkos::parallel_for needs to be declared with KOKKOS_INLINE_FUNCTION so the GPU __device__ decorators are added. For example, see https://github.com/lammps/lammps/blob/master/src/KOKKOS/math_special_kokkos.h#L53.

If a Kokkos view is 2D LayoutRight, then you can get a 1D slice view using Kokkos::subview(d_view,i,Kokkos::ALL);. For a flat array, you can get a raw pointer using view.data()+offset but that circumvents Kokkos bounds checking.

@stanmoore1
Copy link
Contributor

For fix rigid/small that has the body struct, you can just make a Kokkos View of type body. Then the calls are simply MathExtra::transpose_matvec(rot,d_body[i].angmom,wbody);

@stanmoore1
Copy link
Contributor

Actually I'm not 100% sure if MathExtra::transpose_matvec(rot,d_body[i].angmom,wbody); would work, would have to try it out.

@Pakketeretet2
Copy link
Collaborator

OK, before I can actually test any of that I stumbled upon another thing: Rigid uses a bunch of additional storage for center-of-mass positions, velocities and forces. It looks like there are no associated Kokkos containers or data masks for that yet, at least, I cannot see any masks for them in atom_masks.h nor any containers in atom_kokkos.h.

What is the preferred way to add such things? These arrays only have to be the size of nbody which will probably be significantly smaller than nlocal.

@stanmoore1
Copy link
Contributor

Just to be clear, are you working on fix rigid or fix rigid/small?

@Pakketeretet2
Copy link
Collaborator

Pakketeretet2 commented Sep 13, 2019

fix rigid.
EDIT: I plan to port both, but started with rigid. If you think rigid/small will be significantly easier I can also try that one first.

@stanmoore1
Copy link
Contributor

stanmoore1 commented Sep 13, 2019

I would add Kokkos dualviews inside fix rigid/kk and call sync/modify directly on them. No need to add them to atom_masks.h. In other words, any array that is declared in fix_rigid.h would have a corresponding Kokkos view in fix_rigid_kokkos.h.

@stanmoore1
Copy link
Contributor

@Pakketeretet2
Copy link
Collaborator

Ah, that is brilliant! So, just to make sure I fundamentally understand this; whenever you do a memoryKK->create_array or memoryKK->grow_array, Kokkos wraps that memory in a dualview so that after you have a handle to the arrays on both the host and the device? And then in device code you use d_view and in host code you use h_view, and you just have to make sure to sync it whenever required? Does Kokkos take care of memcpy'ing it to the device from host when you create an array? It seems to but I am not sure.

Do you think it makes sense for me to make a pull request to push my changes to so you can review code as we go? That might put more strain on your time than me just bugging you here though.

Overall I'm pleasantly surprised by how straightforward it seems to port to Kokkos. I will try to take notes that might be useful for me and/or others later.

@Pakketeretet2
Copy link
Collaborator

Pakketeretet2 commented Sep 15, 2019 via email

@stanmoore1
Copy link
Contributor

stanmoore1 commented Sep 16, 2019

whenever you do a memoryKK->create_array or memoryKK->grow_array, Kokkos wraps that memory in a dualview so that after you have a handle to the arrays on both the host and the device?

This is correct. Typically memoryKK is used when you need to access the host pointer in legacy code since it takes the pointer from the host Kokkos view and assigns it to the legacy host pointer. If the array is only going to be accessed on the device, then using Kokkos::DualView directly is preferred since it skips the aliasing part.

And then in device code you use d_view and in host code you use h_view, and you just have to make sure to sync it whenever required?

Correct

Does Kokkos take care of memcpy'ing it to the device from host when you create an array? It seems to but I am not sure.

It is a little complicated. If the host view is marked as modified, it will do the create/grow on the host, otherwise it will default to the device.

However, when I do memoryKK->create_kokkos(k_xcm, xcm, "rigid/kk:xcm") the data in xcm is now all zeros, but the data in the host and device views is still zeros. I assume that is because there has to be an explicit synchronization.

By default Kokkos will initialize a view to all zeros when it is created or the new part when it is resized.

For data that has an atom mask the atomKK->sync and atomKK->modified took care of that, but how does that work for my own custom arrays?

You just call one of these:

k_view.modify<DeviceType>();
k_view.sync<DeviceType>();
k_view.modify<LMPHostType>();
k_view.sync<LMPHostType>();

@stanmoore1
Copy link
Contributor

I suggest creating a work in progress PR and I can review as I have time.

@Pakketeretet2
Copy link
Collaborator

Am working on it in PR #1680.

The syncing is still not entirely clear to me. Do these do this? If so, the function names make some sense to me. Otherwise they don't. :)
k_view.modify(); --> Mark that data in k_view was modified on the device
k_view.sync(); --> Sync the data in k_view on the device (from the host)
k_view.modify(); --> Mark that the data in k_view was modified on the host
k_view.sync(); --> Mark sync the data in k_vie won the host (from the device)

@stanmoore1
Copy link
Contributor

Yes:

k_view.modify<DeviceType>(); --> Mark that data in k_view was modified on the device k_view.sync<DeviceType>(); --> Sync the data in k_view on the device (from the host) k_view.modify<LMPHostType>(); --> Mark that the data in k_view was modified on the host k_view.sync<LMPHostType>(); --> Mark sync the data in k_view on the host (from the device)

You never want to concurrently mark modified on both host and device. Also if you sync but nothing is modified it is just a no-op.

@stanmoore1
Copy link
Contributor

So basically before any kernel where the data is used (Kokkos or legacy), first sync to the execution space (host or device). Then after the data is changed by the kernel, mark it as modified on the corresponding execution space. This will ensure you always have synced data.

@Pakketeretet2
Copy link
Collaborator

Pakketeretet2 commented Sep 23, 2019

Sorry to keep bothering you @stanmoore1,
"It is a little complicated. If the host view is marked as modified, it will do the create/grow on the host, otherwise it will default to the device."
This is still not clear to me. Or rather, it is, I think, but does not answer my original question.
In FixRigid::FixRigid(), some arrays (let's take the body array as example, because that one has me stumped) are allocated and filled with data, on the host. I want to preserve these values but move them to the Kokkos array because I need them inside device kernels. However, because the host array is filled in the base constructor, the data is already in the host array by the time I would call memoryKK->create(). Is there a neat Kokkos function to move the data to the kokkos array at the creation of the Kokkos array? I can work around it using temporary arrays and memcpy's, but there are a lot of arrays and I would like to avoid all that boilerplate. If not, I suppose I can just write a simple function in FixRigid to use and then it can be moved to an appropriate location later.
EDIT: From memory_kokkos.h it seems like there is none. Did you work around this issue in another way or did you simply never encounter it?

The good news is that, when I manually fixed the body array using temporary arrays I do get the correct numbers in the thermo output for the first step, and the porting of the initial_integrate and final_integrate functions should be a lot easier than figuring all the memory stuff out.

@stanmoore1
Copy link
Contributor

@Pakketeretet2

In FixRigid::FixRigid(), some arrays (let's take the body array as example, because that one has me stumped) are allocated and filled with data, on the host. I want to preserve these values but move them to the Kokkos array because I need them inside device kernels. However, because the host array is filled in the base constructor, the data is already in the host array by the time I would call memoryKK->create().

Ah yes, this is tricky. Normally I would just override the grow_arrays method but it is not possible to call a virtual method in the base constructor. I think this is poor design in FixRigid so I would suggest moving this memory allocation in the constructor to an init function, then you can override the grow_arrays method like normal.

@Pakketeretet2
Copy link
Collaborator

I think moving initialization to init() makes sense, both conceptually and in that it would solve this specific problem. However, I would not like to touch it at this point because I have not verified yet whether or not the other FixRigid variants depend on some of these flags being set in their own constructors, so for now I will just memcpy the host data to a temporary array and copy it back into the kokkos array.

@stanmoore1
Copy link
Contributor

stanmoore1 commented Sep 24, 2019

For now you could just temporarily wrap the pointers into a host Kokkos view then copy up to the device, for example see https://github.com/stanmoore1/lammps/blob/fft/src/KOKKOS/fft3d_kokkos.cpp#L91

@stanmoore1
Copy link
Contributor

stanmoore1 commented Sep 24, 2019

In other words you can use something like

HAT::t_int_1d h_body(body,atom->local);
k_body.h_view = h_body;
k_body.modify<LMPHostType>();
k_body.sync<DeviceType>();

@Pakketeretet2
Copy link
Collaborator

In other words you can use something like

HAT::t_int_1d h_body(body,atom->local);
k_body.h_view = h_body;
k_body.modify<LMPHostType>();
k_body.sync<DeviceType>();

Right. I suppose this works because the host view "takes ownership" of the original array and the view is copied by value. That looks cleaner than what it is now, I will copy it in later.

@Pakketeretet2
Copy link
Collaborator

Pakketeretet2 commented Oct 1, 2019

In other words you can use something like

HAT::t_int_1d h_body(body,atom->local);
k_body.h_view = h_body;
k_body.modify<LMPHostType>();
k_body.sync<DeviceType>();

This does not seem to work. Even after a call to modify() and sync, k_body.d_view.size() is reported to be 0, and a call to memoryKK->grow_kokkos() leads to the data being overwritten with zeros again.

Turns out I need to do this:

  HAT::t_int_1d h_body(body,nmax);
  k_body.modify<LMPHostType>();
  k_body.h_view = h_body;
  k_body.d_view = create_mirror_view(DeviceType(), k_body.h_view);
  k_body.sync<DeviceType>();

Is this because my k_body has not been initialized at this point yet? As in, I cannot do memoryKK->create_kokkos(k_body, body,...) because that overwrites body but create_kokkos does seem to "couple" the d_view and h_view together, which I might have to do explicitly?
Even Kokkos::deep_copy didn't seem to do anything.

@Pakketeretet2
Copy link
Collaborator

So I semi-abandoned the porting of the functions and instead kept everything on the host but with KOKKOS arrays just to make sure the output is identical to the original code. That seems to be the case now for the rigid example, so I will now selectively port the functions using your comments on my PR.
I hope that does not generate too much noise.

@stanmoore1
Copy link
Contributor

I updated #1680 to latest master. This is still on my to-do list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: High-Priority Features
Development

Successfully merging a pull request may close this issue.

5 participants