Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of HYPRE_IJMatrixSetValues, HYRPE_IJMatrixAddToValues, HYPRE_IJMatrixAssemble #226

Open
PaulMullowney opened this issue Oct 26, 2020 · 13 comments

Comments

@PaulMullowney
Copy link
Contributor

PaulMullowney commented Oct 26, 2020

Hi,
I'm trying to optimize the performance of my matrix assembly in Hypre and I'm finding that all of my time is being spent in
HYPRE_IJMatrixAssemble.

First, on a given rank I call
HYPRE_IJMatrixSetValues()
once on the owned part (rows) of the matrix. The input to this routine is perfectly row and column sorted and there are no repeated entries, i.e. I give it a CSR matrix.

Second, on that same rank, I call
HYPRE_IJMatrixAddToValues()
once on the part that needs to be accumulated into the owned rows of other ranks. The input to this routine is perfectly row and column sorted and there are no repeated entries, i.e. I give it a CSR matrix.

Then, I call HYPRE_IJMatrixAssemble.

I'm finding that this final Assemble call consumes 80-90% of the run time and I'm wondering if anything can be done. For instance, say I'm on process i for n ranks, would splitting HYPRE_IJMatrixAddToValues() into n-1 calls help, i.e.

  • HYPRE_IJMatrixAddToValues() on the values to add to process 0
    ...
  • HYPRE_IJMatrixAddToValues() on the values to add to process i-1
  • HYPRE_IJMatrixAddToValues() on the values to add to process i+1
    ...
  • HYPRE_IJMatrixAddToValues() on the values to add to process n

Here each, the input would be row/column sorted with no repeats.

Thanks,
-Paul

@PaulMullowney
Copy link
Contributor Author

One last point. Ultimately I want this on device, so is there a good place to add a cusparse Matrix-Matrix add to complete the assembly process? Perhaps this could give some performance benefit.

@liruipeng
Copy link
Contributor

liruipeng commented Dec 3, 2020

Hi,
I'm trying to optimize the performance of my matrix assembly in Hypre and I'm finding that all of my time is being spent in
HYPRE_IJMatrixAssemble.

First, on a given rank I call
HYPRE_IJMatrixSetValues()
once on the owned part (rows) of the matrix. The input to this routine is perfectly row and column sorted and there are no repeated entries, i.e. I give it a CSR matrix.

Second, on that same rank, I call
HYPRE_IJMatrixAddToValues()
once on the part that needs to be accumulated into the owned rows of other ranks. The input to this routine is perfectly row and column sorted and there are no repeated entries, i.e. I give it a CSR matrix.

Then, I call HYPRE_IJMatrixAssemble.

I'm finding that this final Assemble call consumes 80-90% of the run time and I'm wondering if anything can be done. For instance, say I'm on process i for n ranks, would splitting HYPRE_IJMatrixAddToValues() into n-1 calls help, i.e.

  • HYPRE_IJMatrixAddToValues() on the values to add to process 0
    ...
  • HYPRE_IJMatrixAddToValues() on the values to add to process i-1
  • HYPRE_IJMatrixAddToValues() on the values to add to process i+1
    ...
  • HYPRE_IJMatrixAddToValues() on the values to add to process n

Here each, the input would be row/column sorted with no repeats.

Thanks,
-Paul

Hi, @PaulMullowney , your observation that HYPRE_IJMatrixAssemble takes most of the time makes sense to me, since HYPRE_IJMatrixSet/AddToValues just copy the input matrix entries to the internal memory, on-process or off-process. All computations (compression) are done at the assemble stage, where off-process entries are first transferred via MPI and combined with on-process entries, and then all the entries are compressed to make CSR matrices. I don't think the splitting you mentioned is a good idea, since it would require a lot of cudaMemcpy without any benefit of doing it. These are described in the paper: https://github.com/hypre-space/hypre/wiki/pubs/porting-hypre-2020.pdf

@PaulMullowney
Copy link
Contributor Author

PaulMullowney commented Dec 14, 2020 via email

@ulrikeyang
Copy link
Contributor

ulrikeyang commented Dec 14, 2020 via email

@liruipeng
Copy link
Contributor

liruipeng commented Dec 14, 2020

Basically, neither of HYPRE_IJMatrixSetRowSizes or HYPRE_IJMatrixSetDiagOffdSizes has effect on GPU assembly, which are used for the CPU algorithm as Ulrike explained. GPU assembly always uses auxiliary matrix.

On the other hand, HYPRE_IJMatrixSetMaxOffProcElmts and hypre_IJMatrixSetMaxOnProcElmtsParCSR (just realized we don't have a HYPRE_ interface for it) may have some effect, not very much as expected, which try to pre-allocate adequate space in the auxiliary matrix. If either is not set, the algorithm starts with some initial allocation and reallocate if being out-of-space by a growing factor. But as you can see, it should not affect the performance too much.

@PaulMullowney we currently only have one approach of assembling on GPU, which tries to address the most generic situations. So, it may not be the most efficient for you, since it seems that you pretty much have CSR matrices already. We can figure out if we can take some shortcut or even bypass the IJ interface and get ParCSR directly.

Thanks!

@PaulMullowney
Copy link
Contributor Author

PaulMullowney commented Dec 15, 2020 via email

@liruipeng
Copy link
Contributor

I’m gonna take a crack at this. I’ve been reading the assembly code in IJMatrix_parcsr_device.c and I have a pretty good idea of what’s happening. I’m going to create a couple of new methods in that file, IJMatrix_parcsr_device.c, and provide simplified version for my use case. If successful, we can discuss what’s needed to get into master.

-Paul From: Ruipeng Li notifications@github.com Reply-To: hypre-space/hypre reply@reply.github.com Date: Monday, December 14, 2020 at 11:09 AM To: hypre-space/hypre hypre@noreply.github.com Cc: "Mullowney, Paul" Paul.Mullowney@nrel.gov, Mention mention@noreply.github.com Subject: Re: [hypre-space/hypre] Performance of HYPRE_IJMatrixSetValues, HYRPE_IJMatrixAddToValues, HYPRE_IJMatrixAssemble (#226) Neither of HYPRE_IJMatrixSetRowSizes or HYPRE_IJMatrixSetDiagOffdSizes basically has no effect on GPU assembly, which are used for the CPU algorithm as Ulrike explained. GPU assembly always use auxiliary matrix. On the other hand, HYPRE_IJMatrixSetMaxOffProcElmts and hypre_IJMatrixSetMaxOnProcElmtsParCSR (just realized we don't have a HYPRE_ interface for it) may have some effect, not very much as expected, which try to pre-allocate adequate space in the auxiliary matrix. If either is not set, the algorithm starts with some initial allocation and reallocate if being out-of-space by a growing factor. But as you can see, it should not affect the performance too much. @PaulMullowneyhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FPaulMullowney&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C1fee777e9e0840110f4208d8a05b6b84%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637435661807159878%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=K8dcUqEtKmYRtl%2BSdEf5Z795xJCpyp%2FpSWOe98Bka2Y%3D&reserved=0 we currently only have one approach of assembling on GPU, which tries to address the most generic situations. So, it may not be the most efficient for you, since it seems that you pretty much have CSR matrices already. We can figure out if we can take some shortcut or even bypass the IJ interface and get ParCSR directly. Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhypre-space%2Fhypre%2Fissues%2F226%23issuecomment-744615314&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C1fee777e9e0840110f4208d8a05b6b84%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637435661807159878%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=xuoa0Vgz9P0CicGWlInuxj9jCFtRRs6ToNkcZ6cK3WA%3D&reserved=0, or unsubscribehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAONG4MUEZHEEQ3BF3FLKHJLSUZIGDANCNFSM4S7NE7UA&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C1fee777e9e0840110f4208d8a05b6b84%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637435661807169833%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=2c%2F9RRxqMawYxRSk2fZ0AbNGKoIrmeatx5xiErgJigs%3D&reserved=0.

Sounds good. Thank you @PaulMullowney

@jamtrott
Copy link

I’m gonna take a crack at this. I’ve been reading the assembly code in IJMatrix_parcsr_device.c and I have a pretty good idea of what’s happening. I’m going to create a couple of new methods in that file, IJMatrix_parcsr_device.c, and provide simplified version for my use case. If successful, we can discuss what’s needed to get into master.

I'm curious if you had any luck with this approach. I'm in a very similar situation myself, where each MPI process already has a full CSR matrix in GPU device memory (stored either as a single CSR matrix or as two separate CSR matrices for the diagonal and offdiagonal parts). I want to provide the data to hypre, hopefully without having to do any unnecessary copying or processing.

Did you find a way to do this, @PaulMullowney ?

@PaulMullowney
Copy link
Contributor Author

I have a working version of this code in a fork/branch. It's pretty much up to date with Hypre master. I would read my SC 21 paper to get details on the assembly. Key points are as follows.

  1. I have data in a COO matrix for all values that are owned by a particular rank (i.e. row_id is in the appropriate range for a given rank)
  2. I have data in a COO matrix for all values that need to be shared to other ranks (i.e. row_id is outside the range for a given rank)

The data from 2 is stacked on top of 1. There's a couple of subtle points and a few extra API calls. I'm happy to share if it helps.

-Paul

@jamtrott
Copy link

Thanks, Paul. I believe this must be the paper you are referring to: https://www.nrel.gov/docs/fy21osti/79645.pdf. I found it very interesting, and I would greatly appreciate anything you can share.

I suspect that my case may be simpler than yours, since I have arranged for the assembly to be a purely local operation on each MPI process. That is, there is no need to communicate between processes in my case.

Anyway, I will first try by calling HYPRE_IJMatrixSetValues with pointers to the CSR data in device memory, followed by HYPRE_IJMatrixAssemble. I'm assuming that HYPRE_IJMatrixSetValues will copy the data (as @liruipeng mentioned in a comment above), but I'm hoping that HYPRE_IJMatrixAssemble will be able to return immediately without any expensive processing. In the ideal case, I would like to avoid even the copy in HYPRE_IJMatrixSetValues. Do you know if it is possible to directly access device-side pointers to the underlying CSR data structure?

Anyway, perhaps I should open a separate issue, if needed, in case this is too off-topic.

@morse129
Copy link

Hi, I'm wondering if there have been any developments on this, or if you could point me to the modified fork/branch and the details of the API calls.

The code I'm working on uses a similar numerical method to that described in @PaulMullowney's SC 21 paper, which I found quite insightful. I, too, found that the majority of time is being spent in HYPRE_IJMatrixAssemble when passing a pre-assembled CSR matrix to hypre, so I think a similar use of the modified HYPRE_IJMatrixAssemble would be valuable. Is there any plan to incorporate this into the master branch?

I would greatly appreciate any help you could provide with this.

Thanks,
Nick

@PaulMullowney
Copy link
Contributor Author

PaulMullowney commented May 30, 2023

An implementation is at:

https://github.com/PaulMullowney/hypre/tree/fast_ijassem_device

In the current implementation, you must allocation the COO buffers that are passed in to Set/AddToValues as:
[values to set, values to add, extra room]
where:
values to set = are on processor values
values to add = values sent to other processors
extra room = space need to receive all the potential values from other ranks

Usage of this is shown in in the following lines below.

https://github.com/PaulMullowney/nalu-wind/blob/master/src/HypreLinearSystem.C#L883
https://github.com/PaulMullowney/nalu-wind/blob/master/src/HypreLinearSystem.C#L940
https://github.com/PaulMullowney/nalu-wind/blob/master/src/HypreLinearSystem.C#L1508
https://github.com/PaulMullowney/nalu-wind/blob/master/src/HypreLinearSystem.C#L1616

-Paul

@morse129
Copy link

An implementation is at:

https://github.com/PaulMullowney/hypre/tree/fast_ijassem_device

In the current implementation, you must allocation the COO buffers that are passed in to Set/AddToValues as: [values to set, values to add, extra room] where: values to set = are on processor values values to add = values sent to other processors extra room = space need to receive all the potential values from other ranks

Usage of this is shown in in the following lines below.

https://github.com/PaulMullowney/nalu-wind/blob/master/src/HypreLinearSystem.C#L883 https://github.com/PaulMullowney/nalu-wind/blob/master/src/HypreLinearSystem.C#L940 https://github.com/PaulMullowney/nalu-wind/blob/master/src/HypreLinearSystem.C#L1508 https://github.com/PaulMullowney/nalu-wind/blob/master/src/HypreLinearSystem.C#L1616

-Paul

Thank you very much @PaulMullowney

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants