-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance of HYPRE_IJMatrixSetValues, HYRPE_IJMatrixAddToValues, HYPRE_IJMatrixAssemble #226
Comments
One last point. Ultimately I want this on device, so is there a good place to add a cusparse Matrix-Matrix add to complete the assembly process? Perhaps this could give some performance benefit. |
Hi, @PaulMullowney , your observation that |
Do you expect either of HYPRE_IJMatrixSetRowSizes or HYPRE_IJMatrixSegDiagOffdSizes to make a difference for GPU assembly?
-Paul
From: Ruipeng Li <notifications@github.com>
Reply-To: hypre-space/hypre <reply@reply.github.com>
Date: Wednesday, December 2, 2020 at 11:13 PM
To: hypre-space/hypre <hypre@noreply.github.com>
Cc: "Mullowney, Paul" <Paul.Mullowney@nrel.gov>, Mention <mention@noreply.github.com>
Subject: Re: [hypre-space/hypre] Performance of HYPRE_IJMatrixSetValues, HYRPE_IJMatrixAddToValues, HYPRE_IJMatrixAssemble (#226)
Hi,
I'm trying to optimize the performance of my matrix assembly in Hypre and I'm finding that all of my time is being spent in
HYPRE_IJMatrixAssemble.
First, on a given rank I call
HYPRE_IJMatrixSetValues()
once on the owned part (rows) of the matrix. The input to this routine is perfectly row and column sorted and there are no repeated entries, i.e. I give it a CSR matrix.
Second, on that same rank, I call
HYPRE_IJMatrixAddToValues()
once on the part that needs to be accumulated into the owned rows of other ranks. The input to this routine is perfectly row and column sorted and there are no repeated entries, i.e. I give it a CSR matrix.
Then, I call HYPRE_IJMatrixAssemble.
I'm finding that this final Assemble call consumes 80-90% of the run time and I'm wondering if anything can be done. For instance, say I'm on process i for n ranks, would splitting HYPRE_IJMatrixAddToValues() into n-1 calls help, i.e.
* HYPRE_IJMatrixAddToValues() on the values to add to process 0
...
* HYPRE_IJMatrixAddToValues() on the values to add to process i-1
* HYPRE_IJMatrixAddToValues() on the values to add to process i+1
...
* HYPRE_IJMatrixAddToValues() on the values to add to process n
Here each, the input would be row/column sorted with no repeats.
Thanks,
-Paul
Hi, @PaulMullowney<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FPaulMullowney&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C8cee9512435e47560c7608d89752867a%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637425728018919245%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=XKTtbRIoh5VMB3cEOpHbq74quk679QWB6XkJIOmikPQ%3D&reserved=0> , your observation that HYPRE_IJMatrixAssemble takes most of the time makes sense to me, since HYPRE_IJMatrixSet/AddToValues just copy the input matrix entries to the internal memory, on-process or off-process. All computations (compression) are done at the assemble stage, where off-process entries are first transferred via MPI and combined with on-process entries, and then all the entries are compressed to make CSR matrices. I don't think the splitting you mentioned is a good idea, since it would require a lot of cudaMemcpy without any benefit of doing it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhypre-space%2Fhypre%2Fissues%2F226%23issuecomment-737690225&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C8cee9512435e47560c7608d89752867a%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637425728018919245%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ybGQkZSQh0mw%2FtgCjxdzyJyK4YqwaQJmyt8RKLd7zPM%3D&reserved=0>, or unsubscribe<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAONG4MTSV5H2ML7LFOWVI6DSS4T7ZANCNFSM4S7NE7UA&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C8cee9512435e47560c7608d89752867a%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637425728018929206%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=rAOjXC1AurAbWaRDurs5TgO4lyFOB1Io2XmwV%2FTZZ4o%3D&reserved=0>.
|
On the CPU the idea for HYPRE_IJMatrixSetDiagOffdSizes was to set or add to values directly into a parcsr matrix data structure and avoid using an auxiliary data structure.
I leave it up to Ruipeng to respond to the GPU question.
Ulrike
From: PaulMullowney <notifications@github.com>
Sent: Monday, December 14, 2020 8:34 AM
To: hypre-space/hypre <hypre@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Subject: Re: [hypre-space/hypre] Performance of HYPRE_IJMatrixSetValues, HYRPE_IJMatrixAddToValues, HYPRE_IJMatrixAssemble (#226)
Do you expect either of HYPRE_IJMatrixSetRowSizes or HYPRE_IJMatrixSegDiagOffdSizes to make a difference for GPU assembly?
-Paul
From: Ruipeng Li <notifications@github.com<mailto:notifications@github.com>>
Reply-To: hypre-space/hypre <reply@reply.github.com<mailto:reply@reply.github.com>>
Date: Wednesday, December 2, 2020 at 11:13 PM
To: hypre-space/hypre <hypre@noreply.github.com<mailto:hypre@noreply.github.com>>
Cc: "Mullowney, Paul" <Paul.Mullowney@nrel.gov<mailto:Paul.Mullowney@nrel.gov>>, Mention <mention@noreply.github.com<mailto:mention@noreply.github.com>>
Subject: Re: [hypre-space/hypre] Performance of HYPRE_IJMatrixSetValues, HYRPE_IJMatrixAddToValues, HYPRE_IJMatrixAssemble (#226)
Hi,
I'm trying to optimize the performance of my matrix assembly in Hypre and I'm finding that all of my time is being spent in
HYPRE_IJMatrixAssemble.
First, on a given rank I call
HYPRE_IJMatrixSetValues()
once on the owned part (rows) of the matrix. The input to this routine is perfectly row and column sorted and there are no repeated entries, i.e. I give it a CSR matrix.
Second, on that same rank, I call
HYPRE_IJMatrixAddToValues()
once on the part that needs to be accumulated into the owned rows of other ranks. The input to this routine is perfectly row and column sorted and there are no repeated entries, i.e. I give it a CSR matrix.
Then, I call HYPRE_IJMatrixAssemble.
I'm finding that this final Assemble call consumes 80-90% of the run time and I'm wondering if anything can be done. For instance, say I'm on process i for n ranks, would splitting HYPRE_IJMatrixAddToValues() into n-1 calls help, i.e.
* HYPRE_IJMatrixAddToValues() on the values to add to process 0
...
* HYPRE_IJMatrixAddToValues() on the values to add to process i-1
* HYPRE_IJMatrixAddToValues() on the values to add to process i+1
...
* HYPRE_IJMatrixAddToValues() on the values to add to process n
Here each, the input would be row/column sorted with no repeats.
Thanks,
-Paul
Hi, @PaulMullowney<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FPaulMullowney&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C8cee9512435e47560c7608d89752867a%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637425728018919245%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=XKTtbRIoh5VMB3cEOpHbq74quk679QWB6XkJIOmikPQ%3D&reserved=0> , your observation that HYPRE_IJMatrixAssemble takes most of the time makes sense to me, since HYPRE_IJMatrixSet/AddToValues just copy the input matrix entries to the internal memory, on-process or off-process. All computations (compression) are done at the assemble stage, where off-process entries are first transferred via MPI and combined with on-process entries, and then all the entries are compressed to make CSR matrices. I don't think the splitting you mentioned is a good idea, since it would require a lot of cudaMemcpy without any benefit of doing it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhypre-space%2Fhypre%2Fissues%2F226%23issuecomment-737690225&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C8cee9512435e47560c7608d89752867a%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637425728018919245%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ybGQkZSQh0mw%2FtgCjxdzyJyK4YqwaQJmyt8RKLd7zPM%3D&reserved=0>, or unsubscribe<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAONG4MTSV5H2ML7LFOWVI6DSS4T7ZANCNFSM4S7NE7UA&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C8cee9512435e47560c7608d89752867a%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637425728018929206%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=rAOjXC1AurAbWaRDurs5TgO4lyFOB1Io2XmwV%2FTZZ4o%3D&reserved=0>.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#226 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AD4NLLIUF5WLXANUDDUYOMTSUY5AXANCNFSM4S7NE7UA>.
|
Basically, neither of On the other hand, @PaulMullowney we currently only have one approach of assembling on GPU, which tries to address the most generic situations. So, it may not be the most efficient for you, since it seems that you pretty much have CSR matrices already. We can figure out if we can take some shortcut or even bypass the IJ interface and get ParCSR directly. Thanks! |
I’m gonna take a crack at this. I’ve been reading the assembly code in IJMatrix_parcsr_device.c and I have a pretty good idea of what’s happening. I’m going to create a couple of new methods in that file, IJMatrix_parcsr_device.c, and provide simplified version for my use case.
If successful, we can discuss what’s needed to get into master.
…-Paul
From: Ruipeng Li <notifications@github.com>
Reply-To: hypre-space/hypre <reply@reply.github.com>
Date: Monday, December 14, 2020 at 11:09 AM
To: hypre-space/hypre <hypre@noreply.github.com>
Cc: "Mullowney, Paul" <Paul.Mullowney@nrel.gov>, Mention <mention@noreply.github.com>
Subject: Re: [hypre-space/hypre] Performance of HYPRE_IJMatrixSetValues, HYRPE_IJMatrixAddToValues, HYPRE_IJMatrixAssemble (#226)
Neither of HYPRE_IJMatrixSetRowSizes or HYPRE_IJMatrixSetDiagOffdSizes basically has no effect on GPU assembly, which are used for the CPU algorithm as Ulrike explained. GPU assembly always use auxiliary matrix.
On the other hand, HYPRE_IJMatrixSetMaxOffProcElmts and hypre_IJMatrixSetMaxOnProcElmtsParCSR (just realized we don't have a HYPRE_ interface for it) may have some effect, not very much as expected, which try to pre-allocate adequate space in the auxiliary matrix. If either is not set, the algorithm starts with some initial allocation and reallocate if being out-of-space by a growing factor. But as you can see, it should not affect the performance too much.
@PaulMullowney<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FPaulMullowney&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C1fee777e9e0840110f4208d8a05b6b84%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637435661807159878%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=K8dcUqEtKmYRtl%2BSdEf5Z795xJCpyp%2FpSWOe98Bka2Y%3D&reserved=0> we currently only have one approach of assembling on GPU, which tries to address the most generic situations. So, it may not be the most efficient for you, since it seems that you pretty much have CSR matrices already. We can figure out if we can take some shortcut or even bypass the IJ interface and get ParCSR directly.
Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhypre-space%2Fhypre%2Fissues%2F226%23issuecomment-744615314&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C1fee777e9e0840110f4208d8a05b6b84%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637435661807159878%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=xuoa0Vgz9P0CicGWlInuxj9jCFtRRs6ToNkcZ6cK3WA%3D&reserved=0>, or unsubscribe<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAONG4MUEZHEEQ3BF3FLKHJLSUZIGDANCNFSM4S7NE7UA&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C1fee777e9e0840110f4208d8a05b6b84%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637435661807169833%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=2c%2F9RRxqMawYxRSk2fZ0AbNGKoIrmeatx5xiErgJigs%3D&reserved=0>.
|
Sounds good. Thank you @PaulMullowney |
I'm curious if you had any luck with this approach. I'm in a very similar situation myself, where each MPI process already has a full CSR matrix in GPU device memory (stored either as a single CSR matrix or as two separate CSR matrices for the diagonal and offdiagonal parts). I want to provide the data to hypre, hopefully without having to do any unnecessary copying or processing. Did you find a way to do this, @PaulMullowney ? |
I have a working version of this code in a fork/branch. It's pretty much up to date with Hypre master. I would read my SC 21 paper to get details on the assembly. Key points are as follows.
The data from 2 is stacked on top of 1. There's a couple of subtle points and a few extra API calls. I'm happy to share if it helps. -Paul |
Thanks, Paul. I believe this must be the paper you are referring to: https://www.nrel.gov/docs/fy21osti/79645.pdf. I found it very interesting, and I would greatly appreciate anything you can share. I suspect that my case may be simpler than yours, since I have arranged for the assembly to be a purely local operation on each MPI process. That is, there is no need to communicate between processes in my case. Anyway, I will first try by calling Anyway, perhaps I should open a separate issue, if needed, in case this is too off-topic. |
Hi, I'm wondering if there have been any developments on this, or if you could point me to the modified fork/branch and the details of the API calls. The code I'm working on uses a similar numerical method to that described in @PaulMullowney's SC 21 paper, which I found quite insightful. I, too, found that the majority of time is being spent in I would greatly appreciate any help you could provide with this. Thanks, |
An implementation is at: https://github.com/PaulMullowney/hypre/tree/fast_ijassem_device In the current implementation, you must allocation the COO buffers that are passed in to Set/AddToValues as: Usage of this is shown in in the following lines below. https://github.com/PaulMullowney/nalu-wind/blob/master/src/HypreLinearSystem.C#L883 -Paul |
Thank you very much @PaulMullowney |
Hi,
I'm trying to optimize the performance of my matrix assembly in Hypre and I'm finding that all of my time is being spent in
HYPRE_IJMatrixAssemble.
First, on a given rank I call
HYPRE_IJMatrixSetValues()
once on the owned part (rows) of the matrix. The input to this routine is perfectly row and column sorted and there are no repeated entries, i.e. I give it a CSR matrix.
Second, on that same rank, I call
HYPRE_IJMatrixAddToValues()
once on the part that needs to be accumulated into the owned rows of other ranks. The input to this routine is perfectly row and column sorted and there are no repeated entries, i.e. I give it a CSR matrix.
Then, I call HYPRE_IJMatrixAssemble.
I'm finding that this final Assemble call consumes 80-90% of the run time and I'm wondering if anything can be done. For instance, say I'm on process i for n ranks, would splitting HYPRE_IJMatrixAddToValues() into n-1 calls help, i.e.
...
...
Here each, the input would be row/column sorted with no repeats.
Thanks,
-Paul
The text was updated successfully, but these errors were encountered: