Performance of HYPRE_IJMatrixSetValues, HYRPE_IJMatrixAddToValues, HYPRE_IJMatrixAssemble #226

PaulMullowney · 2020-10-26T13:55:32Z

Hi,
I'm trying to optimize the performance of my matrix assembly in Hypre and I'm finding that all of my time is being spent in
HYPRE_IJMatrixAssemble.

First, on a given rank I call
HYPRE_IJMatrixSetValues()
once on the owned part (rows) of the matrix. The input to this routine is perfectly row and column sorted and there are no repeated entries, i.e. I give it a CSR matrix.

Second, on that same rank, I call
HYPRE_IJMatrixAddToValues()
once on the part that needs to be accumulated into the owned rows of other ranks. The input to this routine is perfectly row and column sorted and there are no repeated entries, i.e. I give it a CSR matrix.

Then, I call HYPRE_IJMatrixAssemble.

I'm finding that this final Assemble call consumes 80-90% of the run time and I'm wondering if anything can be done. For instance, say I'm on process i for n ranks, would splitting HYPRE_IJMatrixAddToValues() into n-1 calls help, i.e.

HYPRE_IJMatrixAddToValues() on the values to add to process 0
...
HYPRE_IJMatrixAddToValues() on the values to add to process i-1
HYPRE_IJMatrixAddToValues() on the values to add to process i+1
...
HYPRE_IJMatrixAddToValues() on the values to add to process n

Here each, the input would be row/column sorted with no repeats.

Thanks,
-Paul

PaulMullowney · 2020-10-26T14:37:33Z

One last point. Ultimately I want this on device, so is there a good place to add a cusparse Matrix-Matrix add to complete the assembly process? Perhaps this could give some performance benefit.

liruipeng · 2020-12-03T06:13:03Z

Hi,
I'm trying to optimize the performance of my matrix assembly in Hypre and I'm finding that all of my time is being spent in
HYPRE_IJMatrixAssemble.

First, on a given rank I call
HYPRE_IJMatrixSetValues()
once on the owned part (rows) of the matrix. The input to this routine is perfectly row and column sorted and there are no repeated entries, i.e. I give it a CSR matrix.

Second, on that same rank, I call
HYPRE_IJMatrixAddToValues()
once on the part that needs to be accumulated into the owned rows of other ranks. The input to this routine is perfectly row and column sorted and there are no repeated entries, i.e. I give it a CSR matrix.

Then, I call HYPRE_IJMatrixAssemble.

I'm finding that this final Assemble call consumes 80-90% of the run time and I'm wondering if anything can be done. For instance, say I'm on process i for n ranks, would splitting HYPRE_IJMatrixAddToValues() into n-1 calls help, i.e.

HYPRE_IJMatrixAddToValues() on the values to add to process 0
...

HYPRE_IJMatrixAddToValues() on the values to add to process i-1

HYPRE_IJMatrixAddToValues() on the values to add to process i+1
...

HYPRE_IJMatrixAddToValues() on the values to add to process n

Here each, the input would be row/column sorted with no repeats.

Thanks,
-Paul

Hi, @PaulMullowney , your observation that HYPRE_IJMatrixAssemble takes most of the time makes sense to me, since HYPRE_IJMatrixSet/AddToValues just copy the input matrix entries to the internal memory, on-process or off-process. All computations (compression) are done at the assemble stage, where off-process entries are first transferred via MPI and combined with on-process entries, and then all the entries are compressed to make CSR matrices. I don't think the splitting you mentioned is a good idea, since it would require a lot of cudaMemcpy without any benefit of doing it. These are described in the paper: https://github.com/hypre-space/hypre/wiki/pubs/porting-hypre-2020.pdf

PaulMullowney · 2020-12-14T16:34:02Z

Do you expect either of HYPRE_IJMatrixSetRowSizes or HYPRE_IJMatrixSegDiagOffdSizes to make a difference for GPU assembly? -Paul From: Ruipeng Li <notifications@github.com> Reply-To: hypre-space/hypre <reply@reply.github.com> Date: Wednesday, December 2, 2020 at 11:13 PM To: hypre-space/hypre <hypre@noreply.github.com> Cc: "Mullowney, Paul" <Paul.Mullowney@nrel.gov>, Mention <mention@noreply.github.com> Subject: Re: [hypre-space/hypre] Performance of HYPRE_IJMatrixSetValues, HYRPE_IJMatrixAddToValues, HYPRE_IJMatrixAssemble (#226) Hi, I'm trying to optimize the performance of my matrix assembly in Hypre and I'm finding that all of my time is being spent in HYPRE_IJMatrixAssemble. First, on a given rank I call HYPRE_IJMatrixSetValues() once on the owned part (rows) of the matrix. The input to this routine is perfectly row and column sorted and there are no repeated entries, i.e. I give it a CSR matrix. Second, on that same rank, I call HYPRE_IJMatrixAddToValues() once on the part that needs to be accumulated into the owned rows of other ranks. The input to this routine is perfectly row and column sorted and there are no repeated entries, i.e. I give it a CSR matrix. Then, I call HYPRE_IJMatrixAssemble. I'm finding that this final Assemble call consumes 80-90% of the run time and I'm wondering if anything can be done. For instance, say I'm on process i for n ranks, would splitting HYPRE_IJMatrixAddToValues() into n-1 calls help, i.e. * HYPRE_IJMatrixAddToValues() on the values to add to process 0 ... * HYPRE_IJMatrixAddToValues() on the values to add to process i-1 * HYPRE_IJMatrixAddToValues() on the values to add to process i+1 ... * HYPRE_IJMatrixAddToValues() on the values to add to process n Here each, the input would be row/column sorted with no repeats. Thanks, -Paul Hi, @PaulMullowney<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FPaulMullowney&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C8cee9512435e47560c7608d89752867a%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637425728018919245%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=XKTtbRIoh5VMB3cEOpHbq74quk679QWB6XkJIOmikPQ%3D&reserved=0> , your observation that HYPRE_IJMatrixAssemble takes most of the time makes sense to me, since HYPRE_IJMatrixSet/AddToValues just copy the input matrix entries to the internal memory, on-process or off-process. All computations (compression) are done at the assemble stage, where off-process entries are first transferred via MPI and combined with on-process entries, and then all the entries are compressed to make CSR matrices. I don't think the splitting you mentioned is a good idea, since it would require a lot of cudaMemcpy without any benefit of doing it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhypre-space%2Fhypre%2Fissues%2F226%23issuecomment-737690225&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C8cee9512435e47560c7608d89752867a%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637425728018919245%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ybGQkZSQh0mw%2FtgCjxdzyJyK4YqwaQJmyt8RKLd7zPM%3D&reserved=0>, or unsubscribe<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAONG4MTSV5H2ML7LFOWVI6DSS4T7ZANCNFSM4S7NE7UA&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C8cee9512435e47560c7608d89752867a%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637425728018929206%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=rAOjXC1AurAbWaRDurs5TgO4lyFOB1Io2XmwV%2FTZZ4o%3D&reserved=0>.

ulrikeyang · 2020-12-14T16:42:53Z

On the CPU the idea for HYPRE_IJMatrixSetDiagOffdSizes was to set or add to values directly into a parcsr matrix data structure and avoid using an auxiliary data structure. I leave it up to Ruipeng to respond to the GPU question. Ulrike From: PaulMullowney <notifications@github.com> Sent: Monday, December 14, 2020 8:34 AM To: hypre-space/hypre <hypre@noreply.github.com> Cc: Subscribed <subscribed@noreply.github.com> Subject: Re: [hypre-space/hypre] Performance of HYPRE_IJMatrixSetValues, HYRPE_IJMatrixAddToValues, HYPRE_IJMatrixAssemble (#226) Do you expect either of HYPRE_IJMatrixSetRowSizes or HYPRE_IJMatrixSegDiagOffdSizes to make a difference for GPU assembly? -Paul From: Ruipeng Li <notifications@github.com<mailto:notifications@github.com>> Reply-To: hypre-space/hypre <reply@reply.github.com<mailto:reply@reply.github.com>> Date: Wednesday, December 2, 2020 at 11:13 PM To: hypre-space/hypre <hypre@noreply.github.com<mailto:hypre@noreply.github.com>> Cc: "Mullowney, Paul" <Paul.Mullowney@nrel.gov<mailto:Paul.Mullowney@nrel.gov>>, Mention <mention@noreply.github.com<mailto:mention@noreply.github.com>> Subject: Re: [hypre-space/hypre] Performance of HYPRE_IJMatrixSetValues, HYRPE_IJMatrixAddToValues, HYPRE_IJMatrixAssemble (#226) Hi, I'm trying to optimize the performance of my matrix assembly in Hypre and I'm finding that all of my time is being spent in HYPRE_IJMatrixAssemble. First, on a given rank I call HYPRE_IJMatrixSetValues() once on the owned part (rows) of the matrix. The input to this routine is perfectly row and column sorted and there are no repeated entries, i.e. I give it a CSR matrix. Second, on that same rank, I call HYPRE_IJMatrixAddToValues() once on the part that needs to be accumulated into the owned rows of other ranks. The input to this routine is perfectly row and column sorted and there are no repeated entries, i.e. I give it a CSR matrix. Then, I call HYPRE_IJMatrixAssemble. I'm finding that this final Assemble call consumes 80-90% of the run time and I'm wondering if anything can be done. For instance, say I'm on process i for n ranks, would splitting HYPRE_IJMatrixAddToValues() into n-1 calls help, i.e. * HYPRE_IJMatrixAddToValues() on the values to add to process 0 ... * HYPRE_IJMatrixAddToValues() on the values to add to process i-1 * HYPRE_IJMatrixAddToValues() on the values to add to process i+1 ... * HYPRE_IJMatrixAddToValues() on the values to add to process n Here each, the input would be row/column sorted with no repeats. Thanks, -Paul Hi, @PaulMullowney<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FPaulMullowney&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C8cee9512435e47560c7608d89752867a%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637425728018919245%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=XKTtbRIoh5VMB3cEOpHbq74quk679QWB6XkJIOmikPQ%3D&reserved=0> , your observation that HYPRE_IJMatrixAssemble takes most of the time makes sense to me, since HYPRE_IJMatrixSet/AddToValues just copy the input matrix entries to the internal memory, on-process or off-process. All computations (compression) are done at the assemble stage, where off-process entries are first transferred via MPI and combined with on-process entries, and then all the entries are compressed to make CSR matrices. I don't think the splitting you mentioned is a good idea, since it would require a lot of cudaMemcpy without any benefit of doing it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhypre-space%2Fhypre%2Fissues%2F226%23issuecomment-737690225&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C8cee9512435e47560c7608d89752867a%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637425728018919245%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ybGQkZSQh0mw%2FtgCjxdzyJyK4YqwaQJmyt8RKLd7zPM%3D&reserved=0>, or unsubscribe<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAONG4MTSV5H2ML7LFOWVI6DSS4T7ZANCNFSM4S7NE7UA&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C8cee9512435e47560c7608d89752867a%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637425728018929206%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=rAOjXC1AurAbWaRDurs5TgO4lyFOB1Io2XmwV%2FTZZ4o%3D&reserved=0>. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#226 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AD4NLLIUF5WLXANUDDUYOMTSUY5AXANCNFSM4S7NE7UA>.

liruipeng · 2020-12-14T18:09:21Z

Basically, neither of HYPRE_IJMatrixSetRowSizes or HYPRE_IJMatrixSetDiagOffdSizes has effect on GPU assembly, which are used for the CPU algorithm as Ulrike explained. GPU assembly always uses auxiliary matrix.

On the other hand, HYPRE_IJMatrixSetMaxOffProcElmts and hypre_IJMatrixSetMaxOnProcElmtsParCSR (just realized we don't have a HYPRE_ interface for it) may have some effect, not very much as expected, which try to pre-allocate adequate space in the auxiliary matrix. If either is not set, the algorithm starts with some initial allocation and reallocate if being out-of-space by a growing factor. But as you can see, it should not affect the performance too much.

@PaulMullowney we currently only have one approach of assembling on GPU, which tries to address the most generic situations. So, it may not be the most efficient for you, since it seems that you pretty much have CSR matrices already. We can figure out if we can take some shortcut or even bypass the IJ interface and get ParCSR directly.

Thanks!

PaulMullowney · 2020-12-15T18:03:21Z

I’m gonna take a crack at this. I’ve been reading the assembly code in IJMatrix_parcsr_device.c and I have a pretty good idea of what’s happening. I’m going to create a couple of new methods in that file, IJMatrix_parcsr_device.c, and provide simplified version for my use case. If successful, we can discuss what’s needed to get into master.

…

-Paul From: Ruipeng Li <notifications@github.com> Reply-To: hypre-space/hypre <reply@reply.github.com> Date: Monday, December 14, 2020 at 11:09 AM To: hypre-space/hypre <hypre@noreply.github.com> Cc: "Mullowney, Paul" <Paul.Mullowney@nrel.gov>, Mention <mention@noreply.github.com> Subject: Re: [hypre-space/hypre] Performance of HYPRE_IJMatrixSetValues, HYRPE_IJMatrixAddToValues, HYPRE_IJMatrixAssemble (#226) Neither of HYPRE_IJMatrixSetRowSizes or HYPRE_IJMatrixSetDiagOffdSizes basically has no effect on GPU assembly, which are used for the CPU algorithm as Ulrike explained. GPU assembly always use auxiliary matrix. On the other hand, HYPRE_IJMatrixSetMaxOffProcElmts and hypre_IJMatrixSetMaxOnProcElmtsParCSR (just realized we don't have a HYPRE_ interface for it) may have some effect, not very much as expected, which try to pre-allocate adequate space in the auxiliary matrix. If either is not set, the algorithm starts with some initial allocation and reallocate if being out-of-space by a growing factor. But as you can see, it should not affect the performance too much. @PaulMullowney<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FPaulMullowney&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C1fee777e9e0840110f4208d8a05b6b84%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637435661807159878%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=K8dcUqEtKmYRtl%2BSdEf5Z795xJCpyp%2FpSWOe98Bka2Y%3D&reserved=0> we currently only have one approach of assembling on GPU, which tries to address the most generic situations. So, it may not be the most efficient for you, since it seems that you pretty much have CSR matrices already. We can figure out if we can take some shortcut or even bypass the IJ interface and get ParCSR directly. Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhypre-space%2Fhypre%2Fissues%2F226%23issuecomment-744615314&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C1fee777e9e0840110f4208d8a05b6b84%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637435661807159878%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=xuoa0Vgz9P0CicGWlInuxj9jCFtRRs6ToNkcZ6cK3WA%3D&reserved=0>, or unsubscribe<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAONG4MUEZHEEQ3BF3FLKHJLSUZIGDANCNFSM4S7NE7UA&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C1fee777e9e0840110f4208d8a05b6b84%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637435661807169833%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=2c%2F9RRxqMawYxRSk2fZ0AbNGKoIrmeatx5xiErgJigs%3D&reserved=0>.

liruipeng · 2020-12-15T21:28:37Z

I’m gonna take a crack at this. I’ve been reading the assembly code in IJMatrix_parcsr_device.c and I have a pretty good idea of what’s happening. I’m going to create a couple of new methods in that file, IJMatrix_parcsr_device.c, and provide simplified version for my use case. If successful, we can discuss what’s needed to get into master.
…
-Paul From: Ruipeng Li notifications@github.com Reply-To: hypre-space/hypre reply@reply.github.com Date: Monday, December 14, 2020 at 11:09 AM To: hypre-space/hypre hypre@noreply.github.com Cc: "Mullowney, Paul" Paul.Mullowney@nrel.gov, Mention mention@noreply.github.com Subject: Re: [hypre-space/hypre] Performance of HYPRE_IJMatrixSetValues, HYRPE_IJMatrixAddToValues, HYPRE_IJMatrixAssemble (#226) Neither of HYPRE_IJMatrixSetRowSizes or HYPRE_IJMatrixSetDiagOffdSizes basically has no effect on GPU assembly, which are used for the CPU algorithm as Ulrike explained. GPU assembly always use auxiliary matrix. On the other hand, HYPRE_IJMatrixSetMaxOffProcElmts and hypre_IJMatrixSetMaxOnProcElmtsParCSR (just realized we don't have a HYPRE_ interface for it) may have some effect, not very much as expected, which try to pre-allocate adequate space in the auxiliary matrix. If either is not set, the algorithm starts with some initial allocation and reallocate if being out-of-space by a growing factor. But as you can see, it should not affect the performance too much. @PaulMullowney https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FPaulMullowney&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C1fee777e9e0840110f4208d8a05b6b84%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637435661807159878%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=K8dcUqEtKmYRtl%2BSdEf5Z795xJCpyp%2FpSWOe98Bka2Y%3D&reserved=0 we currently only have one approach of assembling on GPU, which tries to address the most generic situations. So, it may not be the most efficient for you, since it seems that you pretty much have CSR matrices already. We can figure out if we can take some shortcut or even bypass the IJ interface and get ParCSR directly. Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhypre-space%2Fhypre%2Fissues%2F226%23issuecomment-744615314&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C1fee777e9e0840110f4208d8a05b6b84%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637435661807159878%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=xuoa0Vgz9P0CicGWlInuxj9jCFtRRs6ToNkcZ6cK3WA%3D&reserved=0, or unsubscribehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAONG4MUEZHEEQ3BF3FLKHJLSUZIGDANCNFSM4S7NE7UA&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C1fee777e9e0840110f4208d8a05b6b84%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637435661807169833%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=2c%2F9RRxqMawYxRSk2fZ0AbNGKoIrmeatx5xiErgJigs%3D&reserved=0.

Sounds good. Thank you @PaulMullowney

jamtrott · 2022-02-11T08:43:49Z

I’m gonna take a crack at this. I’ve been reading the assembly code in IJMatrix_parcsr_device.c and I have a pretty good idea of what’s happening. I’m going to create a couple of new methods in that file, IJMatrix_parcsr_device.c, and provide simplified version for my use case. If successful, we can discuss what’s needed to get into master.

I'm curious if you had any luck with this approach. I'm in a very similar situation myself, where each MPI process already has a full CSR matrix in GPU device memory (stored either as a single CSR matrix or as two separate CSR matrices for the diagonal and offdiagonal parts). I want to provide the data to hypre, hopefully without having to do any unnecessary copying or processing.

Did you find a way to do this, @PaulMullowney ?

PaulMullowney · 2022-02-11T17:50:51Z

I have a working version of this code in a fork/branch. It's pretty much up to date with Hypre master. I would read my SC 21 paper to get details on the assembly. Key points are as follows.

I have data in a COO matrix for all values that are owned by a particular rank (i.e. row_id is in the appropriate range for a given rank)
I have data in a COO matrix for all values that need to be shared to other ranks (i.e. row_id is outside the range for a given rank)

The data from 2 is stacked on top of 1. There's a couple of subtle points and a few extra API calls. I'm happy to share if it helps.

-Paul

jamtrott · 2022-02-15T12:26:20Z

Thanks, Paul. I believe this must be the paper you are referring to: https://www.nrel.gov/docs/fy21osti/79645.pdf. I found it very interesting, and I would greatly appreciate anything you can share.

I suspect that my case may be simpler than yours, since I have arranged for the assembly to be a purely local operation on each MPI process. That is, there is no need to communicate between processes in my case.

Anyway, I will first try by calling HYPRE_IJMatrixSetValues with pointers to the CSR data in device memory, followed by HYPRE_IJMatrixAssemble. I'm assuming that HYPRE_IJMatrixSetValues will copy the data (as @liruipeng mentioned in a comment above), but I'm hoping that HYPRE_IJMatrixAssemble will be able to return immediately without any expensive processing. In the ideal case, I would like to avoid even the copy in HYPRE_IJMatrixSetValues. Do you know if it is possible to directly access device-side pointers to the underlying CSR data structure?

Anyway, perhaps I should open a separate issue, if needed, in case this is too off-topic.

morse129 · 2023-05-25T21:01:15Z

Hi, I'm wondering if there have been any developments on this, or if you could point me to the modified fork/branch and the details of the API calls.

The code I'm working on uses a similar numerical method to that described in @PaulMullowney's SC 21 paper, which I found quite insightful. I, too, found that the majority of time is being spent in HYPRE_IJMatrixAssemble when passing a pre-assembled CSR matrix to hypre, so I think a similar use of the modified HYPRE_IJMatrixAssemble would be valuable. Is there any plan to incorporate this into the master branch?

I would greatly appreciate any help you could provide with this.

Thanks,
Nick

PaulMullowney · 2023-05-30T14:20:26Z

An implementation is at:

https://github.com/PaulMullowney/hypre/tree/fast_ijassem_device

In the current implementation, you must allocation the COO buffers that are passed in to Set/AddToValues as:
[values to set, values to add, extra room]
where:
values to set = are on processor values
values to add = values sent to other processors
extra room = space need to receive all the potential values from other ranks

Usage of this is shown in in the following lines below.

https://github.com/PaulMullowney/nalu-wind/blob/master/src/HypreLinearSystem.C#L883
https://github.com/PaulMullowney/nalu-wind/blob/master/src/HypreLinearSystem.C#L940
https://github.com/PaulMullowney/nalu-wind/blob/master/src/HypreLinearSystem.C#L1508
https://github.com/PaulMullowney/nalu-wind/blob/master/src/HypreLinearSystem.C#L1616

-Paul

morse129 · 2023-05-31T15:28:40Z

An implementation is at:

https://github.com/PaulMullowney/hypre/tree/fast_ijassem_device

In the current implementation, you must allocation the COO buffers that are passed in to Set/AddToValues as: [values to set, values to add, extra room] where: values to set = are on processor values values to add = values sent to other processors extra room = space need to receive all the potential values from other ranks

Usage of this is shown in in the following lines below.

https://github.com/PaulMullowney/nalu-wind/blob/master/src/HypreLinearSystem.C#L883 https://github.com/PaulMullowney/nalu-wind/blob/master/src/HypreLinearSystem.C#L940 https://github.com/PaulMullowney/nalu-wind/blob/master/src/HypreLinearSystem.C#L1508 https://github.com/PaulMullowney/nalu-wind/blob/master/src/HypreLinearSystem.C#L1616

-Paul

Thank you very much @PaulMullowney

Mr-LiuSw mentioned this issue Jun 7, 2023

How can i assembly matrix on CPU and solve it on GPU when matrix need to be frequently updated ? #916

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of HYPRE_IJMatrixSetValues, HYRPE_IJMatrixAddToValues, HYPRE_IJMatrixAssemble #226

Performance of HYPRE_IJMatrixSetValues, HYRPE_IJMatrixAddToValues, HYPRE_IJMatrixAssemble #226

PaulMullowney commented Oct 26, 2020 •

edited

Loading

PaulMullowney commented Oct 26, 2020

liruipeng commented Dec 3, 2020 •

edited

Loading

PaulMullowney commented Dec 14, 2020 via email

ulrikeyang commented Dec 14, 2020 via email

liruipeng commented Dec 14, 2020 •

edited

Loading

PaulMullowney commented Dec 15, 2020 via email

liruipeng commented Dec 15, 2020

jamtrott commented Feb 11, 2022

PaulMullowney commented Feb 11, 2022

jamtrott commented Feb 15, 2022

morse129 commented May 25, 2023

PaulMullowney commented May 30, 2023 •

edited

Loading

morse129 commented May 31, 2023

Performance of HYPRE_IJMatrixSetValues, HYRPE_IJMatrixAddToValues, HYPRE_IJMatrixAssemble #226

Performance of HYPRE_IJMatrixSetValues, HYRPE_IJMatrixAddToValues, HYPRE_IJMatrixAssemble #226

Comments

PaulMullowney commented Oct 26, 2020 • edited Loading

PaulMullowney commented Oct 26, 2020

liruipeng commented Dec 3, 2020 • edited Loading

PaulMullowney commented Dec 14, 2020 via email

ulrikeyang commented Dec 14, 2020 via email

liruipeng commented Dec 14, 2020 • edited Loading

PaulMullowney commented Dec 15, 2020 via email

liruipeng commented Dec 15, 2020

jamtrott commented Feb 11, 2022

PaulMullowney commented Feb 11, 2022

jamtrott commented Feb 15, 2022

morse129 commented May 25, 2023

PaulMullowney commented May 30, 2023 • edited Loading

morse129 commented May 31, 2023

PaulMullowney commented Oct 26, 2020 •

edited

Loading

liruipeng commented Dec 3, 2020 •

edited

Loading

liruipeng commented Dec 14, 2020 •

edited

Loading

PaulMullowney commented May 30, 2023 •

edited

Loading