Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JOSS review #28

Closed
7 tasks done
mengaldo opened this issue Jul 2, 2024 · 8 comments
Closed
7 tasks done

JOSS review #28

mengaldo opened this issue Jul 2, 2024 · 8 comments

Comments

@mengaldo
Copy link

mengaldo commented Jul 2, 2024

@mendezVKI, @philipcardiff, @AndreWeiner

  • The motivation and statement of need can be better stated (not only for fluid mechanics).

  • Some typos here and there that can be addressed.

  • I cannot find any tests; these should be part of the package.

  • It would be great to have instructions on how to install and run the package also in the GitHub repository, and not only on the external doc website.

  • Competing works are missing:

    1. Rogowski, Marcin, Brandon CY Yeung, Oliver T. Schmidt, Romit Maulik, Lisandro Dalcin, Matteo Parsani, and Gianmarco Mengaldo. "Unlocking massively parallel spectral proper orthogonal decompositions in the PySPOD package." Computer Physics Communications 302 (2024): 109246.
    2. Lario, A., Maulik, R., Schmidt, O.T., Rozza, G. and Mengaldo, G., 2022. Neural-network learning of SPOD latent dynamics. Journal of Computational Physics, 468, p.111475.
  • Is your library parallel? If not, that should be stated, stating how these may prevent running large-scale simulation of big data.

  • The size of the library seems not within GitHub guidelines. Some unintended files added?

@mendezVKI
Copy link
Owner

Dear @mengaldo , @philipcardiff, @AndreWeiner

Thank you very much for your thorough review. We believe we have addressed most of the points, though we encountered some difficulties with the one concerning the size of the repository.

Here is the summary of the updates:

We added Unit tests.

The paper now includes citations to similar packages and one sentence stating the potential applications outside fluid mechanics.

The readme has been updated with instructions for installation and usage, as well as more context and references.

Concerning the parallel implementation: currently MODULO does not offer additional parallel computing capabilities beyond those naturally provided by NumPy. This is now explicitly stated in the Readme. Further parallel implementations are planned for future developments. Note, however, that the memory saving feature does allow to process quite large datasets effectively.

A partially open point is the one concerning the size of the repository.

The repository's large size is mainly due to old files committed some time ago. This issue is specific to Git, meaning that users accessing the package from the PyPI repository will not be affected. More specifically, these files are inside the .git folder and the only way to delete these, as far as we know, is to re-write the history of the repository. This means that we would have to delete the repository and create a new one with a cleaner history.

However, we are afraid that the people who forked the repository (we have currently 24) will not be able to push back their developments. We would appreciate any suggestion on how to clean this folder in another way.

Nevertheless, to circumvent the problem, at this stage we encourage potential developers to shallow-clone the repository with the command:

git clone --depth 1 <repository_url>

This is written both in the contribution guideline and in the readme and allows to download all the relevant codes with less than 15MB. We highlight that shallow cloning the repository does not limit users in the collaboration (e.g. pull requests, issues, merges), as the commits involving large files are referring to the preliminary version of the software and are no longer relevant.

@mendezVKI mendezVKI mentioned this issue Jul 24, 2024
11 tasks
@mengaldo
Copy link
Author

"Concerning the parallel implementation: currently MODULO does not offer additional parallel computing capabilities beyond those naturally provided by NumPy. This is now explicitly stated in the Readme. Further parallel implementations are planned for future developments. Note, however, that the memory saving feature does allow to process quite large datasets effectively."

@mendezVKI: Quite large datasets is relative to the application.
If you do not have distributed capabilities, it is likely that you cannot run say a 100TB in a single node.
Am I misunderstanding something here?

@mendezVKI
Copy link
Owner

mendezVKI commented Jul 24, 2024

Sure, you're right: size is very much application dependent and 100 TB sounds off-scale for us.
But the main issue in decomposing such a large case is more memory requirements than parallel capabilities.

With the the memory saving, you could in principle process a dataset of that 100 TB in MODULO, provided that you partition it into chunks sufficiently small to fit in your RAM. On a good desktop with 128 GB of RAM, you would need about 800 partitions. That would of course take quite a lot of time,, most of which being spend in I/O operation to save intermediate results.

The potential gains from parallel computing are still uncertain for us. The most computationally intensive operations are inner products and eigenvalue computations, which NumPy handles in a quasi-parallel manner. On Windows, these operations typically utilize around 70% all the available processors, whereas, on Linux, eigenvalue computations often use 100% of processors by default, with other operations (such as inner products and I/O) utilizing around 60%. However, for the scale you're discussing, Python might not be the most suitable tool.

To give you an order of magnitude for our most common applications in image velocimetry, in planar cases databases of the order of GB are very large. In 3D velocimetry, we might reach something of the order of dozens GB. These can all run quite easily. The largest we have worked are on the order of 50-100 GB and could be easily digested on my 32GB RAM laptop thanks to the memory saving.

@mengaldo
Copy link
Author

Are you suggesting that this is feasible?
"With the the memory saving, you could in principle process a dataset of that 100 TB in MODULO, provided that you partition it into chunks sufficiently small to fit in your RAM. On a good desktop with 128 GB of RAM, you would need about 800 partitions. That would of course take quite a lot of time,, most of which being spend in I/O operation to save intermediate results."

I would recommend to take the geophysical example from this paper (~100TB), and demonstrate that this is the case: https://arxiv.org/abs/2309.11808

Also, "which NumPy handles in a quasi-parallel manner". As far as I know, this is shared-memory parallelism. To go to distributed parallelism you would need to use an mpi package.

@mendezVKI
Copy link
Owner

Indeed, there is quite a lot of work to do for improving the parallelism.

Anyway, yes: 100 TB could be in principle possible depending on the proportion of the number of snapshot versus the weight of the snapshot.

You can see all the details to compute how much memory you need with the memory saving in this video.

Your data has relatively small snapshots but too many of them for the current implementation in MODULO.
I understand you have n_s=38 414 880 and n_t=727584.

With the memory saving you do not need to store all the n_s x n_t points, but you need to store the temporal correlation matrix K and the temporal basis PSI (this is much smaller). As long as n_t^2 is much smaller than n_s x n_t the memory saving helps a lot and can enable digesting huge datasets. Unfortunately this is not your case.

In your case, K is of size 727584 x 727584, which would require about 1.92 TB of memory. Unless you have a computer with that much RAM we cannot do it with the current MODULO.

Roughly speaking, if you have 128 GB of RAM you could deal with datasets having about n_t = 185,363 snapshots.
To consider datasets of the order of 100 TB means that you have snapshots of about n_s=148,289,744.69, that is about 0,55 GB per snapshot. You could have chunks of about n_t’=230 to stay below 128GB per chunk and proceed chunk by chunk. It would take a lot of time, but it is not impossible.

So the point is not 100TB or less. Its about how much is n_s and n_t.

While I appreciate you sharing your work with us, dealing with such datasets is beyond the scope of our current release. We don't have the time or resources to pursue those developments at the moment, and they are not aligned with our current focus, especially since we don't work with data at that scale. We've provided all the necessary information, tutorials, and videos to help assess whether the current version of MODULO is suitable for a given problem. I'm confident that it remains valuable for addressing issues relevant to a large community.

@mengaldo
Copy link
Author

Thank you for the detailed response.

My point is that it seems that you suggest that you can run easily very large datasets.
This is usually not achievable or scalable without a distributed parallel implementation.

For instance, how long would it take using chunking, if it was feasible?

I believe you have a few intermediate steps on the I/O side that would make the runtime of large datasets like the one above very large?

In fact, I/O becomes one of the critical bottlenecks that is not easy to handle if you go to datasets on the order of several terabytes or more.

I believe the library is of interest to the community, but the above limitations may be better highlighted.

@mendezVKI
Copy link
Owner

We removed the sentence ``quite large datasets effectively'' and added a section with a table to compute the memory requirement for a given dataset using the memory saving feature or not.

These are only based on memory requirements. How much time it would take to deal with a given dataset while facing the I/O bottleneck depends on what hardware is being used and we do not know how to estimate it in general. But anyway, here we are talking about "it is possible" vs "it is not possible", not about "how fast" it would be.

@mengaldo
Copy link
Author

Great, thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants