-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
section on computational aspects #4
Comments
I do think we should have something about computational aspects, especially showing speedup with GPUs. I have Fortran code that applies a Gaussian filter to 0.1 degree POP data. The problem size is small enough that it doesn't really scale well with increasing number of processors, and I worked hard to make it as fast as it is. If we could compare that code to the GPU/Laplacian version I think it might help underscore that this new method is actually much faster, not just a fun mathematical exercise. Nora is currently working on applying the filter to MITgcm LLC4320 data, so if anyone has code to apply a convolution filter to that kind of data we could use it as a comparison instead of my code for POP data. |
LLC4320 and similar simulations are a really exciting application for this tool. In many ways, the computational challenges of working with LLC4320 when it first came out really motivated the pivot in my career to become obsessed with software and tools. We definitely plan to use this tool on LLC4320, so it's good to know you are looking into that as well. As I noted in #1 (comment), there are some challenges with the LLC4320 data because of the multi-faceted grid. In order to make this diffusion approach work, we would need to implement exchanges between faces. One possible workaround would be to transform the "LL" part of the grid into a pseudo lat-lon grid (essentially neglecting the arctic), using this function from xmitgcm. However, that would contaminate everything within filter-distance from the northern boundary
Bottom line: I think LLC introduces sufficient additional complexities--beyond the other use cases we have considered--that it is probably out of scope for this current paper. However, longer term, there are very exciting opportunities with that type of model. |
Due to the difficulties with LLC grids that Ryan mentions above, I am now using the POP data to illustrate the filter aspects mentioned in #1. @rabernat, do you have a Laplacian/kernel that communicates across the northern boundary of the POP tripolar grid (and “sews up” the grid along the line between the two northern grid poles)? The Laplacian I am currently using only works away from the Arctic. |
Unfortunately no! All the POP code I have, I shared here: ocean-eddy-cpt/gcm-filters#14 It would be great to figure this out. You could try poking around the pop-tools repo and see if anyone has some code they can share for how to handle the tripolar grid boundary condition. Then we can insert it into the laplacian. |
Here's an idea for a way to deal with the POP tripole grid. First just treat the top boundary as a no-flux land boundary to get a fast roll through the full data. Then go back and correct the last column of data by adding the flux-divergence across the tripole seam. The indexing is as follows: the point immediately across the seam from (i,end) is (3601-i,end). (edited to correct typo in the formula) |
Alternatively you could add an extra column to the data that is equal to the last column with a reversed index. Then apply the standard Laplacian, then delete the last column. |
Just FYI, this is something that we discussed over at xgcm aswell. Would be very curious if there are more general ways to do this that we could potentially port over there. |
I'm imagining the computational section will basically be about timing, but correct me if I'm wrong. I have a fortran fixed-scale and fixed-factor smoother for SST from the POP0.1 data; it would be nice to compare to CPU and GPU versions of the Laplacian code. I will post my code to the repo and my timings here. |
The two POP codes are on the repo. I ran this at CU on a clean node (no other processes) and 24 cores (the max on that node). (code compiled with gfortran 6.1.0) For fixed factor, the factor is 10 and the kernel width is 19. For the fixed scale the length scale is 1 degree latitude (i.e. 111.7 km). For the fixed-scale filter I don't truncate the filter kernel: I use all points, even when the weights are infinitesimal. This is a sub-optimal implementation, but it makes it a lot easier to deal with the tripole grid. In both cases I use a wet mask, i.e. I ignore points on land instead of treating them as 0. I ran each version 10 times, and here are the results using 24 cores (seconds): For fixed scale we have a speedup of about 13x when moving from 1 to 24 cores. For fixed scale the speedup is much more modest: 2.22x. I also ran fixed-factor over all 62 layers: |
I implemented the fixed factor filter for the POP tripolar grid, see this PR, so we can get the timing comparison with the Fortran code started. I filtered the same data set as Ian (we both do only 1 time slice), but all 62 depth levels (whereas Ian does only 1 depth level). The results are documented in this notebook. To filter 62 depth levels, the timing was
Assuming sequential composition, Ian's code would take approximately 24s for 62 depth levels, so gcm-filters on a GPU beats the Fortran code (but is much slower on a CPU). I still have to implement the fixed scale filter. @rabernat, do you have any other ideas on what could go into the paper's section on computational aspects? Would you want to experiment with parallelizing across multiple GPUs? I have only tried this a little bit, but have only seen a slow-down (but it was a while ago). The GPU results above are using one NVIDIA Tesla V100 GPU. |
I left some comments on this in ocean-eddy-cpt/gcm-filters#26 (comment), but I'll repeat some of them here. It's great that the GPU version is so fast. But I'm also a bit concerned by how slow the CPU version is. In comparing with the fortran code, I think we should do a lower level comparison. It's not quite fair to put the low-level fortran code against the xarray / dask code, since they do different things. The xarray / dask code is doing a lot of extra stuff that may influence the timing significantly. For example, it has to align the coordinates of all the input arrays, parse the dask graph, assemble metadata for the outputs, etc. A more direct comparison would be to call the low-level routine, which is just numpy, and get xarray and dask out of the way. This is not directly exposed in the package, but you can get it at with code like this from gcm_filters.filter import _create_filter_func
filter_func = _create_filter_func(filter.filter_spec, filter.Laplacian) # filter is a Filter object you already created
# load everything into memory as numpy arrays, bypass xarray
field_raw = TEMP.isel(k=0).values # only do one level
wet_mask_raw = wet_mask.values
%time filter_func(field_raw, wet_mask_raw) Even that is not really fair because, as we noted in (ocean-eddy-cpt/gcm-filters#22 (comment)), the initialization of the laplacian is still happening inside This may not actually be any faster. But at least it gets down closer to the core of the problem and shows us where we could potentially be optimizing |
An even closer-to-the-metal comparison would be the laplacian itself, which you can time like this laplacian = filter.Laplacian(wet_mask=wet_mask.data)
%time _ = laplacian(da_masked.data) I'd be interested in getting the timing of the laplacian itself for the different implementations. Assuming this is the most expensive part, the total time for filtering should scale roughly like |
@rabernat, thanks for your feedback. I'm replying to your benchmarking-related questions in ocean-eddy-cpt/gcm-filters#26 (comment) in this issue here, to keep the discussion on computational aspects in one place. I apologize in advance if my answers miss the point of your questions, because I am quite new to the whole benchmarking business.
The numbers do include the time to load the data from disk into memory (both for CPU and GPU timings). I expect this to be a small fraction, because we are dealing with a quite small data array here (2.14 GB, dimensions
Not yet. But I will, and will also execute the lower-level tests that you suggested above. @iangrooms, eventually we should also do the Fortran vs. gcm-filters timing comparisons on a larger data array, say at least 100 time slices: |
My $0.02 is that I think we want to focus on comparing the low-level in-memory performance on a relatively small array (like one vertical level). If we are going to compare parallelized implementations (e.g. MPI vs. Dask), then we are also comparing those qualitatively different frameworks, and the scope of this becomes very broad. (E.g. we have to think about strong vs. weak scaling, etc.) |
I agree that it we should focus on in-memory, not multiple time slices. It's just so hard to get a 'fair' comparison for 100 time slices. I could write MPI code that just does all 100 slices in parallel and takes exactly the same amount of time as one slice.
My numbers do not include time loading the data or writing the result back to disk. So that will probably bring the CPU Python code a lot closer to the Fortran code. I can easily do all 62 layers if we want to do that, but it seems easier and simpler to just do the top layer. This gets back to the fair comparison idea: I could just do all 62 layers in parallel using tons of processors... My current one-layer implementation uses 24 CPUs for a single layer. |
To be clear, I think it is very plausible that the fortran code is 10x faster than the numpy CPU code, the way we have written it. I just want to boil it down as much as possible. Since numpy just wraps low-level C and Fortran itself, it should hypothetically be possible to get the numpy version closer to the fortran one. This blog post talks about this in some detail. However, I don't thinking optimizing the numpy code should be a priority right now. We should just report the results of our "naive" implementation. |
@iangrooms, which filter specs did you use to get the times you reported above? I tested for
which translates to But you may have tested for fewer |
😲 This is super important information that I had somehow missed. The numpy version CPU version is 100% serial and uses only 1 CPU. The numpy version seem 10x slower, but the fortran could hypothetically be getting a 24x speedup. So if you correct for the parallelism, the numpy is actually faster? My impression is that parallelizing over the "filter dimensions" (i.e. splitting into lat-lon tiles) is not the most efficient strategy, because the algorithm requires lots of inter-tile communication. This would be reflected in less-than-ideal parallel scaling as you vary the number of MPI ranks. Specifically, how much faster is the 24 CPU version than the 1 CPU version? (I doubt 24x but I could be wrong...) However, the GPU is essentially using parallel processing internally and doing something similar to the fortran MPI code, so it's still an interesting comparison. In contrast, parallelizing over the non-filter dimensions (e.g. depth, time, etc.) is embarassingly parallel (no communication required between processes) and should scale much better. If we wanted to really nerd out on this, I would suggest the following experiments:
Ultimately we are interested in two things:
|
Good questions. I added the answers to my original comment, to keep all the info in one place. Short answer here:
|
The Fortran code doesn't scale perfectly, but I didn't do a careful analysis. I don't use tiles; instead, I found that the most efficient implementation I tried was to filter first over longitude and then over latitude. Within each loop the iterations are embarrassingly parallel. In between the loops I re-pack the data in memory so that it is contiguous for each loop (to avoid cache thrashing). I will go back and get timings for the serial Fortran code, to compare to serial Python. If we can't parallelize a single layer in Python, then I guess I could update my code to run over all 62 layers and then compare Fortran vs Python using the same number of cores but different parallelization strategies. I won't do this if we can parallelize a single layer computation in Python, so let me know. PS, I'm not using MPI I'm using OpenMP. It's a lot easier to code, though it's limited to a single node. |
@iangrooms, what would be an appropriate choice for |
Good question, and sorry for the delay. I think using the default |
Here are the results of my timing tests, which should cover
from Ryan's list above. The tests are documented in this notebook, which I submitted as part of PR ocean-eddy-cpt/gcm-filters#29. I ran the tests for Ian's POP tests data set, on a CPU on casper. @iangrooms, you should be able to rerun this notebook on the CU machines, to get the execution times there. Let me know if you have issues. I will report the GPU times in a separate table. Summary of timing tests
Maybe ignore the Serial numpy + dask for parallelism (62 depth levels) timings for now - I haven't experimented with the number of data chunks and the number of dask workers yet. |
I ran @NoraLoose's jupyter notebook on the same node (24 cores, no other processes) as my tests above. Times for the full-complexity solves (using the tripole correctly) for SST are: |
@iangrooms, to take full advantage of parallel computing with dask on your 24 cores, we will have to set up scheduler and worker processes via Dask.distributed. I think I would try to do this via the Dask-Jobqueue, i.e., running something like this in the jupyter notebook:
and maybe chunk the
rather than just 2 chunks. Getting the dask scheduler and worker processes set up via the Dask-Jobqueue may take me some time (I don't have much experience with this yet), and will also depend on the cluster. So if I set this up for casper this won't translate one-to-one to the CU cluster. I would therefore try to get this set up directly for the CU cluster if this is where we want to do all the timing tests. On the other hand, @arthurBarthe & Co will do their tests on a different cluster from CU, so maybe okay to do this on casper after all? Maybe this is a good time to think more carefully about what tests exactly we want to include in the paper, and how to design these tests. Sorry Ian that I handed over the notebook to you before bringing this up. I hadn't fully appreciated that you would run this on all 24 cores. |
I neglected to mention that I used a chunk size of 3, which resulted in 21 chunks & 22 processes. That was as close as I could get to 24. I also noticed that the "serial" numpy was using about 300% CPU, so basically 3 cores. I don't think we need to run all tests on the same cluster; we just need to be sure to only compare timings on the same cluster. So gcm-filters-CPU vs Fortran CPU on my cluster at CU is a fair comparison, and gcm-filters-CPU vs the NYU convolution code on an NYU cluster (or NCAR) is a fair comparison. If we get an NYU GPU convolution code then we should compare it to gcm-filters-GPU on the same GPU. If we want to compare my Fortran code to the NYU convolution code then we should do it on the same cluster, but I don't know if that's necessary. In any case, I figured out how to fix my parallel Fortran problem at NCAR, so I can always run my Fortran code there if we want everything on the same platform. |
Not quite true. Distributed is only strictly necessary if we want to use multiple nodes. The default threaded scheduler uses multithreading, similar to openmp. |
I expect the "computational efficiency" section to be fairly short. I've already got timings of Nora's POP tripole code and my Fortran code, so here's what I think we need to wrap up:
The point of these tests is just to show that the method we're proposing is not worse than other methods on the CPU, and blazing-fast on the GPU. I just want to give the reader reassurance that this method is practical, and not just another cool idea that ends up being practically unusable because of the cost. |
I wonder why it took about 2-3 times longer for you (on your 24 cores) than for me (serial numpy) to run the gcm-filters test.
|
My node at CU is getting old. I bought it 6 years ago and it wasn't the latest hardware at the time. The CPU and RAM speed on casper are probably faster, but I think what matters here is that the Fortran and Python codes are both running on the same hardware. We can list hardware specs in the timing section of the paper so people who are interested can see how old the CPU is. |
Ok, follow-up question: Did you subtract the time it takes for dask to load the data? Or did you get your times via
where |
From your notebook I ran cells like this one: |
That's not quite how dask works. There is no caching here. If we want to explictly load the data from disk, we call |
I see. My fortran code loads the data, then uses threaded in-memory parallelism. So for a fair comparison I should call |
Cuda GPU times for filtering POP data on tripolar grid (including exchanges across tripolar seam): fixed factor, fixed scale, fixed factor, fixed scale, I split the times into the sum of two. The first number is the wall time for
The second number is the wall time for
All times above include loading of data. Times for only loading data are: load data, 1 level: load data, 62 levels: |
I think the second number, when you actually call |
Yes, I have always been measuring the second number (e.g., in that monstrous table above). I just included the first number this time because Ian reported the sum of these 2 numbers for the CPU times. |
When I use fixed factor SST: 6.11, 6.38, 6.11, 6.12, 6.08, 6.18, 6.07, 6.09, 6.09, 6.14 (mean = 6.14s) |
That's a bit puzzling... 🤔 What is the shape of the SST array? What dimension did you chunk over?
Definitely not what I expected... |
Yes, definitely puzzling. I'm on again this morning and tried chunking SST on both longitude and latitude (separately). This morning it threw an error and simply wouldn't apply the filter unless it's arranged as one chunk. On the other hand, this morning I'm getting wall times orders of magnitude faster than last time. For fixed-factor SST my times are 10 times faster this morning and 42 times faster for fixed-scale. On the other hand, the time for fixed-factor temperature over all layers is still 5m20s, which is much longer than the time I get when I don't |
The If the SST data doesn't have another dimension, there is no speedup to be gained with dask. If you could share more details of the datasets you're passing to gcm_filters (e.g.
Agreed! Something could be weird with your python environment on your cluster? |
I might have figured out what I was doing wrong: I incorrectly assumed that I wouldn't need to do both I don't understand what's happening when I filter all 62 levels though. If I don't use |
If the data are already in memory (i.e. you've called The real question is why loading first is so much slower than loading lazily. Two important questions:
|
FWIW, the distinction between eager and lazy computation is described in the gcm_filters docs: https://gcm-filters.readthedocs.io/en/latest/tutorial.html#use-dask. Reading that might clear up some confusion. |
I use the same chunk size for the 90s version. |
For context, |
Does it make sense to include a section on the computational implementation of the filtering in this paper? If we are only doing one paper, I think so. It would be nice to have some simple benchmarking showing how much the GPU speeds things up.
If so, I would be happy to lead that section.
Depending on our level ambition, we could also look at things like parallel scaling as a function of the number of processors / GPUs.
The text was updated successfully, but these errors were encountered: