Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

section on computational aspects #4

Closed
rabernat opened this issue Jan 12, 2021 · 46 comments
Closed

section on computational aspects #4

rabernat opened this issue Jan 12, 2021 · 46 comments

Comments

@rabernat
Copy link
Collaborator

Does it make sense to include a section on the computational implementation of the filtering in this paper? If we are only doing one paper, I think so. It would be nice to have some simple benchmarking showing how much the GPU speeds things up.

If so, I would be happy to lead that section.

Depending on our level ambition, we could also look at things like parallel scaling as a function of the number of processors / GPUs.

@iangrooms
Copy link
Member

I do think we should have something about computational aspects, especially showing speedup with GPUs. I have Fortran code that applies a Gaussian filter to 0.1 degree POP data. The problem size is small enough that it doesn't really scale well with increasing number of processors, and I worked hard to make it as fast as it is. If we could compare that code to the GPU/Laplacian version I think it might help underscore that this new method is actually much faster, not just a fun mathematical exercise. Nora is currently working on applying the filter to MITgcm LLC4320 data, so if anyone has code to apply a convolution filter to that kind of data we could use it as a comparison instead of my code for POP data.

@rabernat
Copy link
Collaborator Author

Nora is currently working on applying the filter to MITgcm LLC4320 data

LLC4320 and similar simulations are a really exciting application for this tool. In many ways, the computational challenges of working with LLC4320 when it first came out really motivated the pivot in my career to become obsessed with software and tools. We definitely plan to use this tool on LLC4320, so it's good to know you are looking into that as well.

As I noted in #1 (comment), there are some challenges with the LLC4320 data because of the multi-faceted grid. In order to make this diffusion approach work, we would need to implement exchanges between faces. One possible workaround would be to transform the "LL" part of the grid into a pseudo lat-lon grid (essentially neglecting the arctic), using this function from xmitgcm. However, that would contaminate everything within filter-distance from the northern boundary

LL part of LLC Grid
(from https://medium.com/pangeo/petabytes-of-ocean-data-part-1-nasa-ecco-data-portal-81e3c5e077be)

Bottom line: I think LLC introduces sufficient additional complexities--beyond the other use cases we have considered--that it is probably out of scope for this current paper. However, longer term, there are very exciting opportunities with that type of model.

@NoraLoose
Copy link
Member

Due to the difficulties with LLC grids that Ryan mentions above, I am now using the POP data to illustrate the filter aspects mentioned in #1. @rabernat, do you have a Laplacian/kernel that communicates across the northern boundary of the POP tripolar grid (and “sews up” the grid along the line between the two northern grid poles)? The Laplacian I am currently using only works away from the Arctic.

@rabernat
Copy link
Collaborator Author

@rabernat, do you have a Laplacian/kernel that communicates across the northern boundary of the POP tripolar grid (and “sews up” the grid along the line between the two northern grid poles)

Unfortunately no! All the POP code I have, I shared here: ocean-eddy-cpt/gcm-filters#14

It would be great to figure this out. You could try poking around the pop-tools repo and see if anyone has some code they can share for how to handle the tripolar grid boundary condition. Then we can insert it into the laplacian.

@iangrooms
Copy link
Member

iangrooms commented Jan 18, 2021

Here's an idea for a way to deal with the POP tripole grid. First just treat the top boundary as a no-flux land boundary to get a fast roll through the full data. Then go back and correct the last column of data by adding the flux-divergence across the tripole seam. The indexing is as follows: the point immediately across the seam from (i,end) is (3601-i,end). (edited to correct typo in the formula)

@iangrooms
Copy link
Member

Alternatively you could add an extra column to the data that is equal to the last column with a reversed index. Then apply the standard Laplacian, then delete the last column.

@jbusecke
Copy link
Collaborator

Alternatively you could add an extra column to the data that is equal to the last column with a reversed index. Then apply the standard Laplacian, then delete the last column.

Just FYI, this is something that we discussed over at xgcm aswell. Would be very curious if there are more general ways to do this that we could potentially port over there.

@iangrooms
Copy link
Member

I'm imagining the computational section will basically be about timing, but correct me if I'm wrong. I have a fortran fixed-scale and fixed-factor smoother for SST from the POP0.1 data; it would be nice to compare to CPU and GPU versions of the Laplacian code. I will post my code to the repo and my timings here.

@iangrooms
Copy link
Member

iangrooms commented Jan 20, 2021

The two POP codes are on the repo. I ran this at CU on a clean node (no other processes) and 24 cores (the max on that node). (code compiled with gfortran 6.1.0)

For fixed factor, the factor is 10 and the kernel width is 19. For the fixed scale the length scale is 1 degree latitude (i.e. 111.7 km). For the fixed-scale filter I don't truncate the filter kernel: I use all points, even when the weights are infinitesimal. This is a sub-optimal implementation, but it makes it a lot easier to deal with the tripole grid. In both cases I use a wet mask, i.e. I ignore points on land instead of treating them as 0.

I ran each version 10 times, and here are the results using 24 cores (seconds):
fixed factor: .388, .400, .387, .387, .425, .382, .340, .396, .354, .409 (mean = 0.387s)
fixed scale: 275.66, 277.63, 279.71, 274.50, 277.89, 279.82, 279.63, 279.63, 277.32, 275.71 (mean = 277.75 s)
Results with 1 core:
fixed factor: .848, .887, .888, .883, .850, .854, .879, .849, .846, .847 (mean = .863s)
fixed scale: 3657, 3657, 3667, 3659, 3724, 3664, 3660, 3656, 3728, 3840 (mean = 3691s)

For fixed scale we have a speedup of about 13x when moving from 1 to 24 cores. For fixed scale the speedup is much more modest: 2.22x.

I also ran fixed-factor over all 62 layers:
Serial: 54.56, 56.25, 53.47, 54.59, 55.24, 52.12, 54.16, 53.85, 54.62, 53.58 (mean = 54.23s)
Parallelizing over layers: 26.94, 28.37, 27.87, 27.50, 27.81, 27.89, 28.62, 27.60, 27.29, 28.93 (mean = 27.88s)
There's only approximately a factor of 2 speedup, which suggests that my implementation is limited by memory rather than compute. It takes 9.21s on average to read the data, and reading time is not included in the numbers above.

@NoraLoose
Copy link
Member

I implemented the fixed factor filter for the POP tripolar grid, see this PR, so we can get the timing comparison with the Fortran code started. I filtered the same data set as Ian (we both do only 1 time slice), but all 62 depth levels (whereas Ian does only 1 depth level). The results are documented in this notebook.

To filter 62 depth levels, the timing was

  • on CPU:
CPU times: user 8min 36s, sys: 3min 49s, total: 12min 25s
Wall time: 6min 16s
  • on GPU:
CPU times: user 5.31 s, sys: 1.49 s, total: 6.8 s
Wall time: 5.23 s

Assuming sequential composition, Ian's code would take approximately 24s for 62 depth levels, so gcm-filters on a GPU beats the Fortran code (but is much slower on a CPU).

I still have to implement the fixed scale filter.

@rabernat, do you have any other ideas on what could go into the paper's section on computational aspects? Would you want to experiment with parallelizing across multiple GPUs? I have only tried this a little bit, but have only seen a slow-down (but it was a while ago). The GPU results above are using one NVIDIA Tesla V100 GPU.

@rabernat
Copy link
Collaborator Author

I left some comments on this in ocean-eddy-cpt/gcm-filters#26 (comment), but I'll repeat some of them here. It's great that the GPU version is so fast. But I'm also a bit concerned by how slow the CPU version is.

In comparing with the fortran code, I think we should do a lower level comparison. It's not quite fair to put the low-level fortran code against the xarray / dask code, since they do different things. The xarray / dask code is doing a lot of extra stuff that may influence the timing significantly. For example, it has to align the coordinates of all the input arrays, parse the dask graph, assemble metadata for the outputs, etc.

A more direct comparison would be to call the low-level routine, which is just numpy, and get xarray and dask out of the way. This is not directly exposed in the package, but you can get it at with code like this

from gcm_filters.filter import _create_filter_func
filter_func = _create_filter_func(filter.filter_spec, filter.Laplacian)  # filter is a Filter object you already created
# load everything into memory as numpy arrays, bypass xarray
field_raw = TEMP.isel(k=0).values  # only do one level
wet_mask_raw = wet_mask.values
%time filter_func(field_raw, wet_mask_raw)

Even that is not really fair because, as we noted in (ocean-eddy-cpt/gcm-filters#22 (comment)), the initialization of the laplacian is still happening inside filter_func. (We should also compare the same size problem, either 62 levels or 1, rather than assume perfect scaling between them.)

This may not actually be any faster. But at least it gets down closer to the core of the problem and shows us where we could potentially be optimizing

@rabernat
Copy link
Collaborator Author

rabernat commented Feb 17, 2021

An even closer-to-the-metal comparison would be the laplacian itself, which you can time like this

laplacian = filter.Laplacian(wet_mask=wet_mask.data)
%time _ = laplacian(da_masked.data)

I'd be interested in getting the timing of the laplacian itself for the different implementations.

Assuming this is the most expensive part, the total time for filtering should scale roughly like (N_laplacian_steps + 2 * N_biharmonic_steps) * laplacian_time. We can't parallelize over filter steps, so parallelism is unlikely to affect this. In some rough tests I did, this scaling seems to hold well.

@NoraLoose
Copy link
Member

@rabernat, thanks for your feedback. I'm replying to your benchmarking-related questions in ocean-eddy-cpt/gcm-filters#26 (comment) in this issue here, to keep the discussion on computational aspects in one place. I apologize in advance if my answers miss the point of your questions, because I am quite new to the whole benchmarking business.

Were the data already in memory before doing this timing? Or are we including the time to load the data from disk in those numbers?

The numbers do include the time to load the data from disk into memory (both for CPU and GPU timings). I expect this to be a small fraction, because we are dealing with a quite small data array here (2.14 GB, dimensions (time: 1, z_t: 62, nlat: 2400, nlon: 3600)). But I will test this.

How much parallelism was used in each case?

  • First of all, a disclaimer: I did both the CPU and the GPU timings in an interactive session on casper where I asked for one NVIDIA Tesla V100 GPU (via execdav --gres=gpu:v100:1). I don't know enough about casper's Hardware infrastructure (or really HPC hardware in general) to be able to tell what CPU resource I was actually using when I timed the CPU version on this reservation, or whether I was sharing resources with other users. Is there an easy way to find out? I should also repeat the CPU test on a reservation that is intended for CPU use only (and not shared among users).
  • The previous bullet point may answer part of your question. To get the numbers, I simply ran this notebook from top to bottom in a "fresh" interactive session that started up in response to my request execdav --gres=gpu:v100:1. As you can see, I did not experiment with the dask scheduler. To be fair, the data set I filtered was pretty small, and I chunked it only into 2 chunks, each of dimension (time: 1, z_t: 31, nlat: 2400, nlon: 3600), and size 1.07 GB. Same chunk decomposition for CPU and GPU experiments.

Have you compared the serial execution time on a single level?

Not yet. But I will, and will also execute the lower-level tests that you suggested above.

@iangrooms, eventually we should also do the Fortran vs. gcm-filters timing comparisons on a larger data array, say at least 100 time slices: (time: 100, z_t: 62, nlat: 2400, nlon: 3600).

@rabernat
Copy link
Collaborator Author

eventually we should also do the Fortran vs. gcm-filters timing comparisons on a larger data array,

My $0.02 is that I think we want to focus on comparing the low-level in-memory performance on a relatively small array (like one vertical level). If we are going to compare parallelized implementations (e.g. MPI vs. Dask), then we are also comparing those qualitatively different frameworks, and the scope of this becomes very broad. (E.g. we have to think about strong vs. weak scaling, etc.)

@iangrooms
Copy link
Member

iangrooms commented Feb 17, 2021

I agree that it we should focus on in-memory, not multiple time slices. It's just so hard to get a 'fair' comparison for 100 time slices. I could write MPI code that just does all 100 slices in parallel and takes exactly the same amount of time as one slice.

The numbers do include the time to load the data from disk into memory (both for CPU and GPU timings)

My numbers do not include time loading the data or writing the result back to disk. So that will probably bring the CPU Python code a lot closer to the Fortran code. I can easily do all 62 layers if we want to do that, but it seems easier and simpler to just do the top layer. This gets back to the fair comparison idea: I could just do all 62 layers in parallel using tons of processors... My current one-layer implementation uses 24 CPUs for a single layer.

@rabernat
Copy link
Collaborator Author

To be clear, I think it is very plausible that the fortran code is 10x faster than the numpy CPU code, the way we have written it. I just want to boil it down as much as possible. Since numpy just wraps low-level C and Fortran itself, it should hypothetically be possible to get the numpy version closer to the fortran one. This blog post talks about this in some detail.

However, I don't thinking optimizing the numpy code should be a priority right now. We should just report the results of our "naive" implementation.

@NoraLoose
Copy link
Member

@iangrooms, which filter specs did you use to get the times you reported above? I tested for

filter_scale=10,
dx_min=1,
filter_shape=gcm_filters.FilterShape.GAUSSIAN,
n_steps=20

which translates to # Laplacian steps = 18, # Biharmonic steps = 1.

But you may have tested for fewer n_steps. And did you use a wet_mask?

@rabernat
Copy link
Collaborator Author

My current one-layer implementation uses 24 CPUs for a single layer.

😲 This is super important information that I had somehow missed. The numpy version CPU version is 100% serial and uses only 1 CPU. The numpy version seem 10x slower, but the fortran could hypothetically be getting a 24x speedup. So if you correct for the parallelism, the numpy is actually faster?

My impression is that parallelizing over the "filter dimensions" (i.e. splitting into lat-lon tiles) is not the most efficient strategy, because the algorithm requires lots of inter-tile communication. This would be reflected in less-than-ideal parallel scaling as you vary the number of MPI ranks. Specifically, how much faster is the 24 CPU version than the 1 CPU version? (I doubt 24x but I could be wrong...) However, the GPU is essentially using parallel processing internally and doing something similar to the fortran MPI code, so it's still an interesting comparison.

In contrast, parallelizing over the non-filter dimensions (e.g. depth, time, etc.) is embarassingly parallel (no communication required between processes) and should scale much better.

If we wanted to really nerd out on this, I would suggest the following experiments:

  1. The reference timing is for the fortran serial code (one CPU). This is the baseline against which everything else is compared.
  2. MPI-fortran parallelizing over the filter dimensions
  3. MPI-fortran parallelizing over the depth dimension. I predict this is significantly faster than 2
  4. Serial numpy (compare to 1)
  5. Cuda GPU (compare to 2)
  6. Serial numpy + dask for parallelism (interested to see if this can match 3)

Ultimately we are interested in two things:

  • How fast can we filter a real dataset using a single node (either using all the CPUs or all the GPUs)?
  • Can we efficiently do distributed filtering to leverage many nodes (either CPU or GPU)?

@iangrooms
Copy link
Member

@iangrooms, which filter specs did you use to get the times you reported above? I tested for

filter_scale=10,
dx_min=1,
filter_shape=gcm_filters.FilterShape.GAUSSIAN,
n_steps=20

which translates to # Laplacian steps = 18, # Biharmonic steps = 1.

But you may have tested for fewer n_steps. And did you use a wet_mask?

Good questions. I added the answers to my original comment, to keep all the info in one place. Short answer here:

  1. I used a factor of 10 and a kernel width of 19 (i.e. n_steps = 9 would be an appropriate comparison). The discrepancy in kernel width is also probably partly responsible for the Python code being slower.
  2. I did use a wet_mask: the KMT variable from the netcdf file. If KMT==0 then the grid point is on land (for the top layer).

@iangrooms
Copy link
Member

My impression is that parallelizing over the "filter dimensions" (i.e. splitting into lat-lon tiles) is not the most efficient strategy, because the algorithm requires lots of inter-tile communication. This would be reflected in less-than-ideal parallel scaling as you vary the number of MPI ranks. Specifically, how much faster is the 24 CPU version than the 1 CPU version? (I doubt 24x but I could be wrong...) However, the GPU is essentially using parallel processing internally and doing something similar to the fortran MPI code, so it's still an interesting comparison.

The Fortran code doesn't scale perfectly, but I didn't do a careful analysis. I don't use tiles; instead, I found that the most efficient implementation I tried was to filter first over longitude and then over latitude. Within each loop the iterations are embarrassingly parallel. In between the loops I re-pack the data in memory so that it is contiguous for each loop (to avoid cache thrashing).

I will go back and get timings for the serial Fortran code, to compare to serial Python. If we can't parallelize a single layer in Python, then I guess I could update my code to run over all 62 layers and then compare Fortran vs Python using the same number of cores but different parallelization strategies. I won't do this if we can parallelize a single layer computation in Python, so let me know.

PS, I'm not using MPI I'm using OpenMP. It's a lot easier to code, though it's limited to a single node.

@NoraLoose
Copy link
Member

For the fixed scale the length scale is 1 degree latitude (i.e. 111.7 km). For the fixed-scale filter I don't truncate the filter kernel: I use all points, even when the weights are infinitesimal.

@iangrooms, what would be an appropriate choice for n_steps in the gcm-filters framework to do an apples-to-apples comparison with your fortran fixed length scale filter?

@iangrooms
Copy link
Member

Good question, and sorry for the delay. I think using the default n_steps is appropriate. I could have used a fixed stencil width, but it would have been a pain to code up due to the periodic boundary and tripole seam. The comparison is not applies-to-apples, but you could still argue that it's fair because it was easier to code the iterated-Laplacian filter than to code the stencil one. (I guess "fair" in the sense that I didn't spend hours and hours on the Fortran code trying to give it an advantage.)

@NoraLoose
Copy link
Member

NoraLoose commented Feb 24, 2021

Here are the results of my timing tests, which should cover

  1. Serial numpy (compare to 1)
  2. Serial numpy + dask for parallelism (interested to see if this can match 3)

from Ryan's list above. The tests are documented in this notebook, which I submitted as part of PR ocean-eddy-cpt/gcm-filters#29. I ran the tests for Ian's POP tests data set, on a CPU on casper. @iangrooms, you should be able to rerun this notebook on the CU machines, to get the execution times there. Let me know if you have issues. I will report the GPU times in a separate table.

Summary of timing tests

CARTESIAN CARTESIAN_WITH_LAND POP_SIMPLE_TRIPOLAR_T_GRID IRREGULAR_CARTESIAN_WITH_LAND POP_TRIPOLAR_T_GRID
For which filter type? fixed factor fixed factor fixed factor fixed scale fixed scale
Laplacian complexity Ignores continents & tripolar exchanges Ignores tripolar exchanges Handles everything Ignores tripolar exchanges Handles everything
----------- ----------- ----------- ----------- ----------- --------------
One Laplacian step (serial numpy, 1 depth level) CPU times: user 52.1 ms, sys: 52.7 ms, total: 105 ms; Wall time: 103 ms CPU times: user 179 ms, sys: 120 ms, total: 299 ms; Wall time: 298 ms CPU times: user 194 ms, sys: 137 ms, total: 331 ms; Wall time: 331 ms CPU times: user 196 ms, sys: 173 ms, total: 370 ms; Wall time: 369 ms CPU times: user 218 ms, sys: 171 ms, total: 389 ms; Wall time: 389 ms
Full filter (serial numpy, 1 depth level) n_steps=9: CPU times: user 538 ms, sys: 339 ms, total: 877 ms; Wall time: 875 ms n_steps=9: CPU times: user 1.89 s, sys: 1.27 s, total: 3.16 s; Wall time: 3.15 s n_steps=9: CPU times: user 2.04 s, sys: 1.38 s, total: 3.43 s; Wall time: 3.43 s n_steps=9: CPU times: user 2.14 s, sys: 1.65 s, total: 3.79 s; Wall time: 3.78 s n_steps=9: CPU times: user 2.31 s, sys: 1.74 s, total: 4.05 s; Wall time: 4.05 s
n_steps=90: CPU times: user 20.5 s, sys: 16.3 s, total: 36.8 s; Wall time: 36.8 s n_steps=90: CPU times: user 22.1 s, sys: 16.8 s, total: 38.9 s; Wall time: 38.9 s
Serial numpy + dask for parallelism (1 depth level); times are for triggering computation, don't include pre-applied filter.apply() method, but do include loading of data) n_steps=9: CPU times: user 1.37 s, sys: 697 ms, total: 2.06 s; Wall time: 1.21 s n_steps=9: CPU times: user 1.9 s, sys: 1.3 s, total: 3.21 s; Wall time: 3.2 s n_steps=9: CPU times: user 2.04 s, sys: 1.46 s, total: 3.5 s; Wall time: 3.5 s n_steps=9: CPU times: user 2.21 s, sys: 1.77 s, total: 3.97 s; Wall time: 3.97 s n_steps=9: CPU times: user 2.21 s, sys: 1.77 s, total: 3.97 s; Wall time: 3.97 s
n_steps=90: CPU times: user 20.3 s, sys: 16.6 s, total: 36.9 s; Wall time: 36.8 s n_steps=90: CPU times: user 21.1 s, sys: 16.9 s, total: 38 s; Wall time: 38 s
Serial numpy + dask for parallelism (62 depth levels); times are for triggering computation, don't include pre-applied filter.apply() method, but do include loading of data) n_steps=9: CPU times: user 1min 4s, sys: 32.8 s, total: 1min 37s; Wall time: 51.1 s n_steps=9: CPU times: user 3min 12s, sys: 1min 31s, total: 4min 43s; Wall time: 2min 24s n_steps=9: CPU times: user 3min 23s, sys: 1min 39s, total: 5min 3s; Wall time: 2min 35s n_steps=9: CPU times: user 3min 51s, sys: 2min 10s, total: 6min 1s; Wall time: 3min 4s n_steps=9: CPU times: user 3min 59s, sys: 2min 14s, total: 6min 14s; Wall time: 3min 10s
Time loading of data for 1 depth level SST: CPU times: user 61.2 ms, sys: 29 ms, total: 90.1 ms; Wall time: 90.4 ms SST: see left SST: see left SST: see left SST: see left
wet_mask: CPU times: user 11 µs, sys: 6 µs, total: 17 µs; Wall time: 19.8 µs wet_mask: see left wet_mask, dxw, dyw, dxs, dys, area: approximately 6x the times for wet_mask, see left wet_mask, dxe, dye, dxn, dyn, tarea: approximately 6x the times for wet_mask, see left
Time loading of data for 62 depth levels TEMP: CPU times: user 4.95 s, sys: 1.71 s, total: 6.66 s; Wall time: 5.59 s TEMP: see left TEMP: see left TEMP: see left TEMP: see left
wet_mask: see above wet_mask: see above wet_mask, dxw, dyw, dxs, dys, area: see above wet_mask, dxe, dye, dxn, dyn, tarea: see above

Maybe ignore the Serial numpy + dask for parallelism (62 depth levels) timings for now - I haven't experimented with the number of data chunks and the number of dask workers yet.

@iangrooms
Copy link
Member

iangrooms commented Feb 25, 2021

I ran @NoraLoose's jupyter notebook on the same node (24 cores, no other processes) as my tests above. Times for the full-complexity solves (using the tripole correctly) for SST are:
fixed factor: 9.48, 8.59, 10.2, 8.63, 9.68, 8.73, 8.94, 8.99, 8.61, 9.56 (mean = 9.14s)
fixed scale: 64, 63, 70, 66, 71, 65, 66, 66, 67, 63 (mean = 66.1s)
For all 62 layers I only ran fixed-factor:
86, 90, 86, 87, 88, 88, 87, 87, 88, 87 (mean = 87.4s)
For fixed factor SST the Fortran code is about 24x faster. The Python code only uses about 3 cores, but it is using threaded numpy.
For fixed scale SST the Python code is about 4x faster.
For fixed factor 62 layers the Fortran code is about 3x faster.
The algorithm for the Fortran fixed-scale filter is quite crude (the stencil size is the entire domain), so that's why it's so much slower.

@NoraLoose
Copy link
Member

NoraLoose commented Feb 26, 2021

@iangrooms, to take full advantage of parallel computing with dask on your 24 cores, we will have to set up scheduler and worker processes via Dask.distributed. I think I would try to do this via the Dask-Jobqueue, i.e., running something like this in the jupyter notebook:

from dask_jobqueue import PBSCluster
cluster = PBSCluster(cores=24, memory='100GB', queue='regular', ...)
from dask.distributed import Client
client = Client(cluster)  # Connect this local process to remote workers

and maybe chunk the TEMP data into 62 chunks

TEMP = ds.TEMP.chunk({'z_t': 1})  # 62 chunks

rather than just 2 chunks.

Getting the dask scheduler and worker processes set up via the Dask-Jobqueue may take me some time (I don't have much experience with this yet), and will also depend on the cluster. So if I set this up for casper this won't translate one-to-one to the CU cluster. I would therefore try to get this set up directly for the CU cluster if this is where we want to do all the timing tests. On the other hand, @arthurBarthe & Co will do their tests on a different cluster from CU, so maybe okay to do this on casper after all? Maybe this is a good time to think more carefully about what tests exactly we want to include in the paper, and how to design these tests.

Sorry Ian that I handed over the notebook to you before bringing this up. I hadn't fully appreciated that you would run this on all 24 cores.

@iangrooms
Copy link
Member

I neglected to mention that I used a chunk size of 3, which resulted in 21 chunks & 22 processes. That was as close as I could get to 24. I also noticed that the "serial" numpy was using about 300% CPU, so basically 3 cores.

I don't think we need to run all tests on the same cluster; we just need to be sure to only compare timings on the same cluster. So gcm-filters-CPU vs Fortran CPU on my cluster at CU is a fair comparison, and gcm-filters-CPU vs the NYU convolution code on an NYU cluster (or NCAR) is a fair comparison. If we get an NYU GPU convolution code then we should compare it to gcm-filters-GPU on the same GPU. If we want to compare my Fortran code to the NYU convolution code then we should do it on the same cluster, but I don't know if that's necessary. In any case, I figured out how to fix my parallel Fortran problem at NCAR, so I can always run my Fortran code there if we want everything on the same platform.

@rabernat
Copy link
Collaborator Author

rabernat commented Mar 1, 2021

to take full advantage of parallel computing with dask on your 24 cores, we will have to set up scheduler and worker processes via Dask.distributed.

Not quite true. Distributed is only strictly necessary if we want to use multiple nodes. The default threaded scheduler uses multithreading, similar to openmp.

@iangrooms iangrooms changed the title section on computational aspects? section on computational aspects Mar 4, 2021
@iangrooms
Copy link
Member

I expect the "computational efficiency" section to be fairly short. I've already got timings of Nora's POP tripole code and my Fortran code, so here's what I think we need to wrap up:

  1. Timings of @NoraLoose's POP tripole code on the GPU (both fixed-scale and fixed-factor, same configuration as this notebook from Nora's post above. This notebook should be added to the paper repo as well; I don't see why it should go in the gcm-filters repo.
  2. @arthurBarthe and @ElizabethYankovsky's timings of the fixed-factor filter and scipy Gaussian filter using regional data. I think we only need the CPU tests, since the GPU version of the fixed-factor code is nearly identical to the GPU tests on the POP data.

The point of these tests is just to show that the method we're proposing is not worse than other methods on the CPU, and blazing-fast on the GPU. I just want to give the reader reassurance that this method is practical, and not just another cool idea that ends up being practically unusable because of the cost.

@NoraLoose
Copy link
Member

I ran @NoraLoose's jupyter notebook on the same node (24 cores, no other processes) as my tests above. Times for the full-complexity solves (using the tripole correctlyfor SST are:
fixed factor: 9.48, 8.59, 10.2, 8.63, 9.68, 8.73, 8.94, 8.99, 8.61, 9.56 (mean = 9.14s)
fixed scale: 64, 63, 70, 66, 71, 65, 66, 66, 67, 63 (mean = 66.1s)
For all 62 layers I only ran fixed-factor:
86, 90, 86, 87, 88, 88, 87, 87, 88, 87 (mean = 87.4s)

I wonder why it took about 2-3 times longer for you (on your 24 cores) than for me (serial numpy) to run the gcm-filters test.
For example:

  • fixed factor, 1 level, took 3.43s for me, but 9.14s for you.
  • fixed scale, 1 level, took 38.9s for me, but 66.1s for you.

@iangrooms
Copy link
Member

My node at CU is getting old. I bought it 6 years ago and it wasn't the latest hardware at the time. The CPU and RAM speed on casper are probably faster, but I think what matters here is that the Fortran and Python codes are both running on the same hardware. We can list hardware specs in the timing section of the paper so people who are interested can see how old the CPU is.

@NoraLoose
Copy link
Member

Ok, follow-up question: Did you subtract the time it takes for dask to load the data? Or did you get your times via

filter_func = _create_filter_func(filter_simple_pop_tripolar_grid.filter_spec, filter_simple_pop_tripolar_grid.Laplacian) 
%time filter_func(SST_raw, wet_mask_raw),

where SST_raw, wet_mask_raw are numpy arrays, already loaded into memory?

@iangrooms
Copy link
Member

From your notebook I ran cells like this one:
%time SST_filtered_cartesian_with_land = filter_cartesian_with_land.apply(SST, dims=['nlat', 'nlon'])
%time SST_filtered_cartesian_with_land_computed = SST_filtered_cartesian_with_land.compute() # triggering computation
then added up the "Wall time" reported by the two cells. I ran each 11 times and dropped the first, assuming that after the first try the data would be already loaded from disk and the Wall times would not include the time to read from disk.

@rabernat
Copy link
Collaborator Author

rabernat commented Mar 4, 2021

, assuming that after the first try the data would be already loaded from disk and the Wall times would not include the time to read from disk.

That's not quite how dask works. There is no caching here.

If we want to explictly load the data from disk, we call .load() on the inputs and operate eagerly (no dask) from there on. Or if we want threaded parallelism with with in-memory data, we cal .load() on the inputs and then call .chunk() again.

@iangrooms
Copy link
Member

I see. My fortran code loads the data, then uses threaded in-memory parallelism. So for a fair comparison I should call .load() and then chunk again before running the timing tests. I'll work on that.

@NoraLoose
Copy link
Member

Cuda GPU times for filtering POP data on tripolar grid (including exchanges across tripolar seam):

fixed factor, n_steps=9, 1 level:
1.84s + 303ms, 1.84 s + 277 ms, 1.9 s + 291 ms, 1.98 s + 298 ms, 1.92 s + 285 ms, 2.07 s + 300 ms, 1.92 s + 289 ms, 1.92 s + 289 ms, 1.9 s + 293 ms, 1.98 s + 304 ms; time mean: 1.93 s + 293 ms

fixed scale, n_steps=90, 1 level:
1.21 s + 606 ms, 1.19 s + 605 ms, 1.2 s + 552 ms, 1.3 s + 615 ms, 1.22 s + 610 ms, 1.31 s + 624 ms, 1.24 s + 551 ms, 1.24 s + 609 ms, 1.21 s + 557 ms, 1.22 s + 604 ms; time mean: 1.23 s + 593 ms

fixed factor, n_steps=9, 62 levels:
1.92 s + 5.25 s, 2.01 s + 5.46 s, 1.97 s + 5.4 s, 1.98 s + 5.35 s, 2.02 s + 5.5 s, 1.93 s + 5.34 s, 2.02 s + 5.48 s, 1.92 s + 5.25 s, 1.94 s + 5.35 s, 2.02 s + 5.45 s; time mean: 1.97 s + 5.38 s

fixed scale, n_steps=90, 62 levels:
1.21 s + 22.5 s, 1.22 s + 22.5 s, 1.21 s + 22.4 s, 1.31 s + 22.5 s, 1.21 s + 22.4 s, 1.22 s + 22.4 s, 1.44 s + 22.5 s, 1.31 s + 22.6 s, 1.21 s + 22.5 s, 1.32 s + 22.5 s; time mean: 1.27 s + 22.5 s

I split the times into the sum of two. The first number is the wall time for

%time SST_filtered_pop_tripolar_grid = filter_pop_tripolar_grid.apply(SST, dims=['nlat', 'nlon'])

The second number is the wall time for

%time SST_filtered_pop_tripolar_grid_computed = SST_filtered_pop_tripolar_grid.compute()  # triggering computation

All times above include loading of data. Times for only loading data are:

load data, 1 level:
660 ms, 636 ms, 608 ms, 709 ms, 610 ms, 588 ms, 613 ms, 635 ms, 611 ms, 648 ms; time mean: 632 ms

load data, 62 levels:
5.71 s, 5.59 s, 5.96 s, 5.78 s, 5.46 s, 5.67 s, 5.48 s, 5.72 s, 6.16 s, 5.61 s; time mean: 5.71 s

@rabernat
Copy link
Collaborator Author

rabernat commented Mar 5, 2021

I think the second number, when you actually call compute(), is the important one to measure. The first number is just measuring the setup time. As you can see, it's independent of the problem size.

@NoraLoose
Copy link
Member

Yes, I have always been measuring the second number (e.g., in that monstrous table above). I just included the first number this time because Ian reported the sum of these 2 numbers for the CPU times.

@iangrooms
Copy link
Member

When I use .load() and then .chunk() (with appropriate arguments), it seems to make it single threaded. There is no time difference regardless of how many chunks I use, and CPU load is 100%. For SST fixed-factor the new method does save about 3 seconds, which is a lot considering it goes from 9 to 6. But for SST fixed-scale and for all 62 layers fixed-factor the new method (i.e. loading data first) is actually about 4 times slower (321s is the new result vs 87.4s the old way). So I think we should use the new timing results for SST:

fixed factor SST: 6.11, 6.38, 6.11, 6.12, 6.08, 6.18, 6.07, 6.09, 6.09, 6.14 (mean = 6.14s)
fixed scale SST: 64, 64, 64, 64, 64, 64, 64, 65, 64, 65 (mean = 64s)

@rabernat
Copy link
Collaborator Author

rabernat commented Mar 5, 2021

There is no time difference regardless of how many chunks I use, and CPU load is 100%

That's a bit puzzling... 🤔 What is the shape of the SST array? What dimension did you chunk over?

But for SST fixed-scale and for all 62 layers fixed-factor the new method (i.e. loading data first) is actually about 4 times slower

Definitely not what I expected...

@iangrooms
Copy link
Member

Yes, definitely puzzling. I'm on again this morning and tried chunking SST on both longitude and latitude (separately). This morning it threw an error and simply wouldn't apply the filter unless it's arranged as one chunk. On the other hand, this morning I'm getting wall times orders of magnitude faster than last time. For fixed-factor SST my times are 10 times faster this morning and 42 times faster for fixed-scale. On the other hand, the time for fixed-factor temperature over all layers is still 5m20s, which is much longer than the time I get when I don't .load(). I don't like how inconsistent and unpredictable this is.

@rabernat
Copy link
Collaborator Author

rabernat commented Mar 5, 2021

I'm on again this morning and tried chunking SST on both longitude and latitude (separately)

The gcm_filters method will definitely not work if you chunk the "filter dimensions" (e.g. lat and lon). Dask just isn't smart enough to parallelize the algorithm in this difficult way. It can only do the embarassingly parallel thing of using multiple threads to parallelize across a different dimension, e.g. time, depth, etc. That's what I tried to say in #4 (comment).

If the SST data doesn't have another dimension, there is no speedup to be gained with dask.

If you could share more details of the datasets you're passing to gcm_filters (e.g. print(sst)), including the shape, chunk size, etc, that would be useful. (Or maybe this is in one of the linked notebooks?)

I don't like how inconsistent and unpredictable this is.

Agreed! Something could be weird with your python environment on your cluster?

@iangrooms
Copy link
Member

I might have figured out what I was doing wrong: I incorrectly assumed that I wouldn't need to do both .apply() and .compute() if I had already loaded the data, so I in my post above I only counted the time for .apply(). I've gone back and checked the times for both .apply() and .compute() after having used .load(). For SST my times are compute times are unchanged, so pre-loading the data evidently has no impact on performance. (SST is 2400x3600, one chunk.)

I don't understand what's happening when I filter all 62 levels though. If I don't use .load(), then .apply() takes about 1 second and .compute() takes about 90 seconds (same as my original results, posted above; CPU usage goes to about %1600). If I instead TEMP.load() then TEMP.chunk({'z_t': 3}), then .apply() takes 5 minutes and 20 seconds (and runs only 100% CPU), and .compute() only takes .2 seconds.

@rabernat
Copy link
Collaborator Author

rabernat commented Mar 5, 2021

If the data are already in memory (i.e. you've called .load() already), filter.apply operates "eagerly" (rather than "lazily"), and result.compute() does nothing. (Compute triggers a dask array to become an in-memory numpy array).

The real question is why loading first is so much slower than loading lazily. Two important questions:

  • How long does TEMP.load() take on its own?
  • What chunks are you using for the 90s version? Same TEMP.chunk({'z_t': 3})?

@rabernat
Copy link
Collaborator Author

rabernat commented Mar 5, 2021

FWIW, the distinction between eager and lazy computation is described in the gcm_filters docs: https://gcm-filters.readthedocs.io/en/latest/tutorial.html#use-dask. Reading that might clear up some confusion.

@iangrooms
Copy link
Member

TEMP.load() only takes 3.76 seconds, which is pretty fast for 4GB but still realistic. It takes my fortran code about 9 seconds to load the same data.

I use the same chunk size for the 90s version.

@NoraLoose
Copy link
Member

TEMP.load() only takes 3.76 seconds, which is pretty fast for 4GB but still realistic.

For context, TEMP.load() took about 5.7 s on casper. So yes, more or less consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants