Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed Recipes for CESM2 Superparameterization Emulator #100

Open
mspritch opened this issue Jan 28, 2022 · 7 comments
Open

Proposed Recipes for CESM2 Superparameterization Emulator #100

mspritch opened this issue Jan 28, 2022 · 7 comments

Comments

@mspritch
Copy link

mspritch commented Jan 28, 2022

Source Dataset

Several years of high-frequency (15 min) GCM-timestep level output from the Superparameterized CESM2 isolating state variables "before" and "after" a key code region containing computationally intensive superparameterization calculations. For use in making a superparameterization emulator of explicitly resolved clouds + their radiation influence + turbulence, that can be used in a real-geography framework CESM2 framework, to side step the usual computational cost of SP. Similar in spirit to proof of concept in Rasp, Pritchard & Gentine (2018) and Mooers et al. (2021) but with new refinements by Mike Pritchard and Tom Beucler towards compatibility with operational, real-gegoraphy CESM2 (critically, isolating only tendencies up to surface coupling and including outputs relevant to CLM land model's expectations; see Mooers et al. concluding discussion)

  • Link to the website / online documentation for the data
    N/A
  • The file format (e.g. netCDF, csv)
    Raw model output is CESM2-formatted NetCDF history files
  • How are the source files organized? (e.g. one file per day)
    One file per day across 8-10 sim-years each forced with the same annual cycle of SSTs.
  • How are the source files accessed (e.g. FTP)
    Not publicly posted yet. Staged on a few XSEDE or NERSC clusters. Mike Pritchard's group at UC Irvine can help with access to these.
    • provide an example link if possible
  • Any special steps required to access the data (e.g. password required)

Transformation / Alignment / Merging

**Apologies in advance if this is TMI. A starter core-dump from Mike Pritchard on a busy afternoon:

There are multiple pre-processing steps. The raw model output contains many more variables than one would want to analyze so there is trimming. But users may want to experiment with different inputs and outputs. So this trimming may be user-specific. Can provide guidance on specific variable names for inputs/outputs of published emulators worth competing with on request.

Important: A key subset of variables (surface fluxes) that probably everyone would want in their input vector will need to be time-shifted backward relative by one time step to avoid information leaks, having to due to which phase of the integration cycle these fluxes were saved at on history file vs. the emulated regions. Some users may want to make emulators that include memory of state variables from previous time steps in the input vector (e.g. as in Han et al., JAMES, 2020) in which case there is the same preprocessing issue of backwards time shifting made flexible to additional variables (physical caveat: likely no more than a few hours i.e. <= 10 temporal samples at most so never any reason to include contiguous temporal adjacency beyond that limit).

Many users may want to subsample lon,lat,time first to reduce data volume and promote independence of samples due to spatial and temporal autocorrelations riddled throughout the data. Other users may prefer to include all of these samples as fuel for ambitious architectures that require very data-rich limits to find good fits in. This sub-sampling is user-specific.

Many users wanting to make "one-size-fits-all" emulators (i.e. same NN for all grid cells) will want to flatten lon,lat,time into a generic "sample" dimension (retaining level variability) and shuffle that for ML and split into training/validation/test splits. Such users would also want to pre-normalize by means and range/stds defined independently for separate levs, but which lump together the flattened lon/lat/time statistics. Advanced users may want to train regional- or regime-specific emulators, which might then use regionally-aware normalizations, such that flexibility here would help.

Some users may want to convert the specific humidity and temperature input state variables into an equivalent relative humidity as an alternate input that is less prone to out of sample extrapolation when the emulator is tested prognostically. The RH conversion should use a fixed set of assumptions that are consistent with a f90 module for use in identical testing online; can provide python and f90 code when the time comes.

The surface pressure is vital information to make the vertical discretization physically relevant per CESM2's hybrid vertical eta coordinate. So should always be made available. The pressure mid points and pressure thickness of each vertical level can also be derived from this field but vary with lon,lat,time. Mass-weighting the outputs of vertically resolved variables like diabatic heating using the derived pressure thickness could be helpful to users wishing to prioritize the column influence of different samples as in Beucler et al. (2021, PRL).

Output Dataset

I am not qualified to assess the trade-offs of the options listed here but interested in learning.

@rabernat
Copy link
Contributor

rabernat commented Mar 3, 2022

Mike thanks for getting this started!

In order to move forward here, we would need a concrete set of files to ingest, and the files would need to be accessible over the network somehow (http, ftp, scp, globus, etc.)

If you could post a link to one of the raw output files via any of these protocols, that would allow us to start experimenting with ingesting them.

@mspritch
Copy link
Author

mspritch commented Mar 8, 2022 via email

@rabernat
Copy link
Contributor

rabernat commented Mar 9, 2022

Normally I would just fill in an external collaborator permission form to give outside collaborators access to it. Will that work here?

No. I don't think we will be able to pull files over the vpn (at least not without complicated workarounds). We need to get it so that a machine (rather than a human) can get the files. Globus is probably the best bet here.

Another option would be for us to create some sort of ingestion upload point, basically just a temporary bucket that can accept uploads, and then stage the recipe from there. So someone from your team would have to directly upload the files (push instead of pull). This might be a good solution for the many scenarios in which we just can't get access to the machine where the data live. @cisaacstern - what do you think about that idea?

@mspritch
Copy link
Author

mspritch commented Mar 9, 2022 via email

@cisaacstern
Copy link
Member

This might be a good solution for the many scenarios in which we just can't get access to the machine where the data live.

This is certainly a recurring scenario and this solution will definitely work from a technical perspective. In the interest of reproducibility and transparency, I suppose we'd just want to consider how to document provenance of pushed data.

@rabernat
Copy link
Contributor

Via email we have been investigating the possibility of using Globus. Repeating some of that conversation here

Ryan: A lot of the details are in this github thread: pangeo-forge/pangeo-forge-recipes#222 (comment)
Here is the relevant piece

When using the latest Globus Connect version 5 endpoints, data access is via what we call "collections". Each collection has an associated DNS name, such that you can refer to collections directly. By default collections are assigned DNS names in the data.globus.org subdomain, but they can also be customized by deployments.

v5 also supports HTTP/S access to data, while enforcing a common security policy across the access mechanism. For example, here is a publicly accessible file: https://a4969.36fe.dn.glob.us/public/read-only/logo.png. For non-public data, users must authenticate (and be authorized to access the data) before it can be downloaded via HTTPS.

So the ideal solution from our point of view would be for you to have GCv5 endpoint and create a public collection for the associated data. Then we can pull it directly over HTTPS. Would this be possible?

Nate: I believe that Globus Collections functionality requires an upgraded site license. Our campus couldn't justify the pretty significant cost, and I don't think the effort to get a UC-wide license went anywhere.

Falling back to an SSH-based method might be best. Is the client system on a single IP that we can allow through our firewall?

@rabernat
Copy link
Contributor

Following up on this, over in pangeo-forge/pangeo-forge-recipes#222 (comment), I managed to make a proof of concept of the Globus collection via HTTPS. So from the Pangeo Forge POV, using Globus collections is definitely the easiest and quickest option here.

Alternatively, we could move the data to another system (maybe Cheyenne) that has the Globus collections feature enabled.

Falling back to an SSH-based method might be best. Is the client system on a single IP that we can allow through our firewall?

I'm not sure if we can know a-priori what IPs the requests will be coming from. That's a bit beyond our DevOps capability right now. I suppose "somewhere in Google Cloud" is too vague? And would there be a VPN involved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants