-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposed Recipes for CESM2 Superparameterization Emulator #100
Comments
Mike thanks for getting this started! In order to move forward here, we would need a concrete set of files to ingest, and the files would need to be accessible over the network somehow (http, ftp, scp, globus, etc.) If you could post a link to one of the raw output files via any of these protocols, that would allow us to start experimenting with ingesting them. |
Hi Ryan,
Here is a link to a google drive folder with a handful of files:
https://drive.google.com/drive/folders/1P68TfWcAc-KsCdvu1R9uGIaA9iAd8bku?usp=sharing
The full data are housed on an internal UCI server that can only be accessed with UCI VPN. Normally I would just fill in an external collaborator permission form to give outside collaborators access to it. Will that work here? If not I will see if our admins can help with the globus model, as I do believe we have a globus end point that should connect to it.
Mike.
… On Mar 3, 2022, at 7:32 AM, Ryan Abernathey ***@***.***> wrote:
Mike thanks for getting this started!
In order to move forward here, we would need a concrete set of files to ingest, and the files would need to be accessible over the network somehow (http, ftp, scp, globus, etc.)
If you could post a link to one of the raw output files via any of these protocols, that would allow us to start experimenting with ingesting them.
—
Reply to this email directly, view it on GitHub <#100 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGPXKXCFVX6D3SNP4ZINM2DU6DLRXANCNFSM5M7MX6HA>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you authored the thread.
|
No. I don't think we will be able to pull files over the vpn (at least not without complicated workarounds). We need to get it so that a machine (rather than a human) can get the files. Globus is probably the best bet here. Another option would be for us to create some sort of ingestion upload point, basically just a temporary bucket that can accept uploads, and then stage the recipe from there. So someone from your team would have to directly upload the files (push instead of pull). This might be a good solution for the many scenarios in which we just can't get access to the machine where the data live. @cisaacstern - what do you think about that idea? |
Hi Ryan,
Thanks for this! I have a query out to our sysadmins to see if globus will work here, and if not totally happy to upload to your ingestion point.
Mike.
… On Mar 9, 2022, at 5:51 AM, Ryan Abernathey ***@***.***> wrote:
Normally I would just fill in an external collaborator permission form to give outside collaborators access to it. Will that work here?
No. I don't think we will be able to pull files over the vpn (at least not without complicated workarounds). We need to get it so that a machine (rather than a human) can get the files. Globus is probably the best bet here.
Another option would be for us to create some sort of ingestion upload point, basically just a temporary bucket that can accept uploads, and then stage the recipe from there. So someone from your team would have to directly upload the files (push instead of pull). This might be a good solution for the many scenarios in which we just can't get access to the machine where the data live. @cisaacstern <https://github.com/cisaacstern> - what do you think about that idea?
—
Reply to this email directly, view it on GitHub <#100 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGPXKXD52T5SZ73I4LQZTT3U7CUEXANCNFSM5M7MX6HA>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you authored the thread.
|
This is certainly a recurring scenario and this solution will definitely work from a technical perspective. In the interest of reproducibility and transparency, I suppose we'd just want to consider how to document provenance of pushed data. |
Via email we have been investigating the possibility of using Globus. Repeating some of that conversation here
|
Following up on this, over in pangeo-forge/pangeo-forge-recipes#222 (comment), I managed to make a proof of concept of the Globus collection via HTTPS. So from the Pangeo Forge POV, using Globus collections is definitely the easiest and quickest option here. Alternatively, we could move the data to another system (maybe Cheyenne) that has the Globus collections feature enabled.
I'm not sure if we can know a-priori what IPs the requests will be coming from. That's a bit beyond our DevOps capability right now. I suppose "somewhere in Google Cloud" is too vague? And would there be a VPN involved? |
Source Dataset
Several years of high-frequency (15 min) GCM-timestep level output from the Superparameterized CESM2 isolating state variables "before" and "after" a key code region containing computationally intensive superparameterization calculations. For use in making a superparameterization emulator of explicitly resolved clouds + their radiation influence + turbulence, that can be used in a real-geography framework CESM2 framework, to side step the usual computational cost of SP. Similar in spirit to proof of concept in Rasp, Pritchard & Gentine (2018) and Mooers et al. (2021) but with new refinements by Mike Pritchard and Tom Beucler towards compatibility with operational, real-gegoraphy CESM2 (critically, isolating only tendencies up to surface coupling and including outputs relevant to CLM land model's expectations; see Mooers et al. concluding discussion)
N/A
Raw model output is CESM2-formatted NetCDF history files
One file per day across 8-10 sim-years each forced with the same annual cycle of SSTs.
Not publicly posted yet. Staged on a few XSEDE or NERSC clusters. Mike Pritchard's group at UC Irvine can help with access to these.
Transformation / Alignment / Merging
**Apologies in advance if this is TMI. A starter core-dump from Mike Pritchard on a busy afternoon:
There are multiple pre-processing steps. The raw model output contains many more variables than one would want to analyze so there is trimming. But users may want to experiment with different inputs and outputs. So this trimming may be user-specific. Can provide guidance on specific variable names for inputs/outputs of published emulators worth competing with on request.
Important: A key subset of variables (surface fluxes) that probably everyone would want in their input vector will need to be time-shifted backward relative by one time step to avoid information leaks, having to due to which phase of the integration cycle these fluxes were saved at on history file vs. the emulated regions. Some users may want to make emulators that include memory of state variables from previous time steps in the input vector (e.g. as in Han et al., JAMES, 2020) in which case there is the same preprocessing issue of backwards time shifting made flexible to additional variables (physical caveat: likely no more than a few hours i.e. <= 10 temporal samples at most so never any reason to include contiguous temporal adjacency beyond that limit).
Many users may want to subsample lon,lat,time first to reduce data volume and promote independence of samples due to spatial and temporal autocorrelations riddled throughout the data. Other users may prefer to include all of these samples as fuel for ambitious architectures that require very data-rich limits to find good fits in. This sub-sampling is user-specific.
Many users wanting to make "one-size-fits-all" emulators (i.e. same NN for all grid cells) will want to flatten lon,lat,time into a generic "sample" dimension (retaining level variability) and shuffle that for ML and split into training/validation/test splits. Such users would also want to pre-normalize by means and range/stds defined independently for separate levs, but which lump together the flattened lon/lat/time statistics. Advanced users may want to train regional- or regime-specific emulators, which might then use regionally-aware normalizations, such that flexibility here would help.
Some users may want to convert the specific humidity and temperature input state variables into an equivalent relative humidity as an alternate input that is less prone to out of sample extrapolation when the emulator is tested prognostically. The RH conversion should use a fixed set of assumptions that are consistent with a f90 module for use in identical testing online; can provide python and f90 code when the time comes.
The surface pressure is vital information to make the vertical discretization physically relevant per CESM2's hybrid vertical eta coordinate. So should always be made available. The pressure mid points and pressure thickness of each vertical level can also be derived from this field but vary with lon,lat,time. Mass-weighting the outputs of vertically resolved variables like diabatic heating using the derived pressure thickness could be helpful to users wishing to prioritize the column influence of different samples as in Beucler et al. (2021, PRL).
Output Dataset
I am not qualified to assess the trade-offs of the options listed here but interested in learning.
The text was updated successfully, but these errors were encountered: