-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft for HYCOM50 recipe #29
base: master
Are you sure you want to change the base?
Conversation
@cisaacstern Do you have updates on pushing the data to the cloud? :) |
@roxyboy, no updates as of today. Aiming to get back to this later this week. Of all the PRs labeled |
@cisaacstern It's in lieu of a model inter-comparison study so no specific model has priority but... if you could prioritize the surface data from each PR labeled |
@roxyboy, you got it! Will likely begin pushing changes to these PRs today; tomorrow morning at the latest! |
@cisaacstern Thanks a lot! |
@roxyboy, the HYCOM50 https://gist.github.com/cisaacstern/59b39cb86308fa62c9d0da17b1f8630b When you get a moment, please let me know if it looks like these data are complete or if any errors were introduced in writing. I will continue to write the surface data to OSN this evening and tomorrow, and will ping you again as new datasets come online. |
Update:
|
@roxyboy, all of the HYCOM50 surface data should now be on OSN. The following will produce a dictionary of all three datasets: import s3fs
import xarray as xr
endpoint_url = 'https://ncsa.osn.xsede.org'
fs_osn = s3fs.S3FileSystem(anon=True, client_kwargs={'endpoint_url': endpoint_url},)
url = "s3://Pangeo/pangeo-forge/swot_adac/HYCOM50/"
hycom50_datasets = {
ds: xr.open_zarr(fs_osn.get_mapper(f"{url}{ds}.zarr"), consolidated=True)
for ds in ["surf_01", "surf_02", "surf_03"]
}
hycom50_datasets |
@cisaacstern I started looking into the data but it seems that they're filled with NaNs... |
There appears to be no actual data variables present in the dataset, only metadata. E.g. fs_osn.ls('Pangeo/pangeo-forge/swot_adac/HYCOM50/surf_01.zarr/sst')
(we expect to see a bunch of chunks here, e.g. This would happen if the |
On closer inspection, it does appear the execution pipeline was erroring out on the call to Traceback
|
I'm now adding the missing data (per solution identified by Ryan) and will ping this thread again when it is complete. A summary of the underlying issue follows. As evidenced by the Traceback in my previous comment, the error was raised on import s3fs
endpoint_url = 'https://ncsa.osn.xsede.org'
fs_osn = s3fs.S3FileSystem(
key='<KEY>',
secret='<SECRET>',
client_kwargs={'endpoint_url': endpoint_url},
default_cache_type='none',
default_fill_cache=False,
use_listings_cache=False
)
target_base = f's3://Pangeo/pangeo-forge' For this to work, however, the s3fs version must be one that includes fsspec/s3fs#471. The current default s3fs install on Pangeo cloud predates this PR, therefore running |
@roxyboy, as indicated by the output of the below code, I believe the surface data are now good to go. The paths have not changed so the xarray code block in my earlier comment should still work to load them. Thanks for your patience and as always do let me know if anything is amiss. import s3fs
import zarr
endpoint_url = 'https://ncsa.osn.xsede.org'
fs_osn = s3fs.S3FileSystem(anon=True, client_kwargs={'endpoint_url': endpoint_url},)
url_base = "s3://Pangeo/pangeo-forge/swot_adac/HYCOM50/"
root_paths = [f"{url_base}{ds}.zarr" for ds in ("surf_01", "surf_02", "surf_03")]
variables = ["ssh", "sss", "sst", "tauewd", "taunwd", "uu1", "vv1"]
for r in root_paths:
print(r)
group = zarr.open_consolidated(fs_osn.get_mapper(r))
for v in variables:
group_info = group[v].info_items()
print(f"""{group_info[0][1]}
{group_info[-2]}
{group_info[-1]}
""")
|
Yep, seems to be working now! Thanks @cisaacstern :) |
@cisaacstern The time metadata for HYCOM is not based on any real calendar such as Gregorian so it makes it a bit hard to identify which time step is when but is it ok to understand that the ordering of the zarr dataset is in the same order as the recipe: |
I would say that if the months are not contiguous, it would be best to separate into two distinct "seasons", as we did for eNATL60. |
@roxyboy, the brief answer (and apologies that it's taken me a moment to get back around to this question), is yes: the HYCOM50 zarr stores are organized chronologically following the same order of months as provided in the recipe. I figured this would be the case, but wanted to make sure I could prove it in the data itself, and found out one or two interesting things in the process. Here are some thoughts. Once we've defined the CFTimeIndex examplehycom50_datasets["surf_01"].indexes
Each of these timepoints has a number of more easily understandable attributes, however, which we can introspect. Here are the relevant attributes, as pulled from the first timepoint of CFTimeIndex attributest_point = hycom50_datasets["surf_01"].indexes["time"][0]
attributes = [attr for attr in dir(t_point) if attr[0] != "_"]
print(attributes)
So, the month for this first timepoint can be discovered as follows: hycom50_datasets["surf_01"].indexes["time"][0].month
Combining this with a relatively simple loop, we can then print the following mapping of time indices to months. (Just showing the first of the three surface datasets here, as it appears all three time indexes are the same.) The interesting thing shown here is that it appears that the Is this expected? Loopmonths = (
"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
d = {i: m for i, m in zip(range(1,13), months)}
ds = "surf_01"
t_index = hycom50_datasets[ds].indexes["time"]
print(
f"The time index of '{ds}' is a {type(t_index)}, \n"
f"with calendar='{t_index.calendar}' and length={len(t_index)}."
)
month = t_index[0].month
breakpoint = 0
for i in range(len(t_index)):
if t_index[i].month != month:
print(
f"Month {month} ({d[month]}) spans {i - 1 - breakpoint} indices,"
f" from {breakpoint} to {i - 1}."
)
month = t_index[i].month
breakpoint = i
elif i == len(t_index) - 1:
print(
f"Month {month} ({d[month]}) spans {i - breakpoint} indices,"
f" from {breakpoint} to {i}."
)
else:
pass
@rabernat, your suggestion to separate non-contiguous seasons into separate zarr stores, as demonstrated in #24 (comment), is well taken. I will work on a revision to the HYCOM50 recipe this week that accomplishes that, but wanted to get these thoughts down first, in the event they're able to aid data exploration before I'm able to push the new version. |
Should we also split up the GIGATL data into winter (Feb, Mar, Apr) and summer (Aug, Sep, Oct)? |
@cisaacstern Thanks for checking. I was able to exactly match the length of the time by accounting for leap year (hourly data of 29 days in Feb.) within the months of FMA and ASO... I agree it's weird that there would be data in May and Nov... |
Yes, and for the detailed response: #27 (comment)
(☝️ from #26 (comment)) Sure thing! HYCOM50 grids are now on OSN and accessible with: HYCOM50 grid access codeimport s3fs
import xarray as xr
endpoint_url = 'https://ncsa.osn.xsede.org'
fs_osn = s3fs.S3FileSystem(anon=True, client_kwargs={'endpoint_url': endpoint_url},)
url = "s3://Pangeo/pangeo-forge/swot_adac/HYCOM50/"
hycom50_grids = {
grid: xr.open_zarr(fs_osn.get_mapper(f"{url}{grid}.zarr"), consolidated=True)
for grid in ["grid_01", "grid_02", "grid_03"]
}
hycom50_grids |
@martindurant, any recommendations for how to resolve a The first input url for the recipe in question (code here) is: from hycom_recipe import recipes
rec = recipes["HYCOM50/Region01_GS/int/fma"]
for k, v in rec.file_pattern.items():
print(k, v)
break
Which is just shy of 6 GB according to ➜ URL="ftp://ftp.hycom.org/pub/xbxu/ATLc0.02/SWOT_ADAC/HYCOM50_E043_Feb_GS_daily.nc"
➜ curl -sI $URL
Last-Modified: Thu, 15 Apr 2021 17:31:14 GMT
Content-Length: 5937121613
Accept-ranges: bytes When I run for input_name in rec.iter_inputs():
rec.cache_input(input_name) I get: logs before `timeout` Traceback
`timeout` Traceback
The default rec.fsspec_open_kwargs={"timeout": 300} but this does not resolve the error. Any guidance will be very much appreciated. |
That is the right kwarg for the timeout as far as I can tell. Do you know if it was waiting the new longer period, or still 30s (or something else?)? The file object should have FTPFile, like most of these, is geared to random access of the target file, but streaming operations like this could well do with additional simpler helper methods. |
Could you clarify what you mean here. What would an "additional simpler helper method" look like? |
If you look at |
Yes, I can confirm that the correct rec.fsspec_open_kwargs={"timeout": 300}
for input_name in rec.iter_inputs():
rec.cache_input(input_name)
This questions is harder to answer, because I am not sure what specific block(s) of code to time. When I wrap # running here with default of `timeout=30`
for input_name in rec.iter_inputs():
rec.cache_input(input_name) logging of `data = source.read(BLOCK_SIZE)` execution times followed by `timeout` Traceback (from above-linked timing branch)
@martindurant, which specific block(s) in |
I don't know in detail what ftpl;ib does with its timeout. You might be right that you need to get it to the lower-level socket. |
HYCOM50 int data is now on OSN. (@martindurant, I never did figure out a real solution for @roxyboy, you can find them at the following paths: import s3fs
endpoint_url = 'https://ncsa.osn.xsede.org'
fs_osn = s3fs.S3FileSystem(anon=True, client_kwargs={'endpoint_url': endpoint_url},)
path_fmt = "Pangeo/pangeo-forge/swot_adac/HYCOM50/{region}/int"
regions = ["Region01_GS", "Region02_GE", "Region03_MD",]
for r in regions:
int_paths = fs_osn.ls(path_fmt.format(region=r))
print("\n".join(int_paths))
|
@cisaacstern The HYCOM50 group newly diagnosed the vertical velocities so could we flux them to OSN? The file names are in the format of |
@roxyboy the vertical velocities should be good to go. (Apologies for the lag time on this one!) I've added them to the project catalog and updated the example notebook there with loading details. Check out Cell No. 5 here for the the available parameterizations (and keep following along below that, for how to load them): https://github.com/pangeo-data/swot_adac_ogcms/blob/main/intake_demo.ipynb (As with any catalog example, of course, this notebook needs to be run from within the project repo, to have access to the catalog and Let me know of any issues, per usual! |
Seems to be working perfect @cisaacstern :) Thanks! |
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
HYCOM50 data for SWOT Xover regions.