Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add quarterly glider metrics #4

Open
MathewBiddle opened this issue Jul 11, 2022 · 8 comments
Open

Add quarterly glider metrics #4

MathewBiddle opened this issue Jul 11, 2022 · 8 comments
Assignees
Labels
metric issue to start a new metric

Comments

@MathewBiddle
Copy link
Contributor

Similar to the GTS calculations, it would be useful to have gliderDAC quarterly metrics too.

https://github.com/MathewBiddle/ioos_by_the_numbers/tree/main/gts

@MathewBiddle
Copy link
Contributor Author

Need to check my code against @kbailey-noaa

Mine:

import pandas as pd
df_glider = pd.read_csv('https://gliders.ioos.us/erddap/tabledap/allDatasets.csvp?minTime%2CmaxTime%2CdatasetID')
df_glider.dropna(
    axis=0, 
    inplace=True,
    )

# drop delayed datasets
df_glider = df_glider[df_glider["datasetID"].str.contains("delayed")==False]

df_glider[['minTime (UTC)','maxTime (UTC)']] = df_glider[
                                                         ['minTime (UTC)','maxTime (UTC)']
                                                         ].apply(pd.to_datetime)

start_date = '2000-01-01'
end_date = '2022-12-31'

# find glider deployments between 10/01 and 12/31
glider_day_within = df_glider.loc[
    (df_glider['minTime (UTC)'] > pd.to_datetime(start_date,utc=True)) &
    (df_glider['maxTime (UTC)'] < pd.to_datetime(end_date,utc=True))
]

# gliders that start before 10/01 and end after 12/31
glider_day_outside = df_glider.loc[
    (df_glider['minTime (UTC)'] < pd.to_datetime(start_date,utc=True)) &
    (df_glider['maxTime (UTC)'] > pd.to_datetime(end_date,utc=True))
]

glider_day_outside.loc[:, 'maxTime (UTC)'] = pd.to_datetime(end_date, utc=True)
glider_day_outside.loc[:, 'minTime (UTC)'] = pd.to_datetime(start_date, utc=True)

# drop the ones from above as they will be duplicates in the next round of filtering
df_glider.drop(axis=0, index=glider_day_outside.index, inplace=True)

# Find gliders that start before 10/01 and end after 10/01
glider_day_lower = df_glider.loc[
    (df_glider['minTime (UTC)'] < pd.to_datetime(start_date,utc=True)) &
    (df_glider['maxTime (UTC)'] > pd.to_datetime(start_date,utc=True))
]

glider_day_lower.loc[:,'minTime (UTC)'] = pd.to_datetime(start_date, utc=True)

# Find gliders that start before 12/31 and end after 12/31.
glider_day_upper = df_glider.loc[
    (df_glider['minTime (UTC)']<pd.to_datetime(end_date,utc=True)) &
    (df_glider['maxTime (UTC)']>pd.to_datetime(end_date,utc=True))
]

glider_day_upper.loc[:,'maxTime (UTC)'] = pd.to_datetime(end_date, utc=True)

# Combine it all together into one DF.
glider_subset = pd.concat([glider_day_lower, 
                           glider_day_within, 
                           glider_day_upper, 
                           glider_day_outside], 
                          verify_integrity=True)

# Calculate the days between min time and max time.
glider_subset['glider_days'] = (glider_subset['maxTime (UTC)'] - glider_subset['minTime (UTC)']).dt.days

# Calculate total glider days.
glider_days = glider_subset['glider_days'].sum()

print("Glider days between %s and %s: %s" % (start_date,end_date,glider_days))

Kathy's:

# import ipydatetime
# from datetime import datetime
# from ipywidgets import VBox, HBox, Label, BoundedFloatText


# dt0 = ipydatetime.NaiveDatetimePicker(value=datetime(2000, 1, 1, 0, 0, 0), description='start time:')
# dt1 = ipydatetime.NaiveDatetimePicker(value=datetime(2022, 12, 31, 23, 59, 59), description='end time:')

# south = BoundedFloatText(
#     value=-90,
#     min=-90,
#     max=90,
#     description='min lat:',
# )

# north = BoundedFloatText(
#     value=90,
#     min=-90,
#     max=90,
#     description='max lat:',
# )


# west = BoundedFloatText(
#     value=-180,
#     min=-180,
#     max=180,
#     description='min lon:',
# )

# east = BoundedFloatText(
#     value=180,
#     min=-180,
#     max=180,
#     description='max lon:',
# )

# from ipywidgets import Layout, Button, Box


# box_layout = Layout(
#     display="flex",
#     flex_flow="column",
#     align_items="center"
# )

# items = [
#     VBox([dt0, dt1]),
#     north,
#     HBox([west, east]),
#     south,
# ]

# box = Box(children=items, layout=box_layout)
# box

# dt0 = dt0.value
# dt1 = dt1.value
# south = south.value
# north = north.value
# west = west.value
# east = east.value
        
# params = {
#     "min_time": dt0,
#     "max_time": dt1,
#     "min_lat": south,
#     "max_lat": north,
#     "min_lon": west,
#     "max_lon": east,
# }

# from gdutils import GdacClient


# client = GdacClient()

# client.search_datasets(params=params, include_delayed_mode=False)
# client.datasets

# # Count the total number of deployments within the dt0:dt1 time window
# num_deployments = client.datasets.shape[0]

# # Count the number of glider days within the dt0:dt1 time window
# glider_days = client.glider_days_per_yyyymmdd.loc[dt0:dt1].sum()

# # count the number of profiles per dataset
# profile_count = client.profiles_per_yyyymmdd.loc[dt0:dt1].sum()

# datasets = client.datasets.copy()

# import warnings


# sea_names = []
# funding_sources = []
# for dataset_id, row in datasets.iterrows():

#     # Fetch the dataset description from ERDDAP
#     info = client.get_dataset_metadata(dataset_id)

#     if info.empty:
#         continue

#     # Find all global NetCDF attributes
#     nc_globals = info.loc[info["Variable Name"] == "NC_GLOBAL"]

#     # Find the sea_name global attribute
#     sea_name_attr = nc_globals.loc[nc_globals["Attribute Name"] == "sea_name"]
#     sea_name = "unknown"
#     if not sea_name_attr.empty:
#         sea_name = sea_name_attr.Value.iloc[0]
#         sea_name = sea_name or "unknown"
#     else:
#         warnings.warn(f"{dataset_id}: sea_name NC_GLOBAL not found")

#     # Find all global attributes that begin with "acknowledg" as this attribute typically contains the funding sources
#     funding_attr = nc_globals.loc[nc_globals["Attribute Name"].str.startswith("acknowledg")]
#     funding = "unknown"
#     if not funding_attr.empty:
#         funding = funding_attr.Value.iloc[0]
#         funding = funding or "unknown"
#     else:
#         warnings.warn(f"{dataset_id}: acknowledgment NC_GLOBAL not found")

#     sea_names.append(sea_name)
#     funding_sources.append(funding)

#     # Count only the days and profiles that are dt0:dt1 inclusive
#     days = client.datasets_days[dataset_id].loc[dt0:dt1].sum()
#     profiles = client.datasets_profiles[dataset_id].loc[dt0:dt1].sum()
#     datasets.loc[dataset_id, ("days", "num_profiles")] = [days, profiles]


# print(f"time ranged from {dt0} to {dt1}")
# print(f"""
# {num_deployments = }
# {glider_days = }
# {profile_count = }
# """
# )

@MathewBiddle
Copy link
Contributor Author

Should be

Glider Days Reported to the IOOS Glider DAC

@MathewBiddle MathewBiddle added the metric issue to start a new metric label May 3, 2023
@MathewBiddle MathewBiddle changed the title Add glider metrics Add quarterly glider metrics May 3, 2023
@MathewBiddle
Copy link
Contributor Author

@ocefpaf can I get some help with this at some point? I could never get gdutils installed and functioning (mainly due to lack of time - a simple conda install gdutils would help) to actually compare the two.

cc: @kbailey-noaa

@MathewBiddle MathewBiddle self-assigned this Oct 4, 2023
@ocefpaf
Copy link
Member

ocefpaf commented Oct 4, 2023

@MathewBiddle gdutils was never packaged. See kerfoot/gdutils#2, kerfoot/gdutils#3, and kerfoot/gdutils#4.

However, I do have a lockfile and a local clone that could make it work. With that said, we should probably build the necessary functionalities you all need into gliderpy. Let's chat a bit on slack and I can try to get that going ASAP.

@ocefpaf
Copy link
Member

ocefpaf commented Jan 16, 2024

Can you check your functions against https://nbviewer.org/gist/ocefpaf/2f5a844d556a41b54f894e17541ec4db ? That is is "kind of" validated against gdutils and we know what is causing the differences. Some of those differences we may want to fix, others... Maybe not.

@MathewBiddle
Copy link
Contributor Author

using my script, which harvests dates from the global metadata on the glider erddap (https://gliders.ioos.us/erddap/tabledap/allDatasets.htmlTable?minTime%2CmaxTime%2CdatasetID):
Glider days between 2021-06-01 and 2021-08-10: 2187

In your gist:

print(f"We computed {metadata['days'].sum().days} in gdutils we get 2288.")
We computed 2280 in gdutils we get 2288.
print(f"We computed {metadata['num_profiles'].sum()} in gdutils we get 58978.")
We computed 58187 in gdutils we get 58978.

So, off by 100 days. That's a lot.

@MathewBiddle
Copy link
Contributor Author

@ocefpaf
Copy link
Member

ocefpaf commented Jan 18, 2024

The off by 100 is in your script, right? My guess is that it is b/c your are rounding days using the default in pandas instead of using floor/ceil for each start/end date. In my code that is days = df.index[-1].ceil("D") - df.index[0].floor("D") and we have to run it for each dataset_id to get an accurate measure. Still, I don't think 100 is bad if you think that most of those are maybe minutes/hours in a single day that we are not counting. If we want to be "precise" and use days/hours/minutes, my rounded days would be closer to yours.

I don't think your estimate is bad if you look at it from the point of view that you requested the minimum amount of metadata/data possible from the allDatasets in ERDDAP. It is an elegant and fast way to get an approximate answer. However, if I understood what John used to do and what Kathy needs, we should consider a glider that was in the water just a minute past midnight, as a new glider day, right?

TL;DR IMO both are correct answers depending on the message we want to convey.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metric issue to start a new metric
Projects
None yet
Development

No branches or pull requests

2 participants