-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
naming convention for attributes #41
Comments
Hi Carola, Using the same conventions is something that for sure make sense. This is definitely something in the roadmap but we didn't have a ticket to track it, so thank you for opening one :) That being said, deciding how to move forward is not so straight-forward. The problem is that each provider offers a different set of metadata therefore some API-related decisions need to be taken. I am thinking that perhaps it would make sense to have two different APIs:
The idea is that if you need the full info you should use the "low level" API and handle normalization yourself. If you are OK with basic normalization, which should cover the majority of the users, then you should be able to use higher level API Of course there are many more columns that could be added in the "high level" response, but I don't think it's so straightforward to figure out definitions for them that are comparable for all the providers. For example. |
Hi @pmav99, thank you, this is very interesting. It would be interesting to know some of the methods you use to filter the attributes. E.g., for the IOC stations , I use the date from the "Last data received" info to indirectly get the "status" information. I am wondering if I should get back to you with such requests like "can you please add the status functionality for the IOC stations with a function similar to what you provide for COOPS stations: Or should I do this on the user-end-side (which would probably a lot of duplicated efforts). |
At the moment we don't filter any attributes.
The question is what is the definition of "active"? How do you know if a station is "discontinued" vs temporary "offline" due to some technical problem? AFAIK IOC does not provide such details. Furthermore, what is an appropriate threshold for |
@carolakaiser Nevertheless, some IOC stations are sending data in the future... Notice here the "Time in GMT" and the negative "delay": Given that IOC claims that it does not edit the data, I guess that it is the stations that are misconfigured. E.g. Syros which is in Greece sends data 1 hour in the future, which probably means that the station does not apply Daylight Saving Time. The others are mostly stations in Netherlands and I don't know what is wrong with them. The problem with future timestamps, is that we can't really know if they indeed are active now or not. At least not without making an extra query for each of these stations, and at least at this stage, I would like to avoid complicating the implementation. How would you suggest handling these stations? |
Hi @carolakaiser when you have some time, please try the following snippet and let us know what you think. Note. before you run the code, you might need to apply this diff, too: diff --git a/searvey/coops.py b/searvey/coops.py
index e591392..08ed227 100644
--- a/searvey/coops.py
+++ b/searvey/coops.py
@@ -785,7 +785,10 @@ def coops_stations_within_region(
"""
stations = coops_stations(station_status=station_status)
- return stations[stations.within(region)]
+ if region:
+ return stations[stations.within(region)]
+ else:
+ return stations from __future__ import annotations
from enum import Enum
import datetime
import geopandas as gpd
import numpy as np
import pandas as pd
from shapely.geometry import MultiPolygon
from shapely.geometry import Polygon
from searvey import ioc
from searvey import coops
class Providers(Enum):
COOPS: str = "COOPS"
IOC: str = "IOC"
def get_stations(
activity_threshold: datetime.timedelta,
region: Polygon | MultiPolygon | None = None,
) -> gpd.GeoDataFrame:
# Retrieve the metadata
ioc_gdf = ioc.get_ioc_stations(region=region)
coops_gdf = coops.coops_stations_within_region(region=region)
# Convert activity threshold to a Timezone Aware Datetime object
now_utc = datetime.datetime.now(datetime.timezone.utc)
activity_threshold_ts = now_utc - activity_threshold
# Normalize IOC
# Convert delay to minutes
ioc_gdf = ioc_gdf.assign(
delay=pd.concat(
(
ioc_gdf.delay[ioc_gdf.delay.str.endswith("'")].str[:-1].astype(int),
ioc_gdf.delay[ioc_gdf.delay.str.endswith("h")].str[:-1].astype(int) * 60,
ioc_gdf.delay[ioc_gdf.delay.str.endswith("d")].str[:-1].astype(int) * 24 * 60,
)
)
)
# Some stations appear to have negative delay due to server
# [time drift](https://www.bluematador.com/docs/troubleshooting/time-drift-ntp)
# or for other Provider specific reasons. IOC suggests to ignore the negative
# delay and consider the stations as active.
# https://github.com/oceanmodeling/searvey/issues/40#issuecomment-1219509512
ioc_gdf.loc[(ioc_gdf.delay < 0), "delay"] = 0
# Calculate the timestamp of the last observation
ioc_gdf = ioc_gdf.assign(
last_observation=now_utc - ioc_gdf.delay.apply(lambda x: datetime.timedelta(minutes=x))
)
ioc_gdf = ioc_gdf.assign(
provider=Providers.IOC.value,
provider_id=ioc_gdf.ioc_code,
start_date=ioc_gdf.added_to_system.apply(pd.to_datetime).dt.tz_localize('UTC'),
is_active=ioc_gdf.last_observation > activity_threshold_ts,
)
ioc_gdf = ioc_gdf[["provider", "provider_id", "country", "location", "lon", "lat", "is_active", "start_date", "last_observation", "geometry"]]
coops_gdf = coops_gdf.assign(
provider=Provider.COOPS.value,
provider_id=coops_gdf.index,
country=np.where(coops_gdf.state.str.len() > 0, "USA", None),
location=coops_gdf[["name", "state"]].agg(", ".join, axis=1),
lon=coops_gdf.geometry.x,
lat=coops_gdf.geometry.y,
is_active=coops_gdf.status == "active",
start_date=pd.NaT,
last_observation=coops_gdf[coops_gdf.status == "discontinued"].removed.str[:18].apply(pd.to_datetime).dt.tz_localize('UTC'),
)[["provider", "provider_id", "country", "location", "lon", "lat", "is_active", "start_date", "last_observation", "geometry"]]
return pd.concat((ioc_gdf, coops_gdf)) You can call it like: stations = get_stations(activity_threshold=datetime.timedelta(days=1)) |
Do keep in mind that there are duplicates stations in the US since their data are provided by both COOPS and IOC. |
I will definitely give this a try, thank you for working on a solution. How do you fetch the metadata for each IOC station? Does it come from the overview page? For the Syros station: Adding the negative delay to the GMT observations time seems to correspond to the "Last Data received" info on the station details website, so maybe you can go with that? It also matches the time of the last data value when downloading the actual data. Parsing the actual data would be of course the most reliable way but I agree that keeping the number of queries limited makes a lot of sense. How many stations with negative offsets do you have? Thanks again and I will let you know soon about your script snippet. |
The question is what is the definition of "active"? How do you know if a station is "discontinued" vs temporary "offline" due to some technical problem? AFAIK IOC does not provide such details. Furthermore, what is an appropriate threshold for last_data_received to declare that a station is "inactive"? Isn't that threshold application specific? This is very much depended on the user and the specific needs and I agree it is hard to answer in general. |
Hello All, Right now, I filter for data provider (coops/ioc) and data status (active/inactive). It is very convenient that the code now delivers the attributes in the same way and structure! FYI, there is a minor typo in in the code here: It might be interesting for you to know that I convert the GeoPandas object into a geojson file for the website's Javascript. The geojson conversion does not support any timestamp formats, so I converted the objects to strings like this: Thanks again, |
Addresses oceanmodeling#41
Addresses oceanmodeling#41
Addresses oceanmodeling#41
Hi Guys, I have been able to successfully read both the COOPS and IOS GeoDataFrames and display the stations on a map (yellow is COOPS).
I noticed that some variable names are different in the data sets. E.g., the station name in the COOPS dataset is named "name" and in the IOC dataset "location". Will this be unified at some point (naming convention)?
I will start with some filtering of attributes soon and I assume this could be an interesting topic to discuss.
Best!
The text was updated successfully, but these errors were encountered: