Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

naming convention for attributes #41

Open
carolakaiser opened this issue Jul 30, 2022 · 9 comments
Open

naming convention for attributes #41

carolakaiser opened this issue Jul 30, 2022 · 9 comments

Comments

@carolakaiser
Copy link
Contributor

Hi Guys, I have been able to successfully read both the COOPS and IOS GeoDataFrames and display the stations on a map (yellow is COOPS).
I noticed that some variable names are different in the data sets. E.g., the station name in the COOPS dataset is named "name" and in the IOC dataset "location". Will this be unified at some point (naming convention)?
I will start with some filtering of attributes soon and I assume this could be an interesting topic to discuss.
Best!

CERA_Searvey_v1

@pmav99
Copy link
Member

pmav99 commented Jul 31, 2022

Hi Carola,

Using the same conventions is something that for sure make sense. This is definitely something in the roadmap but we didn't have a ticket to track it, so thank you for opening one :)

That being said, deciding how to move forward is not so straight-forward. The problem is that each provider offers a different set of metadata therefore some API-related decisions need to be taken. I am thinking that perhaps it would make sense to have two different APIs:

  1. For each provider we will have a function call that returns the "raw" unaltered metadata, i.e. the function calls that we already have. Let's call this the "low level" API.

  2. On top of that we will also have a single function call (e.g. searvey.get_stations()) that will return metadata from all the available providers. The caveat will be that it will only return a specific subset of the available metadata. I am thinking something along the lines of: lon, lat, country, location, provider, provider-id.

The idea is that if you need the full info you should use the "low level" API and handle normalization yourself. If you are OK with basic normalization, which should cover the majority of the users, then you should be able to use higher level API


Of course there are many more columns that could be added in the "high level" response, but I don't think it's so straightforward to figure out definitions for them that are comparable for all the providers. For example. status can only indirectly be inferred for IOC, while start-date is not available for COOPS.

@carolakaiser
Copy link
Contributor Author

Hi @pmav99, thank you, this is very interesting.

It would be interesting to know some of the methods you use to filter the attributes. E.g., for the IOC stations , I use the date from the "Last data received" info to indirectly get the "status" information.

I am wondering if I should get back to you with such requests like "can you please add the status functionality for the IOC stations with a function similar to what you provide for COOPS stations:
class StationStatus(Enum):
ACTIVE = "active"
DISCONTINUED = "discontinued"

Or should I do this on the user-end-side (which would probably a lot of duplicated efforts).
I understand that creating such an API is a long-term effort, so sharing information would be very valuable. Thanks!

@pmav99
Copy link
Member

pmav99 commented Aug 3, 2022

It would be interesting to know some of the methods you use to filter the attributes

At the moment we don't filter any attributes.

for the IOC stations , I use the date from the "Last data received" info to indirectly get the "status" information.

The question is what is the definition of "active"? How do you know if a station is "discontinued" vs temporary "offline" due to some technical problem? AFAIK IOC does not provide such details. Furthermore, what is an appropriate threshold for last_data_received to declare that a station is "inactive"? Isn't that threshold application specific?

@pmav99
Copy link
Member

pmav99 commented Aug 18, 2022

@carolakaiser
We had a discussion with @brey and we both agree that there is value in having a user configurable "activity threshold" in get_stations().

Nevertheless, some IOC stations are sending data in the future... Notice here the "Time in GMT" and the negative "delay":
image

Given that IOC claims that it does not edit the data, I guess that it is the stations that are misconfigured. E.g. Syros which is in Greece sends data 1 hour in the future, which probably means that the station does not apply Daylight Saving Time. The others are mostly stations in Netherlands and I don't know what is wrong with them.

The problem with future timestamps, is that we can't really know if they indeed are active now or not. At least not without making an extra query for each of these stations, and at least at this stage, I would like to avoid complicating the implementation. How would you suggest handling these stations?

@pmav99
Copy link
Member

pmav99 commented Aug 18, 2022

Hi @carolakaiser when you have some time, please try the following snippet and let us know what you think.

Note. before you run the code, you might need to apply this diff, too:

diff --git a/searvey/coops.py b/searvey/coops.py
index e591392..08ed227 100644
--- a/searvey/coops.py
+++ b/searvey/coops.py
@@ -785,7 +785,10 @@ def coops_stations_within_region(
     """
 
     stations = coops_stations(station_status=station_status)
-    return stations[stations.within(region)]
+    if region:
+        return stations[stations.within(region)]
+    else:
+        return stations
from __future__ import annotations

from enum import Enum
import datetime

import geopandas as gpd
import numpy as np
import pandas as pd
from shapely.geometry import MultiPolygon
from shapely.geometry import Polygon

from searvey import ioc
from searvey import coops


class Providers(Enum):
    COOPS: str = "COOPS"
    IOC: str = "IOC"


def get_stations(
    activity_threshold: datetime.timedelta,
    region: Polygon | MultiPolygon | None = None,
) -> gpd.GeoDataFrame:

    # Retrieve the metadata
    ioc_gdf = ioc.get_ioc_stations(region=region)
    coops_gdf = coops.coops_stations_within_region(region=region)

    # Convert activity threshold to a Timezone Aware Datetime object
    now_utc = datetime.datetime.now(datetime.timezone.utc)
    activity_threshold_ts = now_utc - activity_threshold

    # Normalize IOC
    # Convert delay to minutes
    ioc_gdf = ioc_gdf.assign(
        delay=pd.concat(
            (
                ioc_gdf.delay[ioc_gdf.delay.str.endswith("'")].str[:-1].astype(int),
                ioc_gdf.delay[ioc_gdf.delay.str.endswith("h")].str[:-1].astype(int) * 60,
                ioc_gdf.delay[ioc_gdf.delay.str.endswith("d")].str[:-1].astype(int) * 24 * 60,
            )
        )
    )

    # Some stations appear to have negative delay due to server 
    # [time drift](https://www.bluematador.com/docs/troubleshooting/time-drift-ntp)
    # or for other Provider specific reasons. IOC suggests to ignore the negative 
    # delay and consider the stations as active.
    # https://github.com/oceanmodeling/searvey/issues/40#issuecomment-1219509512
    ioc_gdf.loc[(ioc_gdf.delay < 0), "delay"] = 0

    # Calculate the timestamp of the last observation
    ioc_gdf = ioc_gdf.assign(
        last_observation=now_utc - ioc_gdf.delay.apply(lambda x: datetime.timedelta(minutes=x))
    )

    ioc_gdf = ioc_gdf.assign(
        provider=Providers.IOC.value,
        provider_id=ioc_gdf.ioc_code,
        start_date=ioc_gdf.added_to_system.apply(pd.to_datetime).dt.tz_localize('UTC'),
        is_active=ioc_gdf.last_observation > activity_threshold_ts,
    )

    ioc_gdf = ioc_gdf[["provider", "provider_id", "country", "location", "lon", "lat", "is_active", "start_date", "last_observation", "geometry"]]

    coops_gdf = coops_gdf.assign(
        provider=Provider.COOPS.value,
        provider_id=coops_gdf.index,
        country=np.where(coops_gdf.state.str.len() > 0, "USA", None),
        location=coops_gdf[["name", "state"]].agg(", ".join, axis=1),
        lon=coops_gdf.geometry.x,
        lat=coops_gdf.geometry.y,
        is_active=coops_gdf.status == "active",
        start_date=pd.NaT,
        last_observation=coops_gdf[coops_gdf.status == "discontinued"].removed.str[:18].apply(pd.to_datetime).dt.tz_localize('UTC'),
    )[["provider", "provider_id", "country", "location", "lon", "lat", "is_active", "start_date", "last_observation", "geometry"]]

    return pd.concat((ioc_gdf, coops_gdf))

You can call it like:

stations = get_stations(activity_threshold=datetime.timedelta(days=1))

You should get back a dataframe like this one:
image

@pmav99
Copy link
Member

pmav99 commented Aug 18, 2022

Do keep in mind that there are duplicates stations in the US since their data are provided by both COOPS and IOC.

@carolakaiser
Copy link
Contributor Author

Hey @pmav99 and @brey,

I will definitely give this a try, thank you for working on a solution.

How do you fetch the metadata for each IOC station? Does it come from the overview page?

For the Syros station: Adding the negative delay to the GMT observations time seems to correspond to the "Last Data received" info on the station details website, so maybe you can go with that? It also matches the time of the last data value when downloading the actual data.

Parsing the actual data would be of course the most reliable way but I agree that keeping the number of queries limited makes a lot of sense. How many stations with negative offsets do you have?

Thanks again and I will let you know soon about your script snippet.
Best regards

@carolakaiser
Copy link
Contributor Author

_for the IOC stations , I use the date from the "Last data received" info to indirectly get the "status" information._

The question is what is the definition of "active"? How do you know if a station is "discontinued" vs temporary "offline" due to some technical problem? AFAIK IOC does not provide such details. Furthermore, what is an appropriate threshold for last_data_received to declare that a station is "inactive"? Isn't that threshold application specific?


This is very much depended on the user and the specific needs and I agree it is hard to answer in general.
For our application (real-time storm surge), we are trying to get all stations that have any data within the past 2 days. So we are not looking at a specific time stamp for being "active" but more a time frame.

@carolakaiser
Copy link
Contributor Author

Hello All,
Thank you very much for the code snippet. This really works well and is a huge help to achieve the required filtering.

Right now, I filter for data provider (coops/ioc) and data status (active/inactive). It is very convenient that the code now delivers the attributes in the same way and structure!

FYI, there is a minor typo in in the code here:
coops_gdf = coops_gdf.assign(
provider=Provider.COOPS.value, ...
The class is named "Providers".

It might be interesting for you to know that I convert the GeoPandas object into a geojson file for the website's Javascript. The geojson conversion does not support any timestamp formats, so I converted the objects to strings like this:
start_date=ioc_gdf.added_to_system.apply(pd.to_datetime).dt.tz_localize('UTC').apply(str)

Thanks again,

pmav99 added a commit to pmav99/searvey that referenced this issue Oct 8, 2022
pmav99 added a commit to pmav99/searvey that referenced this issue Oct 12, 2022
pmav99 added a commit to pmav99/searvey that referenced this issue Oct 14, 2022
pmav99 added a commit that referenced this issue Oct 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants