# Citizen weather stations quality checks

In this notebook, we follow the methods of Napoly et al., (2018) [1] to perform quality checks on the Netatmo stations to control for common errors in citizen weather stations (CWS) data.

In [None]:
import geopandas as gpd
import matplotlib.pyplot as plt
import pandas as pd

from uhi_drivers_lausanne import cws_qc

figwidth, figheight = plt.rcParams["figure.figsize"]

In [None]:
cws_ts_df_filepath = "../data/raw/cws-ts-df.csv"
official_ts_df_filepath = "../data/processed/official-ts-df.csv"

# to get the station elevation data
elev_adjust = True
cws_stations_gdf_filepath = "../data/raw/cws-stations.gpkg"

dst_ts_df_filepath = "../data/processed/cws-qc-ts-df.csv"
dst_stations_gdf_filepath = "../data/processed/cws-qc-stations.gpkg"

unreliable_threshold = 0.8
high_alpha = 0.95

We start by reading the time series data for both the official stations and CWS:

In [None]:
# 1. read official stations data
official_ts_df = pd.read_csv(
    official_ts_df_filepath, index_col="time", parse_dates=True
)

# 2. read CWS data
# 2.1 stations locations
cws_stations_gdf = gpd.read_file(cws_stations_gdf_filepath)

# 2.2 time series data
# pivot it into the wide data frame format
# remove last row because it is all nan (TODO: fix query for one more netatmo timestamp)
cws_ts_df = (
    pd.read_csv(cws_ts_df_filepath, index_col="time", parse_dates=True)
    .pivot_table(index="time", columns="station_id", values="value")
    .iloc[:-1]
)

# correct for elevation (atmospheric lapse rate)
if elev_adjust:
    # backup the original so that we apply the elevation adjustment for QC but in the
    # end we save the actual measurements
    _cws_ts_df = cws_ts_df.copy()
    cws_ts_df = cws_qc.elevation_adjustment(
        cws_ts_df,
        # need to set the station ids as index to map station id to altitude
        cws_stations_gdf.set_index("id")["altitude"],
    )

# print the number of stations
print(
    f"N stations: {len(official_ts_df.columns)} official, {len(cws_ts_df.columns)} CWS."
)

ERROR 1: PROJ: proj_create_from_database: Open of /home/martibosch/mambaforge/envs/uhi-drivers-lausanne/share/proj failed


DriverError: ../data/raw/cws-stations.gpkg: No such file or directory

In [None]:
cws_qc.comparison_lineplot(
    cws_ts_df,
    official_ts_df,
)


It seems that the Netatmo stations tend to be warmer than the official stations. As noted by Meier et al., (2017) [2], this is likely due to stations located in non-shaded areas, resulting in radiative errors.

In [None]:
outlier_stations = cws_qc.get_outlier_stations(cws_ts_df, high_alpha=high_alpha)
outlier_stations.sum()

It seems that 31 Netatmo stations show a pattern that can be considered an outlier. We can filter them out and compare the CWS with the official stations again.

In [None]:
fig, axes = plt.subplots(
    1, 2, figsize=(figwidth * 2, figheight), sharex=True, sharey=True
)
for ts_df, label, ax in zip(
    [cws_ts_df.loc[:, ~outlier_stations], cws_ts_df.loc[:, outlier_stations]],
    ["QC CWS", "Outlier CWS"],
    axes,
):
    cws_qc.comparison_lineplot(
        ts_df,
        official_ts_df,
        cws_label=label,
        ax=ax,
    )
    ax.set_title(label)
    ax.set_ylabel("T (°C)")

Once the outlier stations are filtered out, the CWS show a better agreement with the official stations, yet they still seem to be warmer. As suggested by Napoly et al., (2018) [1], another potential explanation is that some Netatmo stations are actually installed indoors:

In [None]:
indoor_stations = cws_qc.get_indoor_stations(cws_ts_df)
indoor_stations.sum()

It seems that 44 Netatmo stations show a pattern that corresponds to an indoor stations. We can again plot them separately:

In [None]:
fig, axes = plt.subplots(
    1, 2, figsize=(figwidth * 2, figheight), sharex=True, sharey=True
)
for ts_df, label, ax in zip(
    [cws_ts_df.loc[:, ~indoor_stations], cws_ts_df.loc[:, indoor_stations]],
    ["QC CWS", "Indoor CWS"],
    axes,
):
    cws_qc.comparison_lineplot(
        ts_df,
        official_ts_df,
        cws_label=label,
        ax=ax,
    )
    ax.set_title(label)
    ax.set_ylabel("T (°C)")

Unlike with the outlier stations, filtering out the indoor stations does not necessarily mean that the CWS show a better agreement with the official stations.

Finally, to avoid problems, e.g., when averaging over the study period, we can also discard unreliable stations with less than 80% valid (non-nan) measurements.

In [None]:
unreliable_stations = cws_qc.get_unreliable_stations(
    cws_ts_df, unreliable_threshold=unreliable_threshold
)
unreliable_stations.sum()

The latter amounts to 4 unreliable stations. Finally, we combine all the discards:

In [None]:
discard_stations = outlier_stations | indoor_stations | unreliable_stations
discard_stations.sum(), (outlier_stations & indoor_stations).sum()

A total of 62 stations are discarded, 16 of which are both outliers and indoor stations. We can plot the remaining stations:

In [None]:
fig, axes = plt.subplots(
    1, 2, figsize=(figwidth * 2, figheight), sharex=True, sharey=True
)
for ts_df, label, ax in zip(
    [cws_ts_df.loc[:, ~discard_stations], cws_ts_df.loc[:, discard_stations]],
    ["QC CWS", "Discarded CWS"],
    axes,
):
    cws_qc.comparison_lineplot(
        ts_df,
        official_ts_df,
        cws_label=label,
        ax=ax,
    )
    ax.set_title(label)
    ax.set_ylabel("T (°C)")

Even when discarding the outlier and indoor stations, the CWS still seem to be warmer than the official stations. We can hypothesize that this is due to the fact that the CWS tend to be located in more urbanized environments. But confirming this constitutes a separate analysis to be conducted in a separate notebook.

Note that stations with misconfigured locations have already been descarded in the netatmo-processing notebook (since the procedure to discard them does not depend on the time-series of observations).

In [None]:
# dump to a file both:
# 1. filtered CWS time series data frame
if elev_adjust:
    cws_ts_df = _cws_ts_df
cws_ts_df.loc[:, ~discard_stations].to_csv(dst_ts_df_filepath)

# 2. filtered CWS stations
cws_stations_gdf.set_index("id").loc[
    cws_ts_df.loc[:, ~discard_stations].columns
].to_file(dst_stations_gdf_filepath)

## References

1. Adrien Napoly, Tom Grassmann, Fred Meier, and Daniel Fenner. Development and application of a statistically-based quality control for crowdsourced air temperature data. *Frontiers in Earth Science*, pages 118, 2018.
2. Fred Meier, Daniel Fenner, Tom Grassmann, Marco Otto, and Dieter Scherer. Crowdsourcing air temperature from citizen weather stations for urban climate research. *Urban Climate*, 19:170–191, 2017.