In [1]:
%load_ext autoreload
%autoreload 2

# QC protocol for Private Weather Stations

This notebook presents how to use the Python package `pypwsqc`, a quality assurance protocol developed for automated private weather stations (PWS). The protocol consists of three filters; the Faulty Zero filter, the High Influx filter and the Station Outlier filter.

The package is based on the original R code available at https://github.com/LottedeVos/PWSQC/.

Publication: de Vos, L. W., Leijnse, H., Overeem, A., & Uijlenhoet, R. (2019). Quality control for crowdsourced personal weather stations to enable operational rainfall monitoring. Geophysical Research Letters, 46(15), 8820-8829

`pypwsqc` depends on the `poligrain`, `xarray`, `pandas` and `numpy` packages. Make sure to install and import the required packages first.

In [2]:
import poligrain as plg
import xarray as xr
import numpy as np
import pandas as pd

import pypwsqc

## Download example data

In this example, we use an open PWS dataset from Amsterdam, called the "AMS PWS" dataset. By running the cell below, an example NetCDF-file will be downloaded to your current repository (if your machine is connected to the internet).

In [4]:
!curl -OL https://github.com/OpenSenseAction/OS_data_format_conventions/raw/main/notebooks/data/OpenSense_PWS_example_format_data.nc

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (35) schannel: next InitializeSecurityContext failed: CRYPT_E_NO_REVOCATION_CHECK (0x80092012) - Återkallningsfunktionen kunde inte göra en återkallningskontroll för certifikatet.


## Data preparations

This package handles rainfall data as `xarray`  Datasets. The data set must have `time` and `id` dimensions, `latitude` and `longitude` as coordinates, and `rainfall` as data variable.

An example of how to convert .csv data to a `xarray` dataset is found [here](https://github.com/OpenSenseAction/OS_data_format_conventions/blob/main/notebooks/PWS_example_dataset.ipynb).

We now load the data set under the name  `ds_pws`.

In [5]:
ds_pws = xr.open_dataset("OpenSense_PWS_example_format_data.nc")
ds_pws

### Reproject coordinates 

First we reproject the coordinates to a local metric coordinate reference system to allow for distance calculations. In the Amsterdam example we use EPSG:25832. **Remember to use a local metric reference system for your use case!** We use the function `spatial.project_point_coordinates` in the `poligrain`package. 

In [6]:
ds_pws.coords["x"], ds_pws.coords["y"] = plg.spatial.project_point_coordinates(
    x=ds_pws.longitude, y=ds_pws.latitude, target_projection="EPSG:25832"
)

### Create distance matrix

Then, we calculate the distances between all stations in our data set. If your data set has a large number of stations this can take some time.

### Calculate data variables 

Next, we will calculate the data variables `nbrs_not_nan` and `reference` that are needed to perform the quality control. If you have already processed your data with the FZ and HI-filters, your `xarray` data set already have these variables and you can proceed to the next section.

`nbrs_not_nan`:
Number of neighbours within a specificed range `max_distance` around the station that are reporting rainfall for each time step. The selected range depends on the use case and area of interest. In this example we use 10'000 meters. 

 `reference`:
Median rainfall of all stations within range `max_distance` from each station.

`max_distance` is called 'd' in the original publication.

### Select considered range around each station

In [8]:
max_distance = 10e3

In [9]:
%%time
ds_pws = ds_pws.load()

nbrs_not_nan = []
reference = []

for pws_id in ds_pws.id.data:
    neighbor_ids = distance_matrix.id.data[
        (distance_matrix.sel(id=pws_id) < max_distance)
        & (distance_matrix.sel(id=pws_id) > 0)
    ]

    N = ds_pws.rainfall.sel(id=neighbor_ids).notnull().sum(dim="id")
    nbrs_not_nan.append(N)

    median = ds_pws.sel(id=neighbor_ids).rainfall.median(dim="id")
    reference.append(median)

ds_pws["nbrs_not_nan"] = xr.concat(nbrs_not_nan, dim="id")
ds_pws["reference"] = xr.concat(reference, dim="id")

CPU times: total: 5min 15s
Wall time: 5min 24s


### Initialize data variables

We initialize data variables for the resulting SO-flags and the median pearson correlation with neighboring stations with the value -999. If the variables have the value 0 (passed the test), 1 (did not pass the test) or -1 (not enough information) after running the SO-filter, we know that these time series have been evaluated. If the value is still -999, this means that something went wrong as the data has not been processed. 

We also save the threshold `gamma` as a variable. In this way we can easily visualize if the median correlation with neighbors drops below this threshold, which is the condition for raising a SO-flag (see below).

In [11]:
gamma = 0.15

In [12]:
ds_pws['so_flag'] = xr.DataArray(np.ones((len(ds_pws.id), len(ds_pws.time)))*-999, dims=("id", "time"))
ds_pws['median_corr_nbrs'] = xr.DataArray(np.ones((len(ds_pws.id), len(ds_pws.time)))*-999, dims=("id", "time"))
ds_pws['gamma'] = xr.DataArray(np.ones((len(ds_pws.id), len(ds_pws.time)))*gamma, dims=("id", "time"))

## Quality control

Now the data set is prepared to run the quality control.

### Apply Station Outlier filter

Conditions for raising Station Outlier flag:

* Median of the rolling pearson correlation with all neighboring stations within range `max_distance` is less than threshold `gamma`
* Filter cannot be applied if less than `nstat` neighbours are reporting data (SO flag is set to -1)
*  Filter cannot be applied if there are less than `nstat` neighbours with less than `mmatch` intervals overlapping with the evaluated station(SO flag is set to -1)

For settings for parameter `evaluation_period`, `mmatch`, `gamma`, and `nstat`, see table 1 in https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2019GL083731 

Note! The SO-filter is different compared with the original R-code. In its original implementation, any interval with at least `mrain` intervals of nonzero rainfall measurements is evaluated. In this implementation, only a fixed rolling window of `evaluation_period` intervals is evaluated. Therefore, the `mrain` from the orignal code is not needed. In the original publication, the variable `evaluation_period` (the evaluation period) is set to 4032. For 5-minute data, this is equivalent of two weeks. When the option of a variable evaluation period is excluded, two weeks is often too short as there might not be enough wet periods in the last two weeks to calculate the correlation. This results in a lot of '-1'-flags (filter cannot be applied). It is suggested to use a longer evaluation period, for example four weeks (`evaluation_period` = 8064 for 5-minute data).

`max_distance` is called `d` in the original publication.
`evaluation_period` is called `mint`in the original publication.

In [22]:
evaluation_period = 8064  
mmatch = 200
gamma = 0.15 
n_stat = 5
max_distance = 10e3 

## Run SO-filter

In [32]:
%%time

ds_pws_filtered = pypwsqc.flagging.so_filter(
    ds_pws,
    distance_matrix,
    evaluation_period,
    mmatch,
    gamma,
    n_stat,
    max_distance,
)

ds_pws_filtered

CPU times: total: 93.8 ms
Wall time: 94.9 ms
