# pypwsqc

[insert package explanation]

The original R code stems from https://github.com/LottedeVos/PWSQC

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# Import packages

import poligrain as plg
import xarray as xr

## Load example data
[short description of example data]

In [3]:
# download nc file to current directory. Does not work to read data if I do this, only if I download manually from github (get ValueError)
# !curl -OL https://github.com/OpenSenseAction/OS_data_format_conventions/tree/main/notebooks/data/OpenSense_PWS_example_format_data.nc

In [4]:
# test = xr.open_dataset("OpenSense_PWS_example_format_data.nc")
# test

# old file:
# !curl -OL https://github.com/OpenSenseAction/training_school_opensene_2023/raw/main/data/pws/data_PWS_netCDF_AMS_float.nc

In [5]:
# read PWS data with xarray
ds_pws = xr.open_dataset(
    "C:/Users/a002461/OPENSENSE/data/OpenSense_PWS_example_format_data.nc"
)
# ds_pws = ds_pws.load() # do here already?

# rename to follow opensense naming convention
ds_pws = ds_pws.rename_vars({"longitude": "lon", "latitude": "lat"})

# slice to time of interest (remove)
# ds_pws = ds_pws.sel(time=slice("2017-09-13 00:00:00", "2017-09-13 03:00"))

ds_pws

In [6]:
# ds_pws.rainfall.cumsum(dim="time").plot.line(x="time", add_legend=False, linewidth=0.5);

## Create distance matrix

[Preparations below both apply to HI and FZ filter. Apply poligrain here]


### Reproject coordinates to metric projection to allow for distance calculations 

In [8]:
ds_pws.coords["x"], ds_pws.coords["y"] = plg.spatial.project_point_coordinates(
    x=ds_pws.lon, y=ds_pws.lat, target_projection="EPSG:25832"
)

### Calculate distance between all stations of the network in meters

In [9]:
# create a sparse matrix --> to be applicable for large datasets

distance_matrix = plg.spatial.calc_point_to_point_distances(ds_pws, ds_pws)

# distance_matrix.to_netcdf('C:/Users/a002461/OPENSENSE/data/distance_matrix.nc')
# distance_matrix = xr.open_dataset("C:/Users/a002461/OPENSENSE/data/distance_matrix.nc")

### Calculate number of neighbours reporting rainfall per timestep

In [10]:
%%time 

# select range maximum_distance in which to find neighbours
max_distance = 10e3  # range around each station, meters

ds_pws = ds_pws.load()
nbrs_not_nan = []

for pws_id in ds_pws.id.data:
    neighbor_ids = distance_matrix.id.data[
        distance_matrix.sel(id=pws_id) < max_distance
    ]
    N = ds_pws.rainfall.sel(id=neighbor_ids).isnull().sum(dim="id")
    nbrs_not_nan.append(N)

CPU times: total: 23.5 s
Wall time: 23.6 s


In [11]:
ds_pws["nbrs_not_nan"] = xr.concat(nbrs_not_nan, dim="id")

In [13]:
# ds_pws.to_netcdf('C:/Users/a002461/OPENSENSE/data/dataset_with_flags_AMS2.nc')
# ds_pws2 = xr.open_dataset("C:/Users/a002461/OPENSENSE/data/dataset_with_flags_AMS2.nc")
# try to save output :)

In [15]:
# %matplotlib widget

# id = 'ams2'

# fig,ax = plt.subplots()

# ds_pws.HIflag.sel(id=id,time=slice("2017-01-03 00:00:00", "2017-03-04 00:00")).plot() #time=slice("2017-01-03 00:00:00", "2017-03-04 00:00"
# plt.title("Filter cannot be applied")

## Calculate reference

The default reference of the filter is to compare the observed rainfall of a given station with the median rainfall from all stations within a range `d`. If the median is below the threshold value `HIthresA`, the HI flag for the station is set to 1 (i.e. high influx) for rainfall amounts above threshold `HIthresB`. When the surrounding stations report moderate to heavy rainfall, the threshold becomes variable: for a median of `HIthresA` or higher, the station's HI flag is set to 1 when its measurements exceed the median times `HIthresB/HIthresA`. 

_Allow for other metrics in addition to median? Stochastic methods? Propose other metrics for variable_ `reference`? Compare with secondary data?

In [16]:
%%time

reference = []

for pws_id in ds_pws.id.data:
    neighbor_ids = distance_matrix.id.data[
        distance_matrix.sel(id=pws_id) < max_distance
    ]
    median = ds_pws.sel(id=neighbor_ids).rainfall.median(dim="id")
    reference.append(median)

CPU times: total: 4min 16s
Wall time: 4min 17s


In [17]:
ds_pws["reference"] = xr.concat(reference, dim="id")
ds_pws

## Faulty Zeroes filter

Conditions for raising Faulty Zeroes flag:

* FZflag is not -1
* Median rainfall of neighbouring stations within range `max_distance` is larger than zero for at least `nint` time intervals while the station itself reports zero rainfall.

The FZ flag remains 1 until the station reports nonzero rainfall. For settings for parameter `nint`, see table 1 in https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2019GL083731 

In [23]:
# FZ_flag = pypwsqc.flagging.FZ_filter(
#    pws_data=ds_pws.rainfall,
#    reference=ds_pws.reference,
#    nint=3
# )

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

In [None]:
# ds_pws["FZflag"]= FZ_flag

## High Influx filter

Conditions for raising High Influx flag:
* If median below threshold `ϕA`, then high influx if rainfall above threshold `ϕB`
* If median above `ϕA`, then high influx if rainfall exceeds median times `ϕB`/`ϕA`

Filter cannot be applied if less than `nstat` neighbours are reporting data (HI flag is set to -1)

For settings for parameter `ϕA`, `ϕB` and `nstat`, see table 1 in https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2019GL083731

In [24]:
# can nstat be passed to both filters at the same time...?

HI_flag = pypwsqc.flagging.HI_filter(
    pws_data=ds_pws.rainfall,
    nbrs_not_nan=ds_pws.nbrs_not_nan,
    reference=ds_pws.reference,
    HIthresA=0.2,
    HIthresB=10,
    nstat=5,
)

app


In [26]:
ds_pws["HIflag"] = HI_flag

## Station Outlier filter and bias correction factor