# High Influx filter

This notebook shows the “HI_filter” to detect unrealistically high rainfall rainfall amounts reported by a PWS compared with a reference. 
The original R code stems from https://github.com/LottedeVos/PWSQ

In [1]:
#Import packages

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
#import poligrain as plg
import xarray as xr
#import pypwsqc

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


## Get example data - move to "preparations" section?

[short description of example data]

In [2]:
#download nc file to current directory. Does not work to read data if I do this, only if I download manually from github (get ValueError)
#!curl -OL https://github.com/OpenSenseAction/OS_data_format_conventions/tree/main/notebooks/data/OpenSense_PWS_example_format_data.nc

#old file: 
#!curl -OL https://github.com/OpenSenseAction/training_school_opensene_2023/raw/main/data/pws/data_PWS_netCDF_AMS_float.nc

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

100  5757  100  5757    0     0   9227      0 --:--:-- --:--:-- --:--:--  9227
100  5757  100  5757    0     0   9222      0 --:--:-- --:--:-- --:--:--     0


In [2]:
#read PWS data with xarray
ds_pws = xr.open_dataset("OpenSense_PWS_example_format_data.nc")
ds_pws

## Move sections below to "preparations" section?

[Preparations below both apply to HI and FZ filter]

### Reproject coordinates to metric projection to allow for distance calculations 



In [None]:
#ds_pws.coords["x"], ds_pws.coords["y"] = plg.spatial.project_point_coordinates(x=ds_pws.lon, y=ds_pws.lat, target_projection="EPSG:25832")

### Calculate distance between stations in meters

In [None]:
#use poligrain
#distance_matrix =  #plg.spatial.calc_point_to_point_distances(ds_pws, ds_pws)

In [4]:
#use precalculated file for now
distance_matrix = xr.open_dataset("distance_matrix_pws_ams_dataset.nc")
distance_matrix = distance_matrix.rename_vars({'__xarray_dataarray_variable__':'dist'})
distance_matrix

## Create list of neighbouring stations 

In [5]:
#select stations within range max_distance
pws_id = "ams1" # for now - should be done for all stations?! ds_pws.id.data[:]
max_distance = 10e3 #meters

In [6]:
#neighbor_ids = distance_matrix.where(distance_matrix.dist.data < max_distance) #De Vos calls this "neighbourlist" 
neighbor_ids = distance_matrix.id.data[distance_matrix.sel(id=pws_id) < max_distance]
neighbor_ids
#reference = median rainfall amount of stations withing range d around station i (same range d for FZ filter?) WHICH TOLERATION RANGE/ ACCEPTANCE RANGE?! 
#N = number of stations not-NaN within range d around station i

#nstat = 5 # threshold for nr of stations within range d reporting data 



TypeError: cannot directly convert an xarray.Dataset into a numpy array. Instead, create an xarray.DataArray first, either with indexing on the Dataset or by invoking the `to_dataarray()` method.

In [7]:
#filter cannot be applied if less than nstat stations are reporting data within range d (same for FZ filter)
#HIflag[xr.where(N < nstat)] = -1 

In [13]:
#Calculate median
#reference = ds_pws.sel(id=neighbor_ids).rainfall.median(dim="id")

reference = ds_pws.rainfall.median(dim="id")
reference

## Calculate reference - move to "preparations" section?

The default reference of the filter is to compare the observed rainfall of a given station with the median rainfall from all stations within a range `max_distance`. If the median is below the threshold value `HIthresA`, the HI flag for the station is set to 1 (i.e. high influx) for rainfall amounts above threshold `HIthresB`. When the surrounding stations report moderate to heavy rainfall, the threshold becomes variable: for a median of `HIthresA` or higher, the station's HI flag is set to 1 when its measurements exceed the median times `HIthresB/HIthresA`. HI flag is set to −1 if fewer stations than treshold `nstat` are reporting observations.

_Allow for other metrics in addition to median? Stochastic methods? Propose other metrics for variable_ `reference`? Compare with secondary data?

In [None]:
#Calculate median
reference = ds_pws.sel(id=neighbor_ids).rainfall.median(dim="id")

## Apply the High Influx filter

[insert explanation about parameters and their selected values]

In [None]:
#set parameters
HIthresA = 0.4 # threshold for median rainfall of stations within range d, mm
HIthresB = 10 #upper rainfall limit, mm

In [5]:
#Initialize HIflag with zeros
HIflag = xr.zeros_like(ds_pws)
HIflag = HIflag.rename_vars({'rainfall':'HIflag'})

In [None]:
#filter cannot be applied if less than nstat stations are reporting data within range d (same for FZ filter) MOVE?!?
HIflag[xr.where(N < nstat)] = -1 

In [None]:
for i in np.arange(np.shape(pws_data)[0]):
    condition1 = (reference < HIthresA) & (pws_data[:,i] > HIthresB)
    condition2 = (reference >= HIthresA) & (pws_data[:,i] > HIthresB /HIthresA * reference)
    HIflag[xr.where(condition1|condition2)] = 1