# Station Outlier filter

This notebook shows the "station outlier filter" to detect... 

The original R code stems from https://github.com/LottedeVos/PWSQC/. 

Publication:
de Vos, L. W., Leijnse, H., Overeem, A., & Uijlenhoet, R. (2019). Quality control for crowdsourced personal weather stations to enable operational rainfall monitoring. _Geophysical Research Letters_, 46(15), 8820-8829.

The idea of the filter is to... 

In [1]:
import numpy as np
import xarray as xr
import poligrain as plg
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
ds_pws = xr.open_dataset('OpenSense_PWS_example_format_data.nc')

#slice to one month
ds_pws = ds_pws.sel(time = slice('2017-07-01','2017-07-31'))

## Calculate distance matrix

In [3]:
ds_pws.coords["x"], ds_pws.coords["y"] = plg.spatial.project_point_coordinates(
    x=ds_pws.longitude, y=ds_pws.latitude, target_projection="EPSG:25832"
)

In [4]:
distance_matrix = plg.spatial.calc_point_to_point_distances(ds_pws, ds_pws)

## SO filter (fixed evaluation period only)

In [5]:
# Set parameters
mint = 4032
mrain = 100
mmatch = 200
gamma = 0.35 # 0.15 original (gives very few SO-flags)
beta = 0.2
n_stat = 5
max_distance = 10e3 # Boolean 2D data array, defining neihbours within max_distance for all stations
dbc = 1

In [6]:
# initalize data variables
ds_pws['so_flag'] = xr.DataArray(np.ones((len(ds_pws.id), len(ds_pws.time)))*-999, dims=("id", "time"))
ds_pws['mean_corr_nbrs'] = xr.DataArray(np.ones((len(ds_pws.id), len(ds_pws.time)))*-999, dims=("id", "time"))

## testing for one station, without loop

In [8]:
# initialize
i = 0
ds_station = ds_pws.isel(id=i) 
pws_id = ds_station.id.values

# picking stations within max_distnance, excluding itself, for the whole duration of the time series
neighbor_ids = distance_matrix.id.data[(distance_matrix.sel(id=pws_id) < max_distance) & (distance_matrix.sel(id=pws_id) > 0)]

#create data set for neighbors
ds_neighbors = ds_pws.sel(id=neighbor_ids)

In [9]:
# this function only calculates the number of overlapping time intervals between the station in question and its neighbors FOR THE WHOLE TIME SERIES
# (here we cut it to one month). It takes 35 sec to run for one station on my machine (for one month of data). 

def so_filter_full_series(da_station, da_neighbors, window_length):
    
    s_station = da_station.to_series()
    s_neighbors = da_neighbors.to_series()

    corr = s_station.rolling(window_length, min_periods=1).corr(s_neighbors)
    rainy_timesteps = (s_neighbors > 0).rolling(window_length, min_periods=1).sum()
    
    # the row below is what takes time to do, the rest is very fast
    matches = s_neighbors.apply(lambda col: ((s_station > 0) & (col > 0)).sum())

    # if matches < mmatch --> filter cannot be applied
    # if matches > mmatch --> proceed with correlation calculation

    # if I try to do the same in a rolling window (over lapping intervals in the last mint timesteps) it explodes
    # matches = s_neighbors.apply(lambda col: ((s_station > 0) & (col > 0)).rolling(window_length, min_periods=1).sum())

    ds = xr.Dataset.from_dataframe(pd.DataFrame({'corr': corr}))
    ds['rainy_timesteps'] = xr.DataArray.from_series(rainy_timesteps)
    ds['matches'] = xr.DataArray.from_series(matches)
    
    return ds

In [None]:
    # this makes the whole thing explode
    # matches = s_neighbors.apply(lambda col: ((s_station > 0) & (col > 0)).rolling(window_length, min_periods=1).sum())

In [10]:
%%time

# this takes 35 sec for one station, for the whole time series, when the time series is one month long
                       
ds_so_filter = so_filter_full_series(ds_station.rainfall, ds_neighbors.rainfall, window_length=mint) 
ds_so_filter

CPU times: total: 27.4 s
Wall time: 27.6 s
