Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic adaptive masking to solve #29 #65

Merged
merged 21 commits into from
Jul 1, 2021

Conversation

stefraynaud
Copy link
Contributor

@stefraynaud stefraynaud commented Jan 13, 2021

As explained in #29, the regridding leads to unexpected results when the input dataset has transient values i.e. when the number of missing values varies along dimensions other than the horizontal ones.

This PR fixes #29 by normalising the results with the regridded transient mask.

@stefraynaud stefraynaud added the draft Work in progress label Feb 17, 2021
@huard huard added this to the v0.6 milestone Feb 19, 2021
@sol1105
Copy link
Contributor

sol1105 commented Mar 31, 2021

Salut Stephane,

the reason, why the tests are failing, is that in tests/test_frontend.py the test_build_regridder_with_masks adds a random mask to the xarray.Dataset ds_in that you use in your test_adaptative_masking (see code below). That is why the assertion of the number of missing values in your test always fails (with different values).

A solution is for example to prevent test_build_regridder_with_masks from altering the dataset by using a copy of ds_in rather than ds_in itself.

def test_build_regridder_with_masks():
    ds_in_test=ds_in.copy()
    ds_in_test['mask'] = xr.DataArray(np.random.randint(2, size=ds_in_test['data'].shape), dims=('y', 'x'))
    # 'patch' is too slow to test
    for method in [
        'bilinear',
        'conservative',
        'conservative_normed',
        'nearest_s2d',
        'nearest_d2s',
    ]:
        regridder = xe.Regridder(ds_in_test, ds_out, method)

        # check screen output
        assert repr(regridder) == str(regridder)
        assert 'xESMF Regridder' in str(regridder)
        assert method in str(regridder)


@pytest.mark.parametrize(
    "method, adaptative_masking, nvalid", [
        ("bilinear", False, 380),
        ("bilinear", True, 395),
        ("conservative", False, 385),
        ("conservative", True, 394)
        ])
def test_adaptative_masking(method, adaptative_masking, nvalid):
    dai = ds_in["data4D"].copy()
    dai[0, 0, 4:6, 4:6] = np.nan
    rg = xe.Regridder(ds_in, ds_out, method)
    dao = rg(dai, adaptative_masking=adaptative_masking)
    assert int(dao[0, 0, 1:-1, 1:-1].notnull().sum()) == nvalid
In the following a few comments:
  • Instead of having two function parameters adaptative_masking and mask_threshold you could have only one.
    A value of 0 for this one parameter is the same as adaptative_masking=True; mask_threshold=0..
    A value of 1 for this one parameter is the same as adaptative_masking=False.
    (This results from the definition of the mask_threshold, as the least unmasked relative source cell area that contributes to the target cell.)
    So in that sense, you could allow a boolean for that parameter (that you then invert and reformat into a float), and additionally allow arbitrary float values between 0 and 1.

  • For our usecase we would need to use adaptive masking also when there are no "non-permanent missing values":
    We generate the weights for a variable on a certain grid by only providing the horizontal input and output grid and not the data variable to xESMF.
    In this example, the variable could be a 3D ocean variable. Such variables have different land-sea-masks for all their vertical levels. But we would want to be able to use the weights for all levels by applying adaptive masking.
    Then, we would want to be able to reuse the weights incl. adaptive masking for any other variable on that grid (i.e. for ndim==2 in below code).

This is how these two suggestions could look in your function:

@staticmethod
def _regrid_array(
        indata, *,
        weights, shape_in, shape_out,
        sequence_in,
        adaptative_masking):
    
     [ ... ]

    # interpreting adaptative_masking
    if isinstance(adaptative_masking, bool):        
        mask_threshold=float(not adaptative_masking)
    elif adaptative_masking>=1 or adaptative_masking<0:
        adaptative_masking=False
        mask_threshold=1.
    else:
        adaptative_masking=True
        mask_threshold=float(adaptative_masking)

     [ ... ]

    # allow adaptive masking not only for non permanent masks
    #   and allow it for only 2 dimensions 
    # Is there any non-permament missing values?
    ndim = np.ndim(indata)
    if ndim > 2:
        inmask = np.isnan(indata)
        has_non_perm_mask = (
            np.apply_over_axes(
                np.sum, inmask, [-2, -1]).ravel().ptp() != 0)       
        if not adaptative_masking and has_non_perm_mask:
            warnings.warn(
                "Your data has transient missing values. "
                "You should set adaptative_masking to True, "
                "which will be the default in future versions.")

   [ ... ]
  
  • has_non_perm_mask = (np.apply_over_axes(np.sum, inmask, [-2, -1]).ravel().ptp() != 0) I think does not capture when the number of missing values does not change, but their location in the grid. If such a case is at all realistic.

@stefraynaud
Copy link
Contributor Author

Thanks alot Martin!
I didn't even have time solve the CI errors.
You did it for me and I agree with your remarks.
I'll update the PR very soon since I'm less busy.

@stefraynaud stefraynaud removed the draft Work in progress label Apr 16, 2021
@stefraynaud
Copy link
Contributor Author

@sol1105 can you test it before I rebase and squash?

@sol1105
Copy link
Contributor

sol1105 commented Apr 30, 2021

@stefraynaud I will test it soon and then get back to you

@sol1105
Copy link
Contributor

sol1105 commented May 3, 2021

@stefraynaud The adaptive masking works nicely, in the following some feedback:

  • For ndim==2 we would also need to use adaptive masking in cases when we want to reuse weights generated earlier for the same grid, but the variable is for example time invariant and has a different mask, or one just selected a single timestep (and a single vertical level) of a data array. Therefore it would be nice if adaptative_masking is not set to False in case of ndim==2.
  • I am not sure if the check for has_non_perm_mask works in the way it is implemented. I altered it a bit and it works in my testcase:
import numpy as np

indata = np.zeros((2,3,3,4))
ndim=indata.ndim

# 1 #  for one timestep and one level, a gridpoint is masked
# has_non_perm_mask should return True
indata[0,0,1,1]=np.nan

inmask = np.isnan(indata)
many = np.apply_over_axes(np.any, inmask, range(ndim-2))  # Applying over all axes but lat-lon
mall = np.apply_over_axes(np.all, inmask, range(ndim-2))  # assuming lat-lon are the rightmost axes
has_non_perm_mask = not (many == mall).all()           # If both are the same, there is either none or a permanent mask
print("#1 should be True:", has_non_perm_mask)

# 2 # for one timestep and all levels, a gridpoint is masked
# has_non_perm_mask should return True
indata[0,:,1,1]=np.nan

inmask = np.isnan(indata)
many = np.apply_over_axes(np.any, inmask, range(ndim-2))
mall = np.apply_over_axes(np.all, inmask, range(ndim-2))
has_non_perm_mask = not (many == mall).all()
print("#2 should be True:", has_non_perm_mask)

# 3 # for all timesteps and all levels, a gridpoint is masked
# has_non_perm_mask should return False
indata[:,:,1,1]=np.nan
inmask = np.isnan(indata)
many = np.apply_over_axes(np.any, inmask, range(ndim-2))
mall = np.apply_over_axes(np.all, inmask, range(ndim-2))
has_non_perm_mask = not (many == mall).all()
print("#3 should be False:", has_non_perm_mask)
  • In the doc string it says The destination grid point is masked if the interpolated mask is above this threshold., but I think it should be ... is below this threshold.
  • In the doc string it says .. note:: It will be set to True in futures versions.. True means, as it is implemented at the moment, that it will get interpreted as the value of tol due to mask_threshold = float(not adaptative_masking) and bad = outvalid < min(max(mask_threshold, tol), 1 - tol). Will the default value get updated? Should the default be dependent on the method (maybe including a warning that for eg. bilinear it is not recommended to use adaptive masking)?
  • adaptative_masking could be replaced by adaptive_masking. Less to write :)

Copy link
Contributor

@huard huard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions:

  • Does it work with dask arrays ?
  • Does it significantly impact performance ?

Otherwise good for me.

xesmf/frontend.py Outdated Show resolved Hide resolved
xesmf/frontend.py Outdated Show resolved Hide resolved
xesmf/frontend.py Outdated Show resolved Hide resolved
xesmf/frontend.py Outdated Show resolved Hide resolved
xesmf/frontend.py Outdated Show resolved Hide resolved
xesmf/frontend.py Outdated Show resolved Hide resolved
xesmf/frontend.py Outdated Show resolved Hide resolved
@huard huard requested a review from aulemahal May 17, 2021 18:46
Copy link
Collaborator

@aulemahal aulemahal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this!
I am also interested in knowing how it impacts performances.
This needs to be tested through, but I believe it's already dask-proof because _regrid_array is wrapped in apply_ufunc. Thus, it always receives numpy arrays, no matter what __call__ receives.

@aulemahal
Copy link
Collaborator

Well, just tried it and it works really well. Even more so, it (almost) fixes #77!

In fact, if we follow the comment of @sol1105 and allow it for ndim == 2, this is equivalent to xarray's skipna.

Proposition, rename it skipna and activate it no matter the shape of the input.

Personnally, I don't see the advantage of checking for transient nan values when it is not activated. In the new perspective that this is a skipna, it is the user's responsibility to know if the data has missing values or not. Moreover, it adds unused computations and surely affects the performance, at least minimally.

I am ok with passing skipna as Union[bool, float] (like adaptative_masking currently), but this changes the paradigm used in xarray (where skipna is Optional[bool], turning on only for floats). To fit this paradigm, it could be interesting to introduce an addtionnal argument na_thresh=1.0.

@stefraynaud
Copy link
Contributor Author

Thank you all for these reviews which culminate with @aulemahal 's review.
I agree with naming this option skip_na and the na_thres with a default value of 1.0.
I'll update the PR accordlinly and with all the other remarks.

@aulemahal
Copy link
Collaborator

Super! With the risk of seeming too nitpicky, I would lean towards skipna (no underscore). Even if this is not the most PEP8 name, it is the same as in xarray, on which we build!

@stefraynaud
Copy link
Contributor Author

You are not nitpicky! In fact, I wanted to say skipna.

@stefraynaud stefraynaud marked this pull request as draft June 17, 2021 07:35
@stefraynaud stefraynaud marked this pull request as ready for review June 17, 2021 09:20
Fix linting
@stefraynaud
Copy link
Contributor Author

Hi guys, you can have a look to this new version.

Copy link
Contributor

@huard huard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be nice to have some docs accompanying this new functionality, but it could be in another PR.

xesmf/frontend.py Outdated Show resolved Hide resolved
xesmf/frontend.py Outdated Show resolved Hide resolved
stefraynaud and others added 2 commits June 30, 2021 21:45
Co-authored-by: David Huard <huard.david@ouranos.ca>
Co-authored-by: David Huard <huard.david@ouranos.ca>
@stefraynaud
Copy link
Contributor Author

@huard yes I'm more in favor of creating another PR to add this feature to the notebook dedicated to masking.

@huard
Copy link
Contributor

huard commented Jun 30, 2021

@raphaeldussin I'll let you merge this one since you requested changes. Good on my end.

Copy link
Contributor

@raphaeldussin raphaeldussin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found 1 typo but very happy with the PR!

xesmf/frontend.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Properly handling of non permanent masks
5 participants