Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating weights from multiple threads/processes fails (ESMC_GridCreateNoPeriDim) #141

Closed
jhamman opened this issue Dec 17, 2021 · 8 comments

Comments

@jhamman
Copy link
Member

jhamman commented Dec 17, 2021

What happened:

When using xesmf inside a parallel framework, an opaque error is raised. I've observed this behavior using dask's threaded and distributed schedulers.

What you expected to happen:

I expected to be able to use xesmf within multiple processes. Or, if this is not possible, a descriptive error and/or documentation on the subject.

Minimal Complete Verifiable Example:

This simple example is just a slightly modified version of the basic example from the xesmf docs.

import numpy as np
import dask
import xesmf as xe
import xarray as xr


@dask.delayed
def regrid(tslice):
    ds = xr.tutorial.open_dataset("air_temperature").isel(time=tslice)
    ds_out = xr.Dataset(
        {
            "lat": (["lat"], np.arange(16, 75, 1.0)),
            "lon": (["lon"], np.arange(200, 330, 1.5)),
        }
    )
    regridder = xe.Regridder(ds, ds_out, "bilinear")
    dr_out = regridder(ds)
    return dr_out


tasks = [regrid(slice(0, 10)), regrid(slice(10, 20))]

# this works
dask.compute(tasks, scheduler='single-threaded')

# this fails
dask.compute(tasks, scheduler='threads')

The traceback is here:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_290/220165509.py in <module>
      1 # this fails
----> 2 dask.compute(tasks, scheduler='threads')

/srv/conda/envs/notebook/lib/python3.9/site-packages/dask/base.py in compute(*args, **kwargs)
    568         postcomputes.append(x.__dask_postcompute__())
    569 
--> 570     results = schedule(dsk, keys, **kwargs)
    571     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    572 

/srv/conda/envs/notebook/lib/python3.9/site-packages/dask/threaded.py in get(dsk, result, cache, num_workers, pool, **kwargs)
     77             pool = MultiprocessingPoolExecutor(pool)
     78 
---> 79     results = get_async(
     80         pool.submit,
     81         pool._max_workers,

/srv/conda/envs/notebook/lib/python3.9/site-packages/dask/local.py in get_async(submit, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, chunksize, **kwargs)
    505                             _execute_task(task, data)  # Re-execute locally
    506                         else:
--> 507                             raise_exception(exc, tb)
    508                     res, worker_id = loads(res_info)
    509                     state["cache"][key] = res

/srv/conda/envs/notebook/lib/python3.9/site-packages/dask/local.py in reraise(exc, tb)
    313     if exc.__traceback__ is not tb:
    314         raise exc.with_traceback(tb)
--> 315     raise exc
    316 
    317 

/srv/conda/envs/notebook/lib/python3.9/site-packages/dask/local.py in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
    218     try:
    219         task, data = loads(task_info)
--> 220         result = _execute_task(task, data)
    221         id = get_id()
    222         result = dumps((result, id))

/srv/conda/envs/notebook/lib/python3.9/site-packages/dask/core.py in _execute_task(arg, cache, dsk)
    117         # temporaries by their reference count and can execute certain
    118         # operations in-place.
--> 119         return func(*(_execute_task(a, cache) for a in args))
    120     elif not ishashable(arg):
    121         return arg

/tmp/ipykernel_290/2192809839.py in regrid(tslice)
      8         }
      9     )
---> 10     regridder = xe.Regridder(ds, ds_out, "bilinear")
     11     dr_out = regridder(ds)
     12     return dr_out

/srv/conda/envs/notebook/lib/python3.9/site-packages/xesmf/frontend.py in __init__(self, ds_in, ds_out, method, locstream_in, locstream_out, periodic, **kwargs)
    768             grid_in, shape_in, input_dims = ds_to_ESMFlocstream(ds_in)
    769         else:
--> 770             grid_in, shape_in, input_dims = ds_to_ESMFgrid(
    771                 ds_in, need_bounds=need_bounds, periodic=periodic
    772             )

/srv/conda/envs/notebook/lib/python3.9/site-packages/xesmf/frontend.py in ds_to_ESMFgrid(ds, need_bounds, periodic, append)
    130         grid = Grid.from_xarray(lon.T, lat.T, periodic=periodic, mask=mask.T)
    131     else:
--> 132         grid = Grid.from_xarray(lon.T, lat.T, periodic=periodic, mask=None)
    133 
    134     if need_bounds:

/srv/conda/envs/notebook/lib/python3.9/site-packages/xesmf/backend.py in from_xarray(cls, lon, lat, periodic, mask)
    109         # However, they actually need to be set explicitly,
    110         # otherwise grid._coord_sys and grid._staggerloc will still be None.
--> 111         grid = cls(
    112             np.array(lon.shape),
    113             staggerloc=staggerloc,

/srv/conda/envs/notebook/lib/python3.9/site-packages/ESMF/util/decorators.py in new_func(*args, **kwargs)
     79 
     80         esmp = esmpymanager.Manager(debug = False)
---> 81         return func(*args, **kwargs)
     82     return new_func
     83 

/srv/conda/envs/notebook/lib/python3.9/site-packages/ESMF/api/grid.py in __init__(self, max_index, num_peri_dims, periodic_dim, pole_dim, coord_sys, coord_typekind, staggerloc, pole_kind, filename, filetype, reg_decomp, decompflag, is_sphere, add_corner_stagger, add_user_area, add_mask, varname, coord_names, tilesize, regDecompPTile, name)
    408             self._struct = ESMP_GridStruct()
    409             if self.num_peri_dims == 0:
--> 410                 self._struct = ESMP_GridCreateNoPeriDim(self.max_index,
    411                                                        coordSys=coord_sys,
    412                                                        coordTypeKind=coord_typekind)

/srv/conda/envs/notebook/lib/python3.9/site-packages/ESMF/interface/cbindings.py in ESMP_GridCreateNoPeriDim(maxIndex, coordSys, coordTypeKind)
    579     rc = lrc.value
    580     if rc != constants._ESMP_SUCCESS:
--> 581         raise ValueError('ESMC_GridCreateNoPeriDim() failed with rc = '+str(rc)+
    582                         '.    '+constants._errmsg)
    583 

ValueError: ESMC_GridCreateNoPeriDim() failed with rc = 545.    Please check the log files (named "*ESMF_LogFile").

The ESMF_LogFile includes the following lines:

20211217 034835.110 ERROR            PET0 ESMCI_VM.C:2168 ESMCI::VM::getCurrent() Internal error: Bad condition  - - Could not determine current VM
20211217 034835.110 ERROR            PET0 ESMCI_VM_F.C:1105 c_esmc_vmgetcurrent() Internal error: Bad condition  - Internal subroutine call returned Error
20211217 034835.110 ERROR            PET0 ESMF_VM.F90:5579 ESMF_VMGetCurrent() Internal error: Bad condition  - Internal subroutine call returned Error
20211217 034835.110 ERROR            PET0 ESMF_Grid.F90:29430 ESMF_GridCreateDistgridReg Internal error: Bad condition  - Internal subroutine call returned Error
20211217 034835.110 ERROR            PET0 ESMF_Grid.F90:10874 ESMF_GridCreateNoPeriDimR Internal error: Bad condition  - Internal subroutine call returned Error
20211217 034835.110 ERROR            PET0 ESMF_Grid_C.F90:78 f_esmf_gridcreatenoperidim Internal error: Bad condition  - Internal subroutine call returned Error
20211217 034835.110 ERROR            PET0 ESMCI_Grid.C:259 ESMCI::Grid::createnoperidim() Internal error: Bad condition  - Internal subroutine call returned Error
20211217 034835.110 ERROR            PET0 ESMC_Grid.C:83 ESMC_GridCreateNoPeriDim() Internal error: Bad condition  - Internal subroutine call returned Error

Anything else we need to know?:

xref: JiaweiZhuang/xESMF#88

Environment:

Output of xr.show_versions() + xesmf + esmf

INSTALLED VERSIONS

commit: None
python: 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:46)
[GCC 9.4.0]
python-bits: 64
OS: Linux
OS-release: 5.4.0-1062-azure
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.1
libnetcdf: 4.8.1

xarray: 0.20.1
pandas: 1.3.4
numpy: 1.21.4
scipy: 1.7.3
netCDF4: 1.5.8
pydap: installed
h5netcdf: 0.11.0
h5py: 3.6.0
Nio: None
zarr: 2.10.3
cftime: 1.5.1.1
nc_time_axis: 1.4.0
PseudoNetCDF: None
rasterio: 1.2.10
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2021.10.0
distributed: 2021.10.0
matplotlib: 3.5.0
cartopy: 0.20.1
seaborn: 0.11.2
numbagg: None
fsspec: 2021.11.1
cupy: None
pint: None
sparse: 0.13.0
setuptools: 59.4.0
pip: 21.3.1
conda: None
pytest: 6.2.5
IPython: 7.30.1
sphinx: None
xesmf: 0.6.2
ESMF: 8.2.0

cc @rokuingh, @norlandrhagen, @theurich

@rsdunlapiv
Copy link

@rokuingh can you please take a look

@rokuingh
Copy link

I just tried to reproduce this issue, but was unable to install one of the dependencies, pooch, due to a number of sub dependency conflicts.

The rc code 545 generally indicates unmapped points. However, the log file has a message about a vm not being available.. which is a much more serious issue that is most likely due to something outside of ESMPy.

@jhamman
Copy link
Member Author

jhamman commented Dec 17, 2021

@rokuingh - if you have access to a conda (or mamba installation), this command will get you an exact replica of the environment this test was made in:

# osx
conda create --name test-env --file https://raw.githubusercontent.com/carbonplan/envs/main/cmip6-downscaling/single-user/conda-osx-64.lock
# linux
conda create --name test-env --file https://raw.githubusercontent.com/carbonplan/envs/main/cmip6-downscaling/single-user/conda-linux-64.lock

@rsdunlapiv
Copy link

@jhamman We are happy to provide support for ESMPy, but we don't currently have the resources (or expertise) to support the additional layers within xESMF. Could you please provide an ESMPy only reproducer that exhibits the issue? We can definitely work from that.

@jhamman
Copy link
Member Author

jhamman commented Dec 18, 2021

@rsdunlapiv and @rokuingh - here's a reproducer that does not use xesmf or xarray.

import ESMF
import numpy as np
import dask


@dask.delayed
def make_grid(shape):
    g = ESMF.Grid(
        np.array(shape),
        staggerloc=ESMF.StaggerLoc.CENTER,
        coord_sys=ESMF.CoordSys.SPH_DEG,
        num_peri_dims=None,  # with out this, ESMF seems to seg fault (clue?)
    )
    return g


tasks = [make_grid((59, 87)), make_grid((60, 88))]

# this works
dask.compute(tasks, scheduler='single-threaded')

# this fails
dask.compute(tasks, scheduler='threads')

@rsdunlapiv
Copy link

@jhamman one question is whether we would even expect the second one to work. I am not sure of the semantics of Dask 'single-threaded' versus 'threads'. In general, ESMF is going to be multi-process, but not multi-threaded. So Dask would need to give ESMF/ESMPy multiple MPI processes in order to run in parallel. Can this be done through Dask?

@jhamman
Copy link
Member Author

jhamman commented Dec 20, 2021

@rsdunlapiv - I'm not sure if this should work or not. If its not supposed to work, it may be nice to put a thread lock in place, or alternatively, raise a more informative error.

Also, I think it would be good to restate my intended parallel behavior here. I want to generate regridding weights for two datasets in parallel. I do not want ESMF to do anything in parallel (or with MPI).

I will say that my first example (most important) does work if called with the multiprocessing scheduler. However, the second does not:

...
/srv/conda/envs/notebook/lib/python3.9/site-packages/cloudpickle/cloudpickle_fast.py in dump()
    600     def dump(self, obj):
    601         try:
--> 602             return Pickler.dump(self, obj)
    603         except RuntimeError as e:
    604             if "recursion" in e.args[0]:

ValueError: ctypes objects containing pointers cannot be pickled

@nwharton
Copy link

nwharton commented Aug 2, 2023

In case others are having this same problem and find this thread as I did: I was getting this same error and log message (ValueError: ESMC_GridCreateNoPeriDim() failed with rc = 545. and Internal error: Bad condition - - Could not determine current VM). The problem went away after I realized I had forgotten to initialize a dask.distributed client:

from dask.distributed import Client
client=Client()

@pangeo-data pangeo-data locked and limited conversation to collaborators Sep 12, 2023
@huard huard converted this issue into discussion #307 Sep 12, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants