# Create masked data for countries or regions

---

This notebook will create masked data (1D instead of 2D lat/lon) that are regional averages. You can run this on any of the files we have created so far. The country data that has already been run is available.

If you want other regions you can download shapefiles to use. Information on `regionmask` can be found here:

https://regionmask.readthedocs.io/en/stable/

In [1]:
!pip install intake-geopandas

Defaulting to user installation because normal site-packages is not writeable


In [2]:
import cftime
import numpy as np
import xarray as xr
xr.set_options(keep_attrs=True)
import climpred
from tqdm import tqdm
import dask.array as da
import matplotlib.pyplot as plt
from matplotlib.ticker import FixedLocator
import xskillscore as xs
import regionmask
import intake
import intake_geopandas
import warnings
warnings.filterwarnings("ignore")

from dask.distributed import Client
import dask.config
dask.config.set({"array.slicing.split_large_chunks": False})

<dask.config.set at 0x2b0ec9e67b50>

In [3]:
client = Client("tcp://10.12.206.46:41051")

Choose your model, data type, and time

In [11]:
model = "ECMWF" #OBS, ECMWF, NCEP, or ECCC
data = "climatology" #raw or anom or climatology
time = "biweekly" #biweekly or daily

In [13]:
#hinda = xr.open_zarr("/glade/campaign/mmm/c3we/jaye/S2S_zarr/"+model+"."+data+".cat_edges."+time+".geospatial.zarr/", consolidated=True).astype('float32')
hinda = xr.open_zarr("/glade/campaign/mmm/c3we/jaye/S2S_zarr/"+model+"."+data+"."+time+".geospatial.zarr/", consolidated=True).astype('float32')
cat = intake.open_catalog('https://raw.githubusercontent.com/aaronspring/remote_climate_data/master/master.yaml')

In [6]:
pip install aiohttp 

Note: you may need to restart the kernel to use updated packages.


In [14]:
hinda

Unnamed: 0,Array,Chunk
Bytes,32.24 MiB,340.31 kiB
Shape,"(3, 97, 121, 240)","(3, 1, 121, 240)"
Count,98 Tasks,97 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 32.24 MiB 340.31 kiB Shape (3, 97, 121, 240) (3, 1, 121, 240) Count 98 Tasks 97 Chunks Type float32 numpy.ndarray",3  1  240  121  97,

Unnamed: 0,Array,Chunk
Bytes,32.24 MiB,340.31 kiB
Shape,"(3, 97, 121, 240)","(3, 1, 121, 240)"
Count,98 Tasks,97 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,32.24 MiB,340.31 kiB
Shape,"(3, 97, 121, 240)","(3, 1, 121, 240)"
Count,98 Tasks,97 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 32.24 MiB 340.31 kiB Shape (3, 97, 121, 240) (3, 1, 121, 240) Count 98 Tasks 97 Chunks Type float32 numpy.ndarray",3  1  240  121  97,

Unnamed: 0,Array,Chunk
Bytes,32.24 MiB,340.31 kiB
Shape,"(3, 97, 121, 240)","(3, 1, 121, 240)"
Count,98 Tasks,97 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,32.24 MiB,340.31 kiB
Shape,"(3, 97, 121, 240)","(3, 1, 121, 240)"
Count,98 Tasks,97 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 32.24 MiB 340.31 kiB Shape (3, 97, 121, 240) (3, 1, 121, 240) Count 98 Tasks 97 Chunks Type float32 numpy.ndarray",3  1  240  121  97,

Unnamed: 0,Array,Chunk
Bytes,32.24 MiB,340.31 kiB
Shape,"(3, 97, 121, 240)","(3, 1, 121, 240)"
Count,98 Tasks,97 Chunks
Type,float32,numpy.ndarray


I understand this is messy, but we need to rechunk for each different type of data. It's a bit too messy for lots of if statements, so just read my comments and choose wisely. Or just try multiple times until it works :)

In [15]:
#hinda = hinda.chunk({"member": "auto", "init": -1, "lead": "auto", "lat": 45, "lon": 60}).persist() #hindcast raw & anom
#hinda = hinda.chunk({"time": -1, "lat": 45, "lon": 60}).persist() #verif
hinda = hinda.chunk({"dayofyear": -1, "lead": "auto", "lat": 45, "lon": 60}).persist() #climatology for the models
#hinda = hinda.chunk({"dayofyear": -1, "lat": 45, "lon": 60}).persist() #climatology for verification
#hinda = hinda.chunk({"category_edge": -1, "dayofyear": -1, "lead": "auto", "lat": 45, "lon": 60}).persist() #cat_edges for the model
#hinda = hinda.chunk({"category_edge": -1, "dayofyear": -1, "lat": 45, "lon": 60}).persist() #cat_edges for verification

Here we are seeing what Countries are available for masking. Just listing them out.

In [16]:
region = cat.regionmask.Countries.read()
region

<regionmask.Regions>
Name:     unnamed

Regions:
  0         Ind0                   Indonesia
  1         Mal0                    Malaysia
  2          Chi                       Chile
  3          Bol                     Bolivia
  4          Per                        Peru
..           ...                         ...
250          Mac                       Macau
251 AshandCarIsl Ashmore and Cartier Islands
252    BajNueBan             Bajo Nuevo Bank
253       SerBan             Serranilla Bank
254       ScaSho           Scarborough Shoal

[255 regions]

## Running the region mask over the data!

In [None]:
mask = region.mask(hinda, lon_name='lon',lat_name='lat')

In [None]:
var = hinda.groupby(mask).mean('stacked_lat_lon')

Here we have a function that adds labels to the region mask.

In [None]:
def set_regionmask_labels(ds, region):
    """Set names as region label for region dimension from regionmask regions."""
    abbrevs = region[ds.region.values].abbrevs
    names = region[ds.region.values].names
    ds.coords["abbrevs"] = ("region", abbrevs)
    ds.coords["number"] = ("region", ds.region.values)
    ds["region"] = names
    return ds

var = set_regionmask_labels(var, region)
var.coords

In [None]:
var

Again, here you need to choose which chunking you want based on your data.

In [None]:
#%time var = var.chunk({"member": -1, "init": -1, "lead": -1, "region": 1}).persist() #hindcast
#%time var = var.chunk({"member": -1, "init": -1, "lead": "auto", "region": 1}).persist() #hindcast
#%time var = var.chunk({"time": -1, "region": 1}).persist() #verif
#%time var = var.chunk({"dayofyear": -1, "lead": -1, "region": 1}).persist() #climatology for the models
#%time var = var.chunk({"dayofyear": -1, "region": 1}).persist() #climatology for verification
#%time var = var.chunk({"category_edge": -1, "dayofyear": -1, "lead": -1, "region": 1}).persist() #cat_edges for the models
%time var = tsurfc.chunk({"category_edge": -1, "dayofyear": -1, "region": 1}).persist() #cat_edges for verification

In [None]:
var

# Write out to zarr!

Or even netcdf if you want. The data is small enough

In [None]:
# %time var.to_zarr("/glade/campaign/mmm/c3we/jaye/S2S_zarr/"+model+"."+data+".cat_edges."+time+".country.zarr/",mode="w",consolidated=True)