# Process gridded observation data into timeseries

## 2025/03/21

The criterion for considering regions unobserved (>10% missing data) is reasonable, but the impact of this threshold on the results should be discussed.

The data availability threshold influences our results by determining the “start year” in which observations are considered complete running into the future. This influences both the trend at any given year (since it may start earlier or later with a different availability threshold) and the envelope of internal variability (since a longer and earlier beginning trend has less internal variability). The estimate the impact of our threshold on the results, we have recalculated the start date with more (5%) and less (30%) stringent thresholds. The change in record start years is now included as a supplementary figure (Figure S??). Overall, we see that the influence of the availability threshold on the start year is small (<X years) in most regions.


__1. Process the gridded temperature data into timeseries for each observational product.__

Output is a dataArray for each model with dimensions of time and IPCC region containing a time series of the TAS variable.


Use this tool:  

https://github.com/IPCC-WG1/Atlas/blob/main/notebooks/reference-regions_Python.ipynb

For now, I will create my code for the CESM1 and MPI models so that it can be generalized easily. I can pull some code from my climatetrend_uncertainty repository (climatetrend_uncertainty/initial_code/PIC_timeseries_preproc.ipynb).

## Code!

In [3]:
import numpy as np
import xarray as xr
import pandas as pd
import matplotlib.pyplot as plt
import os
import glob
import time

import xagg as xa
import geopandas as gpd
import regionmask

from dask_jobqueue import PBSCluster
from dask.distributed import Client
import dask

regionmask.__version__

%matplotlib inline

__Observational Large Ensembles.__

General directory in Nathan's scratch: /glade/scratch/lenssen/data_for_jonah/

NASA GISTEMP: GISTEMP_2x2  GISTEMP_5x5

HadCRUT5: HadCRUT5

In [4]:
gistemp_2x2_dir = '/glade/derecho/scratch/lenssen/data4jonah/GISTEMP_Ensemble_Aug/' # I don't want to use this
gistemp_dir = "/glade/derecho/scratch/lenssen/data4jonah/GISTEMP_Ensemble_Aug_5x5/"
hadcrut5_dir = '/glade/derecho/scratch/lenssen/data4jonah/HadCRUT5_Ensemble_Aug/'
dcent_unfilled_dir = "/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS/DCENT/rawdata/DCENT_ensemble/"

### Collect file paths.

#### Collect GISTEMP file paths.

In [5]:
gistemp_files = glob.glob('%s/ensembleChunk_5x5_????.nc' % gistemp_dir)
gistemp_files.sort()

#### Collect HadCRUT5 file paths.

In [6]:
hadcrut5_files = glob.glob('%s/HadCRUT.5.0.2.0.analysis.anomalies.*.nc' % hadcrut5_dir)
hadcrut5_files.sort()

#### Collect DCENT (unfilled) file paths.

In [7]:
dcent_unfilled_files = glob.glob('%s/DCENT_ensemble_1850_2023_member_???.nc' % dcent_unfilled_dir)
dcent_unfilled_files.sort()

#### BEST file path:

In [8]:
best_files = "/glade/u/home/jonahshaw/w/obs/BEST/Land_and_Ocean_LatLong1.nc"

### Load and process timeseries according to IPCC Region designations.

Mask data based on availability.

### 2. Do masking for each dataset

Variable is "tempAnom". "record" coordinate will allow for easier concatenation.

In [9]:
gistemp_tas_var = 'tas'
hadcrut5_tas_var = 'tas'
dcent_unfiled_tas_var = "temperature"
best_tas_var = "temperature"

### Loop over observation files and compute the regional means.

#### GISTEMP

In [10]:
def create_ipccregion_timeseries_xagg(
    ds_filepath:str,
    ds_var:str,
    model_str:str,
    cesm=False,
    read_wm=True,
    write_wm=True,
    new_times=None,
):
    
    '''
    Compute timeseries for all IPCC AR6 regions when given a simple model output file.
    Now using xagg to appropriately weight gridcells that fall partly within a region!
    '''
    # Load data
    ds = xr.open_dataset(ds_filepath)
    
    try:
        ds = ds.rename({"latitude":"lat", "longitude":"lon"})
    except:
        pass
    
    # Correct time if CESM
    if cesm:
        ds  = fix_cesm_time(ds)
    
    if new_times is not None:
        ds["time"] = new_times

    da = ds[ds_var]

    xagg_dir = "/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/xagg_resources"
    xa.set_options(rgrd_alg='bilinear',nan_to_zero_regridding=False)

    if (read_wm and os.path.exists(os.path.join(xagg_dir, f'wm_{model_str}'))):
        # Load weightmap
        weightmap = xa.read_wm(os.path.join(xagg_dir, f'wm_{model_str}'))
    else:
        # Load IPCC region shp file:
        ipcc_wgi_regions_shp = "IPCC-WGI-reference-regions-v4.shp"
        gdf = gpd.read_file(os.path.join(xagg_dir, ipcc_wgi_regions_shp))
                
        # Compute weights for entire grid. Assuming lat, lon, time dimension on input
        area_weights = np.cos(np.deg2rad(da.lat)).broadcast_like(da.isel(time=0).squeeze())
        
        weightmap = xa.pixel_overlaps(da, gdf, weights=area_weights)
        # Save the weightmap for later:
        if write_wm:
            weightmap.to_file(os.path.join(xagg_dir, f'wm_{model_str}'))

    # Aggregate
    with xa.set_options(silent=True):
        aggregated = xa.aggregate(da, weightmap)
    # aggregated = xa.aggregate(da, weightmap)
    
    # Convert to an xarray dataset
    aggregated_ds = aggregated.to_dataset()
    # Change xarray formatting to match previous file organization.
    aggregated_ds = aggregated_ds.set_coords(("Continent", "Type", "Name", "Acronym")).rename({"poly_idx": "RegionIndex", "Name": "RegionName", "Acronym": "RegionAbbrev"})
        
    return aggregated_ds

In [11]:
def mask_IPCC_byavailability(
    filepath: str,
    var: str,
    regions: regionmask.Regions,
    masking_threshold: float=0.9,
    ufunc=None,
    new_times=None,
    verbose=False,
):
    """
    Function for masking observational record according to data availability.
    True on the output means that region exceeds the threshold.

    Args:
        filepath (str): Path to file to load
        var (str): String identifier of variable to mask by availabiliy
        masking_threshold (float, optional): Fraction of region needed to avoid masking. Defaults to 0.9.
        ufunc (function, optional): Function to apply to the loaded DataArray. Defaults to None.

    Returns:
        region_masks (xr.DataArray): Boolean mask for each IPCC region
    """

    ds = xr.open_dataset(filepath)

    # Added for BEST, hope it works for others.
    try:
        ds = ds.rename({"latitude":"lat", "longitude":"lon"})
    except:
        pass

    da = ds[var]
    if ufunc is not None:
        da = ufunc(da)

    if new_times is not None:
        da["time"] = new_times

    mask    = regions.mask(da.isel(time=0))

    # Get unique region indices
    reg     = np.unique(mask.values)
    reg     = reg[~np.isnan(reg)]

    # for metadata: find abbreviations of all regions that were selected
    abbrevs = regions[reg].abbrevs
    names   = regions[reg].names

    # Compute weights for entire grid
    weights_spatial = np.cos(np.deg2rad(da.lat)).broadcast_like(da.isel(time=0)) # assuming 'lat' used consistently
    weights_spatiotemporal = np.cos(np.deg2rad(da.lat)).broadcast_like(da) # assuming 'lat' used consistently

    # Iterate over regions and compute a weighted time series
    region_masks_list = []
    for i,_abbrev,_name in zip(reg, abbrevs, names):
        if verbose:
            print(f"Region {i} of {len(reg)}")
        _region_weight = weights_spatial.where(mask == i).sum(dim=["lat", "lon"]) # Total weight of the region
        _available_region_weights = weights_spatiotemporal.where((mask == i) & (~np.isnan(da))).sum(dim=["lat", "lon"]) # Weight of the unmasked portion of the region
        
        _region_mask = (_available_region_weights / _region_weight) > masking_threshold
        _region_mask = _region_mask.assign_coords(RegionIndex=i).expand_dims("RegionIndex")
        _region_mask = _region_mask.assign_coords(RegionName=_name)
        _region_mask = _region_mask.assign_coords(RegionAbbrev=_abbrev)
        
        region_masks_list.append(_region_mask)
    region_masks = xr.concat(region_masks_list, dim="RegionIndex")
    region_masks.name = "mask"
    
    return region_masks


def gistemp_5x5_preproc(ds):
    
    try:
        ds = ds.rename({"latitude":"lat", "longitude":"lon"})
    except:
        pass
    
    return ds
    # return ds.rename({"latitude":"lat", "longitude":"lon"})

In [12]:
def aggregateandmask_wrapper(
    ds_filepath:str,
    save_filepath:str,
    ds_var:str,
    model_str:str,
    realization:int,
    regions: regionmask.Regions=regionmask.defined_regions.ar6.all,
    masking_threshold: float=0.9,
    ufunc=None,
    new_times=None,
    verbose=False,
):
    aggregated_ds = create_ipccregion_timeseries_xagg(
        ds_filepath=ds_filepath,
        ds_var=ds_var,
        model_str=model_str,
        new_times=new_times,
    )

    region_masks = mask_IPCC_byavailability(
        filepath=ds_filepath,
        var=ds_var,
        regions=regions,
        masking_threshold=masking_threshold,
        ufunc=ufunc,
        new_times=new_times,
        verbose=verbose,
    )

    ipcc_regions_maskedavail = aggregated_ds.where(region_masks)
    
    # Add "realization" coordinate so concatenation is easier.
    ipcc_regions_maskedavail = ipcc_regions_maskedavail.assign_coords(realization=realization).expand_dims("realization")
    ipcc_regions_maskedavail.to_netcdf(path=save_filepath)


In [13]:
save_dir = '/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS/'

#### GISTEMP

Use 5% (95% availability) threshold.

In [14]:
%%time

# gistemp_new_times = (pd.date_range("1900-01-01", freq="1M", periods=12*61)-pd.offsets.MonthBegin(1)).shift(periods=14,freq='D')
model_subdir = 'GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95'

if not os.path.exists(os.path.join(save_dir,model_subdir)):
    os.makedirs(os.path.join(save_dir,model_subdir))

# New time dimension to apply to correct drift from skipping leap years. JKS test.
new_times = pd.date_range('1880-01-16', '2020-12-31', freq='1ME') + pd.tseries.offsets.Day(-15)

# Variable to select and operate over.
_ds_var = gistemp_tas_var

tasks = []



for i,_ds_filepath in enumerate(gistemp_files):

    filename = _ds_filepath.split('/')[-1]
    _outfilepath = '%s/%s/%s' % (save_dir,model_subdir,filename)
    
    if os.path.exists(_outfilepath):
        print('Skipping %s' % _outfilepath)
        continue

    tasks.append(dask.delayed(aggregateandmask_wrapper)(
        ds_filepath=_ds_filepath,
        save_filepath=_outfilepath,
        ds_var=_ds_var,
        model_str="GISTEMP_5x5",
        realization=i+1,
        regions=regionmask.defined_regions.ar6.all,
        masking_threshold=0.95,
        ufunc=gistemp_5x5_preproc,
        new_times=new_times,
    ))

    # if i == 2: break
# dask.compute(*tasks)
    
    

Skipping /glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0001.nc
Skipping /glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0002.nc
Skipping /glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0003.nc
Skipping /glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0004.nc
Skipping /glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0005.nc
Skipping /glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0006.nc
Skipping /glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xag

In [18]:
# Launch a Dask cluster using PBSCluster
cluster = PBSCluster(cores    = 1,
                    memory   = '8GB',
                    queue    = 'casper',
                    walltime = '00:15:00',
                    project  = 'UCUC0007',
                    )
cluster.scale(jobs=16)
client = Client(cluster)

dask.compute(*tasks)

client.shutdown()

Perhaps you already have a cluster running?
Hosting the HTTP server on port 39817 instead


In [None]:
%%time

# gistemp_new_times = (pd.date_range("1900-01-01", freq="1M", periods=12*61)-pd.offsets.MonthBegin(1)).shift(periods=14,freq='D')
model_subdir = 'GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95'

if not os.path.exists(os.path.join(save_dir,model_subdir)):
    os.makedirs(os.path.join(save_dir,model_subdir))

# New time dimension to apply to correct drift from skipping leap years. JKS test.
new_times = pd.date_range('1880-01-16', '2020-12-31', freq='1ME') + pd.tseries.offsets.Day(-15)

# Variable to select and operate over.
_ds_var = gistemp_tas_var

xagg_tasks = []
mask_tasks = []

for i,_ds_filepath in enumerate(gistemp_files):

    filename = _ds_filepath.split('/')[-1]
    _outfilepath = '%s/%s/%s' % (save_dir,model_subdir,filename)
    
    if os.path.exists(_outfilepath):
        print('Skipping %s' % _outfilepath)
        continue

    # xagg_tasks.append(dask.delayed(create_ipccregion_timeseries_xagg)(
    #     ds_filepath=_ds_filepath,
    #     ds_var=_ds_var,
    #     model_str="GISTEMP_5x5",
    #     cesm=False,
    #     new_times=new_times,
    # ))
    # mask_tasks.append(dask.delayed(mask_IPCC_byavailability)(
    #     filepath=_ds_filepath,
    #     var=_ds_var,
    #     regions=regionmask.defined_regions.ar6.all,
    #     masking_threshold=0.95,
    #     ufunc=gistemp_5x5_preproc,
    #     new_times=new_times,
    # ))

    xagg_region_means = create_ipccregion_timeseries_xagg(
        ds_filepath=_ds_filepath,
        ds_var=_ds_var,
        model_str="GISTEMP_5x5",
        cesm=False,
        new_times=new_times,
    )

    avail_mask = mask_IPCC_byavailability(
        filepath=_ds_filepath,
        var=_ds_var,
        regions=regionmask.defined_regions.ar6.all,
        masking_threshold=0.95,
        ufunc=gistemp_5x5_preproc,
        new_times=new_times,
    )

    ipcc_regions_maskedavail = xagg_region_means.where(avail_mask)
    
    # Add "realization" coordinate so concatenation is easier.
    ipcc_regions_maskedavail = ipcc_regions_maskedavail.assign_coords(realization=i+1).expand_dims("realization")
    
    
    print(_outfilepath)
    
    ipcc_regions_maskedavail.to_netcdf(path=_outfilepath)
    # if i == 2: break
    
    

Skipping /glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0001.nc




/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0002.nc




/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0003.nc




/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0004.nc




/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0005.nc




/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0006.nc




/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0007.nc




/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0008.nc




/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0009.nc




/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0010.nc




/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0011.nc




/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0012.nc




/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0013.nc




/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0014.nc




/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0015.nc




/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0016.nc




/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0017.nc




/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0018.nc




/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0019.nc




/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0020.nc




/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0021.nc




/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0022.nc




/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0023.nc




/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0024.nc




/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0025.nc




/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0026.nc




/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0027.nc




/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0028.nc




/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0029.nc




Use 30% (70% availability) threshold.

In [None]:
%%time

# gistemp_new_times = (pd.date_range("1900-01-01", freq="1M", periods=12*61)-pd.offsets.MonthBegin(1)).shift(periods=14,freq='D')
model_subdir = 'GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.70'

if not os.path.exists(os.path.join(save_dir,model_subdir)):
    os.makedirs(os.path.join(save_dir,model_subdir))

# New time dimension to apply to correct drift from skipping leap years. JKS test.
new_times = pd.date_range('1880-01-16', '2020-12-31', freq='1ME') + pd.tseries.offsets.Day(-15)

# Variable to select and operate over.
_ds_var = gistemp_tas_var

for i,_ds_filepath in enumerate(gistemp_files):

    filename = _ds_filepath.split('/')[-1]
    _outfilepath = '%s/%s/%s' % (save_dir,model_subdir,filename)
    
    if os.path.exists(_outfilepath):
        print('Skipping %s' % _outfilepath)
        continue

    xagg_region_means = create_ipccregion_timeseries_xagg(
        ds_filepath=_ds_filepath,
        ds_var=_ds_var,
        model_str="GISTEMP_5x5",
        cesm=False,
        new_times=new_times,
    )

    avail_mask = mask_IPCC_byavailability(
        filepath=_ds_filepath,
        var=_ds_var,
        regions=regionmask.defined_regions.ar6.all,
        masking_threshold=0.70,
        ufunc=gistemp_5x5_preproc,
        new_times=new_times,
    )

    ipcc_regions_maskedavail = xagg_region_means.where(avail_mask)
    
    # Add "realization" coordinate so concatenation is easier.
    ipcc_regions_maskedavail = ipcc_regions_maskedavail.assign_coords(realization=i+1).expand_dims("realization")
    
    
    print(_outfilepath)
    
    ipcc_regions_maskedavail.to_netcdf(path=_outfilepath)
    # if i == 2: break
    
    



/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//GISTEMP_5x5/20240820/xagg_correctedtime/threshold_0.95/ensembleChunk_5x5_0001.nc
CPU times: user 2.98 s, sys: 1min 2s, total: 1min 5s
Wall time: 1min 12s


#### HadCRUT5

In [20]:
# Arbitrary function to apply

def hadcrut5_preproc(da:xr.DataArray):
    
    try:
        da = da.rename({"latitude":"lat", "longitude":"lon"})
    except:
        pass
    
    return da

Apply the 95% (5%) threshold

In [None]:
%%time

model_subdir = 'HadCRUT5/20240820/xagg/threshold_0.95/'

if not os.path.exists(os.path.join(save_dir,model_subdir)):
    os.makedirs(os.path.join(save_dir,model_subdir))

# Variable to select and operate over.
_ds_var = hadcrut5_tas_var

for _ds_filepath in hadcrut5_files:

    xagg_region_means = create_ipccregion_timeseries_xagg(
        ds_filepath=_ds_filepath,
        ds_var=_ds_var,
        model_str="HadCRUT5",
        cesm=False,
    )

    avail_mask = mask_IPCC_byavailability(
        filepath=_ds_filepath,
        var=_ds_var,
        regions=regionmask.defined_regions.ar6.all,
        masking_threshold=0.95,
        ufunc=hadcrut5_preproc,
    )

    ipcc_regions_maskedavail = xagg_region_means.where(avail_mask)
    
    filename = _ds_filepath.split('/')[-1]
    _realization = filename.split(".")[-2]

    ipcc_regions_maskedavail = ipcc_regions_maskedavail.assign_coords(realization=_realization).expand_dims("realization")
    
    _outfilepath = '%s/%s/%s' % (save_dir,model_subdir,filename)
    print(_outfilepath)
    break
    ipcc_regions_maskedavail.to_netcdf(path=_outfilepath)
    if i == 2: break




/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//HadCRUT5/20240820/xagg/HadCRUT.5.0.2.0.analysis.anomalies.1.nc
CPU times: user 3.26 s, sys: 2.44 s, total: 5.7 s
Wall time: 10.7 s


Apply the 70% (30%) threshold

In [None]:
%%time

model_subdir = 'HadCRUT5/20240820/xagg/threshold_0.70/'

if not os.path.exists(os.path.join(save_dir,model_subdir)):
    os.makedirs(os.path.join(save_dir,model_subdir))

# Variable to select and operate over.
_ds_var = hadcrut5_tas_var

for _ds_filepath in hadcrut5_files:

    xagg_region_means = create_ipccregion_timeseries_xagg(
        ds_filepath=_ds_filepath,
        ds_var=_ds_var,
        model_str="HadCRUT5",
        cesm=False,
    )

    avail_mask = mask_IPCC_byavailability(
        filepath=_ds_filepath,
        var=_ds_var,
        regions=regionmask.defined_regions.ar6.all,
        masking_threshold=0.70,
        ufunc=hadcrut5_preproc,
    )

    ipcc_regions_maskedavail = xagg_region_means.where(avail_mask)
    
    filename = _ds_filepath.split('/')[-1]
    _realization = filename.split(".")[-2]

    ipcc_regions_maskedavail = ipcc_regions_maskedavail.assign_coords(realization=_realization).expand_dims("realization")
    
    _outfilepath = '%s/%s/%s' % (save_dir,model_subdir,filename)
    print(_outfilepath)
    break
    ipcc_regions_maskedavail.to_netcdf(path=_outfilepath)
    if i == 2: break




/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//HadCRUT5/20240820/xagg/HadCRUT.5.0.2.0.analysis.anomalies.1.nc
CPU times: user 3.26 s, sys: 2.44 s, total: 5.7 s
Wall time: 10.7 s


#### BEST

Why this take so long?
/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//BEST/20250320/xagg/Land_and_Ocean_LatLong1.nc

Apply the 5% (95%) threshold.

_Having issues here?_  
_available_region_weights = weights_spatiotemporal.where((mask == i) & (~np.isnan(da))).sum(dim=["lat", "lon"]) # Weight of the unmasked portion of the region

In [28]:
test_ds = xr.open_dataset(_ds_filepath, chunks={"time":1})

In [29]:
test_ds

Unnamed: 0,Array,Chunk
Bytes,506.25 kiB,506.25 kiB
Shape,"(180, 360)","(180, 360)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 506.25 kiB 506.25 kiB Shape (180, 360) (180, 360) Dask graph 1 chunks in 2 graph layers Data type float64 numpy.ndarray",360  180,

Unnamed: 0,Array,Chunk
Bytes,506.25 kiB,506.25 kiB
Shape,"(180, 360)","(180, 360)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,519.10 MiB,9.57 kiB
Shape,"(2100, 180, 360)","(1, 35, 70)"
Dask graph,75600 chunks in 2 graph layers,75600 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 519.10 MiB 9.57 kiB Shape (2100, 180, 360) (1, 35, 70) Dask graph 75600 chunks in 2 graph layers Data type float32 numpy.ndarray",360  180  2100,

Unnamed: 0,Array,Chunk
Bytes,519.10 MiB,9.57 kiB
Shape,"(2100, 180, 360)","(1, 35, 70)"
Dask graph,75600 chunks in 2 graph layers,75600 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.97 MiB,2.97 MiB
Shape,"(12, 180, 360)","(12, 180, 360)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 2.97 MiB 2.97 MiB Shape (12, 180, 360) (12, 180, 360) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",360  180  12,

Unnamed: 0,Array,Chunk
Bytes,2.97 MiB,2.97 MiB
Shape,"(12, 180, 360)","(12, 180, 360)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


In [None]:
%%time

model_subdir = 'BEST/20250320/xagg/threshold_0.95/'

if not os.path.exists(os.path.join(save_dir,model_subdir)):
    os.makedirs(os.path.join(save_dir,model_subdir))

# Variable to select and operate over.
_ds_var = best_tas_var
_ds_filepath = best_files

# for _ds_filepath in dcent_unfilled_files:

# Print cluster dashboard link
print(cluster.dashboard_link)

time_start = time.time()
xagg_region_means = create_ipccregion_timeseries_xagg(
    ds_filepath=_ds_filepath,
    ds_var=_ds_var,
    model_str="BEST",
    cesm=False,
)

print("Time to compute xagg: %s" % (time.time() - time_start))

avail_mask = mask_IPCC_byavailability(
    filepath=_ds_filepath,
    var=_ds_var,
    regions=regionmask.defined_regions.ar6.all,
    masking_threshold=0.95,
    verbose=True,
)

print("Time to compute mask: %s" % (time.time() - time_start))

ipcc_regions_maskedavail = xagg_region_means.where(avail_mask)

print("Time to apply mask: %s" % (time.time() - time_start))

filename = _ds_filepath.split('/')[-1]

_outfilepath = '%s/%s/%s' % (save_dir,model_subdir,filename)
print(_outfilepath)

ipcc_regions_maskedavail.to_netcdf(path=_outfilepath)

print("Total time: %s" % (time.time() - time_start))

Perhaps you already have a cluster running?
Hosting the HTTP server on port 42635 instead


http://128.117.211.221:42635/status




Time to compute xagg: 84.40060353279114
Region 0.0 of 58
Region 1.0 of 58
Region 2.0 of 58
Region 3.0 of 58
Region 4.0 of 58
Region 5.0 of 58
Region 6.0 of 58
Region 7.0 of 58
Region 8.0 of 58
Region 9.0 of 58
Region 10.0 of 58
Region 11.0 of 58
Region 12.0 of 58
Region 13.0 of 58
Region 14.0 of 58
Region 15.0 of 58
Region 16.0 of 58
Region 17.0 of 58
Region 18.0 of 58
Region 19.0 of 58
Region 20.0 of 58
Region 21.0 of 58
Region 22.0 of 58
Region 23.0 of 58
Region 24.0 of 58
Region 25.0 of 58
Region 26.0 of 58
Region 27.0 of 58
Region 28.0 of 58
Region 29.0 of 58
Region 30.0 of 58
Region 31.0 of 58
Region 32.0 of 58
Region 33.0 of 58
Region 34.0 of 58
Region 35.0 of 58
Region 36.0 of 58
Region 37.0 of 58
Region 38.0 of 58
Region 39.0 of 58
Region 40.0 of 58
Region 41.0 of 58
Region 42.0 of 58
Region 43.0 of 58
Region 44.0 of 58
Region 45.0 of 58
Region 46.0 of 58
Region 47.0 of 58
Region 48.0 of 58
Region 49.0 of 58
Region 50.0 of 58
Region 51.0 of 58
Region 52.0 of 58
Region 53.0 of 5

This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.
2025-03-21 13:39:10,718 - tornado.application - ERROR - Exception in callback functools.partial(<bound method Client._send_to_scheduler_safe of <Client: 'tcp://128.117.211.221:44451' processes=0 threads=0, memory=0 B>>, {'op': 'client-releases-keys', 'keys': [('store-map-f3f190065a6955445a18d32968f64206', 52, 788)], 'client': 'Client-a2aba6d8-0673-11f0-b52b-ac1f6bc7cc9a'})
Traceback (most recent call last):
  File "/glade/work/jonahshaw/conda-envs/py_xagg/lib/python3.12/site-packages/tornado/ioloop.py", line 750, in _run_callback
    ret = callback()
          ^^^^^^^^^^
  File "/glade/work/jonahshaw/conda-envs/py_xagg/lib/python3.12/site-packages/distributed/client.py", line 1452, in _send_to_scheduler_safe
    self.sched

KeyboardInterrupt: 

Task exception was never retrieved
future: <Task finished name='Task-370789' coro=<Client._gather.<locals>.wait() done, defined at /glade/work/jonahshaw/conda-envs/py_xagg/lib/python3.12/site-packages/distributed/client.py:2396> exception=AllExit()>
Traceback (most recent call last):
  File "/glade/work/jonahshaw/conda-envs/py_xagg/lib/python3.12/site-packages/distributed/client.py", line 2405, in wait
    raise AllExit()
distributed.client.AllExit
Task exception was never retrieved
future: <Task finished name='Task-370790' coro=<Client._gather.<locals>.wait() done, defined at /glade/work/jonahshaw/conda-envs/py_xagg/lib/python3.12/site-packages/distributed/client.py:2396> exception=AllExit()>
Traceback (most recent call last):
  File "/glade/work/jonahshaw/conda-envs/py_xagg/lib/python3.12/site-packages/distributed/client.py", line 2405, in wait
    raise AllExit()
distributed.client.AllExit
Task exception was never retrieved
future: <Task finished name='Task-370791' coro=<Client._gat

Task exception was never retrieved
future: <Task finished name='Task-370793' coro=<Client._gather.<locals>.wait() done, defined at /glade/work/jonahshaw/conda-envs/py_xagg/lib/python3.12/site-packages/distributed/client.py:2396> exception=AllExit()>
Traceback (most recent call last):
  File "/glade/work/jonahshaw/conda-envs/py_xagg/lib/python3.12/site-packages/distributed/client.py", line 2405, in wait
    raise AllExit()
distributed.client.AllExit
Task exception was never retrieved
future: <Task finished name='Task-370794' coro=<Client._gather.<locals>.wait() done, defined at /glade/work/jonahshaw/conda-envs/py_xagg/lib/python3.12/site-packages/distributed/client.py:2396> exception=AllExit()>
Traceback (most recent call last):
  File "/glade/work/jonahshaw/conda-envs/py_xagg/lib/python3.12/site-packages/distributed/client.py", line 2405, in wait
    raise AllExit()
distributed.client.AllExit
Task exception was never retrieved
future: <Task finished name='Task-370795' coro=<Client._gat

In [None]:
%%time

model_subdir = 'BEST/20250320/xagg/threshold_0.70/'

if not os.path.exists(os.path.join(save_dir,model_subdir)):
    os.makedirs(os.path.join(save_dir,model_subdir))

# Variable to select and operate over.
_ds_var = best_tas_var
_ds_filepath = best_files

# for _ds_filepath in dcent_unfilled_files:

# Launch a Dask cluster using PBSCluster
cluster = PBSCluster(cores    = 1,
                    memory   = '8GB',
                    queue    = 'casper',
                    walltime = '00:15:00',
                    project  = 'UCUC0007',
                    )
cluster.scale(jobs=16)
client = Client(cluster)

# Print cluster dashboard link
print(cluster.dashboard_link)

time_start = time.time()
xagg_region_means = create_ipccregion_timeseries_xagg(
    ds_filepath=_ds_filepath,
    ds_var=_ds_var,
    model_str="BEST",
    cesm=False,
)

print("Time to compute xagg: %s" % (time.time() - time_start))

avail_mask = mask_IPCC_byavailability(
    filepath=_ds_filepath,
    var=_ds_var,
    regions=regionmask.defined_regions.ar6.all,
    masking_threshold=0.70,
)

print("Time to compute mask: %s" % (time.time() - time_start))

ipcc_regions_maskedavail = xagg_region_means.where(avail_mask)

print("Time to apply mask: %s" % (time.time() - time_start))

filename = _ds_filepath.split('/')[-1]

_outfilepath = '%s/%s/%s' % (save_dir,model_subdir,filename)
print(_outfilepath)

ipcc_regions_maskedavail.to_netcdf(path=_outfilepath)

client.shutdown()

print("Total time: %s" % (time.time() - time_start))
del ipcc_regions_maskedavail

Perhaps you already have a cluster running?
Hosting the HTTP server on port 41057 instead


http://128.117.211.221:41057/status




/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS//BEST/20250320/xagg/threshold_0.70//Land_and_Ocean_LatLong1.nc
CPU times: user 54.2 s, sys: 1min 23s, total: 2min 17s
Wall time: 2min 31s


