# Process gridded observation data into timeseries

## 2025/03/23

Include reanalysis in preprocessing for comparison with other data products.

The criterion for considering regions unobserved (>10% missing data) is reasonable, but the impact of this threshold on the results should be discussed.

The data availability threshold influences our results by determining the “start year” in which observations are considered complete running into the future. This influences both the trend at any given year (since it may start earlier or later with a different availability threshold) and the envelope of internal variability (since a longer and earlier beginning trend has less internal variability). The estimate the impact of our threshold on the results, we have recalculated the start date with more (5%) and less (30%) stringent thresholds. The change in record start years is now included as a supplementary figure (Figure S??). Overall, we see that the influence of the availability threshold on the start year is small (<X years) in most regions.


__1. Process the gridded temperature data into timeseries for each observational product.__

Output is a dataArray for each model with dimensions of time and IPCC region containing a time series of the TAS variable.


Use this tool:  

https://github.com/IPCC-WG1/Atlas/blob/main/notebooks/reference-regions_Python.ipynb

For now, I will create my code for the CESM1 and MPI models so that it can be generalized easily. I can pull some code from my climatetrend_uncertainty repository (climatetrend_uncertainty/initial_code/PIC_timeseries_preproc.ipynb).

## Code!

In [15]:
import numpy as np
import xarray as xr
import pandas as pd
import matplotlib.pyplot as plt
import os
import glob
import time

import xagg as xa
import geopandas as gpd
import regionmask

from dask_jobqueue import PBSCluster
from dask.distributed import Client
import dask

import subprocess

regionmask.__version__

%matplotlib inline

__From Adam Phillips via Nathan Lenssen:__  

Hi Nathan,  
Yes. A few months ago we started copying some of observational + reanalysis data over to /glade/campaign/cgd/cas/observations/.   
You can find monthly ERA5 t2m data here:  
/glade/campaign/cgd/cas/observations/ERA5/mon/t2m/era5.t2m.194001-202412.nc (at ~1/4 degree resolution)
and I just copied over what MERRA2 data we have to here:  
/glade/campaign/cgd/cas/observations/MERRA2/mon/T2M_MERRA2_asm_mon_198001_202012.nc (at 1/2 degree resolution)  

In [2]:
era5_datapath = "/glade/campaign/cgd/cas/observations/ERA5/mon/t2m/era5.t2m.194001-202412.nc"
merra2_datapath = "/glade/campaign/cgd/cas/observations/MERRA2/mon/T2M_MERRA2_asm_mon_198001_202012.nc"

In [3]:
era5_tas_var = 't2m'
merra2_tas_var = 'T2M'

### Load and process timeseries according to IPCC Region designations.

Mask data based on availability.

### 2. Do masking for each dataset

Variable is "tempAnom". "record" coordinate will allow for easier concatenation.

### Loop over observation files and compute the regional means.

In [24]:
def create_ipccregion_timeseries_xagg(
    ds_filepath:str,
    ds_var:str,
    model_str:str,
    cesm=False,
    read_wm=True,
    write_wm=True,
    new_times=None,
    ufunc=None,
):
    
    '''
    Compute timeseries for all IPCC AR6 regions when given a simple model output file.
    Now using xagg to appropriately weight gridcells that fall partly within a region!
    '''
    # Load data
    ds = xr.open_dataset(ds_filepath)
    
    if ufunc is not None:
        print(ufunc)
        ds = ufunc(ds)
    
    try:
        ds = ds.rename({"latitude":"lat", "longitude":"lon"})
    except:
        pass
    
    # Correct time if CESM
    if cesm:
        ds  = fix_cesm_time(ds)
    
    if new_times is not None:
        ds["time"] = new_times

    da = ds[ds_var]

    xagg_dir = "/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/xagg_resources"
    xa.set_options(rgrd_alg='bilinear',nan_to_zero_regridding=False)

    if (read_wm and os.path.exists(os.path.join(xagg_dir, f'wm_{model_str}'))):
        # Load weightmap
        weightmap = xa.read_wm(os.path.join(xagg_dir, f'wm_{model_str}'))
    else:
        # Load IPCC region shp file:
        ipcc_wgi_regions_shp = "IPCC-WGI-reference-regions-v4.shp"
        gdf = gpd.read_file(os.path.join(xagg_dir, ipcc_wgi_regions_shp))
                
        # Compute weights for entire grid. Assuming lat, lon, time dimension on input
        area_weights = np.cos(np.deg2rad(da.lat)).broadcast_like(da.isel(time=0).squeeze())
        
        weightmap = xa.pixel_overlaps(da, gdf, weights=area_weights)
        # Save the weightmap for later:
        if write_wm:
            weightmap.to_file(os.path.join(xagg_dir, f'wm_{model_str}'))

    # Aggregate
    with xa.set_options(silent=True):
        aggregated = xa.aggregate(da, weightmap)
    # aggregated = xa.aggregate(da, weightmap)
    
    # Convert to an xarray dataset
    aggregated_ds = aggregated.to_dataset()
    # Change xarray formatting to match previous file organization.
    aggregated_ds = aggregated_ds.set_coords(("Continent", "Type", "Name", "Acronym")).rename({"poly_idx": "RegionIndex", "Name": "RegionName", "Acronym": "RegionAbbrev"})
        
    return aggregated_ds

In [25]:
def aggregate_wrapper(
    ds_filepath:str,
    save_filepath:str,
    ds_var:str,
    model_str:str,
    ufunc=None,
    new_times=None,
):
    aggregated_ds = create_ipccregion_timeseries_xagg(
        ds_filepath=ds_filepath,
        ds_var=ds_var,
        model_str=model_str,
        new_times=new_times,
        ufunc=ufunc,
    )

    aggregated_ds.to_netcdf(path=save_filepath)


In [None]:
# def rename_valid_time_to_time(ds: xr.Dataset) -> xr.Dataset:
#     """
#     Rename the 'valid_time' coordinate to 'time' in the input xarray Dataset.
#     """
#     return ds.rename({"valid_time": "time"})

# lambda x: x.rename({"valid_time": "time"})

In [26]:
save_dir = '/glade/u/home/jonahshaw/w/trend_uncertainty/nathan/OBS_LENS/'

### ERA5

In [27]:
model_subdir = 'ERA5/20250323/xagg/'

if not os.path.exists(os.path.join(save_dir,model_subdir)):
    os.makedirs(os.path.join(save_dir,model_subdir))
    
_ds_var = era5_tas_var
_ds_filepath = era5_datapath

filename = _ds_filepath.split('/')[-1]
_outfilepath = '%s/%s/%s' % (save_dir,model_subdir,filename)

tasks = []

if os.path.exists(_outfilepath):
    print('Skipping %s' % _outfilepath)

else:
    tasks.append(dask.delayed(aggregate_wrapper)(
        ds_filepath=_ds_filepath,
        save_filepath=_outfilepath,
        ds_var=_ds_var,
        model_str="ERA5",
        ufunc=lambda x: x.rename({"valid_time": "time"}),
        new_times=None,
    ))

In [28]:
# Launch a Dask cluster using PBSCluster
try:
    cluster = PBSCluster(cores    = 1,
                        memory   = '32GB',
                        queue    = 'casper',
                        walltime = '00:15:00',
                        project  = 'UCUC0007',
                        )
    cluster.scale(jobs=1)
    client = Client(cluster)

    dask.compute(*tasks)

    client.shutdown()
except subprocess.CalledProcessError as e:
    print(f"An error occurred: {e}")
    client.shutdown()

Perhaps you already have a cluster running?
Hosting the HTTP server on port 39611 instead


### MERRA2

In [29]:
model_subdir = 'MERRA2/20250323/xagg/'

if not os.path.exists(os.path.join(save_dir,model_subdir)):
    os.makedirs(os.path.join(save_dir,model_subdir))
    
_ds_var = merra2_tas_var
_ds_filepath = merra2_datapath

filename = _ds_filepath.split('/')[-1]
_outfilepath = '%s/%s/%s' % (save_dir,model_subdir,filename)

tasks = []

if os.path.exists(_outfilepath):
    print('Skipping %s' % _outfilepath)

else:
    tasks.append(dask.delayed(aggregate_wrapper)(
        ds_filepath=_ds_filepath,
        save_filepath=_outfilepath,
        ds_var=_ds_var,
        model_str="MERRA2",
        ufunc=None,
        new_times=None,
    ))

In [None]:
# Launch a Dask cluster using PBSCluster
cluster = PBSCluster(cores    = 1,
                    memory   = '32GB',
                    queue    = 'casper',
                    walltime = '00:15:00',
                    project  = 'UCUC0007',
                    )
cluster.scale(jobs=1)
client = Client(cluster)

dask.compute(*tasks)

client.shutdown()

Perhaps you already have a cluster running?
Hosting the HTTP server on port 46419 instead


Interpolate to 5x5 degree resolution for comparison.

Clean-up dask workers

In [2]:
import subprocess
import glob

In [3]:
working_dir = "/glade/u/home/jonahshaw/Scripts/git_repos/internalvar-vs-obsunc/preprocess_obs3/"
daskworker_list = glob.glob(f"{working_dir}/dask-worker.????????")

In [4]:
for file_path in daskworker_list:
    print(file_path)
    try:
        subprocess.run(['rm', '-f', file_path], check=True)
        print(f"Removed: {file_path}")
    except subprocess.CalledProcessError as e:
        print(f"Error removing {file_path}: {e}")
    # break
        

/glade/u/home/jonahshaw/Scripts/git_repos/internalvar-vs-obsunc/preprocess_obs3/dask-worker.e4169091
Removed: /glade/u/home/jonahshaw/Scripts/git_repos/internalvar-vs-obsunc/preprocess_obs3/dask-worker.e4169091
/glade/u/home/jonahshaw/Scripts/git_repos/internalvar-vs-obsunc/preprocess_obs3/dask-worker.o4169087
Removed: /glade/u/home/jonahshaw/Scripts/git_repos/internalvar-vs-obsunc/preprocess_obs3/dask-worker.o4169087
/glade/u/home/jonahshaw/Scripts/git_repos/internalvar-vs-obsunc/preprocess_obs3/dask-worker.e4169103
Removed: /glade/u/home/jonahshaw/Scripts/git_repos/internalvar-vs-obsunc/preprocess_obs3/dask-worker.e4169103
/glade/u/home/jonahshaw/Scripts/git_repos/internalvar-vs-obsunc/preprocess_obs3/dask-worker.o4169094
Removed: /glade/u/home/jonahshaw/Scripts/git_repos/internalvar-vs-obsunc/preprocess_obs3/dask-worker.o4169094
/glade/u/home/jonahshaw/Scripts/git_repos/internalvar-vs-obsunc/preprocess_obs3/dask-worker.o4169097
Removed: /glade/u/home/jonahshaw/Scripts/git_repos/inte