# Python Resample/Regrid Examples

This notebook will cover how to resample data onto a common grid usually a few different python pacakges.

----

You will need a custom jupyter kernel to run this notebook on HPC Orion. 

At the command line create a custom kernel called resample:

conda create -y --prefix /path/to/your/personal/work/dir/envs/resample -c conda-forge xarray rioxarray rasterio gdal matplotlib glob2 netcdf4 ipykernel

conda activate /path/to/your/personal/work/dir/envs/resample

python -m ipykernel install --prefix /path/to/your/personal/work/dir/envs/resample --name resample

and don't forget to add the kernel path in the kernel box when you launch jupyter

In [None]:
# all the packages we'll use

import rioxarray as rio
from osgeo import gdal
gdal.UseExceptions()
import numpy as np
import matplotlib.pyplot as plt
import glob
from rasterio.enums import Resampling
import xarray as xr
import re

In [None]:
# files and directories we'll use

# path to the root dir for our datasets
base_path='/work/hpc/datasets/unfao_sera/'#'/our/shared/datasets/dir/'

# dir to write output
outdir=base_path+'temporary/'

# file to use as the common grid
gridfile=base_path+'raw_data/WorldPop/ppp_2010_1km_Aggregated.tiff'

## First up, gdal package

GDAL is a long-time standard for spatial resampling and more. It is often used as a command line tool but most gdal functions are also available in a python package. I find gdal to be confusing and the documentation hard to decipher. In the section after this one I'll show an example of a different package that is using gdal under the hood but is easier to use in my opinion.

Here we use the WorldPop/ppp_2010_1km_Aggregated.tiff as the common grid and resample other datasets to this common grid

using rioxarray package just to look inside the data files

In [None]:
# file to resample
rawdatafile=base_path+'raw_data/URCA/URCA.tif'

In [None]:
# get a look at what's inside the file with the common grid
pop = rio.open_rasterio(gridfile,mask_and_scale=True)
pop

In [None]:
# get a look at what's inside the file that we want to resample
urca=rio.open_rasterio(rawdatafile,mask_and_scale=True)
urca

using gdal package to do the resampling

In [None]:
# use gdal to resample
# the main functions we want are osgeo.gdal.Warp and osgeo.gdal.WarpOptions described at https://gdal.org/api/python/osgeo.gdal.html
# the other osgeo.gdal functions we use are also at that page (.Open, .GetProjection, .Info, .GetGeoTransform)

reference=gdal.Open(gridfile,0) # create a gdal object
ref_proj=reference.GetProjection() # get file projection info
minx, x_res, xskew, maxy, yskew, y_res =reference.GetGeoTransform() # get transform info
info=gdal.Info(reference,**{'format':'json'}) # get other file info into a big dictionary
maxx,miny=info['cornerCoordinates']['lowerRight'] # pull the lower right boundary extents from the info dictionary

filename=str.split(rawdatafile,'/')[-1] # pull just the filename out of the long path
outfile=outdir+filename[0:-4]+'_stdgrid'+filename[-4:] # new path/name for resampled data file
print('file to read',rawdatafile)
print('file to write',outfile)

# all available options to include in this dictionary will be described under osgeo.gdal.WarpOptions 
# this is where you include the resampling method. Default is nearest neighbor. Other options described under -r here https://gdal.org/programs/gdalwarp.html
kwargs={"format": "GTiff", 
        "xRes": x_res, 
        "yRes": y_res,
        "outputBounds":[minx,miny,maxx,maxy],
        "outputBoundsSRS":ref_proj,
        "resampleAlg":"bilinear"}

# see some of this info
print(x_res,y_res)
print(minx,miny,maxx,maxy)
print(ref_proj)

In [None]:
%%time
# do the resampling and write the results to a tiff file
# pretty sure you could write to netcdf too but I've never done that with gdal
ds=gdal.Warp(outfile,rawdatafile,**kwargs)

In [None]:
# open the new file and checkout what is inside
urca_r=rio.open_rasterio(outfile,mask_and_scale=True)
urca_r

In [None]:
# double check that the new file has the exact same grid as the reference file
# np.unique() will print all the unique values of an array. If results are identicalm then the difference array should return only a single value: 0
print(np.unique(pop.x.data-urca_r.x.data)) 
print(np.unique(pop.y.data-urca_r.y.data))

# another way to do this
# if arrays are equal this prints nothing, if arrays are not equal this throws an error
np.testing.assert_array_equal(urca_r.x.data,pop.x.data,verbose=True)
np.testing.assert_array_equal(urca_r.y.data,pop.y.data,verbose=True)

In [None]:
# double check things visually to make sure the resample didn't do anything crazy

fig = plt.figure(figsize=(10,5))

# plot raw data
ax = fig.add_subplot(121) # 121 = 1 row of plots, 2 cols of plots, 1st plot
urca.sel(y=slice(60,30),x=slice(-90,-60)).plot(ax=ax)
plt.title('URCA raw')

# plot resampled data
ax = fig.add_subplot(122) # 122 = 1 row of plots, 2 cols of plots, 2nd plot
urca_r.sel(y=slice(60,30),x=slice(-90,-60)).plot(ax=ax)
plt.title('URCA resampled')

plt.tight_layout()

plt.show()

## Second, we could use rioaxarray & rasterio packages

rioxarray .reproject_match function is based on rasterio.warp.reproject() which is just gdal.warp recoded into a new package with less confusing functions to deal with. With rio_xarray's .reproject_match condenses all the gdal stuff above into a single line of code

and the results of the resampling should be identical

Also, it looks like there is a way to chunk the read/resample/write with rasterio WarpedVRT. I'm not familiar with it and these files process quickly anyways

In [None]:
%%time
# this takes about the same amount of time as the gdal resampling in the previous section
# you can send the same kwargs as above or not send anything as we do below
# reproject_match is from rioxarray package and is smart enough to get the info it needs from urca and pop
# Resampling.bilinear is from rasterio package
# notice that this doesn't write to a file though, we'll have to do that ourselves
urca_r_rio=urca.rio.reproject_match(pop,resampling=Resampling.bilinear)
urca_r_rio

In [None]:
# double check for grid differences
np.testing.assert_array_equal(urca_r_rio.x.data,pop.x.data,verbose=True)
np.testing.assert_array_equal(urca_r_rio.y.data,pop.y.data,verbose=True)

In [None]:
# double check things visually to make sure the resample didn't do anything crazy

fig = plt.figure(figsize=(10,5))

# plot raw data
ax = fig.add_subplot(121) # 121 = 1 row of plots, 2 cols of plots, 1st plot
urca.sel(y=slice(60,30),x=slice(-90,-60)).plot(ax=ax)
plt.title('URCA raw')

# plot resampled data
ax = fig.add_subplot(122) # 122 = 1 row of plots, 2 cols of plots, 2nd plot
urca_r_rio.sel(y=slice(60,30),x=slice(-90,-60)).plot(ax=ax)
plt.title('URCA resampled')

plt.tight_layout()

plt.show()

writing out files:

In [None]:
%%time
# how to write the resampled data to tiff
# .to_raster is from rioxarray
outfile=outdir+'ucra_r_rio_stdgrid.tif'
urca_r_rio.rio.to_raster(outfile)

In [None]:
%%time
# tiled tif
outfile=outdir+'ucra_r_rio_tiled_stdgrid.tif'
urca_r_rio.rio.to_raster(outfile,tiled=True,windowed=True)

In [None]:
%%time
# how to write resampled data to netcdf
# to_netcdf is from xarray

# first, name the variable and get rid of the "band" dim
varname='urca'
ds_out=urca_r_rio.squeeze().to_dataset(name=varname)
del ds_out.coords['band']

# assign variable metadata
var_attrs={'standard_name':varname,
          'long_name':'urban_rural_catchment_areas',
          'units':'none',
          'description':'UCRA S0 category',
          'coordinates':'spatial_ref'}
global_attrs={'documentation':'/work/hpc/datasets/unfao_sera/raw_data/URCA/ReadMe_data_description.docx'}

ds_out[varname].attrs=var_attrs
ds_out=ds_out.assign_attrs(global_attrs)

# define encoding
y_encoding={'_FillValue':None}
x_encoding={'_FillValue':None}
var_encoding = {'dtype':'float32'}  

# write to file
outfile=outdir+'ucra_r_rio_stdgrid.nc'
ds_out.to_netcdf(outfile,
                 mode='w',
                encoding={'y':y_encoding,
                          'x':x_encoding,
                          varname:var_encoding})
ds_out


## Now let's try with the other datasets

## GAEZ

In [None]:
dataset='GAEZ'
rawdatafile=base_path+'raw_data/'+dataset+'/aez_v9v2_CRUTS32_Hist_8110_100_avg.tif'
resample_method=Resampling.nearest

In [None]:
data=rio.open_rasterio(rawdatafile,mask_and_scale=True)
data

In [None]:
%%time
data_r=data.rio.reproject_match(pop,resampling=resample_method)
data_r


In [None]:
# double check for grid differences
np.testing.assert_array_equal(data_r.x.data,pop.x.data,verbose=True)
np.testing.assert_array_equal(data_r.y.data,pop.y.data,verbose=True)

In [None]:
# double check things visually to make sure the resample didn't do anything crazy

fig = plt.figure(figsize=(10,5))

# plot raw data
ax = fig.add_subplot(121) # 121 = 1 row of plots, 2 cols of plots, 1st plot
data.sel(y=slice(60,30),x=slice(-90,-60)).plot(ax=ax)
plt.title('GAEZ raw')

# plot resampled data
ax = fig.add_subplot(122) # 122 = 1 row of plots, 2 cols of plots, 2nd plot
data_r.sel(y=slice(60,30),x=slice(-90,-60)).plot(ax=ax)
plt.title('GAEZ resampled')

plt.tight_layout()

plt.show()

In [None]:
%%time
# how to write the resampled data to tiff
# .to_raster is from rioxarray
outfile=outdir+'gaez_stdgrid_8110.tif'
data_r.rio.to_raster(outfile)

## ESA landcover

In [None]:
dataset='ERA5'
data_dir=base_path+'raw_data/'+dataset+'/dataset-satellite-land-cover/'
resample_method=Resampling.nearest

# looks like some of these files may have gotten corrupted
# ncdump at command line indicates these files are ok: 2010,2013,2014,2015,2020
# ncdump at command line indicates these files are broken: 2011,2012,2016,2017,2018,2019
# this is the full list of files but we'll start with just a single file, the last one (2010)
rawdatafiles=glob.glob(data_dir+'*.nc')
rawdatafiles

In [None]:
# for netcdf open with xarray instead of rioxarray
f=rawdatafiles[-1] # try one file first
ds=xr.open_dataset(f)
ds

In [None]:
# assign the crs and transform to the land cover class variable
lc=ds.lccs_class # pull variable from dataset
lc.rio.write_crs(ds.crs.attrs['wkt'],inplace=True) # write crs in dataset to variable
lc.rio.write_transform(inplace=True) # calculate the transform (double check it's the same and in the ds)
lc

In [None]:
%%time
# notice since .reproject_match is matching the grid in pop that
# it will change the labels of lc from (lat,lon) to (y,x)
# as well as take all the attributes (names, units, etc) of y,x from pop as well
lc_r=lc.rio.reproject_match(pop,resampling=resample_method)
lc_r

In [None]:
# double check for grid differences
np.testing.assert_array_equal(lc_r.x.data,pop.x.data,verbose=True)
np.testing.assert_array_equal(lc_r.y.data,pop.y.data,verbose=True)

In [None]:
# double check things visually to make sure the resample didn't do anything crazy

fig = plt.figure(figsize=(10,5))

# plot raw data
ax = fig.add_subplot(121) # 121 = 1 row of plots, 2 cols of plots, 1st plot
lc.sel(lat=slice(60,30),lon=slice(-90,-60)).plot(ax=ax) # use lat,lon here, that's what's in the raw data
plt.title('ESA landcover raw')

# plot resampled data
ax = fig.add_subplot(122) # 122 = 1 row of plots, 2 cols of plots, 2nd plot
lc_r.sel(y=slice(60,30),x=slice(-90,-60)).plot(ax=ax)  # use y,x here, that's what is written to our resampled variable
plt.title('ESA landcover resampled')

plt.tight_layout()

plt.show()

In [None]:
%%time
# how to write the resampled data to tiff
# .to_raster is from rioxarray

# first get the year string from the filename
yyyy=re.search('P1Y-....',f).group()[-4:]

outfile=outdir+'land_cover_lccs_stdgrid_'+yyyy+'.tif'
lc_r.rio.to_raster(outfile)

process the rest of the landcover files that aren't corrupted

In [None]:
for fi,f in enumerate(rawdatafiles[0:-2]):
    print('processing file',fi+1,'of',len(rawdatafiles[0:-2]),':',f)
    try:
         with xr.open_dataset(f) as ds:
            print('...resampling...')
            lc=ds.lccs_class # pull variable from dataset
            lc.rio.write_crs(ds.crs.attrs['wkt'],inplace=True) # write crs in dataset to variable
            lc.rio.write_transform(inplace=True) # calculate the transform (double check it's the same and in the ds)
            lc_r=lc.rio.reproject_match(pop,resampling=resample_method) # resample

            # double check for grid differences
            np.testing.assert_array_equal(lc_r.x.data,pop.x.data,verbose=True)
            np.testing.assert_array_equal(lc_r.y.data,pop.y.data,verbose=True)
                        
            # double check things visually to make sure the resample didn't do anything crazy
            print('...plotting...')
            fig = plt.figure(figsize=(10,5))
            # plot raw data
            ax = fig.add_subplot(121) # 121 = 1 row of plots, 2 cols of plots, 1st plot
            lc.sel(lat=slice(60,30),lon=slice(-90,-60)).plot(ax=ax) # use lat,lon here, that's what's in the raw data
            plt.title('ESA landcover raw')
            # plot resampled data
            ax = fig.add_subplot(122) # 122 = 1 row of plots, 2 cols of plots, 2nd plot
            lc_r.sel(y=slice(60,30),x=slice(-90,-60)).plot(ax=ax)  # use y,x here, that's what is written to our resampled variable
            plt.title('ESA landcover resampled')
            plt.tight_layout()
            
            print('...writing tif...')
            yyyy=re.search('P1Y-....',f).group()[-4:]
            outfile=outdir+'land_cover_lccs_stdgrid_'+yyyy+'.tif'
            lc_r.rio.to_raster(outfile)
            
            plt.show()
    except:
        print('problem with file')