# Accessing and using BARRA2 and BARPA data for research

## BARRA2

Bureau of Meteorology Atmospheric high-resolution Regional Reanalysis for 
Australia - Version 2 (BARRA2) is a reanalysis from 1979 to the present day covering Australia, New Zealand and a portion of South-East Asia.

## BARPA

The Bureau of Meteorology Atmospheric Regional Projections for Australia (BARPA) delivers high-resolution dynamical downscaling of CMIP6 experiments over CORDEX-Australasia and Australian domains.

## NCI

Both these datasets are hosted at NCI on Gadi. Additional information on these datasets can be found in NCI's documentation pages for [BARPA](https://opus.nci.org.au/pages/viewpage.action?pageId=264241161https://opus.nci.org.au/pages/viewpage.action?pageId=264241161) and [BARRA2](https://opus.nci.org.au/pages/viewpage.action?pageId=264241166https://opus.nci.org.au/pages/viewpage.action?pageId=264241166).

<font color='red'>TODO add reference to OPUS pages for this workshop?</font>

## Accessing BARRA2 & BARPA

The outputs for BARRA2 and BARPA are store at NCI in the ob53 and py18 projects respectively.
Access to these projects can be requested in the usual way at the NCI account management pages - https://my.nci.org.au/mancini
The files are also available for direct download from NCI's Thredds server:
- [BARRA2](https://dap.nci.org.au/thredds/remoteCatalogService?catalog=https://dapds00.nci.org.au/thredds/catalogs/ob53/catalog.xml)
- [BARPA](https://dap.nci.org.au/thredds/remoteCatalogService?catalog=https://dapds00.nci.org.au/thredds/catalogs/py18/catalog.xml)

### List of variables

A list of the variables used by BARPA and BARRA2 can be found [here](https://github.com/joshuatorrance/barpa-barra2-amos2024/blob/main/BARRA2_BARPA_variable_list.csvhttps://github.com/joshuatorrance/barpa-barra2-amos2024/blob/main/BARRA2_BARPA_variable_list.csv).

***
This notebook covers some basic interactions with BARPA and BARRA2 data and some simple manipulations using *xarray*.

For more information on the multitude of tools available with *xarray* check out the xarray documentation:
https://docs.xarray.dev/en/stable/getting-started-guide/index.html
***

## Enter the notebook directory

To begin, navigate to the directory containing this notebook. The 'nci_ipynb' package, developed by NCI, is designed to automate this process. 
For more details, please refer https://pypi.org/project/nci-ipynb

In [None]:
import os
import nci_ipynb
os.chdir(nci_ipynb.dir())
print(os.getcwd())

## Loading BARRA/BARPA data with Xarray & Dask

First we load the required python modules and start a dask client to speed up our computation.

In [None]:
# Imports for the notebook
import os, sys
from glob import glob
from datetime import datetime
import xarray as xr
import pandas as pd
from matplotlib import pyplot as plt
import cartopy.crs
import dask.distributed

In [None]:
# Let's explictly load dask so we can check progress
# Copy and paste the dashboard link/path from this cell's output
# to the Dask tab on the left.
client = dask.distributed.Client()
client

There should now be a dask client running (click on "Launch dashboard in JupyterLab" to see the dashboard). The dask client will allow for better parallelisation of xarray operations such as opens multiple files or processing large datasets.

You can see the progress of dask operations in the dask dashboard.

***
Next we will build a path to the BARPA or BARRA2 data.

In [None]:
## Data location
# Let's define the path to the files we're interested in
# BARRA2
barra_r2_root_path = "/g/data/ob53/BARRA2/output/reanalysis/AUS-11/BOM/ERA5/historical/hres/BARRA-R2/v1"

# BARPA
barpa_top_path = "/g/data/py18/BARPA/output/CMIP6/DD/AUS-15/BOM"
barpa_model = "ACCESS-CM2"
    # One BARPA model out of:
    # ACCESS-CM2, ACCESS-ESM1-5, CESM2, CMCC-ESM2, EC-Earth3, ERA5, MPI-ESM1-2-HR, NorESM2-MM
barpa_scenario = "historical"
barpa_root_path = f"{barpa_top_path}/{barpa_model}/{barpa_scenario}/*/BARPA-R/v1-r1"

### Pick the root path that interests you, either BARPA or BARRA2, comment out the other one. ###
root_path = barra_r2_root_path
#root_path = barpa_root_path

## Time resolution
# e.g. BARRA - mon, day, 3hr, 1hr
#      BARPA - mon, day, 6hr, 1hr
time_resolution = "1hr"

## Variable
# Choose the variable to look at, e.g. "ts" for surface temperature
var = "ts"

## Date (YYYYMM)
# With glob we can use wild cards to find the files we want
# e.g. "2014??" for all of 2014 or "20140[123]" for the first three months
# Note that BARRA2 data has one file per month, BARPA data has one file per year
date = "2014??"

# Build a string to use with glob
glob_str = os.path.join(root_path, time_resolution, var, "*", f"*{date}.nc")

# Pass the string to glob will will return a matching list of file paths
file_list = sorted(glob(glob_str))

In [None]:
# Open the dataset

# If we have a single file we can use open_dataset
#ds = xr.open_dataset(file_list[0])

# If we have a list of files we use open_mfdataset
# Use parallel=True to take advantage of Dask's multiprocessing
ds = xr.open_mfdataset(file_list, parallel=True)

In [None]:
ds

In [None]:
ds[var].attrs

# Loading BARRA2/BARPA data with Intake-ESM

Instead, you can use NCI Intake-ESM catalog files to manipulate BARRA2/BARPA data collections.

For more details in NCI indexing scheme, please refer https://opus.nci.org.au/display/DAE/Dataset+catalogue+Indexes+and+Intake


In [None]:
import intake

# Specify the Intake-esm catalog files.
catalog_files={"BARPA":"/g/data/dk92/catalog/v2/esm/barpa-py18/catalog.json",
              "BARRA2":"/g/data/dk92/catalog/v2/esm/barra2-ob53/catalog.json"}

dclt="BARRA2" # BARPA
data_catalog = intake.open_esm_datastore(catalog_files[dclt])

In [None]:
# The catalog keys are produced by following the naming convention in 
#/g/data/ob53/BARRA2/README.txt
#/g/data/py18/BARPA/README.txt

for col in data_catalog.df.columns:
# Aavailable keys for the dataset
    print("avaliable key: ",col)
# Available values for each key
#    print ("values: ", data_catalog.df[col].unique())

In [None]:
# Now we can combine searches for a goup of key-values.
query = dict(
    variable_id=["ts"],
    time_range=["2014/*"],
    freq=["1hr"],
)
catalog_subset = data_catalog.search(**query)
catalog_subset

In [None]:
# The information from the searched catalog object could be viewed in a pandas table.
print(catalog_subset.df)

In [None]:
# You could also set keywords when loading the dataset.
dsets = catalog_subset.to_dataset_dict(
#    cdf_kwargs={'chunks':{'lat': 646, 'lon': 1082, 'time':100}}
)
dsets

In [None]:
# You could also set keywords when loading the dataset.
dsets = catalog_subset.to_dataset_dict(
#    cdf_kwargs={'chunks':{'lat': 646, 'lon': 1082, 'time':100}}
)
dsets

In [None]:
esmds[var].attrs

## esmloader

'esmloader' is a module included with these notebooks to simplify access to BARPA and BARRA2 datasets.## esmloader



In [None]:
from esmloader import EsmCat

# Specify the which data collection you want to load, i.e. BARRA2 or BARPA.
barra2=EsmCat("BARRA2")

In [None]:
# Load some BARRA data
ds = barra2.load_barra2_data("BARRA-R2", "1hr", "ts", tstart="2014010100", tend="2015010100")
ds

In [None]:
barpa=EsmCat("BARPA")
# Load some BARPA data
ds = barpa.load_barpa_data("BARPA-R", "ACCESS-CM2", "historical", "1hr", "ts", tstart="2014010100", tend="2015010100")
ds

### Load data contains some other helpful functions
Here's a couple of them

In [None]:
# Examine a particular variable
_ = barra2.whatis('1hr', 'pr')

In [None]:
# List the available variables
_ = barra2.list_barra2_variables('BARRA-R2', '1hr')

## Instantaneous vs. Accumulated variables
The variables used in BARPA and BARRA2 can be separated into two group, *instantaneous* and *accumulated*.

*Instantaneous* variables give a snapshot of the underlying model stat at the given time.

*Accumulated* variables give an aggregate view of a given time window (e.g. hourly mean, daily max). Accumulated variables will have an additional coordinate, 'time_bnds'.

It's important to note that time values are different between instantaneous and accumulated variables.
For instantaneous variables the time value match the start of the window when snapshot was taken, i.e. 00:00, 01:00.
Accumulated variables use time values in the centre of their window, i.e. 00:30, 01:30.

It's important to keep these different time values in mind if one if combining variables in some way, e.g. performing arithmetic or plotting.

In [None]:
var = 'tas'
time_resolution = '1hr'

# What is this variable?
barra2.whatis(time_resolution, var)

# Take a look at the first time step
ds_inst = barra2.load_barra2_data("BARRA-R2", time_resolution, var, tstart="2014010100", tend="2015010100")
ds_inst['time'][0].data

In [None]:
var = 'tasmax'
time_resolution = '1hr'

# What is this variable?
barra2.whatis(time_resolution, var)

# Take a look a the first time step
ds_accum = barra2.load_barra2_data("BARRA-R2", time_resolution, var, tstart="2014010100", tend="2015010100")
ds_accum['time'][0].data

In [None]:
# Accumulated variables have time_bnds
ds_accum['time_bnds'][0:3].compute()

## Indexing and Plotting Data
Xarray has sophisticated indexing tools available.
There are many ways to index data with Xarray, below are a couple of examples.

See Xarray's [documentation](https://docs.xarray.dev/en/latest/user-guide/indexing.html) for more details.

Xarray data sets can be easily plotting with matplotlib.

In [None]:
var = 'ts'
ds = barra2.load_barra2_data("BARRA-R2", "1hr", var, tstart="2014010100", tend="2015010100")
ds

### First timestep

In [None]:
# Select the first timestep using the index
ds_first_timestep = ds.isel(time=0)
ds_first_timestep

In [None]:
# Alternatively select the first timestep by giving a string
# Look what happens if we don't specify said string precisely
ds.sel(time='2014-01-01')

In [None]:
# Alternatively select the first timestep by giving a string
# Look what happens if we don't specify said string precisely
ds_first_timestep = ds.sel(time='2014-01-01T00:00')
ds_first_timestep

### Basic Plotting
xarray uses matplotlib to allow for quick and convenient plotting.

In [None]:
# Plot the first field

# Can only plot data arrays (not datasets)
da = ds_first_timestep[var]
da.plot()

In [None]:
# Plot the first field - with coastlines!

# Can only plot data arrays (not datasets)
da = ds_first_timestep[var]

# Build a cartopy projection so we can draw on the coastlines
centre_lon = da['lon'].mean().values
projection = cartopy.crs.PlateCarree(central_longitude=centre_lon)

# Now plot the field with the transform.
plot = da.plot(
    transform=cartopy.crs.PlateCarree(),
    subplot_kws={"projection": projection})

# Draw the coastlines using cartopy
plot.axes.coastlines()

### More indexing - Zoom in on Melbourne

In [None]:
# Select the Melbourne region using slice
melb_lat, melb_lon = -37.840935, 144.946457
width = 1.5

ds_melb = ds.sel(lat=slice(melb_lat - width/2, melb_lat + width/2),
                 lon=slice(melb_lon - width/2, melb_lon + width/2))

In [None]:
# Let's convert the temperature to centigrade
ds_melb[var] = ds_melb[var] - 273.25

# The above arithmetic will not preserve the DataArray's attributes
# So let's copy them here and update the units.
ds_melb[var].attrs = ds[var].attrs
ds_melb[var].attrs['units'] = 'C'

In [None]:
# Let's plot the resulting data array as we did before
# We can reuse the project we defined earlier
da = ds_melb.isel(time=0)[var]

plot = da.plot(
    transform=cartopy.crs.PlateCarree(),
    subplot_kws={"projection": projection})

plot.axes.coastlines()

### Data Manipulation - Mean temperature in Melbourne during 2014

In [None]:
# Take our Melbourne dataset and caculate the mean of each spatial field then plot the result
ds_melb[var].mean(dim=['lat', 'lon']).plot()

# xarray uses matplotlib to handle the plotting
# Add a custom title to the plot using the standard matplotlib command
plt.title(f"Mean {da.attrs['long_name']} in Melbourne region ({da.attrs['units']})")

### Exercises
1. Plot the average temperature in Melbourne by time-of-day
2. Plot the min and max daily temperatures in Melbourne

In [None]:
# First get time in the the local Melb timezone
# Use pandas to add the UTC timezone to 'time', convert it to Melbourne's timezone (AEDT), then remove the timezone again
time_melb = pd.to_datetime(ds_melb['time']).tz_localize('UTC').tz_convert('Australia/Melbourne').tz_localize(None)

# Replace time with time_melb
ds_melb_aedt = ds_melb.assign_coords(time_melb=("time", time_melb)).drop('time').swap_dims({'time': 'time_melb'})

# Add the attributes for time_melb so our plots below are nicer
ds_melb_aedt['time_melb'].attrs = {'standard_name': 'time_melb', 'axis': 'T', 'long_name': 'Time in AEDT'}

ds_melb_aedt

#### Average temperature by time of day

In [None]:
# Now plot the mean spatial field value, averaged for each hour of the day
ds_melb_aedt[var].mean(dim=['lat', 'lon']).groupby("time_melb.hour").mean().plot()

plt.title(f'Mean {ds_melb_aedt[var].attrs["long_name"]} by time of day in Melbourne during 2014')

plt.xlabel(f'Hour of the day (Melbourne timezone)')
plt.ylabel(f'Mean {ds_melb_aedt[var].attrs["long_name"]} ({ds_melb_aedt[var].attrs["units"]})')

#### Min and Max Temperature

In [None]:
# Daily min and max is usually from 9am to 9am
# Resample our hourly data into daily data, and offset to 9am
ds_melb_aedt.resample({'time_melb': '1D'}, offset='+9H').max().max(dim=['lat', 'lon'])[var].plot(label='Daily Max')
ds_melb_aedt.resample({'time_melb': '1D'}, offset='+9H').mean().mean(dim=['lat', 'lon'])[var].plot(label='Daily Mean')
ds_melb_aedt.resample({'time_melb': '1D'}, offset='+9H').min().min(dim=['lat', 'lon'])[var].plot(label='Daily Min')

# Let's add a legend and title, etc.
plt.legend(bbox_to_anchor=(1.01, 1.0))

plt.title(f'{ds_melb_aedt[var].attrs["long_name"]} in Melbourne')

plt.xlabel(f'{ds_melb_aedt["time_melb"].attrs["long_name"]}')
plt.ylabel(f'{ds_melb_aedt[var].attrs["long_name"]} ({ds_melb_aedt[var].attrs["units"]})')

In [None]:
client.close()