# Access CMIP data using Xarray

In this notebook:

* Launch Jupyter Notebook 
* What is Xarray
* Remote vs. direct filesystem access 
* File Variables and Attributes

    
---------


### Launch the Jupyter Notebook application


Launch the Jupyter Notebook application:
```
    $ jupyter notebook
``` 

<div class="alert alert-info">
<b>NOTE: </b> This will launch the <b>Notebook Dashboard</b> within a new web browser window. 
</div>

<br>



### Xarray

Xarray builds upon and extends the strengths of pandas and numpy. Numpy provides the structure and core for working with multi-dimensional arrays while pandas integrates its indexing and dataframe type capabilities. Xarray is actively developed by the climate science community and a useful tool for analysis. For more information on the developments being undertaken (along with other related projects) see the Pangeo community: https://pangeo.io/
 
We will use xarray to open the CMIP5 file defined below. Opening a file with xarray creates an xarray.Dataset. A 'Dataset' is a collection of multiple variables. A DataArray on the other hand is a single multi-dimensional variable and the coordinates.
 
xarray always loads netCDF data 'lazily', this means that data can be manipulated, sliced and subset without loading array values into memory. Data is loaded into memory when the load() command is applied or when a computaiton is performed on the data.
 
xarray is designed for use with multidimensional datasets and is particularly useful for climate data on multidimensional grids with dimensions such as lat, lon, depth and time

#### Import the xarray and netCDF modules

In [1]:
import xarray as xr
import netCDF4 as nc
%matplotlib inline

### Remote vs. direct filesystem access

In this example, we will use a file from the CMIP5 Australian Published data collection, specifically the monthly historical tasmax data:

    /g/data/rr3/publications/CMIP5/output1/CSIRO-BOM/ACCESS1-3/historical/mon/atmos/Amon/r1i1p1/v20120727/tasmax/tasmax_Amon_ACCESS1-0_historical_r1i1p1_185001-200512.nc
    

and we are going to compare direct vs. remote access. Timings (using the `%%time` magic function) will also be shown to help illustrate when it can be useful to conduct analysis on the filesystem.

#### Local path on /g/data

In [2]:
path = '/g/data/rr3/publications/CMIP5/output1/CSIRO-BOM/ACCESS1-3/historical/mon/atmos/\
Amon/r1i1p1/v20130325/tasmax/tasmax_Amon_ACCESS1-3_historical_r1i1p1_185001-200512.nc'

#### OPeNDAP Data URL

For more information on where to find OPeNDAP URL's, see:
<a href="https://nbviewer.jupyter.org/github/nci-training/readthedoc_NCI_data_training/blob/master/docs/_notebook/TDS/tds_OPeNDAP_cmip5.ipynb">THREDDS Data Server: Data Access</a>


In [3]:
url = 'http://dapds00.nci.org.au/thredds/dodsC/rr3/CMIP5/output1/CSIRO-BOM/ACCESS1-3/historical/mon/atmos/\
Amon/r1i1p1/v20130325/tasmax/tasmax_Amon_ACCESS1-3_historical_r1i1p1_185001-200512.nc'

#### Open the file, comparing the time on the local filesystem and remote url

In [4]:
%%time
f1 = xr.open_dataset(path)

CPU times: user 765 ms, sys: 61 ms, total: 826 ms
Wall time: 7.67 s


In [5]:
%%time
f2 = xr.open_dataset(url)

CPU times: user 734 ms, sys: 17 ms, total: 751 ms
Wall time: 7.23 s


#### Not much different in times because of the lazy loading of data. But if force the data to load into memory:

In [6]:
%%time
f1 = xr.open_dataset(path)
tasmax = f1.tasmax
tasmax.load()

CPU times: user 1.11 s, sys: 305 ms, total: 1.41 s
Wall time: 3.15 s


<xarray.DataArray 'tasmax' (time: 1872, lat: 145, lon: 192)>
array([[[242.22649, 242.22649, ..., 242.22028, 242.22028],
        [243.91931, 243.87326, ..., 244.00957, 243.96425],
        ...,
        [241.69199, 241.74554, ..., 241.68051, 241.69809],
        [240.87485, 240.87485, ..., 240.87485, 240.87485]],

       [[235.21791, 235.21791, ..., 235.2146 , 235.2146 ],
        [237.052  , 237.029  , ..., 237.13855, 237.09853],
        ...,
        [238.93102, 238.94348, ..., 238.8793 , 238.91093],
        [236.43845, 236.43845, ..., 236.43845, 236.43845]],

       ...,

       [[240.10124, 240.10124, ..., 240.09166, 240.09166],
        [241.85622, 241.80452, ..., 241.93628, 241.88435],
        ...,
        [260.00974, 260.02213, ..., 259.99246, 260.0092 ],
        [258.00223, 258.00223, ..., 258.00223, 258.00223]],

       [[247.96088, 247.96088, ..., 247.93704, 247.93704],
        [248.94565, 248.89906, ..., 249.03673, 248.98915],
        ...,
        [249.8625 , 249.90404, ..., 249.77

In [7]:
%%time
f2 = xr.open_dataset(url)
tasmax = f2.tasmax
tasmax.load()

CPU times: user 2.23 s, sys: 621 ms, total: 2.85 s
Wall time: 4.59 s


<xarray.DataArray 'tasmax' (time: 1872, lat: 145, lon: 192)>
array([[[242.22649, 242.22649, ..., 242.22028, 242.22028],
        [243.91931, 243.87326, ..., 244.00957, 243.96425],
        ...,
        [241.69199, 241.74554, ..., 241.68051, 241.69809],
        [240.87485, 240.87485, ..., 240.87485, 240.87485]],

       [[235.21791, 235.21791, ..., 235.2146 , 235.2146 ],
        [237.052  , 237.029  , ..., 237.13855, 237.09853],
        ...,
        [238.93102, 238.94348, ..., 238.8793 , 238.91093],
        [236.43845, 236.43845, ..., 236.43845, 236.43845]],

       ...,

       [[240.10124, 240.10124, ..., 240.09166, 240.09166],
        [241.85622, 241.80452, ..., 241.93628, 241.88435],
        ...,
        [260.00974, 260.02213, ..., 259.99246, 260.0092 ],
        [258.00223, 258.00223, ..., 258.00223, 258.00223]],

       [[247.96088, 247.96088, ..., 247.93704, 247.93704],
        [248.94565, 248.89906, ..., 249.03673, 248.98915],
        ...,
        [249.8625 , 249.90404, ..., 249.77

<div class="alert alert-info">
One big advantage of working directly on the filesystem is that data access is much faster. For modest subsets, the difference is quite small but as you work with larger data, remote access can become much slower or even exceed NCI's THREDDS Data Server memory limits. </div>

### File variables and attributes

With xarray, you can easily view the dataset variables and attributes contained in the file by printing the loaded metadata

In [8]:
f1 = xr.open_dataset(path)
print(f1)

<xarray.Dataset>
Dimensions:    (bnds: 2, lat: 145, lon: 192, time: 1872)
Coordinates:
  * time       (time) datetime64[ns] 1850-01-16T12:00:00 ... 2005-12-16T12:00:00
  * lat        (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0
  * lon        (lon) float64 0.0 1.875 3.75 5.625 ... 352.5 354.4 356.2 358.1
    height     float64 ...
Dimensions without coordinates: bnds
Data variables:
    time_bnds  (time, bnds) datetime64[ns] ...
    lat_bnds   (lat, bnds) float64 ...
    lon_bnds   (lon, bnds) float64 ...
    tasmax     (time, lat, lon) float32 ...
Attributes:
    institution:            CSIRO (Commonwealth Scientific and Industrial Res...
    institute_id:           CSIRO-BOM
    experiment_id:          historical
    source:                 ACCESS1.3 2011. Atmosphere: AGCM v1.0 (N96 grid-p...
    model_id:               ACCESS1.3
    forcing:                GHG, Oz, SA, Sl, Vl, BC, OC, (GHG = CO2, N2O, CH4...
    parent_experiment_id:   piControl
    parent_exper

### Dataset and DataArray

In the above we have loaded the Dataset and you can see the multiple variables included in the file. If we look at a specific variable, like tasmax, we will get an xarray.DataArray with its coordinates.

In [9]:
f1 = xr.open_dataset(path)
print(f1.tasmax)

<xarray.DataArray 'tasmax' (time: 1872, lat: 145, lon: 192)>
[52116480 values with dtype=float32]
Coordinates:
  * time     (time) datetime64[ns] 1850-01-16T12:00:00 ... 2005-12-16T12:00:00
  * lat      (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0
  * lon      (lon) float64 0.0 1.875 3.75 5.625 7.5 ... 352.5 354.4 356.2 358.1
    height   float64 ...
Attributes:
    standard_name:     air_temperature
    long_name:         Daily Maximum Near-Surface Air Temperature
    comment:           monthly mean of the daily-maximum near-surface air tem...
    units:             K
    cell_methods:      time: maximum within days time: mean over days
    cell_measures:     area: areacella
    history:           2013-03-25T05:03:31Z altered by CMOR: Treated scalar d...
    associated_files:  baseURL: http://cmip-pcmdi.llnl.gov/CMIP5/dataLocation...


#### Print an attribute
The attributes of a variable can be easily accessed using the `.<attribute>` command. So if we want to print the units of tasmax we could go:

In [10]:
f1.tasmax.units

'K'