# 13 NetCDF and `xarray`

In this lesson, we will get acquainted with a popuar format for working with multidimensional datasets called NetCDF and the Python package `xarray` which is based on NetCDF. 

xarray is a python package that augments numpy by adding labeled dimensions, coordinates, and attributes. 

xarray is based on the Netcdf data model, great tool to open process and create datasets in this data format. 

## xarray.DataArray

is the primary structure of the xarray package. its an n dimensional array with labeled dimensins. 

xarray.DataArray represents a single variabel in the netcdf data format - it has that variables values, dimensions, coordinates nad attributes

we will create a small xarray.DataArray from scratch to exemplify this. 


In [1]:
# Import packages
import os
import numpy as np
import pandas as pd
import xarray as xr

### Variable values

The underlying data in the `xarray.DataArray` is a `numpy.array` that holds the variable values. 

In [4]:
# Values of a single variable at each point of the coords 
temp_data = np.array([np.zeros((5,5)),
                     np.ones((5,5)),
                     np.ones((5,5))*2]).astype(int)

temp_data

array([[[0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]],

       [[1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1]],

       [[2, 2, 2, 2, 2],
        [2, 2, 2, 2, 2],
        [2, 2, 2, 2, 2],
        [2, 2, 2, 2, 2],
        [2, 2, 2, 2, 2]]])

### Dimensions and Coordinates

To specify the dimensions of our upcoming `xarray.DataArray`, we must examine how we've constructed the `numpy.array` holding the temperature data. 
The first dimension is time, the second is latitude, and longitude the third. 

From our exercises, we can also see that the coordinates (values of each dimension) are:

- time coordinates are 2022-09-01, 2022-09-02, 2022-09-03
- latitude coordinates are 70, 60, 50, 40, 30 (notice decreasing order)
- longitude coordinates are 60, 70, 80, 90, 100 (notice increasing order)

We add the dimensions as a tuple of strings and coordinates as a dictionary:

In [9]:
# Names of the dimensions in the required order
dims = ('time', 'lat', 'lon')
# Create coordinates to use for indexing along each dimension 
coords = {'time':pd.date_range("2022-09-01", "2022-09-03"),
          'lat':np.arange(70, 20, -10),
          'lon':np.arange(60, 110, 10)}  

In [6]:
coords

{'time': DatetimeIndex(['2022-09-01', '2022-09-02', '2022-09-03'], dtype='datetime64[ns]', freq='D'),
 'lat': array([70, 60, 50, 40, 30]),
 'lon': array([ 60,  70,  80,  90, 100])}

#### Attributes

Next, we add the attributes (metadata) for our temperature data as a dictionary:

In [7]:
# Attributes (metadata) of the data array 
attrs = { 'title':'temperature across weather stations',
          'standard_name':'air_temperature',
          'units':'degree_c'}

#### Putting it all together

Finally, we put all these pieces together (data, dimensions, coordinates, and attributes) to create an `xarray.DataArray`:

In [10]:
# Initialize xarray.DataArray
temp = xr.DataArray(data = temp_data, 
                    dims = dims,
                    coords = coords,
                    attrs = attrs)
temp

We can also update the variable’s attributes after creating the object. 
Notice that each of the coordinates is also an `xarray.DataArray`, so we can add attributes to them.

In [12]:
# Update attributes
temp.attrs['description'] = 'Simple example of an xarray.DataArray'

# Add attributes to coordinates 
temp.time.attrs = {'description':'date of measurement'}

temp.lat.attrs['standard_name']= 'grid_latitude'
temp.lat.attrs['units'] = 'degree_N'

temp.lon.attrs['standard_name']= 'grid_longitude'
temp.lon.attrs['units'] = 'degree_E'
temp

at this point we have a single variable and the dataset attributes and variable attributes are the same

## Subsetting

to select data from an xarray data array we need to specify the subsets we want along each dimension

we can specify the data we need from each dimension by looking up the dimension by position or name

## Example 

we want tok know what temp was recorded by weather station at 40 deg N 80 deg E on sept 1st. 

**Dimension loopup by position**

when we want to rely on the position of the dimensions on the xarray, we neeed to remember that time is the first position, lat is the second, and long is the third. 

Then we can access values along each axes

- by intergers: use the integer location of the data we need to retrieve

In [13]:
# Access dimensions by position, then use integers  for indexing
temp[0,3,2]

In [14]:
# Access dimensions by position, then use labels for indexing
temp.loc['2022-09-01', 40, 80]

For datasets with dozens of dimensions, it can be tough to index that way. BUT

**Dimensions lookup by name**

We can use the dimensions name to subset ddata, without the need to remember the dimensions in order

In [15]:
# Access dimensions by name, then use integers for indexing
temp.isel(time=0, lon=2, lat=3)

In [16]:
# Access dimensions by name then use labels for indexing
temp.sel(time='2022-09-01', lat=40, lon=80)

Notice that we get a 1x1 xarray.DataArray, is item() method to retrive the actual number  

## Reductiion

xarray has several methods to reduce an xr.DataArray along a number of dimensions. 

For example calculate the averat temp at each sation over time and obtain a new xr.DataArray. 

In [17]:
avg_temp = temp.mean(dim='time')
avg_temp

## xarray.Dataset

An xarray data is an in mem represtination of a netcdf fiele with multiple variables. 

In [19]:
avg_temp = temp.mean(dim='time')
avg_temp.attrs = {'title': 'avg temp'}

In [22]:
# Dictionary with varibales 
data_vars = { 'temp':temp,
            'avg_temp':avg_temp}

attrs = { 'title':'temp',
        'descriotion': 'stuff'}

# create array
temp_dataset = xr.Dataset(data_vars = data_vars,
                         attrs=attrs)

temp_dataset