# NetCDF files

NetCDF is a binary storage format for many different kinds of rectangular data. Examples include atmosphere and ocean model output, satellite images, and timeseries data. NetCDF files are intended to be device independent, and the dataset may be queried in a fast, random-access way. More information about NetCDF files can be found [here](http://www.unidata.ucar.edu/software/netcdf/). The [CF conventions](http://cfconventions.org) are used for storing NetCDF data for earth system models, so that programs can be aware of the coordinate axes used by the data cubes.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import cartopy
import cmocean.cm as cmo

import netCDF4

### Sea surface temperature example

An example NetCDF file containing monthly means of sea surface temperature over 160 years can be found [here](http://www.esrl.noaa.gov/psd/data/gridded/data.noaa.ersst.v4.html). We'll use the NetCDF4 package to read this file, which has already been saved into the `data` directory.

In [None]:
nc = netCDF4.Dataset('../data/sst.mnmean.v4.nc')
print(nc)

The representation of the object shows some of the attributes of the netCDF file. The final few lines show the dimensions and the variable names (with corresponding dimensions). Another representation of the file can be seen using the `ncdump` command. This is similar to the output of the command (at a command-line prompt, not within python) 

    $ ncdump -h ../data/sst.mnmean.v4.nc
     
    netcdf sst.mnmean.v4 {
    dimensions:
        lon = 180 ;
        lat = 89 ;
        nbnds = 2 ;
        time = UNLIMITED ; // (1946 currently)
    variables:
        float lat(lat) ;
            lat:units = "degrees_north" ;
            lat:long_name = "Latitude" ;
            lat:actual_range = 88.f, -88.f ;
            lat:standard_name = "latitude" ;
            lat:axis = "Y" ;
            lat:coordinate_defines = "center" ;
        float lon(lon) ;
            lon:units = "degrees_east" ;
            lon:long_name = "Longitude" ;
            lon:actual_range = 0.f, 358.f ;
            lon:standard_name = "longitude" ;
            lon:axis = "X" ;
            lon:coordinate_defines = "center" ;
        double time_bnds(time, nbnds) ;
            time_bnds:long_name = "Time Boundaries" ;
        double time(time) ;
            time:units = "days since 1800-1-1 00:00:00" ;
            time:long_name = "Time" ;
            time:delta_t = "0000-01-00 00:00:00" ;
            time:avg_period = "0000-01-00 00:00:00" ;
            time:prev_avg_period = "0000-00-07 00:00:00" ;
            time:standard_name = "time" ;
            time:axis = "T" ;
            time:actual_range = 19723., 78923. ;
        float sst(time, lat, lon) ;
            sst:long_name = "Monthly Means of Sea Surface Temperature" ;
            sst:units = "degC" ;
            sst:var_desc = "Sea Surface Temperature" ;
            sst:level_desc = "Surface" ;
            sst:statistic = "Mean" ;
            sst:missing_value = -9.96921e+36f ;
            sst:actual_range = -1.8f, 33.95f ;
            sst:valid_range = -5.f, 40.f ;
            sst:dataset = "NOAA Extended Reconstructed SST V4" ;
            sst:parent_stat = "Individual Values" ;

    // global attributes:
            :history = "created 10/2014 by CAS using NCDC\'s ERSST V4 ascii values" ;
    [....and so on....]

### Mapping the netcdf object to the python object

We can query the data within the NetCDF file using the NetCDF object. The structure of the object (the composition of the methods and attributes) is designed to mirror the data structure in the file. See how these queries give the same information as the textual representation above.

In [None]:
# `Global` attributes of the file
nc.history

In [None]:
# Variables are stored in a dictionary
nc.variables['lon']  # this is a variable object, just a pointer to the variable. NO DATA HAS BEEN LOADED!

In [None]:
# Variable objects also have attributes
nc.variables['lon'].units

In [None]:
# we can also query the dimensions
nc.dimensions['lon']

In [None]:
# to find the length of a dimension, do
len(nc.dimensions['lon'])

In [None]:
# A list of the dimensions can be found by looking at the keys in the dimensions dictionary
nc.dimensions.keys()

In [None]:
# Same for variables
nc.variables.keys()

In [None]:
# Let's take a look at the main 3D variable
nc['sst'] # A shorthand for nc.variables['sst']

---
### *Exercise*

> Inspect the NetCDF object. 

>  1. What are the units of the time variable?
>  1. What are the dimensions of the latitude variable?
>  1. What is the length of the latitude dimension?

---

In [None]:
# We can extract data from the file by indexing:
lon = nc['lon'][:]
lat = nc['lat'][:]
sst = nc['sst'][0]   # same as nc['sst'][0, :, :], gets the first 2D time slice in the series.

In [None]:
# Extract the time variable using the convenient num2date, which converts from time numbers to datetime objects
time = netCDF4.num2date(nc['time'][:], nc['time'].units)

In [None]:
sst.shape

In [None]:
proj = cartopy.crs.Robinson(central_longitude=180)

fig = plt.figure(figsize=(14,6))
ax = fig.add_subplot(111, projection=proj)
ax.add_feature(cartopy.feature.LAND, facecolor='0.9')
mappable = ax.contourf(lon, lat, sst, cmap=cmo.thermal, transform=cartopy.crs.PlateCarree())
ax.set_title(time[0].isoformat())
fig.colorbar(mappable).set_label(r'Sea Surface Temperature [$^\circ$C]')

### THREDDS example. Loading data from a remote dataset.

The netCDF library can be compiled such that it is 'THREDDS enabled', which means that you can put in a URL instead of a filename. This allows access to large remote datasets, without having to download the entire file. You can find a large list of datasets served via an OpenDAP/THREDDs server [here](http://apdrc.soest.hawaii.edu/data/data.php).

Let's look at the ESRL/NOAA 20th Century Reanalysis – Version 2. You can access the data by the following link (this is the link of the `.dds` and `.das` files without the extension.):

In [None]:
nc_cprat = netCDF4.Dataset('http://apdrc.soest.hawaii.edu/dods/public_data/Reanalysis_Data/esrl/daily/monolevel/cprat')

In [None]:
nc_cprat['cprat'].long_name

In [None]:
time = netCDF4.num2date(nc_cprat['time'][:], nc_cprat['time'].units)

In [None]:
cprat = nc_cprat['cprat'][-1]   # get the last time, datetime.datetime(2012, 12, 31, 0, 0)
lon = nc_cprat['lon'][:]
lat = nc_cprat['lat'][:]

In [None]:
proj = cartopy.crs.Robinson(central_longitude=180)

fig = plt.figure(figsize=(14,6))
ax = fig.add_subplot(111, projection=proj)
ax.coastlines(linewidth=0.25)
mappable = ax.contourf(lon, lat, cprat, 20, cmap=cmo.tempo, transform=cartopy.crs.PlateCarree())
ax.set_title(time[-1].isoformat()[:10])
fig.colorbar(mappable).set_label('%s' % nc_cprat['cprat'].long_name)

---
### *Exercise*

> Pick another [variable](http://apdrc.soest.hawaii.edu/dods/public_data/Reanalysis_Data/esrl/daily/monolevel) from this dataset. Inspect and plot the variable in a similar manner to precipitation.

> Find another dataset on a THREDDS server at SOEST (or elsewhere), pick a variable, and plot it.

---

### Creating NetCDF files

We can also create a NetCDF file to store data.

In [None]:
from matplotlib import tri

Ndatapoints = 1000
Ntimes = 20
Nbad = 200

xdata = np.random.rand(Ndatapoints)
ydata = np.random.rand(Ndatapoints)
time = np.arange(Ntimes)

# create a progressive wave
fdata = np.sin((xdata+ydata)[np.newaxis, :]*5.0 + 
               time[:, np.newaxis]/3.0)

# remove some random 'bad' data.
idx = range(fdata.size)
np.random.shuffle(idx)
fdata.flat[idx[:Nbad]] = np.nan

ygrid, xgrid = np.mgrid[0:1:60j, 0:1:50j]
fgrid = np.ma.empty((Ntimes, 60, 50), 'd')

# interpolate
for n in range(Ntimes):
    igood = ~np.isnan(fdata[n])
    t = tri.Triangulation(xdata[igood], ydata[igood])
    interp = tri.LinearTriInterpolator(t, fdata[n][igood])
    fgrid[n] = interp(xgrid, ygrid)

# create netCDF file

nc = netCDF4.Dataset('foo.nc', 'w')
nc.author = 'Rob Hetland'

nc.createDimension('x', 50)
nc.createDimension('y', 60)
nc.createDimension('time', None)    # An 'unlimited' dimension. 

nc.createVariable('f', 'd', ('time', 'y', 'x'))
nc.variables['f'][:] = fgrid
nc.variables['f'].units = 'meters sec-1'

nc.createVariable('x', 'd', ('x',))
nc.variables['x'][:] = xgrid[0, :]
nc.variables['x'].units = 'meters'

nc.createVariable('y', 'd', ('y',))
nc.variables['y'][:] = ygrid[:, 0]
nc.variables['y'].units = 'meters'

nc.createVariable('time', 'd', ('time',))
nc.variables['time'][:] = time
nc.variables['time'].units = 'seconds'

nc.close()


### GRIB files

NetCDF4 can also read GRIB2 files over THREDDS! GRIB files are used by NOAA for weather forecast and climate model output. There are many, many, many datasets that are available over THREDDS in GRIB format.

In [None]:
nc = netCDF4.Dataset('http://nomads.ncdc.noaa.gov/thredds/dodsC/modeldata/cmd_grblow/2011/201103/20110301/spllnl.gdas.2011030118.grb2')
sh = nc['Specific_humidity'][0, 0]
lon = nc['lon'][:]
lat = nc['lat'][:]
time = netCDF4.num2date(nc['time'][0], nc['time'].units)

proj = cartopy.crs.Robinson(central_longitude=180)

fig = plt.figure(figsize=(14,6))
ax = fig.add_subplot(111, projection=proj)
ax.coastlines(linewidth=0.25)
mappable = ax.contourf(lon, lat, sh, 20, cmap=cmo.matter, transform=cartopy.crs.PlateCarree())
plt.title(time.isoformat())
fig.colorbar(mappable).set_label('%s' % nc['Specific_humidity'].long_name)

---
### *Exercise*

> Find another dataset at [NOMADS](http://nomads.ncdc.noaa.gov/thredds) (or [here](http://nomads.ncdc.noaa.gov/data.php)), and plot it up!

> *Bonus*: Try to read in and plot regional model predictions: [NAM](http://nomads.ncdc.noaa.gov/thredds/catalog/nam218/catalog.html)


---

### See also

- [Xarray](http://xarray.pydata.org/en/stable/): NetCDF + PANDAS + CF conventions. Awesome.
- [pygrib](https://github.com/jswhit/pygrib): Reading GRIB files.
- [ncview](http://meteora.ucsd.edu/~pierce/ncview_home_page.html): Not python, but a very useful NetCDF file viewer.

## `xarray`

`xarray` expands the utility of the time series analysis package `pandas` into more than one dimension. It is actively being developed so some functionality isn't yet available, but for certain analysis it is very useful.

In [None]:
import xarray as xr

In the previous material, we used `netCDF` directly to read in a data file, then access the data:

In [None]:
nc = netCDF4.Dataset('../data/sst.mnmean.v4.nc')

print(nc['sst'].shape)

However, as was pointed out in class, in this approach if we want to pull out the sea surface temperature data at a particular time, we need to first know which time index that particular time corresponds to. How can we find this?

First we convert the time numbers from the file into datetimes, like before:

In [None]:
# Extract the time variable using the convenient num2date
time = netCDF4.num2date(nc['time'][:], nc['time'].units)

Say we want to search for the time index corresponding to May 1, 1954.

In [None]:
from datetime import datetime

date = datetime(1954, 5, 1, 0, 0)

Now we search for the time index:

In [None]:
tind = np.where(time==date)[0][0]
print(tind)

Great! So the time index we want is 1204. We can now make our sea surface temperature plot:

In [None]:
proj = cartopy.crs.Robinson(central_longitude=180)

fig = plt.figure(figsize=(14,6))
ax = fig.add_subplot(111, projection=proj)
ax.add_feature(cartopy.feature.LAND, facecolor='0.9')
mappable = ax.contourf(nc['lon'][:], nc['lat'][:], nc['sst'][tind], cmap=cmo.thermal, transform=cartopy.crs.PlateCarree())
ax.set_title(time[tind].isoformat())
fig.colorbar(mappable).set_label(r'Sea Surface Temperature [$^\circ$C]')

What if instead we want the index corresponding to May 23, 1954

In [None]:
date = datetime(1954, 5, 23, 0, 0)
np.where(time==date)

What is the problem here? There is no data at that exact time.

So what should we do?

---
### *Exercise*

> Search for the time index corresponding to the time in the data file closest to May 23, 1954.

---

Now let's access this data using a different package called `xarray`:

In [None]:
ds = xr.open_dataset('../data/sst.mnmean.v4.nc')  # similar way to read in — also works for nonlocal data addresses
ds

Now we can search for data in May 1954:

In [None]:
ds['sst'].sel(time=slice('1954-05','1954-05'))

Or we can search for the nearest output to May 23, 1954:

In [None]:
ds['sst'].sel(time='1954-05-23', method='nearest')

Let's plot it!

In [None]:
sst = ds['sst'].sel(time='1954-05-23', method='nearest')

proj = cartopy.crs.Robinson(central_longitude=180)

fig = plt.figure(figsize=(14,6))
ax = fig.add_subplot(111, projection=proj)
ax.add_feature(cartopy.feature.LAND, facecolor='0.9')
mappable = ax.contourf(nc['lon'][:], nc['lat'][:], sst, cmap=cmo.thermal, transform=cartopy.crs.PlateCarree())
ax.set_title(sst.time.data)
fig.colorbar(mappable).set_label(r'Sea Surface Temperature [$^\circ$C]')

Note that you can also just plot against the included coordinates with built-in convenience functions (this is analogous to `pandas` which was for one dimension):

In [None]:
sst.plot.contourf()

## GroupBy

Like in `pandas`, we can use the `groupby` method to do some neat things. Let's group by season and save a new file.

In [None]:
seasonal_mean = ds.groupby('time.season').mean('time')
seasonal_mean

Do you remember how many lines of code were required to save a netCDF file from scratch? It is straight-forward, but tedious. Once you are working with data using `xarray`, you can save new, derived files very easily from your data array:

In [None]:
fname = 'output/test.nc'
# seasonal_mean.to_netcdf(fname)  # you can't run this in read-only, but I already did for you

In [None]:
d = netCDF4.Dataset(fname)
d

---
### *Exercise*

> Plot the difference between summer and winter mean sea surface temperature.

---