# Tutorial 1

This tutorial provides an introduction to methods available in nctoolkit. It requires the downloading of a sea surface temperature dataset from NOAA, which can be done as follows:

In [1]:
! wget ftp://ftp.cdc.noaa.gov/Datasets/COBE2/sst.mon.mean.nc

--2020-05-26 16:22:06--  ftp://ftp.cdc.noaa.gov/Datasets/COBE2/sst.mon.mean.nc
           => ‘sst.mon.mean.nc’
Resolving ftp.cdc.noaa.gov (ftp.cdc.noaa.gov)... 140.172.38.117
Connecting to ftp.cdc.noaa.gov (ftp.cdc.noaa.gov)|140.172.38.117|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /Datasets/COBE2 ... done.
==> SIZE sst.mon.mean.nc ... 529004179
==> PASV ... done.    ==> RETR sst.mon.mean.nc ... done.
Length: 529004179 (504M) (unauthoritative)


2020-05-26 16:23:29 (6.33 MB/s) - ‘sst.mon.mean.nc’ saved [529004179]



This provides global gridded monthly average sea surface temperature from 1850 to the present day.

nctoolkit should be imported as follows:

In [2]:
import nctoolkit as nc

To analyze data we need to create a dataset. This can either be made up of a single file or an ensemble of multiple files. Tutorial 2 will cover how to handle multiple files.

To open the sea surface temperature data, we do this:

In [5]:
sst = nc.open_data("sst.mon.mean.nc")

We can see the size of the dataset:

In [9]:
sst.size

'Number of files: 1\nFile size: 529.004179 MB'

We can find the years available:

In [11]:
sst.years

[1850,
 1851,
 1852,
 1853,
 1854,
 1855,
 1856,
 1857,
 1858,
 1859,
 1860,
 1861,
 1862,
 1863,
 1864,
 1865,
 1866,
 1867,
 1868,
 1869,
 1870,
 1871,
 1872,
 1873,
 1874,
 1875,
 1876,
 1877,
 1878,
 1879,
 1880,
 1881,
 1882,
 1883,
 1884,
 1885,
 1886,
 1887,
 1888,
 1889,
 1890,
 1891,
 1892,
 1893,
 1894,
 1895,
 1896,
 1897,
 1898,
 1899,
 1900,
 1901,
 1902,
 1903,
 1904,
 1905,
 1906,
 1907,
 1908,
 1909,
 1910,
 1911,
 1912,
 1913,
 1914,
 1915,
 1916,
 1917,
 1918,
 1919,
 1920,
 1921,
 1922,
 1923,
 1924,
 1925,
 1926,
 1927,
 1928,
 1929,
 1930,
 1931,
 1932,
 1933,
 1934,
 1935,
 1936,
 1937,
 1938,
 1939,
 1940,
 1941,
 1942,
 1943,
 1944,
 1945,
 1946,
 1947,
 1948,
 1949,
 1950,
 1951,
 1952,
 1953,
 1954,
 1955,
 1956,
 1957,
 1958,
 1959,
 1960,
 1961,
 1962,
 1963,
 1964,
 1965,
 1966,
 1967,
 1968,
 1969,
 1970,
 1971,
 1972,
 1973,
 1974,
 1975,
 1976,
 1977,
 1978,
 1979,
 1980,
 1981,
 1982,
 1983,
 1984,
 1985,
 1986,
 1987,
 1988,
 1989,
 1990,
 1991,
 1992,

We can find the times available:

In [14]:
sst.times

['1850-01-01T00:00:00',
 '1850-02-01T00:00:00',
 '1850-03-01T00:00:00',
 '1850-04-01T00:00:00',
 '1850-05-01T00:00:00',
 '1850-06-01T00:00:00',
 '1850-07-01T00:00:00',
 '1850-08-01T00:00:00',
 '1850-09-01T00:00:00',
 '1850-10-01T00:00:00',
 '1850-11-01T00:00:00',
 '1850-12-01T00:00:00',
 '1851-01-01T00:00:00',
 '1851-02-01T00:00:00',
 '1851-03-01T00:00:00',
 '1851-04-01T00:00:00',
 '1851-05-01T00:00:00',
 '1851-06-01T00:00:00',
 '1851-07-01T00:00:00',
 '1851-08-01T00:00:00',
 '1851-09-01T00:00:00',
 '1851-10-01T00:00:00',
 '1851-11-01T00:00:00',
 '1851-12-01T00:00:00',
 '1852-01-01T00:00:00',
 '1852-02-01T00:00:00',
 '1852-03-01T00:00:00',
 '1852-04-01T00:00:00',
 '1852-05-01T00:00:00',
 '1852-06-01T00:00:00',
 '1852-07-01T00:00:00',
 '1852-08-01T00:00:00',
 '1852-09-01T00:00:00',
 '1852-10-01T00:00:00',
 '1852-11-01T00:00:00',
 '1852-12-01T00:00:00',
 '1853-01-01T00:00:00',
 '1853-02-01T00:00:00',
 '1853-03-01T00:00:00',
 '1853-04-01T00:00:00',
 '1853-05-01T00:00:00',
 '1853-06-01T00:

Usefully, we can create an interactive plot of the dataset:

In [20]:
sst.plot()

It is important at the outset to understand some key features of nctoolkit. First, all methods directly modify the dataset (though not the underlying NetCDF files).

Initially, the sst dataset is a single file "sst.mon.mean.nc". This is tracked in the dataset using the current attribute:

In [22]:
sst.current

'/tmp/nctoolkitlpbkqhkxnctoolkittmp_37e8q29.nc'

However, if we modify the dataset, we will find that sst.current has changed, and is now a temporary file:

In [23]:
sst.mean()
sst.current

'/tmp/nctoolkitlpbkqhkxnctoolkittmp2900g5nr.nc'

Each time nctoolkit carries out an operation a new temporary file will be generated. Behind the scenes nctoolkit will handle these temporary files automatically. For example in the code below temporary files are generated twice, but the second time nctoolkit recognizes the first temporary file is no longer needed and will remove it from disk space. Creating temporary files twice (why not just create one at the end?) is clearly inefficient, and you will see later when we discuss lazy evaluation, how nctoolkit lets this be more efficient.

In [25]:
sst = nc.open_data("sst.mon.mean.nc")
sst.spatial_mean()
sst.mean()

## Subsetting data

Subsetting datasets is very easy. If we wanted to select only years in the 20th Century, we would the following:

In [28]:
sst = nc.open_data("sst.mon.mean.nc")
sst.select_years(range(1900, 2000))

Selecting months is similar. If only wanted data for September, we would do this:

In [29]:
sst.select_months(9)

We might to select a spatial region. We can use clip to do this. For example, if we wanted to select the North Atlantic, we would do this: 

In [30]:
sst.clip(lon = [-80, 20], lat = [30, 70])
sst.plot()

## Averaging etc.

In [32]:
sst = nc.open_data("sst.mon.mean.nc")

## Making things more efficient

By default nctoolkit carries out eager evaluation, i.e. it evaluates everything when you call a method. This is clearly inefficient when things get complicated. Consider the following processing chain. This will calculated the mean North Atlantic temperature for each year in the 20th Century. 

In [41]:
sst = nc.open_data("sst.mon.mean.nc")
sst.select_years(range(1900, 2000))
sst.clip(lon = [-80, 20], lat = [20, 70])
sst.annual_mean()
sst.spatial_mean()
sst.plot()

We can use the history attribute to see what commands have actually been run behind the scenes by nctoolkit. These can be ignored in most cases. But they are instructuctive here:

In [42]:
sst.history

['cdo -L -selyear,1900,1901,1902,1903,1904,1905,1906,1907,1908,1909,1910,1911,1912,1913,1914,1915,1916,1917,1918,1919,1920,1921,1922,1923,1924,1925,1926,1927,1928,1929,1930,1931,1932,1933,1934,1935,1936,1937,1938,1939,1940,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999 sst.mon.mean.nc /tmp/nctoolkitlpbkqhkxnctoolkittmpvgygsfys.nc',
 'cdo -L  -sellonlatbox,-80,20,20,70 /tmp/nctoolkitlpbkqhkxnctoolkittmpvgygsfys.nc /tmp/nctoolkitlpbkqhkxnctoolkittmp3ux2nneg.nc',
 'cdo -L -yearmean /tmp/nctoolkitlpbkqhkxnctoolkittmp3ux2nneg.nc /tmp/nctoolkitlpbkqhkxnctoolkittmp0jouybdj.nc',
 'cdo -L -fldmean /tmp/nctoolkitlpbkqhkxnctoolkittmp0jouybdj.nc /tmp/nctoolkitlpbkqhkxnctoolkittmpv0h4p4f5.nc']

You can see that a total of 4 temporary files have been generated. This is leading to a lot of unnecessary IO. Thankfully CDO provides method chain, which let's us get around this probably. Instead of writing three temporary files, we can write one. To this we need to see evaluation to lazy:

In [37]:
nc.options(lazy = True)

Methods will now only be evaluated either when you force them to be or when they are required to be. For example, if we call plot, that will force all commands to be run.

In [38]:
sst = nc.open_data("sst.mon.mean.nc")
sst.clip(lon = [-80, 20], lat = [20, 70])
sst.spatial_mean()
sst.annual_anomaly(baseline = [1950, 1969])
sst.history

['cdo -L -fldmean  -sellonlatbox,-80,20,20,70 sst.mon.mean.nc /tmp/nctoolkitlpbkqhkxnctoolkittmpky_mpooa.nc',
 'cdo -L sub -runmean,1 -yearmean /tmp/nctoolkitlpbkqhkxnctoolkittmpky_mpooa.nc -timmean -selyear,1950/1969 /tmp/nctoolkitlpbkqhkxnctoolkittmpky_mpooa.nc /tmp/nctoolkitlpbkqhkxnctoolkittmpqeoukvtr.nc']