# NetCDF for Python
A common format for meteorological data to come in is called netCDF or network Common Data Form. This format is designed to have sufficient metadata to know everything about the data that is contained within the file. Additionally, this format is exceptionally easy to read into a python program using netCDF4 module.

This module was developed by an atmospheric scientist and is partially maintained and hosted by Unidata (the same folks to work on GEMPAK). But before we get started, we need to look a little bit more into this file format and how we can determine what is contained within a file.

## NetCDF File
A netCDF file is "is an interface to a library of data access functions for storing and retrieving data in the form of arrays" (Unidata NetCDF Users' Guide). Basically it is a structure to store data of any type that has common functions to store and retrieve data. So can I just open the file like an ASCII (plain text) file? Unfortunately, no. NetCDF is a binary format; if you opened the file with gedit or vi, you wouldn't see much...kinda like opening a GEMPAK file with a similar program.

There are some easy tools that you can use to open, read, and manipulate netCDF files, which are combined in a group call the NetCDF Operators (NCOs). One of the most useful is a command called <span style=font-family:Courier>**ncdump**</span>. This command allows us to easily read the header file (metadata) contained within the file. To do so you would issue the following command from a terminal window:

<span style=font-family:Courier>ncdump -h *&lt;filename&gt;*</span>

The cell below does this, using an exclamation point at the beginning runs that line as if it were on the terminal command line. So go ahead and run the line below and read through the output from the ncdump.

In [1]:
!/usr/bin/ncdump -h /home/kgoebber/met330/python/air.mon.mean.plevs.nc

netcdf air.mon.mean.plevs {
dimensions:
	level = 17 ;
	lat = 73 ;
	lon = 144 ;
	time = UNLIMITED ; // (813 currently)
variables:
	float level(level) ;
		level:units = "millibar" ;
		level:long_name = "Level" ;
		level:positive = "down" ;
		level:GRIB_id = 100s ;
		level:GRIB_name = "hPa" ;
		level:actual_range = 1000.f, 10.f ;
		level:axis = "Z" ;
	float lat(lat) ;
		lat:units = "degrees_north" ;
		lat:actual_range = 90.f, -90.f ;
		lat:long_name = "Latitude" ;
		lat:standard_name = "latitude" ;
		lat:axis = "Y" ;
	float lon(lon) ;
		lon:units = "degrees_east" ;
		lon:long_name = "Longitude" ;
		lon:actual_range = 0.f, 357.5f ;
		lon:standard_name = "longitude" ;
		lon:axis = "X" ;
	double time(time) ;
		time:long_name = "Time" ;
		time:delta_t = "0000-01-00 00:00:00" ;
		time:avg_period = "0000-01-00 00:00:00" ;
		time:prev_avg_period = "0000-00-01 00:00:00" ;
		time:standard_name = "time" ;
		time:axis = "T" ;
		time:units = "hours since 1800-01-01 0

## Questions about the netCDF File
* What kind of data is in the file?
* What variables are contained in the file?
* What are the dimensions of the variables?
* What are the units of the variables?
* What other information is contained in the header? Why is any of it important?

## Getting netCDF data into a Python program
So we have these files, we can get information about them from <span style=font-family:Courier>**ncdump**</span>, but how do we go about getting these data into our program?

The netCDF4 Python module is our friend and makes it relatively easy, so in the cell below we will go ahead and import the module and make a call to link the dataset to the program.

In [2]:
from netCDF4 import Dataset
data = Dataset('/home/kgoebber/met330/python/air.mon.mean.plevs.nc', 'r')

So the first line above imports a single function from the netCDF4 module called Dataset. Since that is the only function that we want to use, we can bring in just that function. We could always bring in more, but then we are using more computer memory when we don't need to, which will make our program run more efficiently.

You also likely noticed that nothing apparent happened when executing the cell above (if everything worked properly!). What happened was that the file was linked to the handle **data** from which we can then read in data or metadata from this file.

We'll now explore some of the different things we can get from or about this file in the cells below and to get us started just use the print function on the handle, data.

In [9]:
print(data.dimensions['level'].size)

17


What we have is a netCDF4 data object organized in the classic data format and we see some of the metadata that was at the end of the ncdump command. These are the global attributes of the file. These attributes can be accessed through the call to the handle (data) with a dot and then the attribute name (see below).

In [11]:
print(data.title)

monthly mean air from the NCEP Reanalysis


In [12]:
# A really informative attribute is the VARIABLES!
print(data.variables)

OrderedDict([('level', <class 'netCDF4._netCDF4.Variable'>
float32 level(level)
    units: millibar
    long_name: Level
    positive: down
    GRIB_id: 100
    GRIB_name: hPa
    actual_range: [ 1000.    10.]
    axis: Z
unlimited dimensions: 
current shape = (17,)
filling on, default _FillValue of 9.969209968386869e+36 used
), ('lat', <class 'netCDF4._netCDF4.Variable'>
float32 lat(lat)
    units: degrees_north
    actual_range: [ 90. -90.]
    long_name: Latitude
    standard_name: latitude
    axis: Y
unlimited dimensions: 
current shape = (73,)
filling on, default _FillValue of 9.969209968386869e+36 used
), ('lon', <class 'netCDF4._netCDF4.Variable'>
float32 lon(lon)
    units: degrees_east
    long_name: Longitude
    actual_range: [   0.   357.5]
    standard_name: longitude
    axis: X
unlimited dimensions: 
current shape = (144,)
filling on, default _FillValue of 9.969209968386869e+36 used
), ('time', <class 'netCDF4._netCDF4.Variable'>
float64 time(time)
    long_name: Time
 

In [13]:
# An easier to read list
print(list(data.variables))

['level', 'lat', 'lon', 'time', 'air']


In [14]:
# and the dimensions of the data
print(data.dimensions)

OrderedDict([('level', <class 'netCDF4._netCDF4.Dimension'>: name = 'level', size = 17
), ('lat', <class 'netCDF4._netCDF4.Dimension'>: name = 'lat', size = 73
), ('lon', <class 'netCDF4._netCDF4.Dimension'>: name = 'lon', size = 144
), ('time', <class 'netCDF4._netCDF4.Dimension'> (unlimited): name = 'time', size = 813
)])


## Pulling the Air Temperature from the file
If we want to use the air temperature that is contained within the file, then we use the handle and attribute to gain access to the variable 'air'.

In [18]:
air = data.variables['air']
print(air.units)

degC


The above is still an object with a lot of information about the including the data, but is not just the data. The following cell will read in the air temperature data and store it in a variable name within our program called **airtemp**. Note how long it takes for the cell to operate. Up until this point we haven't actually read any data in, just metadata!

In [19]:
airtemp = data.variables['air'][:]
airtemp = air[:]

With the above command we bring in the air temperture data from the netCDF file and it is brought into our program as a numpy array, which means it will be easy to manipulate and we can readily get all types of infromation from or about the data with simple functions and methods.

From the metadata in one of the above cells, we see that there are attributes associted with the variable air (e.g., long_name, units, precision, shape, etc.). We can access these attributes just like we could the global attributes of the whole file. For example, if we wanted to know the shape of the array, we would issue the following command.

In [22]:
print(airtemp.shape)
print(air)

(813, 17, 73, 144)
<class 'netCDF4._netCDF4.Variable'>
float32 air(time, level, lat, lon)
    long_name: Monthly Mean of Air temperature
    units: degC
    precision: 2
    least_significant_digit: 1
    var_desc: Air Temperature
    level_desc: Multiple levels
    statistic: Mean
    parent_stat: Other
    missing_value: -9.96921e+36
    valid_range: [-200.  300.]
    dataset: NCEP Reanalysis Derived Products
    actual_range: [-108.64998627   43.24000549]
unlimited dimensions: time
current shape = (813, 17, 73, 144)
filling on, default _FillValue of 9.969209968386869e+36 used



* How do we interpret this shape?
* How many dimensions are there?
* What order are they in?
* How do we know?

In [21]:
# Note - We can also get the shape from using the numpy module
import numpy as np
np.shape(airtemp)

(813, 17, 73, 144)

Let's also bring in some data for all of these dimensions, so that we can figure out where and when this data is valid.

In [23]:
levs  = data.variables['level'][:]
lats  = data.variables['lat'][:]
lons  = data.variables['lon'][:]

In [24]:
# These are all of the pressure levels that we have air temperature data for...
print(levs)

[ 1000.   925.   850.   700.   600.   500.   400.   300.   250.   200.
   150.   100.    70.    50.    30.    20.    10.]


In [25]:
# Latitudes are in degrees North; equator == 0
print(lats)

[ 90.   87.5  85.   82.5  80.   77.5  75.   72.5  70.   67.5  65.   62.5
  60.   57.5  55.   52.5  50.   47.5  45.   42.5  40.   37.5  35.   32.5
  30.   27.5  25.   22.5  20.   17.5  15.   12.5  10.    7.5   5.    2.5
   0.   -2.5  -5.   -7.5 -10.  -12.5 -15.  -17.5 -20.  -22.5 -25.  -27.5
 -30.  -32.5 -35.  -37.5 -40.  -42.5 -45.  -47.5 -50.  -52.5 -55.  -57.5
 -60.  -62.5 -65.  -67.5 -70.  -72.5 -75.  -77.5 -80.  -82.5 -85.  -87.5
 -90. ]


In [26]:
# Longitudes are in degrees East; Prime Meridian == 0, International Date line == 180
print(lons)

[   0.     2.5    5.     7.5   10.    12.5   15.    17.5   20.    22.5
   25.    27.5   30.    32.5   35.    37.5   40.    42.5   45.    47.5
   50.    52.5   55.    57.5   60.    62.5   65.    67.5   70.    72.5
   75.    77.5   80.    82.5   85.    87.5   90.    92.5   95.    97.5
  100.   102.5  105.   107.5  110.   112.5  115.   117.5  120.   122.5
  125.   127.5  130.   132.5  135.   137.5  140.   142.5  145.   147.5
  150.   152.5  155.   157.5  160.   162.5  165.   167.5  170.   172.5
  175.   177.5  180.   182.5  185.   187.5  190.   192.5  195.   197.5
  200.   202.5  205.   207.5  210.   212.5  215.   217.5  220.   222.5
  225.   227.5  230.   232.5  235.   237.5  240.   242.5  245.   247.5
  250.   252.5  255.   257.5  260.   262.5  265.   267.5  270.   272.5
  275.   277.5  280.   282.5  285.   287.5  290.   292.5  295.   297.5
  300.   302.5  305.   307.5  310.   312.5  315.   317.5  320.   322.5
  325.   327.5  330.   332.5  335.   337.5  340.   342.5  345.   347.5
  350.

# The Time Variable
The time variable is often one of the more problematic times because the units are not usually in an easily readable format. In the cell below we'll bring in the time variable metadata and have a look at what the units are, then we'll print out the time to see what the values look like.

In [27]:
time = data.variables['time']
print(time)

<class 'netCDF4._netCDF4.Variable'>
float64 time(time)
    long_name: Time
    delta_t: 0000-01-00 00:00:00
    avg_period: 0000-01-00 00:00:00
    prev_avg_period: 0000-00-01 00:00:00
    standard_name: time
    axis: T
    units: hours since 1800-01-01 00:00:0.0
    actual_range: [ 1297320.  1890480.]
unlimited dimensions: time
current shape = (813,)
filling on, default _FillValue of 9.969209968386869e+36 used



In [28]:
print(time[:])

[ 1297320.  1298064.  1298760.  1299504.  1300224.  1300968.  1301688.
  1302432.  1303176.  1303896.  1304640.  1305360.  1306104.  1306848.
  1307520.  1308264.  1308984.  1309728.  1310448.  1311192.  1311936.
  1312656.  1313400.  1314120.  1314864.  1315608.  1316280.  1317024.
  1317744.  1318488.  1319208.  1319952.  1320696.  1321416.  1322160.
  1322880.  1323624.  1324368.  1325040.  1325784.  1326504.  1327248.
  1327968.  1328712.  1329456.  1330176.  1330920.  1331640.  1332384.
  1333128.  1333824.  1334568.  1335288.  1336032.  1336752.  1337496.
  1338240.  1338960.  1339704.  1340424.  1341168.  1341912.  1342584.
  1343328.  1344048.  1344792.  1345512.  1346256.  1347000.  1347720.
  1348464.  1349184.  1349928.  1350672.  1351344.  1352088.  1352808.
  1353552.  1354272.  1355016.  1355760.  1356480.  1357224.  1357944.
  1358688.  1359432.  1360104.  1360848.  1361568.  1362312.  1363032.
  1363776.  1364520.  1365240.  1365984.  1366704.  1367448.  1368192.
  1368

YIKES! That doesn't look like a real time to me at all. That's likely becuase you don't know how many hours since 1 January 1800 looks like. Thankfully, we don't have to work out all of the details ourselves, we can use a function from netCDF4 called num2date to help us out. First we need to import the function, then we can use it!

In [35]:
from netCDF4 import num2date, date2num
vtimes = num2date(time[:],'hours since 1800-01-01 00:00:00.0')
print(vtimes[-1])

2015-09-01 00:00:00


Wow, now that looks like a value that I can recognize.

To go the other way, we need to use the datetime module to create an appropriate datetime object that we can then use date2num function from netCDF4 to get back the appropriate number of hours since 1/1/1800

In [40]:
from datetime import datetime
print(date2num(datetime(1981,6,1),time.units))
print(time[-1])

1590240.0
1890480.0


## Fun Exercise

Find the appropriate time index for the month you were born and compute the global average 850-hPa temperature.