<p style="float:right">
<img src="images/logos/cu.png" style="display:inline" />
<img src="images/logos/cires.png" style="display:inline" />
<img src="images/logos/nasa.png" style="display:inline" />
</p>

# Python, Jupyter & pandas tutorial: Module 2

## Obtaining data and basic inspection

### Basic data access

It is, of course, possible to obtain data (rougly construed -- we'll look at images here because they're simple to view) externally (or via the `%%script` magic, which saves the trouble of opening a separate terminal / command / browser window). We can fetch an image to the local filesystem, then display it with Markdown:

In [None]:
%%script bash
wget ftp://sidads.colorado.edu/DATASETS/NOAA/G02135/Feb/N_197902_extn.png -O N_197902_extn.png

<img src='N_197902_extn.png' style='float:left'/>

We can also obtain an image directly from the internet and display in with Python code:

In [None]:
from IPython.display import Image
Image(url='ftp://sidads.colorado.edu/DATASETS/NOAA/G02135/Feb/N_201602_extn.png')

### OpenDAP data access

The `netCDF4` package provide OpenDAP client capabilities. Here we use it to obtain data via an OpenDAP server at NSIDC:

In [None]:
import netCDF4
url = ('http://opendap.apps.nsidc.org:80/opendap/DATASETS/'
       'nsidc0530_MEASURES_nhsnow_daily25/2012/nhtsd25e2_20120101_v01r01.nc'
)
dataset = netCDF4.Dataset(url)

### Data inspection

We can inspect the `dataset` object to see its class. In this case, it's exactly what we'd expect given that we created it with `netCDF4.Dataset`. However, it's sometimes the case, especially when working with a new library, that we do not anticipate the type of an object returned from some method / function call, so it's handy to be able to find out what it is.

In [None]:
type(dataset)

Given that we have a `Dataset` object from the `netCDF4` library, we could of course go consult that library's documentation to learn what kinds of attributes and methods such an object has. Or, we can bravely plunge in and have a look for ourselves. Here, we use Python's built-in `dir` command to get a list of object members. We filter out those whose names begin with `_`, as these are generally not meant to be used directly.

In [None]:
import re
list(filter(lambda x: not re.match('^_.*', x), dir(dataset)))

Looking at this list, some items appear to be metadata, e.g. those starting with `geospatial_`, or `institution` or `platform`. We can look at the name of the dataset via the `title` attribute:

In [None]:
dataset.title

We can refer to [this dataset's documentation](http://nsidc.org/data/docs/measures/nsidc-0530/index.html) for more information about the meaning of these attributes.

The actual data is available under `variables`: 

In [None]:
for variable in dataset.variables:
    print(variable)

Note that these variables correspond to those listed in Table 3 of the documentation page linked to above.

Let's extract the `latitude` variable and look at its properties. In a Jupyter notebook, as in a Python REPL -- but unlike in non-interactive Python code -- simply giving the name of an object will cause its textual representation to be printed. (In a Jupyter notebook, this only works on the last line of a cell.)

In [None]:
latitude = dataset.variables['latitude']
latitude

Lots of useful information! `latitude` is a 720x720 array, with valid values ranging from -90 to 90 degrees north, and invalid ("fill") marked as -999.

Since we pulled `latitude` out of another object, rather than creating it explicitly as we did with the `Dataset`, what kind of object do we have?

In [None]:
type(latitude)

No surprise there. And how are the data in this variable represented?

In [None]:
latitude.datatype

As 32-bit floating-point numbers. We also saw, but might not have noticed, this when we printed `latitude`, above: Note the `float32` designation in the second line. Similarly, all the other data shown above can be extracted with more targeted queries:

In [None]:
latitude.long_name

In [None]:
latitude.valid_range

In [None]:
latitude.shape

We can also extract metadata that we could otherwise compute:

In [None]:
latitude.ndim

In [None]:
latitude.size

These are shorthand for

In [None]:
len(latitude.shape)

In [None]:
latitude.shape[0] * latitude.shape[1]

Just as with our `Dataset`, we can look at all the public attributes and methods of our `Variable`:

In [None]:
list(filter(lambda x: not re.match('^_.*', x), dir(latitude)))

Instances of the `Variable` class (like our `latitude` object) from `netCDF``latitude` variable behave like multidimensional arrays, similar to NumPy's `ndarray`. So, we can access elements with the familiar `[]` bracket notation. Since we know that `latitude` is 720x720, as we expect:

In [None]:
len(latitude)

And if we look at the first row in the `latitude` array, its length is similarly what we'd expect:

In [None]:
len(latitude[0])

Note that Python is zero-indexed like C, and unlike Fortran, so valid indices range from 0 to 719.

Let's extract the `time` variable from our dataset and examine it. (Note that we can view output from commands other than the last one in a cell by explicitly using Python's `print` command.)

In [None]:
time = dataset.variables['time']
print(time)
print(time[0])

# Yo! Is there an easy way to convert the time value, below, to an actual date? It would be better to show that than to make a snarky comment about.

So, this dataset's data starts 4749 days after 1998-12-31 on the Gregorian calendar. Handy! (Not really.)

Out of curiosity, let's check that the `longitude` variable's shape conforms to that of `latitude`, as we'd hope.

In [None]:
longitude = dataset.variables['longitude']
longitude

Good, it does.

Now let's look at one of the actual snow-cover variables which, presumably, is why we're bothering with this dataset in the first place:

In [None]:
msce = dataset.variables['merged_snow_cover_extent']
msce

Inspecting the Merged Snow Cover Extent variable, we see that the data is in a 720x720 array (the dimensions match those of latitude and longitude, so that's good!) whose values are integers specifying snow cover information from various sources, as well as snow-free and ice-covered land, and ocean.

We can pick a "random" array element and see its value:

In [None]:
msce[0][360][360]

40 = Ocean. Does it make sense?

In [None]:
print(latitude[360][360])
print(longitude[360][360])

That's pretty close to the north pole, in the Arctic Ocean, so seems reasonable.

Let's use NumPy to convert our `msce` variable into an `ndarray` object, and get rid of that useless first dimension:

In [None]:
import numpy as np
msce = np.array(msce)[0, :, :]
msce.shape

That's better. Now `msce`'s dimension matches that of `latitude` and `longitude`.

How much _good_ data do we have in `msce`? That is, how many data elements are there in total, and how many are set to the bad-data fill value?

In [None]:
print(msce.size)
print(msce[msce != -99].size)

So, over 20% of the data elements are set to the fill value. This sometimes happens with satellite (and other data): Quality Control (QC) algorithms determine that some observations are suspect, so they are marked as such so that further analysis can avoid depending on them.

### Subsetting data with OpenDAP

One benefit to using OpenDAP for data acess is that data can be subsetted prior to download, to avoid the transfer and storage of data one is not interested in.

NSIDC's [OPeNDAP Server Dataset Access Form](http://opendap.apps.nsidc.org/opendap/DATASETS/nsidc0530_MEASURES_nhsnow_daily25/2012/nhtsd25e2_20120101_v01r01.nc.html) for this data gives some guidance on subsetting the data. For starters, let's restrict our query to the three variables -- Latitude, Longitude, and Merged Snow Cover Extent -- that we are interested in. When we tick the checkboxes for _latitude_, _longitude_, and _merged_snow_cover_extent_, the URL shown in the _Data URL_ field is updated to:

`http://opendap.apps.nsidc.org:80/opendap/DATASETS/nsidc0530_MEASURES_nhsnow_daily25/2012/nhtsd25e2_20120101_v01r01.nc?latitude[0:1:719][0:1:719],longitude[0:1:719][0:1:719],merged_snow_cover_extent[0:1:0][0:1:719][0:1:719]`

Let's perform our query and data again with this URL and check the variables we have now:

In [None]:
url = ('http://opendap.apps.nsidc.org:80/opendap/DATASETS/'
       'nsidc0530_MEASURES_nhsnow_daily25/2012/nhtsd25e2_20120101_v01r01.nc?'
       'latitude[0:1:719][0:1:719],'
       'longitude[0:1:719][0:1:719],'
       'merged_snow_cover_extent[0:1:0][0:1:719][0:1:719]'
)
dataset = netCDF4.Dataset(url)
for variable in dataset.variables:
    print(variable)

Previously, we had ten variables; now we have only three. Nice.

Let's say we're only interested in snow cover in Iceland. Let's subset the data geographically as well. The [OPeNDAP Server Dataset Access Form](http://opendap.apps.nsidc.org/opendap/DATASETS/nsidc0530_MEASURES_nhsnow_daily25/2012/nhtsd25e2_20120101_v01r01.nc.html) gives us options for constraining the variables, but expects us to do so by row and column. Iceland lies between about 12 to 25 degrees west, and 63 to 67 degrees north. Let's see which rows and columns in our Dataset fall within those bounds:

# Yo! Is there a nicer way to do the thing below?

In [None]:
latitude = np.array(dataset.variables['latitude'])
longitude = np.array(dataset.variables['longitude'])
minrow = 720
maxrow = -1
mincol = 720
maxcol = -1
for row in range(0, 720):
    for col in range(0, 720):
        a = latitude[row][col]
        b = longitude[row][col]
        if a >= 63 and a <= 67 and b >= -25 and b <= -12:
            minrow = min(minrow, row)
            maxrow = max(maxrow, row)
            mincol = min(mincol, col)
            maxcol = max(maxcol, col)
print('rows %d:%d' % (minrow, maxrow))
print('cols %d:%d' % (mincol, maxcol))

Now let's add contraints our OpenDAP URL to select just the rows and columns that we think correspond to Iceland. The OpenDAP constraints are given in `lower_bound:stride:upper_bound` form:

In [None]:
url = ('http://opendap.apps.nsidc.org:80/opendap/DATASETS/'
       'nsidc0530_MEASURES_nhsnow_daily25/2012/nhtsd25e2_20120101_v01r01.nc?'
       'latitude[453:1:476][310:1:338],'
       'longitude[453:1:476][310:1:338],'
       'merged_snow_cover_extent[0:1:0][453:1:476][310:1:338]'
)
dataset = netCDF4.Dataset(url)
dataset

Note the line `dimensions(sizes): time(1), cols(29), rows(24)`: 24 rows and 29 columns, which corresponds to our request. This is a lot less data than we were retrieving before!

In Module 3, we'll display this geolocated data and see if we really got what we asked for.