<p style="float:right">
<img src="images/logos/cu.png" style="display:inline" />
<img src="images/logos/cires.png" style="display:inline" />
<img src="images/logos/nasa.png" style="display:inline" />
<img src="images/logos/nsidc_daac.png" style="display:inline" />
</p>

# Python, Jupyter & pandas: Module 5

## Inference / visualization, and working with xarray

### Simple inference and more visualization

In [None]:
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Let's create other kinds of graphs from the data we looked at in the previous module. We'll read in the .csv file we saved at the end of Module 4.

_**Note**: Please be sure to run the Module 4 that notebook before proceeding. Otherwise, the .csv file will not exist._

In [None]:
monthly = pd.read_csv('monthly-extents.csv', index_col='date', parse_dates=True)
print(type(monthly))
monthly.head()

We're going to look for a trend in the northern-hemisphere June snow cover.

First, as we did in Module 4, we'll create a new `DataFrame` indexed by years, with a column for each month.

In [None]:
year_by_month = monthly.set_index([monthly.index.year, monthly.index.month]).unstack(1)
year_by_month.head()

Now let's create and then graph a new series representing snow-cover anomalies from the mean for all Junes:

In [None]:
mean = year_by_month['snowcover'][6].mean()
june_anomalies = year_by_month['snowcover'][6] - mean
june_anomalies = june_anomalies.dropna()
print(type(june_anomalies))

plt.figure(figsize=(15, 4))
june_anomalies.plot(title='Northern Hemisphere Snow Cover Anomalies: June', kind='bar', color='r')

Now we'll:
- Use NumPy's [`polyfit()`](http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.polyfit.html) to compute slope and intercept for a least-squares fit line
- Use NumPy's [`poly1d`](http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.poly1d.html) to create a function representing this line
- Apply the linear function to our anomalies data frame's index to produce an array of points on the best-fit line
- Plot the anomalies together with the best-fit line

In [None]:
# polyfit arguments are: x-values, y-values, polynomial-degree
slope, intercept = np.polyfit(june_anomalies.index.values, june_anomalies.values, 1)
fit_function = np.poly1d([slope, intercept])
best_fit = fit_function(june_anomalies.index)

plt.figure(figsize=(15, 4))
june_anomalies.plot(title='Northern Hemisphere Snow Cover Anomalies: June',kind='Bar', color='r')
plt.plot(best_fit, color='b', linestyle='--')

We can use [Plotly](https://plot.ly/) to create an interactive graph to more closely examine the anomaly values. Let's do some imports and initial setup:

In [None]:
import plotly
import plotly.graph_objs as go
plotly.offline.init_notebook_mode()

Plotly's API is very declarative. Here, we'll use some of the basic settings to create a graph similar to the previous one. This may seem like a lot of declaration, but we'll see that the results are worth it:

First, we define the bar-chart bars representing anomaly magnitude:

In [None]:
snow_cover_anomalies = go.Bar(
    # data
    x=june_anomalies.index,
    y=june_anomalies,
    # style
    name='Anomaly',
    marker={'color': 'red'},
    hoverinfo='y'
)

Next, we define the best-fit line:

In [None]:
snow_cover_trend = go.Scatter(
    # data
    x=june_anomalies.index,
    y=best_fit,
    # style
    name='Best Fit',
    line ={'dash': 8, 'color': 'blue'},
    hoverinfo='y'
)

And the general layout for the plot:

In [None]:
layout = go.Layout(
    title='Northern Hemisphere Snow Cover Anomalies: June',
    xaxis={
        'tickmode': 'linear',
        'dtick': 5,
        'showline': True,
        'showgrid': True
    },
    yaxis={
        'showline': True
    }
)

Finally, we combine the bars and line data components, combine them with the layout, and produce a figure that we can plot:

In [None]:
data = go.Data([snow_cover_anomalies, snow_cover_trend])
figure = go.Figure({'data': data, 'layout': layout})

When we plot, note that we can hover on the graph to see the actual data values at that point in time, click on the legend to show/hide graph objects, click and drag to zoom various areas, etc.

In [None]:
plotly.offline.iplot(figure)

### Working with xarray

# [xarray](http://xarray.pydata.org/en/stable/)

     "xarray (formerly xray) is an open source project and Python package
     that aims to bring the labeled data power of pandas to the physical
     sciences, by providing N-dimensional variants of the core pandas data
     structures."

With xarray, we can open a NetCDF file as an `xarray.Dataset` and avoid much of the grunt work of setting up dimensions and converting axes:

In [None]:
import xarray as xr

We can attach a `Dataset` variable to a NetCDF endpoint or local file with `xarray.open_dataset()`. In this case, we have a local, zipped data file we will use -- the same northern-hemisphere snow-cover data we used in Module 4.

In [None]:
%%bash
rm -f nhsce_v01r01_19661004_20160201.nc
unzip $PWD/data/nhsce_v01r01_19661004_20160201.nc.zip

In [None]:
# To read from the DODs/OPeNDAP NetCDF endpoint, use URL
#   'http://www.ncdc.noaa.gov/thredds/dodsC/cdr/snowcover/nhsce_v01r01_19661004_latest.nc'

snowcover_file = 'nhsce_v01r01_19661004_20160201.nc'
dataset = xr.open_dataset(snowcover_file)
dataset

Let's look at the dataset's dimensions attribute:

In [None]:
dataset.dims

And its indexes attribute:

In [None]:
dataset.indexes

Notice that xarray has already taken care of converting the time coordinate into a `DatetimeIndex` (as opposed to how we manually converted it in Module 4).

List the dataset's variables:

In [None]:
dataset.data_vars

We can access the variables as attributes (e.g. `dataset.land`) or dictionary keys (e.g. `dataset['land']`).

Accessing a `DataSet` attribute yields a `DataArray`.

In [None]:
dataset.land

So as in Module 4, we have access to all of the data and indexes from the endpoint/file.

Let's look at the `DataArray`'s Snow Cover Extent data.

In [None]:
snow_cover_extent = dataset.snow_cover_extent
snow_cover_extent

Note the second line of output.

> `[19933056 values with dtype=float64]`


This indicates that the operation of loading the data has been deferred; that is, we have not retrieved all the values from the file or endpoint -- just the metadata, which is what we're seeing. When we access the `.values` or `.data` attributes, we force the loading of the actual data and see the above summary replaced by a printed representation of the actual `numpy.ndarray`.

This deferred loading allows us to examine a data source and load only what we really need, which is helpful when examining a large dataset.

### We can access data in `DataArray`s a number of ways.

By indexing positionally by integer:

In [None]:
# grabbing the 2402nd time slice in the file
snow_cover_extent[2401]

We can see the order of the dimensions, and subset accordingly.

In [None]:
snow_cover_extent.dims

In [None]:
a_slice = snow_cover_extent[2400:2403, 30:35, 35:41]
a_slice

Take a quick look back at the full `snow_cover_extent` data array:

In [None]:
snow_cover_extent

Note that loading of the values remains deferred.

So, to create `a_slice`, this operation retrieved only the data necessary from the data source. (This is less important when we're working with a local file, but potentially crucial when working with remote data sources over the network.)

We can grab a slice by integer along a named index with [`DataArray.isel`](http://xarray.pydata.org/en/stable/generated/xarray.DataArray.isel.html) (_isel_ for **i**nteger **sel**ect):

In [None]:
snow_cover_extent.isel(rows=slice(30, 40, 2), time=slice(970, 972), cols=slice(40, 45))

Or we can take slices using an index's native type and [`DataArray.sel`](http://xarray.pydata.org/en/stable/generated/xarray.DataArray.sel.html):

In [None]:
snow_cover_extent.sel(time=slice('2010-01-01', '2011-01-02'))

Note that, despite taking these slices, loading the full variable data remains deferred. We can force the loading of `snow_cover_extent`'s full data by accessing its `values` attribute:

In [None]:
downloaded_data = snow_cover_extent.values

Note that the data summary has been replaced by a description of the data array that has been created by downloading the full set of data values:

In [None]:
snow_cover_extent

The `module-5-extra` notebook in the `extra` folder goes further with xarray, creating a beautiful interactive plot using this data and a number of new techniques. Check it out!

In Module 6, we'll look at some ways of sharing Jupyter notebooks online.