<p style="float:right">
<img src="images/logos/cu.png" style="display:inline" />
<img src="images/logos/cires.png" style="display:inline" />
<img src="images/logos/nasa.png" style="display:inline" />
</p>

# Python, Jupyter & pandas: Module 5

## Inference and Visualization

In [None]:
%matplotlib inline
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd


Let's create some other graphs from the data we looked at in the previous module.

_**Note**: The .csv read by the next cell should have been created by the Module 4 notebook. Please be sure you have evaluated that notebook before proceeding._

In [None]:
monthly = pd.read_csv('monthly-extents.csv', index_col='date', parse_dates=True)

In [None]:
monthly.head()

We're going to look for a trend in the Northern Hemisphere June snowcover.

First, we'll create a new DataFrame indexed by years, with a column for each month.

In [None]:
year_by_month = monthly.set_index([monthly.index.year, monthly.index.month]).unstack(1)
year_by_month.head()

Find the overall mean for June snow cover, and graph each year's difference from the mean.

In [None]:
june_anomalies = year_by_month['snowcover'][6] - year_by_month['snowcover'][6].mean()
june_anomalies = june_anomalies.dropna()

In [None]:
with mpl.rc_context(rc={'figure.figsize': (15, 4)}):
    june_anomalies.plot(title='Northern Hemisphere Snow Cover Anomalies: June',kind='bar', color='r')
    

Compute a least squares linear fit.

In [None]:
slope, intercept = np.polyfit(june_anomalies.index.values, june_anomalies.values, 1)
fit_function = np.poly1d([ slope, intercept])
best_fit = fit_function(june_anomalies.index)

In [None]:
with mpl.rc_context(rc={'figure.figsize': (15, 4)}):
    june_anomalies.plot(title='Northern Hemisphere Snow Cover Anomalies: June',kind='Bar', color='r')
    plt.plot(best_fit, color='b', linestyle='--')


We can use [Plotly](https://plot.ly/) to create an interactive graph to more closely examine the anomaly values.

In [None]:
import plotly
import plotly.graph_objs as go
plotly.offline.init_notebook_mode()

Plotly's API is very declarative. Here, we'll use some of the basic settings to create a graph similar to the previous one.

In [None]:
snow_cover_anomalies = go.Bar(
    # data
    x=june_anomalies.index,
    y=june_anomalies,
    
    # style
    name='Anomaly',
    marker={
        'color': 'red'
    },
    hoverinfo='y'
)

In [None]:
snow_cover_trend = go.Scatter(
    # data
    x=june_anomalies.index,
    y=best_fit,
    
    # style
    name='Best Fit',
    line ={
        'dash': 8,
        'color': 'blue'
    },
    hoverinfo='y'
)

In [None]:
layout = go.Layout(
    title='Northern Hemisphere Snow Cover Anomalies: June',
    xaxis={
        'tickmode': 'linear',
        'dtick': 5,
        'showline': True,
        'showgrid': True
    },
    yaxis={
        'showline': True
    }
)

In [None]:
data = go.Data([snow_cover_anomalies, snow_cover_trend])

In [None]:
figure = go.Figure({'data': data, 'layout': layout})

We can hover to see the precise data values, click on the legend to show/hide graph objects, click and drag to zoom in various areas, etc.

In [None]:
plotly.offline.iplot(figure)

# [xarray](http://xarray.pydata.org/en/stable/)

     "xarray (formerly xray) is an open source project and Python package
     that aims to bring the labeled data power of pandas to the physical
     sciences, by providing N-dimensional variants of the core pandas data
     structures."

With xarray you can open a NetCDF file as an `xarray.Dataset` and a lot of the grunt work of setting up dimensions and converting axes is done for you.

In [None]:
import xarray as xr  # import as xr by convention
import pandas as pd
import numpy as np

Attach a dataset variable to the NetCDF endpoint or file with `xarray.open_dataset()`.

`snowcover_url` can be either a NetCDF file or a NetCDF endpoint.

In [None]:
# unzip packaged data file 
!cd data; unzip -o nhsce_v01r01_19661004_20160201.nc.zip 

In [None]:
# if you want to read from the DODs/OPeNDAP NetCDF endpoint:
# snowcover_url = 'http://www.ncdc.noaa.gov/thredds/dodsC/cdr/snowcover/nhsce_v01r01_19661004_latest.nc'

snowcover_file = 'data/nhsce_v01r01_19661004_20160201.nc'
dataset = xr.open_dataset(snowcover_file)

# Fix this to read a local file?


In [None]:
print(dataset)

You can see the dataset's dimensions attribute:

In [None]:
dataset.dims

and the indexes attribute:

In [None]:
dataset.indexes

Notice xarray has already taken care of converting the time coordinate into a `DatetimeIndex` (as opposed to how we handled it by hand in module-4)

List the dataset's variables:

In [None]:
dataset.data_vars

We can access the variables as attributes or dictionary keys.

Accessing a `DataSet` attribute yields a `DataArray`.

In [None]:
dataset['land']

So just like in Module-4, we have access to all of the data and indexes from the endpoint/file.

Let's look at the `DataArray` for Snow Cover Extent.

In [None]:
snow_cover_extent = dataset['snow_cover_extent']
print(snow_cover_extent)

Note the second line of output.

> `[19933056 values with dtype=float64]`


This indicates the operation of
downloading the data has been deferred; that is, we have not fetched all of
the values from the endpoint, just the metadata, which is being
displayed.  If you've accessed the `.values` or `.data` attributes, you will
have downloaded the data and you will see a printed representation of the
`numpy.ndarray`.

This defered downloading allows you to work with just the data you are
interested in, without having to download an entire file.

## You can access data in `DataArray`s a number of ways.

By indexing positionally by integer:

In [None]:
snow_cover_extent[2401]

You can see the order of the dimensions, and subset accordingly.

In [None]:
print(snow_cover_extent.dims)

In [None]:
a_slice = snow_cover_extent[2400:2403, 30:35, 35:41]
print(a_slice.shape)
print(a_slice)

And you can see again, this operation to retrieve `a_slice` has retrieved only the data necessary from the remote file or endpoint.

In [None]:
print(snow_cover_extent)

You can also grab a slice by integer along a named index with `DataArray.isel`  (*isel* => for integer select)

In [None]:
snow_cover_extent.isel(rows=slice(30, 40, 2), time=slice(970, 972), cols=slice(40, 45))

Or you can use slices of an index's native type `DataArray.sel`  (in this case, the date strings are coerced to numpy.datetime64 objects)

In [None]:
snow_cover_extent.sel(time=slice('2010-01-01', '2011-01-02'))

And fetching all of the data is still deferred.

In [None]:
print(snow_cover_extent)

Finally, we can use what we know about the data file write a couple of routines
to compute anomalies and display them on a map.

We can use the `DataSet.groupby` function to gather groups of time dimensions
like in module-4.  Here, we will group all of the time index values by the month
number.  Accessing the `groups` attribute returns a dictionary where the keys are
months, and the values are a list of indices into the dataset time index for that month.

In [None]:
month_indices = dataset.groupby('time.month').groups

In [None]:
print("Keys:", month_indices.keys(), "  One for each month")
print("Feburary Indices:", month_indices[2][0:5], "...")

We can use a list of indices to select data with `isel`.

In [None]:
month_num = 6

weeks = dataset['time'].isel(time=month_indices[month_num])

print("first 10 DataSet['Time'] Values:\n ", weeks.values[0:10])
print("\nTotal number of elements in month_indices", len(weeks))

We will use every available measurement in the `DataSet` to compute a median snow cover for a given month.

We do this by computing a mean across time for each of the month's samples. This gives us a fracional probability of any measurement having snow cover or not.  By taking those values that are greater than or equal to .5, we get a median snow cover for the month.

Choose a month with some snow.

In [None]:
month_number = 2
average_snowcover = dataset['snow_cover_extent'].isel(time=month_indices[month_number]).mean(dim='time')
median_snowcover = average_snowcover > .5

In [None]:
lats = dataset.latitude.values
lons = dataset.longitude.values
land = dataset.land.values

Create a function that will return a categorical grid of the snow cover anomaly. For each cell, we want to identify whether it is part of the selected month's extent and/or median extent, or if it is land or ocean.

In [None]:
def anomaly_snowcover(selected_snow_cover, median_snowcover, land):
    sel = selected_snow_cover.values.astype(bool)
    med = median_snowcover.values.astype(bool)
    land = land.astype(bool)

    # Do logical intersections of the data
    both     =  sel &  med
    only_med = ~sel &  med
    only_sel =  sel & ~med

    # Assign a those intersections values.
    out = np.zeros_like(land.astype(int))
    out[~land] = 0
    out[land] = 1
    out[both] = 2
    out[only_med] = 3
    out[only_sel] = 4

    return out



Create a [colormap](http://matplotlib.org/api/colors_api.html?highlight=listedcolormap#matplotlib.colors.ListedColormap) and [normalizer](http://matplotlib.org/users/colormapnorms.html) for plotting the `anomaly_snowcover` output


We know our data will only have values 0 through 5.

In [None]:
# Choose some nice colors
categorical_cmap = mpl.colors.ListedColormap(colors=['#D4EFFA', '#A3BAA5','#FEFEFE','#BC80BC', '#ACD665' ])

# center your bounds around your datapoint.
bounds = [-.5, .5, 1.5, 2.5, 3.5, 4.5]
norm = mpl.colors.BoundaryNorm(bounds, categorical_cmap.N)

In [None]:
from mpl_toolkits.basemap import Basemap
from ipywidgets import interact
import ipywidgets as widgets

Now use a widget to plot anomalies of snow cover

In [None]:
import datetime as dt
import calendar as cal

def title_function(datetime64ns):
    date = datetime64ns.astype('M8[ms]').astype(dt.date)
    return "Snow Cover: {0}-{1} compared to median".format(cal.month_abbr[date.month], date.year)

def selected_month_label(datetime64ns):
    date = datetime64ns.astype('M8[ms]').astype(dt.date)
    return '{0} {1} Only'.format(cal.month_abbr[date.month], date.year)

In [None]:
@interact(index_in=widgets.IntSlider(min=0,max=len(month_indices[month_number])-1,step=1,value=0, continuous_update=False))
def plot_anomaly(index_in=0):
    index = month_indices[month_number][index_in]
    plt.figure(figsize=(10, 10))
    m = Basemap(projection='npstere', boundinglat=30, lon_0=-45)
    m.drawcoastlines()

    parallels = np.arange(0, 90, 20)
    m.drawparallels(parallels, labels=[True])
    meridians = np.arange(-180, 180, 45)
    m.drawmeridians(meridians, labels=[True,True,False,True,True,True,True,True])
    
    the_data = anomaly_snowcover(dataset['snow_cover_extent'].isel(time=index), median_snowcover, land)
    m.pcolor(lons, lats, the_data, latlon=True, cmap=categorical_cmap, norm=norm)
    
    times = dataset['time'].isel(time=index).values
    
    cbar = plt.colorbar(ticks=[0, 1, 2, 3, 4], norm=norm)
    cbar.set_ticklabels(['Ocean', 'Land', 'Both', 'Median Only', selected_month_label(times)])
    plt.title(title_function(times))
    plt.draw()

