<p style="float:right">
<img src="images/logos/cu.png" style="display:inline" />
<img src="images/logos/cires.png" style="display:inline" />
<img src="images/logos/nasa.png" style="display:inline" />
</p>

# Python, Jupyter & pandas: Module 4

## Using [xarray](http://xarray.pydata.org/en/stable/) and [pandas](http://pandas.pydata.org/) for analysis

In [None]:
%matplotlib inline
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt 

In [None]:
mpl.style.use('default')

We can find a nice timeseries to examine.

David Robinson's Rutgers Northern hemisphere snowcover is a coarse (88 x 88)
northern hemisphere grid, with data going back to 1966.

http://climate.rutgers.edu/snowcover/docs.php?target=datareq

_Robinson, David A., Estilow, Thomas W., and NOAA CDR Program (2012):NOAA
Climate Date Record (CDR) of Northern Hemisphere (NH) Snow Cover Extent
(SCE), Version 1. [indicate subset used]. NOAA National Climatic Data
Center. doi:10.7289/V5N014G9 [access date]._

Following along the initial html link above we can find the opendap (DODS)
endpoint and access it via the netCDF4 python package.

In [None]:
import netCDF4
snowcover_url = 'http://www.ncdc.noaa.gov/thredds/dodsC/cdr/snowcover/nhsce_v01r01_19661004_latest.nc'

Open and connect the opendap endpoint.  

In [None]:
%%time 
ds = netCDF4.Dataset(snowcover_url)

Examine the netcdf attributes.

In [None]:
ds.ncattrs()

In [None]:
ds.cdr_variable

In [None]:
ds.title

Look at what variables are provided in the file.

In [None]:
ds.variables.keys()

attatch some metadata variables.

In [None]:
latitude = ds.variables['latitude']
longitude = ds.variables['longitude']
land = ds.variables['land']
area = ds.variables['area']

In [None]:
latitude

So we see it's an 88 x 88 grid of floats 

So what area does the grid cover?

In [None]:
from mpl_toolkits.basemap import Basemap
from ipywidgets import interact
import ipywidgets as widgets

@interact(longitude_0=widgets.IntSlider(min=-165,max=-15,step=30,value=-105))
def plot_land(longitude_0=-80):
    plt.figure(figsize=(10, 10))
    m = Basemap(projection='npstere', boundinglat=30, lon_0=longitude_0)
    m.drawcoastlines()
    m.pcolor(longitude[:], latitude[:], land[:], latlon=True, cmap='Accent')
    plt.draw()



In [None]:
%%time
snowcover = ds.variables['snow_cover_extent']

In [None]:
snowcover

we have attatched to a data set with 2574 88 x 88 grids where `1 = snow_covered` and `0 = no_snow`

This step copies all of the data from the url to your data variable.  It can take a long time. ~5min.

In [None]:
%%time
all_data = snowcover[:,:,:]

read and convert the time data into datetime objects.

In [None]:
time = ds.variables['time']
times = netCDF4.num2date(time[:], time.units)

In [None]:
sTimes = pd.Series(times)

In [None]:
single_week = all_data[1000, : , :]  # just choose a snowy index

In [None]:
plt.imshow(single_week)

In [None]:
plt.imshow(area)

use numpy's multiplication to multiply the cells to get a snowcovered area per cell.

In [None]:
plt.imshow(single_week * area[:])

define a quick routine to compute the total snowcovered area for a grid.

In [None]:
def snowcover_area_km2(grid, area):
    return np.sum(grid * area)

In [None]:
all_data.shape

compute each weeks total snow covered area in km^2

In [None]:
grid_area = area[:]
total_area = np.ma.zeros(2574)

for i in np.arange(2574):
    total_area[i] = snowcover_area_km2(all_data[i, :, :], grid_area)
    

In [None]:
with mpl.rc_context(rc={'figure.figsize': (15,2)}):
    plt.plot(times, total_area)

In [None]:
with mpl.rc_context(rc={'figure.figsize': (15,2)}):
    plt.plot(times[100:120], total_area[100:120], marker='.')

In [None]:
We have weekly data at 7 day resolution, but we're interested in monthly averages.

In [None]:
xrandom = 501
times[xrandom+1] - times[xrandom]

This is ok.  But Pandas provides lots of routines for working with timeseries data.
Let's create a timeseries `ts` from our data and times. 

In [None]:
ts = pd.Series(data=total_area, index=times)

The timeseries has a built in plot method that will give us something like our original data.

In [None]:
with mpl.rc_context(rc={'figure.figsize': (15,2)}):
    ts.plot()

So what's special about our Series index?

In [None]:
print(ts.index)

Pandas tells us it's a [DatetimeIndex](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DatetimeIndex.html) and that link is to the documentation which is pretty overwhelming.  But let's just look at a few things it can do.

Create a little subindex

In [None]:
subindex = ts.index[50:70]

In [None]:
subindex

In [None]:
subindex.year

In [None]:
subindex.month

In [None]:
subindex.day

In [None]:
subindex.dayofyear

In [None]:
subindex.dayofweek

You can select data based on the index.

Select all data that fell in the month of December during any year.

In [None]:
ts[ts.index.month == 12].head(10)

select data from year 2000

In [None]:
ts[ts.index.year == 2000].head()


What was the maximum value?

In [None]:
ts.max()

On what day did that maxium occur?

In [None]:
ts[ts == ts.max()]

When was the snow at least 95% of the max ever?

In [None]:
ts[ts > ts.max() * .95]

In [None]:
help(ts.resample('M'))

In [None]:
ts.groupby([ts.index.year, ts.index.month]).mean()

In [None]:
ts.resample('MS').mean()

In [None]:
a.resample('D').ffill()

In [None]:
import pandas as pd
import xarray as xr

In [None]:
all_data.shape

In [None]:
dset = xr.Dataset({'snowcover': (('time', 'row', 'col'), all_data)},
                  {'time': pd.DatetimeIndex(times)})

In [None]:
dset.dims

In [None]:
dset.time[0]

In [None]:
dset.sel(time=['1966-10-10'])

In [None]:
xr.DataArray(np.random.randn(2, 3))


In [None]:
data = xr.DataArray(np.arange(6.).reshape(2, 3), [('x', ['a', 'b']), ('y', [-2, 0, 2])])


In [None]:
data

In [None]:
xr.DataArray(pd.Series(range(3), index=list('abc'), name='foo'))


In [None]:
data.attrs

In [None]:
data[:,[0,1,2]]                      # 

In [None]:
data.loc[:,:]

In [None]:
data.loc['b':'a':-1]

In [None]:
data.isel(x=slice(0,2,1))

In [None]:
data.sel(x='a')

In [None]:
a = xr.DataArray(np.random.randn(3), [data.coords['y']])

In [None]:
b = xr.DataArray(np.random.randn(4), dims='z')


In [None]:
a

In [None]:
b

In [None]:
a + b

In [None]:
data.T - data

In [None]:
data[:-1] - data[:1]

In [None]:
labels = xr.DataArray(['E', 'F', 'E'], [data.coords['y']], name='labels')


In [None]:
labels

In [None]:
data.groupby(labels).groups

In [None]:
data

In [None]:
data.groupby(labels).min('y').to_series()

In [None]:
data

In [None]:
data.to_series()

In [None]:
dsex = data.to_dataset(name='foo')

In [None]:
dsex

In [None]:
dsex.to_netcdf('example.nc')

In [None]:
ds2 = xr.open_dataset(snowcover_url)

In [None]:
ds2.time

In [None]:
ds2.time