<p style="float:right">
<img src="images/logos/cu.png" style="display:inline" />
<img src="images/logos/cires.png" style="display:inline" />
<img src="images/logos/nasa.png" style="display:inline" />
</p>

# Python, Jupyter & pandas: Module 4

## Using [xarray](http://xarray.pydata.org/en/stable/) and [pandas](http://pandas.pydata.org/) for analysis

In [None]:
%matplotlib inline
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

We can find a nice timeseries to examine.

David Robinson's Rutgers Northern hemisphere snowcover is a coarse (88 x 88)
northern hemisphere grid, with data going back to 1966.

http://climate.rutgers.edu/snowcover/docs.php?target=datareq

_Robinson, David A., Estilow, Thomas W., and NOAA CDR Program (2012):NOAA
Climate Date Record (CDR) of Northern Hemisphere (NH) Snow Cover Extent
(SCE), Version 1. [indicate subset used]. NOAA National Climatic Data
Center. doi:10.7289/V5N014G9 [access date]._

Following along the initial html link above we can find the opendap (DODS)
endpoint and access it via the netCDF4 python package.

In [None]:
import netCDF4
snowcover_url = 'http://www.ncdc.noaa.gov/thredds/dodsC/cdr/snowcover/nhsce_v01r01_19661004_latest.nc'

Open and connect the opendap endpoint.

In [None]:
%%time
ds = netCDF4.Dataset(snowcover_url)

Examine the netcdf attributes.

In [None]:
ds.ncattrs()

In [None]:
ds.cdr_variable

In [None]:
ds.title

Look at what variables are provided in the file.

In [None]:
ds.variables.keys()

attatch variables to some metadata variables on the file.

In [None]:
latitude = ds.variables['latitude']
longitude = ds.variables['longitude']
land = ds.variables['land']
area = ds.variables['area']

In [None]:
print(latitude)

So we see it's an 88 x 88 grid of floats 

We know it's Northern Hemisphere Data, but what's teh grid really look like?

In [None]:
with mpl.rc_context(rc={'figure.figsize': (10,10)}):
    plt.imshow(land[:], cmap='Accent', interpolation='nearest')

So if you squint and are accustomed to looking at polar projections, you can probably see North America on the lower part of the grid.

But we can use the Basemap package from matplotlib to add some graticules and coastlines.


In [None]:
from mpl_toolkits.basemap import Basemap
from ipywidgets import interact
import ipywidgets as widgets

@interact(longitude_0=widgets.IntSlider(min=-165,max=-15,step=30,value=-105))
def plot_land(longitude_0=-80):
    plt.figure(figsize=(10, 10))
    m = Basemap(projection='npstere', boundinglat=30, lon_0=longitude_0)
    m.drawcoastlines()

    parallels = np.arange(0, 90, 20)
    m.drawparallels(parallels, labels=[True])
    meridians = np.arange(-180, 180, 45)
    m.drawmeridians(meridians, labels=[True, True,True,True,True])

    m.pcolor(longitude[:], latitude[:], land[:], latlon=True, cmap='Accent')
    plt.draw()



We can attatch to the main variable in the file and get an idea of what's in it.

In [None]:
%%time
snowcover = ds.variables['snow_cover_extent']

In [None]:
print(snowcover)

we have attatched to a data set with 2574 88 x 88 grids where `1 = snow_covered` and `0 = no_snow`

This step copies all of the data from the url to your data variable.  It can take a long time. ~5min.

In [None]:
%%time
all_data = snowcover[:,:,:]

We can just plot the data and take a look at a few of the grids.

In [None]:
@interact(index=widgets.IntSlider(min=0,max=2573,step=4,value=0))
def show_it(index=0):
    with mpl.rc_context(rc={"figure.figsize": (10, 10)}):
        plt.imshow(all_data[index,:,:], interpolation='nearest', cmap='Blues')
        

So we have snow/no snow binary grid, and we saw there was an area grid

In [None]:
with mpl.rc_context(rc={'figure.figsize': (10,10)}):
    plt.imshow(area, interpolation='nearest', cmap="plasma")
    cb = plt.colorbar()
    cb.set_label('Grid Cell Area: $km^2$')

use numpy's multiplication to multiply the cells to get a snowcovered area per cell.

In [None]:
@interact(index=widgets.IntSlider(min=0,max=2573,step=4,value=0))
def show_it(index = 0):
    with mpl.rc_context(rc={'figure.figsize': (10,10)}):
        plt.imshow(all_data[index,:,:] * area[:], interpolation='nearest', cmap='plasma')

define a quick routine to compute the total snowcovered area for a grid.

In [None]:
def snowcover_area_km2(grid, area):
    return np.sum(grid * area)

compute each weekly total snow covered area in km^2

In [None]:
weeks = all_data.shape[0]
grid_area = area[:]
total_area = np.ma.zeros(weeks)
for i in np.arange(weeks):
    total_area[i] = snowcover_area_km2(all_data[i, :, :], grid_area)


read and convert the time data into datetime objects.

In [None]:
ds.variables['time']

Use the netcdf helper function to convert `days since <X>` into datetime objects.

In [None]:
file_time = ds.variables['time']
times = netCDF4.num2date(file_time[:], file_time.units)

In [None]:
print(file_time[3:7])
print(times[3:7])

Now we can plot the area vs time.

In [None]:
with mpl.rc_context(rc={'figure.figsize': (15,2)}):
    plt.plot(times, total_area)
    plt.title('Northern Hemisphere Weekly Snow Covered Area')

or just look at a subset

In [None]:
with mpl.rc_context(rc={'figure.figsize': (15,2)}):
    plt.plot(times[100:120], total_area[100:120], marker='.')
    plt.title('subset of weekly NH snowcover data')

We can see that have weekly data at 7 day resolution, but we're going to need to compare this timeseries to a monthly dataset.

In [None]:
offset = 500
times[offset+1] - times[offset]

This is ok.

Pandas provides lots of routines for working with timeseries data. Let's create a [pandas.Series](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html) timeseries `ts` from the data and times.

"[Series](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#series) is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:"

      s = pd.Series(data, index=index)



In [None]:
import pandas as pd    # by convention import pandas as pd

In [None]:
ts = pd.Series(total_area, index=times)

the `pd.Series.head()` function lets you examine a items from a series object

In [None]:
ts.head()

the `pd.Series.describe()` method give you a statistical overview of your series.

In [None]:
ts.describe()

The `pandas.Series` has a built in plot method that will give us something like our original data

It will plot the datavalues vs the index.

In [None]:
with mpl.rc_context(rc={'figure.figsize': (15,2)}):
    ts.plot(title='Northern Hemisphere Snow Covered Area: $km^2$')

Examine the index of our `pd.Series`.

In [None]:
print(ts.index)

Pandas tells us it's a [DatetimeIndex](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DatetimeIndex.html) and that link is to the documentation and it is pretty overwhelming.  Let's just look at a few things one can do with a DatetimeIndex.

Create a little subindex to play with. (subset by offset)
Choose 20 values that overlap a year boundary.

In [None]:
subindex = ts.index[50:70]

In [None]:
print(subindex)

You can access built-in attributes that know about years, months, etc.

In [None]:
subindex.year

In [None]:
subindex.month

In [None]:
subindex.day

In [None]:
subindex.dayofyear

You can select data based on the index.

Select all data that fell in the month of December during any year.

In [None]:
ts[ts.index.month == 12].head(10)

In [None]:
ts[ts.index.month ==12].plot(title='NH December Snowcover Extent')

select data from year 2000

In [None]:
ts[ts.index.year == 2000].head()


In [None]:
ts[ts.index.year == 2000].plot(title='NH snowcover: 2000')

What was the maximum value of all data points?

In [None]:
ts.max()

On what day did that maxium occur?

In [None]:
ts[ts == ts.max()]

When did snow covered area reach at least 95% of the max ever?

In [None]:
ts[ts >= ts.max() * .95]

## resample timeseries 

Subsetting and selecting is useful. But in this case let's say we need to compare our total snowcovered area with some other monthly derived geophysical constant. We are going to have to turn our weekly data into monthly data.

We can use the `pd.Series.resample()` method to "align" our data with months.

In [None]:
ts.resample('MS').mean().head()

In [None]:
with mpl.rc_context(rc={'figure.figsize': (15,2)}):
    ts.resample('MS').mean().plot(title='NH Monthly Average Snow Covered Area')


In [None]:
with mpl.rc_context(rc={'figure.figsize': (15,2)}):
    ts[200:250].resample('MS').mean().plot(marker='x')
    ts[200:250].plot(marker='.')
    plt.title('50 Weeks of NH snowcover data')


In [None]:
with mpl.rc_context(rc={'figure.figsize': (15,2)}):
    ts.resample('MS').mean()[100:150].plot(title='50 months of snowcover data')


So what is happening when we call resample is to take the mean of all values that fell into a month.  We can show that explicitly.

original values for Nov 1966

In [None]:
ts.iloc[3:9]

Mean of Nov 1966 values

In [None]:
print(ts.iloc[4:8])
print("Mean = ", ts.iloc[4:8].mean())


and the resampled timeseries head

In [None]:
ts.resample('MS').mean().head(3)

you can see the '1966-11-01' values is the same as the mean of the data from
[1966-11-07, 1966-11-14, 1966-11-21, 1966-11-28]


But really in in this case, each indexed snowcover grid represents the snow for a week of data. Therefore, the data time indexed at '1966-10-31' represents the week beginning on that day through 1966-11-06 and we should include those data values when we compute the mean for November-1966.

You could do a computation where you compute weights for each file based on how many days are in the target month and do weighted means, but with Pandas, there's an easier way

We can sample the data to a Daily period before sampling to a month period. We fill between the indexes with `ffill()`

Here's what the forward filled timeseries sampled to Days looks like around the beginning of the month.

Here's how Pandas sees the current dataset.

A single extent value for index matching time.

In [None]:
ts.resample('D').mean().head(15)

But if you fill forward you can assign a value to each resampled location in the index.

In [None]:
ts.resample('D').ffill(limit=7).head(15)

So we filled every day with a value based on its input file period.

Now when we resample to monthly we will have correctly weighted all of the data for a particular month.

In [None]:
ts.resample('D').ffill(limit=7).resample('MS').mean().head()

remember without the daily sampling first:

In [None]:
ts.resample('MS').mean().head()

So now we can create a monthly timeseries of Total Northern Hemisphere Snow Cover.  What should we do with it?

In [None]:
monthly = ts.resample('D').ffill(limit=7).resample('MS').mean()

plot a couple of months?

In [None]:
with mpl.rc_context(rc={'figure.figsize': (15,3)}):
    monthly[monthly.index.month == 2].plot(linestyle='-', label='February', legend=True)
    monthly[monthly.index.month == 5].plot(marker='.', label='May', legend=True, title='Compare months?') 


We could answer the question
Which February has the lowest Snowcover?

In [None]:
monthly[monthly.index.month == 2].idxmin()

What are the rankings of snowcover for March from greatest to least?

In [None]:
march = monthly[monthly.index.month ==  3]
rank = march.rank(ascending=False)

In [None]:
rank.head()

So the march with the highest snowcover would be the one with rank=1.

In [None]:
rank[rank == 1.]

But then to determine what the actual value of snowcover extent was on that march we need to go back to the march Series.


In [None]:
march[march.index == '1978-03-01']

Here would be a good point to introduce a [`pandas.DataFrame`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)  

     Two-dimensional size-mutable, potentially heterogeneous tabular data
     structure with labeled axes (rows and columns). Arithmetic operations
     align on both row and column labels. Can be thought of as a dict-like
     container for Series objects. The primary pandas data structure.

A DataFrame will allow us to align march rank and values into a single object.

Create a simple pd.DataFrame from our monthly data using March and its Rank.
We see that the indexes are the same.  One has total snowcover area and the other has the rank.

In [None]:
march.head()

In [None]:
rank.head()

Now create a DataFrame from a these series.

In [None]:
d = {'march': march, 'rank': rank}
df = pd.DataFrame(data=d)


In [None]:
df.head()

In [None]:
df.columns

In [None]:
df.sort_values('rank', ascending=True).head()

So now we can just look at the sorted rank and see the the maxium march snowcover extent occurred in 1978 at 4.598040e+07 km^2

Say now we want to know anomalies from the mean for all march. (anomaly computation)

With a DataFrame we just add another column where we have subtracted the mean of all Marches from each value.

In [None]:
df['march_anomaly'] = (df['march'] - df['march'].mean())

In [None]:
df.head()

You can keep adding colums at will to your DataFrame.

Now that we've seen how a basic DataFrame works, we can create a DataFrame from our original series and shape it how we like.

Create a new DataFrame from the Northern Hemisphere `monthly` snowcovered extent series

In [None]:
monthly.head()

In [None]:
df = pd.DataFrame(monthly, columns=['snowcover'])
df.head()

It doesn't look that different from the series, but we have the ability to add as many columns to the index as we need.


And part of this is re can reshape the existing DataFrame how we like.  If we want each month in its own column, we can set the index and then unstack
months.

First: Set the index to Year and Month creating it as a multi-index first...

In [None]:
df = df.set_index([df.index.year, df.index.month])
df.head()

In [None]:
type(df.index)

In [None]:
print(df.index.levels[0])
print(df.index.levels[1])

You see that now the index is by [year and month] and is a [`pandas.MultiIndex`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.MultiIndex.html)

Where in the index level 0 has years and level 1 has months.

You can select data from this index.

In [None]:
df.loc[1979:1980]

Or use a cross section to grab a specific month (and then years)

In [None]:
df.xs(5, level=1).loc[1980:1985]

You can use `unstack` on the months' level to get an index of years, with columns of months

In [None]:
year_by_month = df.unstack(level=1)
year_by_month.head()


If instead you wanted rows of months and columns of years, you would have unstacked level=0


In [None]:
month_by_year = df.unstack(level=0)
month_by_year.head(3)

you can still select from the DataFrame

In [None]:
month_by_year['snowcover'][[1970,1980, 1990, 2000, 2010 ]]

In [None]:
month_by_year['snowcover'][[1970,1980, 1990, 2000, 2010 ]].plot()

We can save our work by writing the monthly data out to a CSV file.

In [None]:
monthly.head()

In [None]:
monthly.name = 'snowcover'
monthly.to_csv('monthly-extents.csv', index_label='date', header=True)

In [None]:
!ls monthly-extents.csv

In [None]:
!head monthly-extents.csv