<p style="float:right">
<img src="../images/logos/cu.png" style="display:inline" />
<img src="../images/logos/cires.png" style="display:inline" />
<img src="../images/logos/nasa.png" style="display:inline" />
<img src="../images/logos/nsidc_daac.png" style="display:inline" />
</p>

# Python, Jupyter & pandas: Exercises for Module 4

Run the following cell as-is to do some initial setup. Some steps from the setup for Exercise 3 are repeated here, as well as some pieces of Module 4. Since pandas excels at working with timeseries data, rather than gridded data, we are ultimately interested in the total sea ice area for each day in the dataset. Here, we'll save that to a variable called `total_area` before plugging it into pandas.

In [None]:
%matplotlib inline
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
import netCDF4
import numpy as np
import pandas as pd

data_file = '../data/september-concentration.nc'
dataset = netCDF4.Dataset(data_file)
variables = dataset.variables

area = variables['area']
sic = variables['sea_ice_concentration']

# the time variable in the netCDF file is days since some epoch,
# let's just work with datetime objects
time = netCDF4.num2date(variables['time'][:], variables['time'].units)

def seaice_area_km2(grid, area):
    # get rid of flagged values and convert 0-100% to 0.0 to 1.0
    decimal = (np.ma.masked_outside(grid, 0, 100) / 100)
    
    return np.sum(area * decimal)

days = sic.shape[0]
grid_area = area[:]
total_area = np.ma.zeros(days)
for i in np.arange(days):
    total_area[i] = seaice_area_km2(sic[i, :, :], grid_area)

## Big list to a DataFrame

`total_area` is a list with values representing the total area of sea ice on a given day. Print out the value of `total_area`.

Construct a pandas DataFrame using `total_area` as the data, `time` as the index, and `['area']` as the columns. Assign it to the variable `df`.

`DataFrame` has a method `min()` that returns the minimum value in the DataFrame, and a method `idxmin()` that returns the index where that minimum value occurs. What is the lowest sea ice area found in this dataset, and on which date did it occur?

## DataFrame with a column for each year

Since we're looking at data for just one month across different years, it would be more useful to have the data from each year in a separate column, rather than having a single column for all of the data.

First, we need to create a hierarchical index, or a `MultiIndex` of years and days (since the month is constant across all the data).

The first step in creating a MultiIndex is to get a list of (year, day) tuples. We can access the year and the day in the DateTimeIndex with `df.index.year` and `df.index.day`. Evaluate each of those in one of the cells below.

The native python function `zip()` joins two lists together to form a list of tuples. Run the following cell for an example.

In [None]:
zip([1,2,3], ['a', 'b', 'c'])

`zip()` actually returns a generator; all of its values aren't actually evaluated until they're explicitly asked for. We can ask for all the values by converting it to a list with a call to `list()`.

In [None]:
list(zip([1,2,3], ['a', 'b', 'c']))

It works on `numpy` arrays as well.

In [None]:
list(zip(np.array([1,2,3]), np.array(['a', 'b', 'c'])))

Now, how we can we use `df.index.year`, `df.index.day`, `zip()`, and `list()` to get a list of tuples of year-day pairs? Save your result to a variable called `year_days`, and then print out the result.

Now that we have a list of tuples corresponding to our DataFrame's current index, we can use the function `pd.MultiIndex.from_tuples()` to get a hierarchical index. Just pass the `year_days` list of tuples to that function, and save the result to a variable called `new_index`.

The type of `new_index` should be a pandas `MultiIndex`. Verify that it is.

We are ready to update our DataFrame with the new index. This can be done by assigning a new value to `df.index`. However, this changes the DataFrame, so first let's create `df2` and work with that.

In [None]:
df2 = df.copy()

Now update `df2.index` and evaluate `df2` to see the changes.

Pandas DataFrames have an `unstack()` to create columns from one piece of a hierarchical index.

Here we have a DataFrame with a row for each year and columns for each day in September. Create a DataFrame with a column for each year and a row for each day of the month by adjusting the value of the `level` parameter, which describes the level of the index that becomes the column headers. Save this result to `df3`.

## Plots

Normally we can select columns as `pd.Series` objects with `df[colname]`. However, when our column names are numbers, we need to use the DataFrame `iloc` method to get subsets of the DataFrame (the first colon here selects all the rows):

In [None]:
df3.iloc[:, [0]]

In [None]:
df3.iloc[:, [0,3]]

Let's plot the 2002 and 2012 data on the same graph. Tell `plt` to produce a 10" by 10" figure, use `DataFrame.iloc()` to subset `df3` to the years we're interested in, and use `DataFrame.plot()` to render the graph.

We can see the more recent year is lower, but comparing values from just 2 years is not terribly informative. Let's plot how the September mean changes over the years.

First, how do we get the mean values we want?

In [None]:
df3.mean(axis=1)

The above expression returns a pandas Series with the mean for each day of the month. Call `mean()` with a different value for `axis` so that the index of the new Series is the year, not the day of the month. Save your result to the variable `mean`.

Note the `area` in the printed output of `mean`. This is a side effect of the named MultiIndex DataFrame. Having that label around will just clutter things up; run the cell below to remove it.

In [None]:
mean = mean.area
mean

What is the type of `mean` now?

Since we've got a simple Series now, we can just call `plot()` on it to get a sense of how the mean sea ice area is changing over time. Do this now:

Let's add a trend line to this graph. First, let's get this `mean` Series into a DataFrame (since it's easy to plot multiple lines when they're just columns in a DataFrame).

In [None]:
df4 = pd.DataFrame(mean, columns=['mean'])
df4

NumPy has some powerful and convenient functions for calculating least-square regression polynomials. To get the coefficients for a linear fit to the mean data:

In [None]:
coefficients = np.polyfit(x=mean.index, y=mean.values, deg=1)
coefficients

We can plug these coefficients in to `np.poly1d()`, which returns another function representing the best fit line.

In [None]:
best_fit_fn = np.poly1d(coefficents)

Then we can call that function with whatver x-values we want to get the corresponding values on the trend line. Here, we want the values for the same years that we plotted the mean, but if we wanted we could plug in different years to, for example, project the trend into the future.

In [None]:
best_fit_fn(mean.index)

Columns can be added to a DataFrame with the syntax `df['colname'] = values`.

Add a column to `df4` for the best fit line, and then call `plot()` on `df4` to see mean and best fit on one graph.