# Tutorial for MUR SST on AWS  

- Funding: Interagency Implementation and Advanced Concepts Team [IMPACT](https://earthdata.nasa.gov/esds/impact) for the Earth Science Data Systems (ESDS) program and AWS Public Dataset Program

Credits: Tutorial development
* [Dr. Chelle Gentemann](mailto:gentemann@faralloninstitute.org)    - Farallon Institute, USA
* [Dr. Rich Signell](mailto:rsignell@usgs.gov) - USGS
* [Dr. Ryan Abernathey](mailto:rpa@ldeo.columbia.edu))


Credits: Creating of the Zarr MUR SST dataset
* [Aimee Barciauskas](mailto:aimee@developmentseed.org) - Development Seed
* [Dr. Rich Signell](mailto:rsignell@usgs.gov) - USGS
* [Dr. Chelle Gentemann](mailto:gentemann@faralloninstitute.org)    - Farallon Institute, USA
* [Joseph Flasher](mailto:jflasher@amazon.com) - AWS
-------------

## Please note that this is global, 1 km, daily data.  This is a very large dataset and the analyses below can take up to 5-10 minutes

## [MUR SST](https://podaac.jpl.nasa.gov/Multi-scale_Ultra-high_Resolution_MUR-SST) [AWS Public dataset program](https://registry.opendata.aws/mur/) 

### Access the MUR SST which is in an s3 bucket.  
### This Pangeo binder is faster when run on AWS.  

![image](https://podaac.jpl.nasa.gov/Podaac/thumbnails/MUR-JPL-L4-GLOB-v4.1.jpg)

This code is an example of how to read from a s3 bucket.  

Right now (2/16/2020) this takes ~1min on AWS and ~2 min on google cloud, there are two issues here and we are working to solve both.  
1. In our Zarr datastore the time coodinate is chunked.  We should have this fixed by 3/1/2020.
1. Some shortcomings in the s3fs and zarr formats have been identified.  To work on these, git issues were raised to the developers [here](https://github.com/dask/s3fs/issues/285) and [here](https://github.com/zarr-developers/zarr-python/issues/536)


# Structure of this tutorial

1. Opening data
2. Data exploration
2. Data plotting

# 1. Opening data
-------------------

## Import python packages

It is nice to turn off warnings and set xarray display options

In [None]:
import warnings
import numpy as np
import pandas as pd
import xarray as xr
import fsspec

warnings.simplefilter('ignore') # filter some warning messages
xr.set_options(display_style="html")  #display dataset nicely 

### Start a cluster, a group of computers that will work together.

(A cluster is the key to big data analysis on on Cloud.)

- This will set up a [dask kubernetes](https://docs.dask.org/en/latest/setup/kubernetes.html) cluster for your analysis and give you a path that you can paste into the top of the Dask dashboard to visualize parts of your cluster.  
- You don't need to paste the link below into the Dask dashboard for this to work, but it will help you visualize progress.
- Try 40 workers to start (during the tutorial) but you can increase to speed things up later

In [None]:
from dask_kubernetes import KubeCluster
from dask.distributed import Client

In [None]:
cluster = KubeCluster()
cluster.adapt(minimum=1, maximum=20, interval='2s', wait_count=3)
client = Client(cluster)
cluster

** ☝️ Don’t forget to click the link above or copy it to the Dask dashboard ![images.png](attachment:images.png) on the left to view the scheduler dashboard! **

### Initialize Dataset

Here we load the dataset from the zarr store. Note that this very large dataset initializes nearly instantly, and we can see the full list of variables and coordinates.

### Examine Metadata

For those unfamiliar with this dataset, the variable metadata is very helpful for understanding what the variables actually represent
Printing the dataset will show you the dimensions, coordinates, and data variables with clickable icons at the end that show more metadata and size.


In [None]:
%%time

file_location = 's3://mur-sst/zarr'

ds_sst = xr.open_zarr(fsspec.get_mapper(file_location, anon=True),consolidated=True)

ds_sst

# 2.  Explore the data

#### Let's explore the data

- look at all the SST data
- look at the SST data masked to only ocean and ice-free data

- With all data, it is important to explore it and understand what is contains before doing an analysis.
- The ice mask used by MUR SST is from NSIDC and is based on satellite passive microwave estimates of sea ice concentration
- The satellite data isn't available near land, so the is no estimate of sea ice concentration near land
- For this data, it means that there are some erroneous SSTs near land, that is likely ice and this is something to be aware of.

In [None]:
sst = ds_sst['analysed_sst']

cond = (ds_sst.mask==1) & ((ds_sst.sea_ice_fraction<.15) | np.isnan(ds_sst.sea_ice_fraction))

sst_masked = ds_sst['analysed_sst'].where(cond)

sst_masked

### Using ``.groupby`` and ``.resample``

Xarray has a lot of nice build-in methods, such as [.resampe](http://xarray.pydata.org/en/stable/generated/xarray.Dataset.resample.html#xarray-dataset-resample) which can upsample or downsample data and [.mean](http://xarray.pydata.org/en/stable/generated/xarray.DataArray.mean.html#xarray-dataarray-mean). Here we use these to calculate a climatology and anomoaly.

#### Create a daily SST anomaly dataset
- Calculate the daily climatology using ``.groupby``
- Calculate the anomaly 

#### Create a monthly SST anomaly dataset
- First create a monthly version of the dataset using ``.resample``.  Two nice arguments for ``.resample``: ``keep_addrs`` which keeps the metadata and ``skipna`` which ensures that only data that is always present is included
- Calculate the monthly climatology using ``.groupby``
- Calculate the anomaly 


In [None]:
%%time
#create a daily climatology and anomaly
climatology_mean = sst_masked.groupby('time.dayofyear').mean('time',keep_attrs=True,skipna=False)

sst_anomaly = sst_masked.groupby('time.dayofyear')-climatology_mean  #take out annual mean to remove trends

#create a monthly dataset, climatology, and anomaly
sst_monthly = sst_masked.resample(time='1MS').mean('time',keep_attrs=True,skipna=False)

climatology_mean_monthly = sst_monthly.groupby('time.month').mean('time',keep_attrs=True,skipna=False)

sst_anomaly_monthly = sst_monthly.groupby('time.month')-climatology_mean_monthly  #take out annual mean to remove trends

sst_anomaly

# 3. Data plotting

``xarray`` plotting functions rely on matplotlib internally, but they make use of all available metadata to make the plotting operations more intuitive and interpretable. More plotting examples are given [here](http://xarray.pydata.org/en/stable/plotting.html)

### Here we use ``holoviews`` and ``hvplot`` for interactive graphics

In [None]:
import hvplot.xarray
import holoviews as hv
from holoviews.operation.datashader import regrid
hv.extension('bokeh')

### Plot the SST timeseries in the Pacific Blob Region
Plot both the daily and monthly data

In [None]:
daily = sst.sel(lon=-140, lat=53).load()

monthly = sst_monthly.sel(lon=-140, lat=53).load()

daily.hvplot(grid=True) * monthly.hvplot(grid=True)

### Plot the SST anomaly timeseries in the Pacific Blob Region


In [None]:
daily = sst_anomaly.sel(lon=-140, lat=53).drop('dayofyear').load()

monthly = sst_anomaly_monthly.sel(lon=-140, lat=53).drop('month').load()

daily.hvplot(grid=True) * monthly.hvplot(grid=True)

### Plot a global image of SST on 10/1/2015

In [None]:
%%time
sst_dy = sst.sel(time='2015-10-01').load()

sst_dy.hvplot.quadmesh(x='lon', y='lat', 
                       geo=True, 
                       rasterize=True, 
                       cmap='rainbow', 
                       tiles='EsriImagery')

#### Check how the masked data looks

In [None]:
%%time
subset = sst_masked.sel(time='2015-10-01',lat=slice(-70,0)).load()

subset.hvplot.quadmesh(x='lon', y='lat', 
                       geo=True, 
                       rasterize=True, 
                       cmap='rainbow', 
                       tiles='EsriImagery')

### Plot the anomaly on 10/1/2015


In [None]:
%%time
anom_2015 = sst_anomaly.sel(time='2015-10-01',lat=slice(-60,70)).drop('dayofyear').load()

anom_2015.hvplot.quadmesh(x='lon', y='lat', 
                          geo=True, 
                          rasterize=True, 
                          clim=(-2,2), 
                          cmap='seismic', 
                          tiles='EsriImagery')

### Subset the El Niño/La Niña Region:

In [None]:
%%time
sst_elnino = sst.sel(lon=slice(-180,-70), lat=slice(-25,25))

### Difference the monthly mean temperature fields for Jan 2016 (El Niño) and Jan 2014 (normal)

In [None]:
%%time
sst_jan2016 = sst_elnino.sel(time=slice('2016-01-01','2016-02-01')).mean(dim='time')

sst_jan2014 = sst_elnino.sel(time=slice('2014-01-01','2014-02-01')).mean(dim='time')

In [None]:
%%time
sst_diff = (sst_jan2016 - sst_jan2014).load()

sst_diff.hvplot.quadmesh(x='lon', y='lat', 
                         geo=True,                 
                         rasterize=True, 
                         cmap='rainbow', 
                         tiles='EsriImagery')

### Plotting on maps

For plotting on maps, we rely on the excellent [cartopy](http://scitools.org.uk/cartopy/docs/latest/index.html) library.

In [None]:
import cartopy.crs as ccrs

### In cartopy you need to define the map projection you want to plot.  

- Common ones are Ortographic and PlateCarree.
- You can add coastlines and gridlines to the axes as well.

In [None]:
sst_dy.hvplot.quadmesh(x='lon', y='lat', 
                       geo=True,         
                       rasterize=True, 
                       cmap='rainbow', 
                       projection=ccrs.Orthographic(-80, 35),
                       coastline='110m')

## Please close cluster

In [None]:
client.close()
cluster.close()

## A nice cartopy tutorial is [here](http://earthpy.org/tag/visualization.html)

# xarray can do more!

* concatentaion
* open network located files with openDAP
* import and export Pandas DataFrames
* .nc dump to 
* groupby_bins
* resampling and reduction

For more details, read this blog post: http://continuum.io/blog/xray-dask


## Where can I find more info?

### For more information about xarray

- Read the [online documentation](http://xarray.pydata.org/)
- Ask questions on [StackOverflow](http://stackoverflow.com/questions/tagged/python-xarray)
- View the source code and file bug reports on [GitHub](http://github.com/pydata/xarray/)

### For more doing data analysis with Python:

- Thomas Wiecki, [A modern guide to getting started with Data Science and Python](http://twiecki.github.io/blog/2014/11/18/python-for-data-science/)
- Wes McKinney, [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do) (book)

### Packages building on xarray for the geophysical sciences


- [eofs](https://github.com/ajdawson/eofs): empirical orthogonal functions by Andrew Dawson
- [infinite-diff](https://github.com/spencerahill/infinite-diff) by Spencer Hill 
- [aospy](https://github.com/spencerahill/aospy) by Spencer Hill and Spencer Clark
- [regionmask](https://github.com/mathause/regionmask) by Mathias Hauser
- [salem](https://github.com/fmaussion/salem) by Fabien Maussion

Resources for teaching and learning xarray in geosciences:
- [Fabien's teaching repo](https://github.com/fmaussion/teaching): courses that combine teaching climatology and xarray
