# Alignment and Preprocessing
Once the data is made available via `intake` as detailed in the  [Data_Ingestion_with_Intake](./02_Data_Ingestion_with_Intake.ipynb) user guide, the next step is to ensure the data has been appropriately reshaped and aligned across data sources for consumption by the machine learning pipeline, which you can learn about in the next user guide [Machine_Learning](./04_Machine_Learning.ipynb).

We'll be aggregating data across several years using the same landsat images used in [Walker_Lake](../Walker_Lake.ipynb). See that notebook for more work on calculating the difference between the water levels over time.

In [None]:
import intake
import numpy as np
import xarray as xr

import hvplot.xarray
import holoviews as hv
from holoviews.operation.datashader import rasterize

import cartopy.crs as ccrs
import geoviews as gv

hv.extension('bokeh', width=80)

## Recap: Loading data

In [None]:
cat = intake.open_catalog('../catalog.yml')
l5_da = cat.l5().read_chunked()
l5_da

In [None]:
l8_da = cat.l8().read_chunked()
l8_da

We can use this EPSG value shown above under the ``crs`` key to create a cartopy coordinate reference system that we will be using later on in this notebook:

In [None]:
crs=ccrs.epsg(32611)

## Preprocessing
The first step in processing data is to remove the missing values. In this case the xarray self-reports the values assigned to `nodatavals`. We can use this information to set the missing values to `NaN`.

In [None]:
l5_da = l5_da.where(l5_da > l5_da.nodatavals[0])
l8_da = l8_da.where(l8_da > l8_da.nodatavals[0])

We can make sure that no more -9999s show up in the data, by calculating the minimum value in each dataarray as follows:

In [None]:
l5_da.min().compute()

In [None]:
l8_da.min().compute()

**NOTE:** These operations take a non-trivial amount of time because they require that the data actually be loaded. 

## Compute NDVI

Now we will calculate NDVI for each of these image sets and persist the output in memory for speedy calculations later.

In [None]:
NDVI_1988 = (l5_da.sel(band=5) - l5_da.sel(band=4)) / (l5_da.sel(band=5) + l5_da.sel(band=4)).persist()
NDVI_1988

In [None]:
NDVI_2017 = (l8_da.sel(band=5) - l8_da.sel(band=4)) / (l8_da.sel(band=5) + l8_da.sel(band=4)).persist()
NDVI_2017

**NOTE:** Since we did this operations on just one band we lost some of the helpful metadata on our `xarray` objects. We can get that metadata back by copying the structure of the one of the original array bands and creating a new array with that structure that the NDVI data.

In [None]:
NDVI_1988 = l5_da.sel(band=4).drop('band').copy(data=NDVI_1988)
NDVI_2017 = l8_da.sel(band=4).drop('band').copy(data=NDVI_2017)
NDVI_2017

## Aligning the data

These two sets of landsat bands cover roughly the same area but were taken in 1988 and 2017. While they have the same resolution (30m) they have different numbers of grid cells and different x and y offsets (transform).

In [None]:
NDVI_1988.transform is not NDVI_2017.transform

#### Treat year as name

When these data are combined onto one dataset, the shape of the dataset grows to be the union of the dimensions on each array. 

In [None]:
ds = xr.Dataset({'1988': NDVI_1988, '2017': NDVI_2017})
ds

We can quickly subset to one point, such as the center of Walker lake: 38.6942° N, 118.7081° W. First convert the lat lon (PlareCarree) to the coordinate reference system of the data (we set this on the `crs` var above).

In [None]:
x_center, y_center = crs.transform_point(-118.7081, 38.6942, ccrs.PlateCarree())

Then we'll select the data point nearest to this point. 

In [None]:
ds.sel(x=x_center, y=y_center, method='nearest')

Or we can subset the area around this central point by getting a slice of the data with a certain buffer.

In [None]:
buffer = 1.5e4

In [None]:
subset = ds.sel(x=slice(x_center-buffer, x_center+buffer), y=slice(y_center-buffer, y_center+buffer))
subset

**NOTE:**  If coordinates are in decreasing order, the slice needs to be between center+buffer and center-buffer.  In the case of the coordinates the coordinates are increasing so the slice is betweeen center-buffer and center+buffer.

Now that the data are on the same coordinate system, when these data are visualized, the plots can be linked. 

In [None]:
%%time
%%opts Image [width=500 height=500] (cmap='viridis')

NDVI_1988_p = subset['1988'].hvplot(clim=(-3, 1), crs=subset['1988'].crs).relabel('1988')
NDVI_2017_p = subset['2017'].hvplot(clim=(-3, 1), crs=subset['2017'].crs).relabel('2017')

display(hv.Layout(NDVI_1988_p + NDVI_2017_p).options(shared_axes=True))

In [None]:
%%opts Image [width=600 height=500] (cmap='coolwarm')

(subset['1988'] - subset['2017']).hvplot(crs=crs, clim=(-2, 2)).relabel('Difference in NDVI')

#### Treat year as coords
Another way to join the data from the two different years, is by treating the years as coordinates. This approach is more logically sound, but sometimes global attribute data can be lost. 

In [None]:
NDVI_by_year = xr.concat([NDVI_1988, NDVI_2017], dim=xr.DataArray([1988, 2017], dims=('year'), name='year'))
NDVI_by_year

**NOTE:** The `transform` is no longer present on the data because it didn't match for both years. In many cases this information isn't used, so it doesn't really matter. If it *does* matter, then you can add another coordinate on the year dimension to store this data.

In [None]:
set(NDVI_1988.attrs) - set(NDVI_by_year.attrs)

The transform is a special case since it is a tuple, and in `xarray`, tuples can't be used as items in arrays. So we will wrap it in a string for use later. 

In [None]:
transform = xr.DataArray([str(NDVI_1988.attrs['transform']), str(NDVI_2017.attrs['transform'])], dims=('year'))

Now with this extra set of coordinates, we can make a more complete year dimension to use in our concatenation.

In [None]:
year = xr.DataArray([1988, 2017], dims=('year'), coords={'transform': transform}, name='year')

In [None]:
NDVI_by_year = xr.concat([NDVI_1988, NDVI_2017], dim=year)
NDVI_by_year

**NOTE:** Now the transform information is persisted and indexed by year. 

We'll do the same subsetting at a point and over an area to demonstrate how years as coords differ from years as names.

In [None]:
NDVI_by_year.sel(x=x_center, y=y_center, method='nearest')

In [None]:
subset = NDVI_by_year.sel(x=slice(x_center-buffer, x_center+buffer), y=slice(y_center-buffer, y_center+buffer))

**NOTE:** It is simpler to define a series of subplots where the variable that is being iterated over is a coordinate.


In [None]:
%%opts Image [width=500 height=500] (cmap='viridis')

p = subset.hvplot('x','y', col='year', crs=subset.crs, shared_axes=True)
print(p)

Since the output is a gridspace, we can select a a subplot from the output and alter it in place. For instance, we only really need the colorbar on the second subplot. So We can turn it off for the first one. 

In [None]:
p[1988] = p[1988].options(colorbar=False)

In [None]:
p

## Regridding

In the case of the images that we have loaded so far, all the data have the same resolution (30m). In the section above we saw that it is straightforward to align these datasets even though they cover slightly different areas. In some cases though the resolution of the image is different. This is the case for band 8 (the pancromatic band). The resolution of that band is 15m. This section will demonstrate *aggregation(down-sampling)* and *interpolation(up-sampling)*. In practice, aggregation is much more common.

We'll be using `datashader` operations `rasterize` and `regrid` to handle our multidimensional regridding.

In [None]:
from holoviews.operation.datashader import regrid, rasterize
from datashader import transfer_functions as tf, reductions as rd

### Aggregation

We'll define a new resolution that is visibly different from 30m.

In [None]:
res = 1e3

Just to make things pretty and as a sanity check, let's turn the colorbar back on for both plots and set the width of the first plot slightly higher to account for the extra axis that is being portrayed.

In [None]:
p[1988] = p[1988].options(colorbar=True, width=370, height=300)
p[2017] = p[2017].options(colorbar=True, width=310, height=300)

In [None]:
p_1000 = regrid(p, x_sampling=res, y_sampling=res)
p_1000

Notice how fast it was to generate these plots. Aggregation is by mean by default, but there are other ways to aggregate. Here are some:

In [None]:
hv.Layout([
    regrid(p, x_sampling=res, y_sampling=res, aggregator=agg).relabel(f'Aggregated by {label}')
    for label, agg in ({
        'std': rd.std(), 
        'maximum': rd.max(), 
        'minimum': rd.min(), 
        'mode': rd.mode()
    }.items())
]).cols(2)

Let's reduce our resolution and look at that regrid again.

In [None]:
regrid(p, x_sampling=res/10, y_sampling=res/10, aggregator=rd.std())

This view could certainly help us pick out the bounds of the lake at least in 2017.

### Similar workflow in `xarray`

To accomplish a similar thing in `xarray` by grouping the values into bins based on the desired resolution and taking the mean on each of those bins.

In [None]:
res = 1e3

In [None]:
x = np.arange(subset.x.min(), subset.x.max(), res)
y = np.arange(subset.y.min(), subset.y.max(), res)

We'll use the left edge as the label for now

In [None]:
da_1000 = (subset
    .groupby_bins('x', x, labels=x[:-1]).mean(dim='x')
    .groupby_bins('y', y, labels=y[:-1]).mean(dim='y')
    .rename(x_bins='x',y_bins='y')
)
da_1000

We can compare this to the results from using datashader regridding by getting the data from p_1000 and subtracting the nearest data from da_1000.

In [None]:
def get_data(p):
    df = p.dframe()
    pivotted = df.pivot(index='y', columns='x', values='value')
    stacked = pivotted.stack()
    return xr.DataArray.from_series(stacked)

In [None]:
(da_1000.sel(year=2017).reindex(get_data(p_1000[2017]).indexes, method='nearest') - get_data(p_1000[2017])).hvplot('x','y')

### Handling band with different resolution

First we need to load the band-8 data. We'll grab it straight from google cloud storage:

In [None]:
da_8 = cat.google_landsat_band(pid='LC08_L1TP_042033_20171022_20171107_01_T1', path=42, row=33, band=8).to_dask()

In [None]:
subset_8 = da_8.sel(x=slice(x_center-buffer, x_center+buffer), y=slice(y_center+buffer, y_center-buffer))
subset_8 = subset_8.drop('band').squeeze().persist()
subset_8

In [None]:
p_8 = subset_8.hvplot('x', 'y', width=500, height=400)
p_8

Let's define a little helper function to determine the resolution of plots

In [None]:
def get_res(p, x='x', y='y'):
    df = p.dframe()
    pivotted = df.pivot(index=y, columns=x, values='value')
    stacked = pivotted.stack()
    da = xr.DataArray.from_series(stacked)
    print(f'{x} res:', np.unique(np.around(da[x].diff(x), 2)))
    print(f'{y} res:', np.unique(np.around(da[y].diff(y), 2)))

In [None]:
get_res(p_8)

We can use `xarray` to merge this band with the rest of our data and we will get a union of all the coordinates. In this case the shape expands to (1000, 1000) to (2000, 2000).

In [None]:
ds = xr.merge([{'NDVI': subset, '2017_band_8': subset_8}])
ds

All of our data our properly represented, but using methods like selecting the nearest value to a certain, x, y might yield nans:

In [None]:
ds.sel(x=x_center, y=y_center, method='nearest').compute()

We can regrid the band 8 to a 30m resolution or we can regrid the NDVI to a 15m resolution.

In [None]:
res = 30
p_8_30 = regrid(p_8, x_sampling=res, y_sampling=res, width=500, height=400, 
                x_range=(x_center-1e3, x_center+1e3), y_range=(y_center-1e3, y_center+1e3))
p_8_30 

In [None]:
get_res(p_8_30)

**NOTE:** `x_sampling` and `y_sampling` set the minimum allowable resolution, so the resolution of a given plot might not be exactly `x_sampling` and `y_sampling` unless it is sufficiently zoomed in.

## Interpolation
Now let's quickly take a look at up-sampling. For this we will use `regrid` since up-sampling is not allowed in `rasterize`

In [None]:
p_ndvi_15 = regrid(p, upsample=True, 
                   x_sampling=15, y_sampling=15, 
                   x_range=(x_center-1e3, x_center+1e3), y_range=(y_center-1e3, y_center+1e3))
p_ndvi_15

In [None]:
get_res(p_ndvi_15[1988])

This doesn't look any more resolved than 30m, but that is because it is using nearest by default so the grid cells look the same size. The resolution becomes more apparent when using linear interpolation.

In [None]:
p_ndvi_15 = regrid(p, interpolation='linear', upsample=True, 
                   x_sampling=15, y_sampling=15, 
                   x_range=(x_center-1e3, x_center+1e3), y_range=(y_center-1e3, y_center+1e3))
p_ndvi_15.relabel('Using linear interpolation')

In [None]:
get_res(p_ndvi_15[1988])

### Similar workflow in `xarray`

`xarray` supports a number of interpolations for up-sampling data. Here is what it takes to re-scale the ndvi images from res=30 to res=15 to match the pancromatic band. The options are `nearest` and `linear` with linear being selected by default.

In [None]:
ndvi_15 = subset.interp_like(subset_8)
ndvi_15

In [None]:
ndvi_15.hvplot('x', 'y', col='year', 
               crs=crs, cmap='viridis', width=300,
               xlim=(x_center-1e3, x_center+1e3), ylim=(y_center-1e3, y_center+1e3))