# TS-3: Spatial/temporal averaging, Resampling, Variability and Grouping

This notebook explains four types of analysis:

* Using functions like `mean()` to remove the <a href="#basic_time">temporal</a> or <a href="#basic_space">spatial</a> dimension.
* <a href="#resampling">Resampling</a> from one time frequency to another, e.g. to calculate annual means.
* Thinking about <a href="#variability">variability</a>: using the standard deviation and deviation from the mean.
* Using <a href="#repeating">group-by</a> to analyze features that repeat over time.

All of these analyses rely on a set of operations to combine the data. Below we mainly use `mean()`. Depending on the type of data and your research question, there are plenty of alternatives which you can use instead of `mean()`:

* `median()`
* `std()` (standard deviation)
* `min()` / `max()` (minimum or maximum value)
* `sum()` (the total of all values)
* `first()` / `last()` (the first or last value in the group of values under consideration)

See here for more background information: https://docs.xarray.dev/en/stable/user-guide/time-series.html#resampling-and-grouped-operations

Xarray's time-series functionality is based on the the Pandas package. See also the Pandas documentation for lots of information on working with time-series data: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html

***

In [None]:
%%html
<style>
    .dothis{
    font-weight: bold;
    color: #ff7f0e;
    font-size:large
    }
</style>

In [None]:
# Import modules

# reload module before executing code
%load_ext autoreload
%autoreload 2

# define modules locations (you might have to adapt define_mod_locs.py)
# %run ../sdc-notebooks/Tools/define_mod_locs.py

import os
import shutil
import xarray as xr
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
from matplotlib.pyplot import cm
from matplotlib.patches import Polygon, Rectangle
from sdc_utilities import *

# silence warning (not recommended during development)
import warnings
warnings.filterwarnings("ignore")

In [None]:
# Especially for the beamer, we're going to use seaborn to make the figure text bigger.
import seaborn as sns
sns.set_context('talk')
plt.rcParams['figure.figsize'] = (16,8)       # this line changes the size of the figures displayed in the notebooks

<hr style="border-top:8px solid black" />

## *Preparing/downloading our data*

We will use a pre-prepared small data subset around Fribourg which we extracted from the Swiss Data Cube for you earlier. It is located in the `data/` folder.

<span style="color:gray; font-style:italic">We made this data subset using `ts1_data_preparation.ipynb`. You will find this approach useful when doing your project work.</span>

In [None]:
nc_filename = "data/landsat_ot_c2_l2_fribourg_example.nc"

In [None]:
# Open the prepared Landsat 8 subset for the Fribourg region 
ds = xr.open_dataset(nc_filename, engine='netcdf4')

In [None]:
# ds - the dataset
ds

In [None]:
# Create a 'shortcut' variable so that we can work with NDVI directly.
ndvi = ds.ndvi
ndvi

<a name="basic_time"></a>
<hr style="border-top:8px solid black" />

## Mean of an area through time

Let's have a bit of a reminder about reducing on space or time, to remind ourselves about what we learned in **Time Series 2: Selecting and Saving**.

Here, we take the average (mean) of all pixels in our cube. This removes the spatial coordinates, leaving us with just the temporal coordinate.

In [None]:
mean_thru_time = ndvi.mean(dim=('x', 'y'))
mean_thru_time

In [None]:
mean_thru_time.plot()

As an example of another operation, we could take the standard deviation instead:

In [None]:
ndvi.std(dim=('x', 'y')).plot()

<a name="basic_space"></a>
<hr style="border-top:8px solid black" />

## Mean of each pixel in a cube

We can remove the time coordinate by applying an operation like `mean()` over it. This leaves us with a single map of our spatial area.

In [None]:
mean_each_px = ndvi.mean(dim='time')
mean_each_px

In [None]:
mean_each_px.plot(vmin=0, vmax=1, cmap='Greens')

Just like with time, we could also compute a different statistic such as the median:

In [None]:
ndvi.median(dim='time').plot(vmin=0, vmax=1, cmap='Greens')

<a name="resampling"></a>
<hr style="border-top:8px solid black" />

## Resampling

We use `.resample()` to change the frequency of the time axis to e.g. monthly or annual.

The `resample()` operation takes the argument/keyword `time=(Frequency)`. Replace `(Frequency)` with your desired frequency. Popular examples include:

* `A` or `Y` - annual (i.e. yearly) frequency.
* `Q` - quarterly frequency.
* `M` - monthly frequency.
* `D` - daily frequency.

Background information: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#resampling

In [None]:
# Again, we first need to make a spatial average
mean_thru_time = ndvi.mean(dim=('x', 'y'))

In [None]:
mean_thru_time.plot()

Now we make the annual time series. `A` stands for *Annual* and `S` stands for *start*. The `S` means that the time coordinates which are created correspond to the start of each period, e.g. 2019-01-01. Without the `S`, the coordinates would correspond to the end of each period, e.g. 2019-12-31, which sometimes makes interpreting graphs a bit tricky. There's an example of this in a moment.

In [None]:
annual_ndvi = mean_thru_time.resample(time='AS').mean()

In [None]:
annual_ndvi

In [None]:
annual_ndvi.plot(marker='o')

If we look at this graph, we can see that the value for 2019 is about 0.56. This is correct, let's look at both the resampled series again to make sure:

In [None]:
annual_ndvi.sel(time='2019')

In contrast, if we didn't use the `S` letter in our `resample()` command then the time for 2019 would be 2019-12-31:

In [None]:
annual_ndvi_endyear = mean_thru_time.resample(time='A').mean()
annual_ndvi_endyear.sel(time='2019')

Looks OK so far. But when we plot a graph of this, we can see that the 2019 value appears to have been plotted in 2020 - this is because it has been placed at 2019-12-31:

In [None]:
annual_ndvi_endyear.plot(marker='o')

<a name="variability"></a>
<hr style="border-top: 8px solid black" />

## Variability

Variability refers to how much a variable like NDVI changes in general, as compared to how much the values change systematically over time (--> trends/tendencies). The monthly example from before shows no statistically significant trend ($p > \alpha$). But we see that the values change a lot. Some example for high variability can be different crops on the fields that result in differentt NDVI values, different precipitation patterns in combination with temperature that lead to variable snow cover, etc.

A common statistic to describe variability is the sample **standard deviation**. 


$s = \sqrt\frac{\sum{(x_i-\bar{x})^2}}{n-1}$

The standard deviation has the same unit as the data in the time series. It makes it therefore more intuitive to use it instead of the ***variance***.


Another useful way to investigate variability is by looking at the **deviation from the mean**, sometimes called anomalies. Instead of calculating a single statistic over all time steps, one derives for each time step a value.


### Standard deviation

One can directly calculate the standard deviation for each pixel by calling the function `.std('time')`, indicating that it should be applied over the **time** dimension.

The following example shows directly the difference between the urban and the rural area in terms of NDVI variability. Crops fields can easily be identified where the variability is especially high.


In [None]:
ndvi.std('time').plot.imshow()

In [None]:
# The same example but only for the month of August
ndvi.sel(time=ndvi.time.dt.month==8).std('time').plot.imshow()

### Deviation from the mean
As the name says, we have to calculate the mean first and subtract this value from each individual NDVI value. If there is a strong seasonality, we have to think of which mean we calculate (monthly, annual, ...), and of which data we subtract this mean (also monthly, annual, ...).



In [None]:
da_annual_mean = ndvi.mean('time')
da_annual = ndvi.resample(time='AS').mean()
da_dev_from_mean = da_annual - da_annual_mean

# plot the time series for a pixel:
da_dev_from_mean_pixel = da_dev_from_mean.sel(x=2580000, y=1181500, method='nearest')

da_dev_from_mean_pixel.plot.line('ko-')
plt.hlines(y= da_dev_from_mean_pixel.mean(), 
           xmin=da_dev_from_mean_pixel.time[0], 
           xmax=da_dev_from_mean_pixel.time[-1])


In [None]:
# plot the deviation from the mean for the year 2018 - as a map
# da_dev_from_mean.sel(time=da_dev_from_mean.time.dt.year==2018)[0].plot.imshow()
da_dev_from_mean.sel(time=da_dev_from_mean.time.dt.year==2018).mean(dim='time').plot.imshow()


### Showcasing variability in line plots

When plotting mean values, e.g. extracted for a point or area, the line graph does not show how variable the mean is. This might be important information because the values around the mean might vary strongly.

A good way to include such information in a plot is by adding the standard deviation or percentiles as `error bars` or `boundary polygon` to the plot.

The next cell shows again the mean value, but this time with error bars (note error bars is just how the figure features are called; they do not neccessarily represent a real **error**).
 

In [None]:
# Full example plotting mean, annual, values of a specified bounding box with standard deviation of the annual values as error bars

# a point in the middle of the study area
point_x = 2580000
point_y = 1181500

# bounding box 2 x 2 km around the chosen pixel (point_x, point_y)
x_coords = slice(point_x-1000, point_x+1000)
y_coords = slice(point_y+1000, point_y-1000)

# making the spatial subset (according to bounding box)
subset_spatial = ndvi.sel(y=y_coords, x=x_coords)

# averaging over time (resample) and then taking mean and STD (over all pixels for each annual time step)
ts_mean = subset_spatial.resample(time='AS').mean(dim=('time','x','y'))
ts_std1 = subset_spatial.resample(time='AS').std(dim=('time','x','y'))

# to pandas:
ts_mean_pd = ts_mean.to_pandas()
ts_std1_pd = ts_std1.to_pandas()



In [None]:
# The actual plot
ts_mean_pd.plot(yerr=ts_std1_pd,
                     fmt='o',
                     linestyle='--',
                     capsize=5,
                     ylim=(0, 0.5),
                     xlim=('2012','2022'),
                     xlabel='Time',
                     ylabel='NDVI [-]',
                     legend=False)


<a name="repeating"></a>
<hr style="border-top:8px solid black" />

## Features which repeat (e.g. annual cycles)

We can use the `.groupby()` function to group our data by a repeating feature. Here, we're often particularly interested in calculating statistics for each month of the calendar over a period of several years.

The following figure (also in your `data/` folder (`groupby_example2_cropped.png`) shows an overview of how `.groupby()` works (large version: groupby_example2_cropped.png):


- <span style="color:darkblue">**Data**</span>
    - an example of an xarray or DataFrame with different columns
- <span style="color:darkgreen">**Selection**</span>
    - a pre-selection example to select two years
- <span style="color:darkred">**Aggregation (the actual `.groupby()` part)**</span>
    - groupby based on the different categories (columns) that allow grouping in different ways
- <span style="color:purple">**Process**</span>
    - examples on what function can be applied finally




![groupby](https://www.dropbox.com/scl/fi/9bi57pa5ak3zfkgb04u0k/groupby_example2_cropped.png?rlkey=1j3ua00n1n813z64igh2lvj4r&dl=1)
*Figure 1: Pandas/Xarray selection, grouping, and processing chaining examples.*

***
The first step is to reduce the spatial coordinates, leaving us with a time series:

In [None]:
# Let's reduce on the spatial coordinates. We do this by calculating the average value over the whole area at each time point
# (We already did this earlier in the notebook, we're just doing it again here for completeness)
mean_thru_time = ndvi.mean(dim=('x', 'y'))

In [None]:
mean_thru_time

In [None]:
mean_thru_time.plot()

Now we are going to calculate what the average annual cycle looks like, by taking the mean of all observations in each calendar month.

In [None]:
# Now let's calculate an annual cycle by taking the mean of every calendar month
cycle = mean_thru_time.groupby('time.month').mean()

In [None]:
cycle

In [None]:
cycle.plot()

In [None]:
# If we just want seasons then we can also do this
seasonal_cycle = mean_thru_time.groupby('time.season').mean()

In [None]:
seasonal_cycle

In [None]:
seasonal_cycle.plot()
# However, this will fail because seasons are 'categorical' (i.e. not numerical) so xarray doesn't understand how to plot it.

In [None]:
seasonal_cycle['season'].values

In [None]:
# A workaround is to explicitly transform the xarray Data.Array into a pandas.DataFrame

import pandas as pd

df = pd.DataFrame(seasonal_cycle)
# Overwrite the Index values with the Season abbreviations ("DJF", "MAM", ...)
df.set_index(seasonal_cycle['season'].values, inplace=True)


# The ordering is not as we want it (Summer before Spring):
df

In [None]:
# a rather comlicated workaround:
df.index = pd.CategoricalIndex(df.index, categories=['DJF', 'MAM', 'JJA', 'SON'], ordered=True)
df = df.sort_index()

# Plotting
df.plot(kind='line', legend=False)
plt.xlabel("Season")
plt.ylabel("Value")
plt.title("Seasonal Values")
plt.show()

In [None]:
df

Long story short: It is easier to work with numerical values on the x-axis (like months [1,2,3,...])