# TS-2: Getting to know your data

This Notebook demonstrates:

* What the temporal and spatial dimensions are
* How you can access the data in these dimensions
* How to account for spatial variability
* How to mitigate missing data
* How to export a time series as a CSV file which you can open in Excel for further analysis.

In [None]:
%%html
<style>
    .dothis{
    font-weight: bold;
    color: #ff7f0e;
    font-size:large
    }
</style>

In [None]:
# Import modules

# reload module before executing code
%load_ext autoreload
%autoreload 2

# define modules locations (you might have to adapt define_mod_locs.py)
# %run ../sdc-notebooks/Tools/define_mod_locs.py

import os
import shutil
import xarray as xr
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
from matplotlib.pyplot import cm
from matplotlib.patches import Polygon, Rectangle
from sdc_utilities import *

# silence warning (not recommended during development)
import warnings
warnings.filterwarnings("ignore")

In [None]:
## Changes to make figures easier to read

# Especially for the beamer, we're going to use seaborn to make the figure text bigger.
import seaborn as sns
sns.set_context('talk')

# this line changes the size of the figures displayed in the notebooks
plt.rcParams['figure.figsize'] = (16,8)       

<hr style="border-top:8px solid black" />

## *Preparing our data*

We will use a pre-prepared small data subset around Fribourg which we extracted from the Swiss Data Cube for you earlier. It is located in the `data/` folder.

<span style="color:gray; font-style:italic">We made this data subset using `ts1_data_preparation.ipynb`. You will find this approach useful when doing your project work.</span>

In [None]:
nc_filename = "data/landsat_ot_c2_l2_fribourg_example.nc"


In [None]:
# Open the prepared Landsat 8 subset for the Fribourg region 
ds = xr.open_dataset(nc_filename, engine='netcdf4')

In [None]:
# ds - the dataset
ds

In [None]:
# Create a 'shortcut' variable so that we can work with NDVI directly.
ndvi = ds.ndvi
ndvi

<hr style="border-top:8px solid black">

## Temporal Data
### Time components

Of special interest for us is the `time` dimension. `time` has multiple attributes that allow you to select data of interest. We can look at all the time steps in the dataset by calling `<xarrayDataArray>.time`. In the cell below you will see that the time of each scene is stored in a very detailed format:

- 2013-04-18T10:18:18.000000000

with:
- 2013 - year
- 04 - month
- 18 - day
- 10:18:18.000000000 - Hour:Minute:Second

*More information can be found at https://docs.xarray.dev/en/stable/user-guide/time-series.html#datetime-components.*

In [None]:
ds.time
# ds["time"]  # will yield the same output / different way of writing it

We can access the individual parts using the same writing but with an additional `.dt` followed by the attribute of interest:

In [None]:
# examples
ds.time.dt.month
# ds.time.dt.day
# ds.time.dt.year
# ds.time.dt.season

***
> **Note** The date/time string is in a format that we understand (years, months, days, etc.). Inside a computer, the date/time is represented as a numeric value. A standard way is to represent any date as number of days since "1970-01-01". This allows to convert the date/time string into something meaningful for the computer.
***

In [None]:
from matplotlib import dates

print(dates.date2num(np.datetime64('1850-11-17 13:12:11')))
print(dates.date2num(np.datetime64('1970-01-01 00:00:00')))  # this is the standard time starting point
print(dates.date2num(np.datetime64('2022-11-17')))

# The output unit is [days since start]

### Reducing over *time* produces a map

Along the time axis of the DataArray `ndvi`, every pixel (x,y / lon,lat) represents the evolution of the Normalized Difference Vegetation Index (NDVI). 

First, let's look at the 3-D data array:

In [None]:
ndvi = ds.ndvi
ndvi

Now we can take a mean of data from only September 2013:

In [None]:
sept13 = ds.ndvi.sel(time='2013-09').mean(dim="time")
sept13

# In comparison to all Septmeber months:
# september_mean = ds.ndvi.sel(time=ds.time.dt.month == 9).mean(dim="time")

Finally we can plot the map:

In [None]:
sept13.plot.imshow(vmin=0, vmax=1, cmap=cm.Greens)

The map above shows the average value (`.mean()`) over the time axis (`dim="time"`) for all scenes (images) available in September for the year 2013.

This example reduces the dimensions of the DataArray (`ndvi`) in 3-D:
- time
- y (latitude)
- x (longitude)

to 2-D:
- y
- x


***
We might instead want to look at all values from April, combining all the Aprils across all the years of our data. We can do this as follows....

`.sel()` allows us to select certain months, seasons, or years by asking where the **time components** match a condition. In the example below the expression `ndvi.time.dt.month==4` asks where the `month` component matches the value `4` (April). 

In [None]:
# Only the time dimension values are show with the additional ".time" at the end. For the whole dataset, remove this ending.
ndvi.sel(time=ndvi.time.dt.month==4).time

Like previously, we can produce a map, but this time only using the April values from 2013 to 2021:

In [None]:
ndvi.sel(time=ndvi.time.dt.month == 4).mean(dim='time').plot(vmin=0, vmax=1, cmap='Greens')

<span class='dothis'>Try to select all time steps from the `ndvi` DataArray that correspond to summer `JJA` (June, July, August) using the **time component**:`.time.dt.season`.</span>

In [None]:
# Put your code here. Don't be shy to copy+paste
ndvi.sel(time=((ndvi.time.dt.month == 4) & (ndvi.time.dt.day == 20))).mean(dim='time')\
    .plot(vmin=0, vmax=1, cmap='Greens')

<hr style="border-top: 8px solid black">

## Spatial data

### Reducing over *space* produces a time series

What about selecting only a certain spatial location from a DataSet or our DataArray?

We set the total spatial extent when we extracted this DataSet using `ts1_data_preparation`. We can remind ourselves of the total extent now:

In [None]:
ndvi.coords

In [None]:
# a point in the middle of the study area
point_x = 2580000
point_y = 1181500

In [None]:
# Let's draw another map to show where this point is located
ndvi.sel(time=ndvi.time.dt.month == 4).mean(dim='time').plot(vmin=0, vmax=1, cmap='Greens')
plt.plot(point_x, point_y, 'o', color='yellow', markersize=10)
plt.show()

In [None]:
# in which "dimensions" is the information stored?
ndvi.dims

In [None]:
# With the .sel() method you select certain data. You define the dimension (dimension name)
# in which the value should be looked for. In the example these are "longitude" and "latitude"

da = ndvi.sel(
    x=point_x,      
    y=point_y,     
    method="nearest"               # the nearest method finds the 1 closest pixel
             )

In [None]:
# Look at the output, the dimensions have been reduced. Lon and Lat are only single values and are not dimensions any more
da.dims

In [None]:
da

In [None]:
da.plot.line('o')
# da[da<0.6].plot.line('o')  # Example to skip plotting values >= 0.6

The above plot shows for one pixel each time step as a blue point. 

### Reducing over an area of space

Above, we looked at just a single pixel. Now let's look at how we can extract small or large spatial areas using the `slice()` command together with `.sel()`.

`slice()` allows us to literally 'slice' out a smaller area from within a larger one. To use it we provide the minimum and maximum coordinates which define our box of interest.

In [None]:
x_coords = slice(point_x-1000, point_x+1000)
y_coords = slice(point_y+1000, point_y-1000)
# NOTE: the order   ^        ,   ^   is the higher and THEN the LOWER latitude. That is
# because the image coordinates go from top to bottom, but latitudes 
# go from south to north --> botttom to top. That's why they are reversed

In [None]:
da = ndvi.sel(y=y_coords, x=x_coords)

In [None]:
da

To illustrate what we have just done, let's make a map which shows the area that we are extracting a time series from:

In [None]:
# Overview map with positions indicated by circles
fig,ax = plt.subplots(1)

# First plot the mean NDVI of the whole time series as a map
ndvi.mean(dim='time').plot.imshow(vmin=0,
                   vmax=1,
                   cmap=cm.Greens)

area = Rectangle((x_coords.start, y_coords.stop),            # Corner
                 x_coords.stop-x_coords.start,              # Width
                 y_coords.start-y_coords.stop,              # Height
                facecolor="#FF000022", edgecolor='r'   # Formatting
                )
# Draw a box of the area we have extracted
ax.add_patch(area)

ax.plot(point_x, point_y, 'o', mfc='yellow')

Plotting a time series of the rectangle in the map above:

In [None]:
da.mean(dim=('x', 'y')).plot.line(x='time', marker='o', linestyle='none')

<hr style="border-top: 8px solid black">

## A complete summary of `.sel()`

Once we have loaded some data from the Swiss Data Cube, we might be interested in looking only at specific parts of it in either time or space. To do this we use the `.sel()` method. The keywords/arguments that we supply are any, some, or all of the dimensions:

- 1st dimension: `time`
- 2nd dimension: `latitude`
- 3rd dimension: `longitude`

To tell the method which selection we want to have, we define a **single value** to look for (e.g. `time='2019-10-30'`), or a **range** (`longitude=slice(7.192, 7.193)`). The `slice()` function is interpreted directly by `.sel()` to know that all the values between the first (7.192) and the last value (7.193) should be found.


**Examples**

Specific dates and date ranges:
- `mydata.sel(time='2019-10-12')` - one date - will find a time step and its values only if there is data on that day! If there is no data available then no data will be returned.
- `mydata.sel(time='2019-10-13', method="nearest")` - one date, and method='nearest' because the exact time entry is: `2019-10-12T10:17:17`. This will return the value from the day before.
- `mydata.sel(time=slice('2019-10-11', '2019-10-13'))` - this one will return all the entries in the time `slice`


All dates of the same month:
- `mydata.sel(time=ndvi.time.dt.month==4)` - select all time steps where the month is April
- `mydata.sel(time=ndvi.time.dt.month.isin([1, 2, 3]))` - select all time steps where the month is are either: January, February, or March
***
Spatial:
- `mydata.sel(y=1182500)` - the point at Swiss Grid y 1182500 m. If there is no data here then you will get an empty DataArray.
- `mydata.sel(y=1182500, method="nearest")` - the measurement at the point closest to Swiss Grid y 1182500 m.
- `mydata.sel(y=slice(1182500, 1180500))` - all measurements between the two coordinates.
- You can also do the same thing with x (longitude), just change the keyword accordingly.

***
Combining selections:

You can combine exact and "nearest" selections by using two `.sel()` operations:

In [None]:
ndvi.sel(time='2019-10-13', method='nearest').sel(y=slice(1182500, 1180500))

You can combine multiple keywords in one statement:

In [None]:
ndvi.sel(x=2578000, time='2019-08-12', method='nearest')

<span class='dothis'>Now try out some different dates and x/y coordinates, with and without the `method='nearest'`, and a `slice(<date-start>, <date-end>)` operation.</span>

In [None]:
# Put your code here:


<hr style="border-top:8px solid black" />

## Saving a time series to a CSV file

CSV files are 'comma-separated-value' files that can be opened in Excel and many other analysis packages.

Let's export a time series of NDVI from a small part of Fribourg that we have loaded.

In [None]:
ndvi_thru_time = ndvi.sel(x=point_x, y=point_y, method='nearest')

Next we check that we have only a single dimension, in this case time. (Otherwise the export won't work!

In [None]:
ndvi_thru_time

We convert the data to a Pandas DataFrame:

In [None]:
ndvi_thru_time_pd = ndvi_thru_time.to_dataframe()

Let's take a look at it. Note that it looks very similar to an xarray DataArray - it just does not have spatial coordinates any more.

In [None]:
ndvi_thru_time_pd

Now we save to Comma Separated Values (CSV) file. There are more options possible to reduce the time-series e.g. to daily values (see next notebook for the details). Before saving the file to CSV, we calculate the mean daily values, and remove rows with no data. This will make it easier later on for plotting in your preferred plotting software (python, R, Excel).

In [None]:
# Resample and calculate daily mean for all variables
ndvi_thru_time_daily_pd = ndvi_thru_time_pd.resample('1D').mean()

# remove all rows without values
ndvi_thru_time_daily_clean_pd = ndvi_thru_time_daily_pd.dropna()

# Save including empty rows
ndvi_thru_time_pd.to_csv('ndvi_over_fribourg.csv')
# Save after removing NA values
ndvi_thru_time_daily_clean_pd.to_csv('ndvi_daily_clean_over_fribourg.csv')

You can now download the file that was made and open it in Excel or some other analysis software.