# TS-1: Data preparation

*****

This notebook allows you to load and pre-process an SDC dataset, which you can then save into a NetCDF (.nc) file to be reused quickly in other Notebooks where you do your analysis.

Things you should change:

* The config_cell variables
* The output filename of the netcdf file (see the last cell).

Then, note that the Notebook has two different options depending on the dataset that you want to pre-process:

* Landsat
* Land use statistics

Only execute the section which corresponds to the product that you specified in the config_cell!

*****
<span style="color:red">
    
Key processing steps in this notebook:
- Definition of *bad* pixels (clouds, fill values, ...)
- Setting bad pixels to `NA`
- Scaling with scaling factors (original values are stored as `Integer` but need to be transformed to `Float`)
- Cleaning empty timesteps (scenes with no valid data)

</span>

In [None]:
# Import modules

# reload module before executing code
%load_ext autoreload
%autoreload 2

# Load packages
import numpy as np
import pandas as pd
import ast
from odc.stac import stac_load
import time
import psutil
import dask.distributed
import rioxarray
import numpy as np
import xarray as xr
from pystac_client import Client
import matplotlib.pyplot as plt
import pandas as pd 
from sdc_utilities import *
# silence warning (not recommended during development)
import warnings
warnings.filterwarnings("ignore")

ds_clean = None
ds_astat = None

The next cell contains the dataset configuration information:
- product
- geographical extent
- time period
- bands
- ...

You can generate it in three ways:
1. manually from scratch,
2. by manually copy/pasting the final cell content of the [config_tool](config_tool.ipynb) notebook,
3. by loading the final cell content of the [config_tool](config_tool.ipynb) notebook using the magic `%load config_cell.txt`. (You need to execute the cell twice - 1. loading & 2. execution)

In [None]:
# %load "config_cell.txt"
# Configuration

product = 'landsat_ot_c2_l2'
measurements = ['QA_PIXEL', 'SR_B2', 'SR_B3', 'SR_B4', 'SR_B5', 'SR_B6', 'SR_B7', 'ST_B10']
aliases = ['QA_PIXEL', 'blue', 'green', 'red', 'nir', 'swir_1', 'swir_2', 'surface_temperature']  # you can also provide only the aliases and get the measurements with:
# measurements, aliases = get_alias_band(aliases)
# to make your live easier you can manually replace the measurements variable by 
# one of their alias:

longitude = (7.127, 7.199)
latitude = (46.773, 46.816)
crs = 'epsg:4326'

time = ('2016-04-01', '2016-07-01')
# the following date formats are also valid:
# time = ('2000-01-01', '2001-12-31')
# time=('2000-01', '2001-12')
# time=('2000', '2001')

# You can use an UTM zone according to the DataCube System.
# We prefer not to use this, instead specifying SwissGrid (epsg:2056).
# output_crs = 'epsg:2056'

output_crs = 'epsg:2056'
resolution = -30.0, 30.0

# These are the pixel classifications for Sentinel (SCL) and Landsat (QA_PIXEL); 
# you can use values to mask out values that belong to certain classes

###################################
# SCL categories:                 #
#   0 - no data                   #
#   1 - saturated or defective    #
#   2 - dark area pixels          #
#   3 - cloud_shadows             #
#   4 * vegetation                #
#   5 * not vegetated             #
#   6 * water                     #
#   7 * unclassified              #
#   8 - cloud medium probability  #
#   9 - cloud high probability    #
#  10 - thin cirrus               #
#  11 * snow                      #
###################################

# Check for more detailed information: 
# - Landsat 8/9 (OLI/TIRS), Page 19:
# https://d9-wret.s3.us-west-2.amazonaws.com/assets/palladium/production/s3fs-public/media/files/LSDS-1619_Landsat8-9-Collection2-Level2-Science-Product-Guide-v6.pdf
# - Landsat 7 (ETM+), Page 15:
# https://d9-wret.s3.us-west-2.amazonaws.com/assets/palladium/production/s3fs-public/media/files/LSDS-1337_Landsat7ETM-C2-L2-DFCB-v6.pdf
# - Landsat 4,5 (TM), Page 18:
# https://d9-wret.s3.us-west-2.amazonaws.com/assets/palladium/production/s3fs-public/atoms/files/LSDS-1415_Landsat4-5-TM-C2-L1-DFCB-v3.pdf

#############################################
# QA_PIXEL BITS : CATEGORIES                #
#    0 : Fill                               #
#    1 : Clear                              #
#    2 : Water                              #
#    3 : Cloud shadow                       #
#    4 : Snow                               #
#    5 : Cloud                              #
#   10 : Terrain occlusion (Landsat 8 only) #
#############################################

chunks = {"x": 2048, "y": 2048, "time": 1}  # 2048 values are OK with ~21Gb memory available



In [None]:
# %load "config_cell.txt"
# Configuration

product = 'landsat_ot_c2_l2'
measurements = ['QA_PIXEL', 'SR_B2', 'SR_B3', 'SR_B4', 'SR_B5', 'SR_B6', 'SR_B7']
aliases = ['QA_PIXEL', 'blue', 'green', 'red', 'nir', 'swir_1', 'swir_2']  # you can also provide only the aliases and get the measurements with:
# measurements, aliases = get_alias_band(aliases)
# to make your live easier you can manually replace the measurements variable by 
# one of their alias:

longitude = (7.127, 7.199)
latitude = (46.773, 46.816)
crs = 'epsg:4326'

time = ('2016-04-01', '2016-07-01')
# the following date formats are also valid:
# time = ('2000-01-01', '2001-12-31')
# time=('2000-01', '2001-12')
# time=('2000', '2001')

# You can use an UTM zone according to the DataCube System.
# We prefer not to use this, instead specifying SwissGrid (epsg:2056).
# output_crs = 'epsg:2056'

output_crs = 'epsg:2056'
resolution = -30.0, 30.0

# These are the pixel classifications for Sentinel (SCL) and Landsat (QA_PIXEL); 
# you can use values to mask out values that belong to certain classes

###################################
# SCL categories:                 #
#   0 - no data                   #
#   1 - saturated or defective    #
#   2 - dark area pixels          #
#   3 - cloud_shadows             #
#   4 * vegetation                #
#   5 * not vegetated             #
#   6 * water                     #
#   7 * unclassified              #
#   8 - cloud medium probability  #
#   9 - cloud high probability    #
#  10 - thin cirrus               #
#  11 * snow                      #
###################################

# Check for more detailed information: 
# - Landsat 8/9 (OLI/TIRS), Page 19:
# https://d9-wret.s3.us-west-2.amazonaws.com/assets/palladium/production/s3fs-public/media/files/LSDS-1619_Landsat8-9-Collection2-Level2-Science-Product-Guide-v6.pdf
# - Landsat 7 (ETM+), Page 15:
# https://d9-wret.s3.us-west-2.amazonaws.com/assets/palladium/production/s3fs-public/media/files/LSDS-1337_Landsat7ETM-C2-L2-DFCB-v6.pdf
# - Landsat 4,5 (TM), Page 18:
# https://d9-wret.s3.us-west-2.amazonaws.com/assets/palladium/production/s3fs-public/atoms/files/LSDS-1415_Landsat4-5-TM-C2-L1-DFCB-v3.pdf

#############################################
# QA_PIXEL BITS : CATEGORIES                #
#    0 : Fill                               #
#    1 : Clear                              #
#    2 : Water                              #
#    3 : Cloud shadow                       #
#    4 : Snow                               #
#    5 : Cloud                              #
#   10 : Terrain occlusion (Landsat 8 only) #
#############################################

chunks = {"x": 2048, "y": 2048, "time": 1}  # 2048 values are OK with ~21Gb memory available



In [None]:
# In case you loaded the cell above from the config_cell.txt:
# Did you run the cell again above after loading the config?

# Load Landsat satellite data

In [None]:
# Work around with dask
client = dask.distributed.Client()
catalog = Client.open("https://explorer.swissdatacube.org/stac")



In [None]:
# Load dataset with parameters defined in the config cell
dataset_in = load_product_ts(catalog=catalog,
                        product=product,
                        longitude=longitude,
                        latitude=latitude,
                        output_crs=output_crs,
                        measurements=measurements,
                        resolution = resolution,
                        time=time,
                        chunks=chunks,
                        rename=True,
                        alias_names = aliases
                        )

In [None]:
dataset_in

In [None]:
# plot first 5 scenes
dataset_in.time

In [None]:
# plot first 5 scenes
dataset_in.blue[0:5,:,:].plot(col='time', vmin=0)

In [None]:
# dataset_in.to_netcdf('dataset_in.nc')

## Pre-defined cloud identification and classification

Check the notebook `help_cloudcover.ipynb` for more information.
It makes sense to find out first what pixel classification (clouds, water, land, etc.) works best for your study region. Some classes might not work in mountains and on snow. Then a manual selection (plotting several scenes and identify the dates you want) might be your best choice.

The overview of pixel classifications and quality flags can be found online. The links are provided in the `config_cell.txt` file, and screenshots of these tables in the folder `data/` with the names `Landsat8_QA_PIXEL.png`, `Landsat4-7_QA_PIXEL.png`, and `Sentinel2_SCL.png`.

<img src="https://www.dropbox.com/scl/fi/zslaub479bjrpmqnlrkb3/Landsat8_QA_PIXEL.png?rlkey=jglvektr4mw7vax5bcm97jd1p&dl=1" width="600" />

*Figure 1: Landsat 8 QA_PIXEL Bit flags example.*

In [None]:
# We want to mask out the 'fill' values (no measurement) and 'cloud' 
bit_positions = [0,3]  
ds_tmp = create_mask_from_bits(dataset_in, bit_positions)

# # For Sentinel 2 see the "help_cloudcover" notebook
# invalid_values = [0,1,9,10]  
# ds_tmp = create_mask_from_values(ds_tmp, invalid_values)

In [None]:
# ds_tmp.to_netcdf('ds_tmp.nc')

In [None]:
# set all values of identified pixels to not assigned 'np.nan'
ds_na = ds_tmp.where(ds_tmp['mask'] != 1, other=np.nan)

# # to drop all empty timesteps and the QA_PIXEL/SCL layer:
# ds_clean = ds_na_reduced.dropna(dim='time', how='all')

# to keep the QA_PIXEL/SCL layer, you need to select one band that is NOT the quality layer
_band_check = 'blue'
ds_clean = ds_na.dropna(dim='time', how='all', subset=[_band_check])

# there are some issues with the CRS. This one-liner makes sure the CRS is working as intended.
ds_clean = fix_crs(ds_clean)

In [None]:
# ds_clean.to_netcdf('ds_na_test.nc')

In [None]:
# plot first 5 scenes:
ds_clean.blue[0:5,:,:].plot(col='time', vmin=0, cmap='nipy_spectral')

### Optional: add normalised difference index

In [None]:
ds_clean

In [None]:
# OPTIONAL CELL TO CALCULATE NDIs
# You can already calculate normalised difference indexes here to be saved with the measurements.
# To do this, use the relevant line(s) below and/or add your own.

ds_clean['ndvi'] = (ds_clean.nir - ds_clean.red) / (ds_clean.nir + ds_clean.red)
ds_clean['ndwi'] = (ds_clean.green - ds_clean.nir) / (ds_clean.green + ds_clean.nir)

# 'NDWI': '(ds.green - ds.nir) / (ds.green + ds.nir)',
# 'NDBI': '(ds.swir2 - ds.nir) / (ds.swir2 + ds.nir)'

# As always, fix the CRS:
ds_clean = fix_crs(ds_clean)

### Take a quick look at the summary of the data

In [None]:
ds_clean.ndvi[0:5,:,:].plot(col='time', vmin=0, cmap='nipy_spectral')

In [None]:
# plot first 5 scenes:
ds_clean.blue[0:5,:,:].plot(col='time', vmin=0, cmap='nipy_spectral')

In [None]:
ds_clean.ndwi[0,:,:].plot()

In [None]:
ds_clean.to_netcdf('after_ndvi.nc', engine="netcdf4")

## Add land use statistics

In [None]:
# Here, we manually change the variables `product` and `measurements` to specify what we want to load from arealstatistik.
# We leave longitude, latitude, resolution, output_crs exactly as they were for Landsat. 
# This ensures that the data from arealstatistik will match the spatial coordinates of Landsat perfectly.

# Specify the arealstatistik product
product = ['arealstatistik']

# Here, the measurements are not individual colour bands, 
# but instead are the different surveys with the desired number of classes.
# By default we are loading the surveys for the most recent time period: 2013-2018.
# To see all the available surveys, refer to the arealstatistik PDF document and explore_datacube.ipynb.
measurements = ['AS18_4', 'AS18_17', 'AS18_27', 'AS18_72']

In [None]:
query = catalog.search(
    collections=[product],
    limit=100,
    bbox=(longitude[0], latitude[0],
          longitude[1], latitude[1])
)
items = list(query.items())

# load identified items
ds_astat = stac_load(
    items,
    lon=longitude,
    lat=latitude,
    bands=measurements,
    crs=output_crs,
    resolution=resolution[1],
    chunks=chunks,
)

# Squeeze to remove the defunct time dimension [otherwise we retain a default timestamp of 1970-01-01, which is not helpful].
ds_astat = ds_astat.squeeze()

# and as usual apply the CRS fix:
ds_astat = fix_crs(ds_astat)

### Take a quick look at the summary of these data

In [None]:
ds_astat

In [None]:
# ds_astat.to_netcdf('ds_astat.nc')

## Saving the data

In [None]:
## First, figure out if we need to combine Landsat data with arealstatistik.

if (ds_clean is not None) and (ds_astat is not None):
    # In this case, you have loaded both Landsat and arealstatistik.
    # So, let's combine them into a single Dataset, allowing them to be saved together.
    ds_save = xr.merge([ds_clean, ds_astat])
elif (ds_clean is not None):
    # We are saving only the Landsat dataset
    ds_save = ds_clean
elif (ds_astat is not None):
    # We are saving only the arealstatistik dataset
    ds_save = ds_astat
else:
    raise ValueError('Hmm, unknown combination of data. Ask a teacher for help.')

# you guessed correctly, we make sure to apply the fix (there is not always an issue, but this fix is very fast)
ds_save = fix_crs(ds_save)

### This is what will be saved...

In [None]:
ds_save

### Save the file.

In [None]:
# Save the file. Change the output filename to something useful!
output_filename = 'mydata.nc'
ds_save.to_netcdf(output_filename, engine="netcdf4")
