# 3.1 Data Processing
In this exercise we will build a complete EO workflow on a cloud platform; from data access to obtaining the result. 
In this example we will analyse snow cover in the Alps. 
**MORE DETAILS HERE**: This exercise should be more repetition, and the goal is that everybody arrives at the result - without coding very much themselves. Then the transfer application will be done in the sharing exercise

We are going to follow these steps in our analysis:
- Load relevant data sources
- Specify the spatial, temporal extents and the features we are interested in
- Process the satellite data to retreive snow cover information
- aggregate information in data cubes
- Tracking the resources we use for our computation
- Visualize and analyse the results


## Login

In [1]:
# platform libraries
import openeo
from sentinelhub import (SHConfig, SentinelHubRequest, DataCollection, MimeType, CRS, BBox, bbox_to_dimensions, geometry)

# utility libraries
from datetime import date
import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import folium

import getpass

  from .autonotebook import tqdm as notebook_tqdm


In [8]:
config = SHConfig()
config.sh_client_id = getpass.getpass('Client ID: ')
config.sh_client_secret = getpass.getpass('Client Secret: ')

In [9]:
#conn = openeo.connect('https://jjdxlu8vwl.execute-api.eu-central-1.amazonaws.com/production')
conn = openeo.connect(' https://openeo.dataspace.copernicus.eu/openeo/1.2')

In [10]:
conn.authenticate_basic(username=config.sh_client_id, password=config.sh_client_secret)

OpenEoApiError: [403] CredentialsInvalid: Credentials are not correct. (ref: r-f08287794e44450ba1b74ba7a4c09999)

In [5]:
# Use this for more 
# https://github.com/openEOPlatform/sample-notebooks/blob/main/openEO%20Platfrom%20-%20Basics.ipynb
# https://github.com/Open-EO/openeo-community-examples/tree/main/python

## Select a region of interest
- Select fixed region for all students -> easier evaluation, easier analysis catchment, easier validation
- Everybody choose region in predefined area -> More fun, reusable for next exercise (ideally you see which regions are already computed by a stac catalog with all entries of the course participants). Size limit: X x X pixels

--> Will start with fixed region and recalculate the result in the sharing lesson

Load the catchment area.
**Possible Question: What is the city at the outlet of the catchment? a) Meran, b) Innsbruck, c) Grenoble**

In [6]:
catchment_outline = gpd.read_file('data/catchment_outline.geojson')

In [7]:
m = folium.Map(location=[catchment_outline.centroid.y, catchment_outline.centroid.x])
folium.GeoJson(data=catchment_outline.to_json(), name='catchment').add_to(m)
m


  m = folium.Map(location=[catchment_outline.centroid.y, catchment_outline.centroid.x])

  m = folium.Map(location=[catchment_outline.centroid.y, catchment_outline.centroid.x])
  float(coord)
  if math.isnan(float(coord)):
  return [float(x) for x in coords]


## Inspect Metadata
We need to set the following configurations to define the content of the data cube we want to access:
- dataset name
- band names
- time range
- the area of interest specifed via bounding box coordinates
- spatial resolution

To select the correct dataset we can first list all the available datasets.

In [8]:
print(conn.list_collection_ids())

['SENTINEL2_L2A_MOSAIC_120', 'COPERNICUS_30', 'MAPZEN_DEM', 'SENTINEL1_GRD', 'CDS_2M_TEMP_2020', 'ALOS_PALSAR2_RICE_PADDY_FIELD_MAP', 'ALOS_PALSAR2_AGRICULTURE', 'ALOS_PALSAR2_L2_1_3M', 'ALOS_PALSAR2_L2_1_10M', 'CAMS_GLC', 'CNR_CHL', 'CNES_LAND_COVER_MAP', 'SENTINEL_5P_CO_T3D_AVERAGE', 'CORINE_LAND_COVER', 'CORINE_LAND_COVER_ACCOUNTING_LAYERS', 'E12C_MOTORWAY', 'E12D_PRIMARY', 'ESA_WORLDCOVER_10M_2020_V1', 'GHS_BUILT_S2', 'GLOBAL_LAND_COVER', 'GLOBAL_SURFACE_WATER', 'NASA_HARMONIZED_LANDSAT_SENTINEL', 'ICEYE_GRD_E11', 'ICEYE_GRD_E11A', 'ICEYE_GRD_E13B', 'ICEYE_GRD_E3', 'JAXA_WQ_CHLA_ANOMALY', 'JAXA_WQ_CHLA_AVERAGE', 'JAXA_WQ_TSM_ANOMALY', 'JAXA_WQ_TSM_AVERAGE', 'LANDSAT1-5_MSS_L1', 'LANDSAT4-5_TM_L1', 'LANDSAT4-5_TM_L2', 'LANDSAT7_ETM_L1', 'LANDSAT7_ETM_L2', 'LANDSAT8-9_L1', 'LANDSAT8-9_L2', 'MODIS', 'LTK_NATIONAL_HIGH_RESOLUTION_LAYER', 'POPULATION_DENSITY', 'SENTINEL_5P_CH4_T7D_AVERAGE', 'SENTINEL_5P_NO2_T14D_AVERAGE', 'SEA_ICE_INDEX', 'SEASONAL_TRAJECTORIES', 'SENTINEL1_CARD4L', 'SE

We want to use the Sentinel-2 L2A product. It's name is `'SENTINEL2_L2A_SENTINELHUB'`. 

We get the metadata for this collection as follows.

In [9]:
conn.describe_collection("SENTINEL2_L2A_SENTINELHUB")

As a time range we will focus on the snow melting season 2018, in particular from Febraury to June 2018:
**How many images are available in the time range?**
**How many pixels are in the data cube?** (time*x*y*bands)

https://github.com/openEOPlatform/sample-notebooks/blob/main/openEO%20Platfrom%20-%20Basics.ipynb

## Define a workflow
We will define our workflow now. And chain all the processes together we need for analyzing the snow cover in the catchment.

### Define the data cube
We define all extents of our data cube

In [10]:
bbox = catchment_outline.bounds.iloc[0]
bbox

minx    11.020833
miny    46.653599
maxx    11.366667
maxy    46.954167
Name: 0, dtype: float64

In [11]:
collection      = 'SENTINEL2_L2A_SENTINELHUB'
spatial_extent  = {'west':bbox[0],'east':bbox[2],'south':bbox[1],'north':bbox[3],'crs':4326}
temporal_extent = ["2018-02-01", "2018-06-30"]
bands           = ['B03', 'B11', 'CLM'] # ['B02', 'B03', 'B04', 'CLM']

### Load the data cube
We have defined the extents we are interested in. Now we use these definitions to load the data cube.

In [12]:
s2 = conn.load_collection(collection,
                          spatial_extent=spatial_extent, # put json here
                          bands=bands,
                          temporal_extent=temporal_extent)

In [13]:
s2

### NDSI - Normalized Difference Snow Index
The Normalized Difference Snow Index (NDSI) is computed as:

$$ NDSI = \frac {GREEN - SWIR} {GREEN +SWIR} $$

We have created a Sentinel-2 data cube with bands B03 (green), B11 (SWIR) and the cloud mask (CLM). We will use the green and SWIR band to calculate a the NDSI. This process is reducing the band dimension of the data cube to generate new information, the NDSI.

In [16]:
green = s2.band("B03")
swir = s2.band("B11")
#green = s2.filter_bands("B03")
#swir = s2.filter_bands("B11")

ndsi = (green - swir) / (green + swir)
#ndsi = s2.filter_bands("B03", "B11").reduce_dimension(dimension="bands",reducer = )
ndsi
#why merge cubes?

### Cloud masking
We are going to use "CLM" band for creating a cloud mask and then applying it to the NDSI.

In [17]:
#cloud_band = s2.band("CLM")
cloud_band = s2.filter_bands("CLM")

cloud_mask = cloud_band == 1
# cloud_mask = s2.filter_bands("CLM").reduce_dimension(dimension="bands",reducer = lambda value: eq(array_element(value,0),1))
ndsi_cloudfree = ndsi.mask(cloud_mask)
ndsi_cloudfree

### Creating the Snow Map
So far we have a cloud free timeseries of NDSI values. We are intereseted in the presence of snow though. Ideally in a binary classification: snow and no snow.
To achieve this we are setting a threshold of 0.4 on the NDSI. This gives us a binary snow map.

In [18]:
snowmap = ndsi_cloudfree > 0.4 
#snow_map = cube_s2snowmap.NDSI > 0.4
snowmap

### Cloud Percentage
We are looking at a region over time. We need to make sure that the information content meets our expected quality. Therefore, we calculate the cloud percentage for the catchment for each timestep. We use this information to filter the timeseries. All timesteps that have a cloud coverage of over 20% will be discarded.

In [19]:
# reduce_spatial, aggregate_spatial, 
n_cloud = cube_s2snowmap_masked.CLM.sum(dim=['lat', 'lon'])
n_cloud_valid = cube_s2snowmap_masked.CLM.count(dim=['lat', 'lon'])

cube_s2snowmap_masked['cloud_percent'] = n_cloud / n_cloud_valid * 100
cube_s2snowmap_masked

NameError: name 'cube_s2snowmap_masked' is not defined

### Snow Covered Area in the Catchment
We are interested in the snow covered area (SCA) within the catchment. We count all snow covered pixels within the catchment for each time step. After our snow classification our data cube has the values: 0 = no snow, 1 = snow, NA = cloud. This means we can sum up all pixels within the catchment and the sum will give us the count of the snow covered pixels. Later we can use this number to translate pixel count into area.

In [None]:
catchment_outline.to_json() # is this format acceptable?

In [82]:
snowarea = snowmap.aggregate_spatial(geometries = catchment_outline.to_json(), reducer="sum")
snowarea

## Download the results
So far no processing has happened! We have only created the workflow instructions. Now we are moving to the step where the actual processing will take place.
Before downloading please run the connection step again to make sure your connection is still active. Then we check the available file formats that the cloud backend supports. This is very important to know, since not all file formats are suitable for all types of information. Finally we will download the results in two different ways.

### Reconnect before downloading
Run the connection cell in the beginning of the script again to make sure your connection is still valid.

### Available File Formats
We check the available file formats that the cloud backend supports. This is very important to know, since not all file formats are suitable for all types of information.

In [28]:
conn.list_file_formats()

### Synchronous Download
One way of receiving the data from the cloud platform is via direct download. In this way we tell the platform to execute our workflow while we wait until it is done and then it will be directly downloaded. This blocks our development environment. It is suitable for quickly receiving small amounts of data.

In [37]:
snowarea.download("snowarea.tiff")

OpenEoApiError: [403] TokenInvalid: Authorization token has expired or is invalid. Please authenticate again.

### Batch Job
A second way to receive the results from the platform is to use a batch job, or asynchrounous processing. In this way a job is registered on the backend first. This job is persistently available via it's ID for a given amount of time. The job can be started whenever wanted. It is then executed in the background. Its status can be cecked. When it's done the results can be downloaded. In this way the development environment is not blocked, other things can be done. This is suitable for larger analysis. 

In [84]:
snowmap_fin = snowmap.save_result(format="GTiff")
#snowmap_fin.execute_batch()
snowmap_fin_job = snowmap_fin.create_job(title = "snowmap")
snowmap_fin_job.start_job()

OpenEoApiError: [400] 400: Unable to convert process graph to evalscript: list index out of range

In [40]:
from openeo.rest.auth.config import RefreshTokenStore
RefreshTokenStore().remove()

## Analysis of the results
**we should move this complete step to validation (and rename it to data analysis and validation)**
In a next step we will analyze and validate the results. We are going to compare the SCA timeseries we have derived from satellite observations to runoff measurements at the outlet of the catchment and to snow measurement station data. For now we will have a look at the timeseries. And at a map (one time step or mean of the winter).