# Retrieve and Concatenate Copernicus Data

### 🎯 Request Data from Copernicus (GloFAS; ERA5) and Save Concatenated Data to Cloud Object Storage (COS)

 We use the [Copernicus Data Store](https://cds.climate.copernicus.eu/#!/home) to retrieve historic and current climate data. We are collecting the following variables from the ERA5 and GloFAS datasets:

- 🌍 [ERA5-Land hourly data from 1950 to present](https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-land?tab=form)
    - ```stl1``` (Soil Temperature Level 1)
    - ```vswl1``` (Volumetric Soil Water Layer 1)
    - ```total_preciptation``` (Total Precipitation)

- 🌊 [River discharge and related historical data from the Global Floow Awareness System (GloFAS)](https://cds.climate.copernicus.eu/cdsapp#!/dataset/cems-glofas-historical?tab=form)
    - ```dis24``` (averaged daily river discharge in m^3/s)

We want to concatenate the variables of the two tables of the spatio-temporal common (interpolated) columns (e.g. ```latitude```, ```longitude```, ```time```)

**In this notebook, we assume that when it is run, there is no notion of our data since it has never been persisted. Therefore this is the *initial* notebook to run and ideally only run once.** When invoking the pipeline further times there should already be historic data in place, which makes running this notebook unnecessary (at that point).

#### Steps covered in this notebook:
1. Retrieve parameters & Set-up Cloud Object Storage connection
2. Set-up Copernicus credentials (w/ Configuration File)
3. **Retrieve ERA5 and GloFAS for given timeframe**
4. Handle both netcdf files (open, interpolate, reset_index, to_pandas)
5. **Concatenate both datasets on Latitude, Longitude, Time**
6. Serialize result and persist with Cloud Object Storage

In [None]:
# TODO: Create software configuration in Watson Studio to reduce resource waste by installing manually on each run
!pip install cdsapi netCDF4 xarray ibm_watson_studio_pipelines scikit-learn==1.1 ibm-cos-sdk botocore

In [None]:
# data sources
import cdsapi

# data manipulation
from netCDF4 import Dataset
import xarray as xr
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# serialization
import pickle
import json

# remotes
from botocore.client import Config
from ibm_watson_studio_pipelines import WSPipelines
import ibm_boto3

# misc
import logging
import os, types
import warnings
warnings.filterwarnings("ignore")




### 1. Retrieve parameters & Set-up Cloud Object Storage connection

**Note**: If you are running this notebook outside of a Watson Studio Pipeline execution. Make sure to set the environment variables that the Pipeline environment would have passed to the notebook.
Refer to ```credentials.py```.

In [None]:
# Uncomment this cell and put your credentials in credentials.py to run locally.
from credentials import set_env_variables_for_credentials
set_env_variables_for_credentials()

In [None]:
## Retrieve cos credentials from global pipeline parameters

# Get json from environment and convert to string
project_cos_credentials = json.loads(os.getenv('PROJECT_COS_CREDENTIALS'))
mlops_cos_credentials = json.loads(os.getenv('MLOPS_COS_CREDENTIALS'))

## PROJECT COS 
AUTH_ENDPOINT = project_cos_credentials['AUTH_ENDPOINT']
ENDPOINT_URL = project_cos_credentials['ENDPOINT_URL']
API_KEY_COS = project_cos_credentials['API_KEY']
BUCKET_PROJECT_COS = project_cos_credentials['BUCKET']

## MLOPS COS
ENDPOINT_URL_MLOPS = mlops_cos_credentials['ENDPOINT_URL']
API_KEY_MLOPS = mlops_cos_credentials['API_KEY']
CRN_MLOPS = mlops_cos_credentials['CRN']
BUCKET_MLOPS  = mlops_cos_credentials['BUCKET']

In [None]:
CLOUD_API_KEY = os.getenv('CLOUD_API_KEY')

In [None]:
def save_df_to_cos(df,filename,key):
    """
    
    Save Data in IBM Cloud Object Storage

    
    """

    try:
        #df.to_csv(filename,index=False)
        with open(filename, 'wb') as file:
            pickle.dump(df, file)
        mlops_res = ibm_boto3.resource(
            service_name='s3',
            ibm_api_key_id=API_KEY_MLOPS,
            ibm_service_instance_id=CRN_MLOPS,
            ibm_auth_endpoint=AUTH_ENDPOINT,
            config=Config(signature_version='oauth'),
            endpoint_url=ENDPOINT_URL_MLOPS)

        mlops_res.Bucket(BUCKET_MLOPS).upload_file(filename,key)
        print(f"Dataframe {filename} uploaded successfully")
    except Exception as e:
        print(e)
        print("Dataframe upload for {filename} failed")

### Set-up Copernicus credentials (w/ Configuration File)

The Python Library for the Copernicus API (```cdsapi```) handles service authentication via a configuration file in the users home directory. <br>Hardcode the ```CDS_USER_ID``` and ```CDS_API_KEY``` environment variables in your ```credentials.py```, or preferably pass them as Pipeline Parameters within Watson Studio.

The code below will take the passed env. variables and write the configuration file to your home dir.

In [None]:
# Use your Copernicus API_KEY
# @hidden_cell
import os
CDS_USER_ID = os.getenv("CDS_USER_ID")
CDS_API_KEY = os.getenv("CDS_API_KEY")

In [None]:
# Setup copernicus credentials file for cdsapi
import os
with open(os.path.join(os.path.expanduser('~'), '.cdsapirc'), 'w') as f:
    f.write('url: https://cds.climate.copernicus.eu/api/v2\n')
    f.write(f'key: {CDS_USER_ID}:{CDS_API_KEY}')

In [None]:
# Ensure COPERNICUS config is setup at the right place
!cat ~/.cdsapirc

The ```cdsapi.Client``` initialized below, will check for the existence of the  configuration file created above, and for the correctness of the credentials it houses. If neither of these applies, the below cell will run into an exception.

In [None]:
copernicus = cdsapi.Client()

### Retrieve ERA5 and GloFAS for given timeframe

The amount of data requested for either dataset may be delimited by the selection of various data variables, as well as by setting spatial and/or temporal  bounds for the download request.

In [None]:
europe = [72,25,34,40] # NWSE bounds for Europe
days = [str(i) for i in range(31)]
# months = ['january', 'february', 'march', 'april']
# years = ['2023']

months = ['january', 'february', 'march', 'april']#, 'may', 'june', 'july', 'august', 'september', 'october', 'november', 'december']
years = ['2023']

hours = [
            '00:00', '01:00', '02:00',
            '03:00', '04:00', '05:00',
            '06:00', '07:00', '08:00',
            '09:00', '10:00', '11:00',
            '12:00', '13:00', '14:00',
            '15:00', '16:00', '17:00',
            '18:00', '19:00', '20:00',
            '21:00', '22:00', '23:00',
]

hours = [ '00:00',]
hours

In [None]:
def download_glofas_historic(client, bounds, years, months, days, download_path):
    glofas_format = ".netcdf4.zip"
    if os.path.exists(f'{download_path}{glofas_format}'):
        # Reason to cancel download process if file exists is elaborated where method is invoked.
        print(f"Target filename already exists in target path ({download_path}{glofas_format})... cancelling download")
        exit
    else:
        client.retrieve(
            'cems-glofas-historical',
            {
                'system_version': 'version_3_1',
                'variable': 'river_discharge_in_the_last_24_hours',
                'format': 'netcdf4.zip',
                'hyear': years,
                'hmonth': months,
                'hday': days,
                'hydrological_model': 'lisflood',
                'product_type': 'intermediate',
                'area': bounds,
            },
            f'{download_path}.netcdf4.zip')

In [None]:
# Download ERA5 monthly averaged data from soil temp l1, volumetric soil water l1, total precipitation
def download_era5_historic(client, bounds, years, months, days, hours, download_path):
    era5_format = ".netcdf.zip"
    if os.path.exists(f'{download_path}{era5_format}'):
        # Reason to cancel download process if file exists is elaborated where method is invoked.
        print(f"Target filename already exists in target path ({download_path}{era5_format})... cancelling download")
        exit
    else:
        client.retrieve(
            'reanalysis-era5-land',
            {
                'variable': [
                    'soil_temperature_level_1', 'total_precipitation', 'volumetric_soil_water_layer_1',
                ],
                'year': years,
                # CDS Datasets do not have uniformal requests. Here Months are expected to be e.g. "01" instead of 'january'.
                # Work-around with list comprehension
                # 'month': [str(i) for i in range(len(months))],
                'month': [f'0{i+1}' if i < 9 else str(i+1) for i in range(len(months))],
                'day': [f'0{i+1}' if i < 9 else str(i+1) for i in range(len(days))],
                'time': hours,
                'format': 'netcdf.zip',
                'area': bounds,
            },
            f'{download_path}.netcdf.zip')

In [None]:
# NOTE: cdsapi has no notion of the files in the current working directory. 
# Passing a download path and filename where a file already sits causes a seemingly infinite loop in the download process.
# Your cell will never finish running and resources will be wasted.
# No problem for CPDaaS since working directory is runtime bound (no persistent filesystem) and in production the file cannot already exist.
download_glofas_historic(
    copernicus,
    bounds=europe,
    years=years,
    months=months,
    days=days,
    download_path="glofas_2023"
)

In [None]:
# NOTE: cdsapi has no notion of the files in the current working directory. 
# Passing a download path and filename where a file already sits causes a seemingly infinite loop in the download process.
# Your cell will never finish running and resources will be wasted.
# No problem for CPDaaS since working directory is runtime bound (no persistent filesystem) and in production the file cannot already exist.
download_era5_historic(
    copernicus,
    bounds=europe,
    years=years,
    months=months,
    days=days,
    hours=hours,
    download_path="era5_2023"
)

In [None]:
!ls -lh

### Handle ERA5/GloFAS netcdf files (open, interpolate, reset_index, to_pandas)

In [None]:
!mkdir era5 && mkdir glofas

In [None]:
!unzip era5_2023.netcdf.zip -d era5 && unzip glofas_2023.netcdf4.zip -d glofas

In [None]:
e5 = xr.open_dataset('era5/data.nc')
f = xr.open_dataset('glofas/data.nc')

## Handle ERA5 Data

**Data**: Total Precipitation; Volumetric Soil Water Layer 1; Soil Temperature Level 1

**Mission**: We requested the above mentioned variables for roughly the same coordinates (variation of .05). Lets have a quick look at the dataset and prepare it for a training split, version control, and more.


In [None]:
e5

In [None]:
# Interpolate to drop 'expver' mask from coordinates
e5_interp = e5.interp_like(f)

In [None]:
e5_interp

In [None]:
# Get rid of that darn supplementary expver dimension's issue (See https://confluence.ecmwf.int/display/CUSF/ERA5+CDS+requests+which+return+a+mixture+of+ERA5+and+ERA5T+data)
e5_combine = e5_interp.sel(expver=1).combine_first(e5_interp.sel(expver=5))
e5_combine.load()
e5_combine

### Concatenate both datasets on common columns: Latitude, Longitude, Time

In [None]:
## Joining predictand onto feature y-interpolated table 
# Set features to keep and choose target variable
X = e5_combine.to_dataframe()
y = f['dis24'].to_dataframe()

# Reset the index to include the coordinates as columns
X.reset_index(inplace=True)
y.reset_index(inplace=True)

In [None]:
# Merge features and predictand together common coordinates (time, latitude, longitude)
data = pd.merge(X, y, on=['time', 'latitude', 'longitude'])
data

In [None]:
# Shows most recent day covered by data ('2023-04-30') to later handle merging with newer data more efficiently
most_recent_covered_day = str(data['time'].max()).split()[0] 

### Serialize Concatenated Dataset

In [None]:
# Pickle and save data

FILENAME = "era5-glofas-merged.pkl"

save_df_to_cos(data, FILENAME, FILENAME)

### Persist on Cloud Object Storage

Serialized dataset will be moved to COS since filesystem in CPDaaS runtimes is temporary and therefore unfit to house our data. 

In [None]:
files_copied_in_cos = check_if_file_exists(FILENAME)
files_copied_in_cos

### Hand-off to Next Notebook

In [None]:
validation_params = {}
validation_params['most_recent_day_in_data'] = most_recent_covered_day
validation_params['serialized_data_filename'] = FILENAME
validation_params['files_copied_in_cos'] = files_copied_in_cos

In [None]:
pipelines_client = WSPipelines.from_apikey(apikey=CLOUD_API_KEY)
pipelines_client.store_results(validation_params)