# Initialize DVC and Start Tracking Merged Data

When this notebook is executed, we expect <br>**(1)** the existence of a private git repository for DVC tracking metadata, as well as <br>**(2)** the existence of historic ERA5/GloFAS data serialized (pickle) on Cloud Object Storage via the preceding notebook.

Steps covered in this notebook:
1. Retrieve parameters and set-up COS connection
2. Set-up DVC situation
    - Clone empty *private* repository
    - ```dvc init````
    - Add COS instance as remote to DVC configuration file
3. Download dataset from COS
4. Track dataset (```git add```, ```dvc push```, ...)

In [None]:
# Install required packages.
# TODO: Create IBM Cloud Software Configuration for those
!pip install ibm-cos-sdk ibm_watson_studio_pipelines

In [None]:
from botocore.client import Config
from sklearn.model_selection import train_test_split
from dataclasses import dataclass
import numpy as np
import pandas as pd

from ibm_watson_studio_pipelines import WSPipelines
import ibm_boto3

import logging
import os, types
import warnings

warnings.filterwarnings("ignore")

In [None]:
!pip install 'dvc[s3]' # dvc[all] alternatively, however, COS is covered by S3

### 1. Retrieve parameters and set-up COS connection

**Note**: If you are running this notebook outside of a Watson Studio Pipeline execution. Make sure to set the environment variables that the Pipeline environment would have passed to the notebook.
Refer to ```credentials.py```.

In [None]:
# Uncomment this cell and put your credentials in credentials.py to run locally.
from credentials2 import set_env_variables_for_credentials
set_env_variables_for_credentials()

In [None]:
## Retrieve cos credentials from global pipeline parameters
import json
# Get json from environment and convert to string
project_cos_credentials = json.loads(os.getenv('PROJECT_COS_CREDENTIALS'))
mlops_cos_credentials = json.loads(os.getenv('MLOPS_COS_CREDENTIALS'))

## PROJECT COS 
AUTH_ENDPOINT = project_cos_credentials['AUTH_ENDPOINT']
ENDPOINT_URL = project_cos_credentials['ENDPOINT_URL']
API_KEY_COS = project_cos_credentials['API_KEY']
BUCKET_PROJECT_COS = project_cos_credentials['BUCKET']

## MLOPS COS
ENDPOINT_URL_MLOPS = mlops_cos_credentials['ENDPOINT_URL']
API_KEY_MLOPS = mlops_cos_credentials['API_KEY']
CRN_MLOPS = mlops_cos_credentials['CRN']
BUCKET_MLOPS  = mlops_cos_credentials['BUCKET']

In [None]:
CLOUD_API_KEY = os.getenv("CLOUD_API_KEY")
DATA_FILENAME = os.getenv("serialized_data_filename")

In [None]:
# # @hidden_cell
# CLOUD_API_KEY = ""
# DATA_FILENAME = ""

## 2. Set-up DVC Situation

Clone a (preferably empty) git repository which will be used for data and model version tracking.<br> It will store ```.dvc``` files **and it will contain the remote locations as well as the corresponding access keys.**<br> Make sure to create a private repository or work with GitHub Enterprise.

The following cells expect the repository to be empty, however they should be able to skip cells if they have already been completed. Their nature is non-overwriting.

#### 2.1. Clone Empty Repository for Versioning w/ DVC

In [None]:
# NOTE: env set in credentials.py
!git clone $GIT_REPOSITORY

In [None]:
!cd dvc-testing && dvc init

#### 2.2. Add IBM COS Instance to dvc.config as remote

To successfully complete this step, make sure that you create Cloud Object Storage "Credentials" for the COS Instance that you want to use.
<br>**Note:** Make sure to enable HMAC credentials when generating the "Credentials" in IBM Cloud.

In [None]:
!cd dvc-testing && dvc remote add -d -f ibm-cos s3://mlops-sustainability-data/

In [None]:
!cd dvc-testing && dvc remote modify ibm-cos endpointurl https://s3.eu-de.cloud-object-storage.appdomain.cloud

In [None]:
!cd dvc-testing && dvc remote modify ibm-cos access_key_id $HMAC_ADMIN_ACCESS_KEY

In [None]:
!cd dvc-testing && dvc remote modify ibm-cos secret_access_key $HMAC_ADMIN_SECRET_ACCESS_KEY

In [None]:
!cd dvc-testing && git commit .dvc/config -m "Configure IBM COS (S3) as remote storage"

In [None]:
!cd dvc-testing && dvc push

#### 2.3. Beginning tracking concatenated ERA5/GloFAS data

The purpose of tracking the whole unsplitted dataset is solely for safety. We will track the train/test split data separately in a later notebook.

In [None]:
!cd dvc-testing && mkdir data

In [None]:
!mv era5-glofas-merged.pkl dvc-testing/data

In [None]:
!cd dvc-testing && dvc add data/era5-glofas-merged.pkl

In [None]:
# To track the cahnges with git, run:
!cd dvc-testing && git add data/.gitignore data/era5-glofas-merged.pkl.dvc

In [None]:
!cd dvc-testing && git commit -m "Newest ERA5xGloFAS data"

In [None]:
!cd dvc-testing && git push

In [None]:
# To enable auto staging, run:
!dvc config core.autostage true

In [None]:
!cd dvc-testing && git config --global user.email "ilias.ennmouri@ibm.com"
!cd dvc-testing && git config --global user.name "Ilias Ennmouri"

In [None]:
!cd dvc-testing && git add 

In [None]:
!cd dvc-testing && git commit -m "Add concatenated ERA5 and GloFas data"

In [None]:
!cd dvc-testing && dvc push

### 3. Hand-off to Next Pipeline Node

In [None]:
validation_params = {}
validation_params["tracking_merged"] = True

In [None]:
pipelines_client = WSPipelines.from_apikey(apikey=CLOUD_API_KEY)
pipelines_client.store_results(validation_params)