# Initialize DVC and Start Tracking Merged Data

- Clean data
    - Drop columns not required for training
    - Drop rows with null valus where it makes sense 
    (river discharge may be NaN where there is no river. It makes sense to keep these rows for the model to learn where rivers are)
- Think about whether or not to have separate notebooks for new data retrievals and prep
- Version Control the data
- Train test splitting
- Version control again??

In [None]:
# Install required packages.
# TODO: Create IBM Cloud Software Configuration for those
!pip install ibm-cos-sdk ibm_watson_studio_pipelines

In [5]:
from botocore.client import Config
from sklearn.model_selection import train_test_split
from dataclasses import dataclass
import numpy as np
import pandas as pd

from ibm_watson_studio_pipelines import WSPipelines
import ibm_boto3

import logging
import os, types
import warnings

warnings.filterwarnings("ignore")

In [6]:
!pip install 'dvc[s3]' # dvc[all] alternatively, however, COS is covered by S3

zsh:1: /Users/ennmouri/csm/mlops-sustainability-oss/venv/bin/pip: bad interpreter: /Users/ennmouri/csm/mlops-sustainability/venv/bin/python3: no such file or directory

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.10 -m pip install --upgrade pip[0m


### 1. Setup IBM Cloud and COS Credentials

**Note**: If you are running this notebook outside of a Watson Studio Pipeline execution. Make sure to set the environment variables that the Pipeline environment would have passed to the notebook.
Refer to ```credentials.py```.

In [7]:
# Uncomment this cell and put your credentials in credentials.py to run locally.
from credentials2 import set_env_variables_for_credentials
set_env_variables_for_credentials()

In [8]:
## Retrieve cos credentials from global pipeline parameters
import json
# Get json from environment and convert to string
project_cos_credentials = json.loads(os.getenv('PROJECT_COS_CREDENTIALS'))
mlops_cos_credentials = json.loads(os.getenv('MLOPS_COS_CREDENTIALS'))

## PROJECT COS 
AUTH_ENDPOINT = project_cos_credentials['AUTH_ENDPOINT']
ENDPOINT_URL = project_cos_credentials['ENDPOINT_URL']
API_KEY_COS = project_cos_credentials['API_KEY']
BUCKET_PROJECT_COS = project_cos_credentials['BUCKET']

## MLOPS COS
ENDPOINT_URL_MLOPS = mlops_cos_credentials['ENDPOINT_URL']
API_KEY_MLOPS = mlops_cos_credentials['API_KEY']
CRN_MLOPS = mlops_cos_credentials['CRN']
BUCKET_MLOPS  = mlops_cos_credentials['BUCKET']

In [9]:
CLOUD_API_KEY = os.getenv("CLOUD_API_KEY")
DATA_FILENAME = os.getenv("serialized_data_filename")

In [10]:
# # @hidden_cell
# CLOUD_API_KEY = ""
# DATA_FILENAME = ""

In [11]:
# Secret to git repository on public git
#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

## 2. Set-up DVC Situation

Clone a (preferably empty) git repository which will be used for data and model version tracking.<br> It will store ```.dvc``` files **and it will contain the remote locations as well as the corresponding access keys.**<br> Make sure to create a private repository or work with GitHub Enterprise.

The following cells expect the repository to be empty, however they should be able to skip cells if they have already been completed. Their nature is non-overwriting.

#### 2.1. Clone Empty Repository for Versioning w/ DVC

In [12]:
# NOTE: env set in credentials.py
!git clone $GIT_REPOSITORY

fatal: destination path 'dvc-testing' already exists and is not an empty directory.


In [13]:
!cd dvc-testing && dvc init

zsh:1: /Users/ennmouri/csm/mlops-sustainability-oss/venv/bin/dvc: bad interpreter: /Users/ennmouri/csm/mlops-sustainability/venv/bin/python3: no such file or directory
[31mERROR[39m: failed to initiate DVC - '.dvc' exists. Use `-f` to force.
[0m

#### 2.2. Add IBM COS Instance to dvc.config as remote

To successfully complete this step, make sure that you create Cloud Object Storage "Credentials" for the COS Instance that you want to use.
<br>**Note:** Make sure to enable HMAC credentials when generating the "Credentials" in IBM Cloud.

In [14]:
!cd dvc-testing && dvc remote add -d -f ibm-cos s3://mlops-sustainability-data/

zsh:1: /Users/ennmouri/csm/mlops-sustainability-oss/venv/bin/dvc: bad interpreter: /Users/ennmouri/csm/mlops-sustainability/venv/bin/python3: no such file or directory
Setting 'ibm-cos' as a default remote.
[0m

In [15]:
!cd dvc-testing && dvc remote modify ibm-cos endpointurl https://s3.eu-de.cloud-object-storage.appdomain.cloud

zsh:1: /Users/ennmouri/csm/mlops-sustainability-oss/venv/bin/dvc: bad interpreter: /Users/ennmouri/csm/mlops-sustainability/venv/bin/python3: no such file or directory
[0m

In [16]:
!cd dvc-testing && dvc remote modify ibm-cos access_key_id $HMAC_ADMIN_ACCESS_KEY

zsh:1: /Users/ennmouri/csm/mlops-sustainability-oss/venv/bin/dvc: bad interpreter: /Users/ennmouri/csm/mlops-sustainability/venv/bin/python3: no such file or directory
[0m

In [17]:
!cd dvc-testing && dvc remote modify ibm-cos secret_access_key $HMAC_ADMIN_SECRET_ACCESS_KEY

zsh:1: /Users/ennmouri/csm/mlops-sustainability-oss/venv/bin/dvc: bad interpreter: /Users/ennmouri/csm/mlops-sustainability/venv/bin/python3: no such file or directory
[0m

In [18]:
!cd dvc-testing && git commit .dvc/config -m "Configure IBM COS (S3) as remote storage"

On branch main
Your branch is ahead of 'origin/main' by 1 commit.
  (use "git push" to publish your local commits)

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31mdata/era5-glofas-merged.pkl.dvc[m

nothing added to commit but untracked files present (use "git add" to track)


In [18]:
!cd dvc-testing && dvc push

zsh:1: /Users/ennmouri/csm/mlops-sustainability-oss/venv/bin/dvc: bad interpreter: /Users/ennmouri/csm/mlops-sustainability/venv/bin/python3: no such file or directory
Everything is up to date.                                                       
[0m

#### 2.3. Beginning tracking concatenated ERA5/GloFAS data

The purpose of tracking the whole unsplitted dataset is solely for safety. We will track the train/test split data separately in a later notebook.

In [19]:
!cd dvc-testing && mkdir data

mkdir: data: File exists


In [20]:
!mv era5-glofas-merged.pkl dvc-testing/data

In [20]:
!cd dvc-testing && dvc add data/era5-glofas-merged.pkl

zsh:1: /Users/ennmouri/csm/mlops-sustainability-oss/venv/bin/dvc: bad interpreter: /Users/ennmouri/csm/mlops-sustainability/venv/bin/python3: no such file or directory
[?25l[32m⠋[0m Checking graph                                                 
Adding...                                                                       
![A
  0% Checking cache in '/Users/ennmouri/csm/mlops-sustainability-oss/dvc-testing[A
                                                                                [A
![A
  0%|          |Checking out data/era5-glofas-merged.p0/? [00:00<?,    ?files/s][A
  0%|          |Checking out data/era5-glofas-merged.p0/1 [00:00<?,    ?files/s][A
100% Adding...|████████████████████████████████████████|1/1 [00:00, 72.04file/s][A

To track the changes with git, run:

	git add data/era5-glofas-merged.pkl.dvc

To enable auto staging, run:

	dvc config core.autostage true
[0m

In [22]:
# To track the cahnges with git, run:
!cd dvc-testing && git add data/.gitignore data/era5-glofas-merged.pkl.dvc

In [25]:
!cd dvc-testing && git commit -m "Newest ERA5xGloFAS data"

On branch main
Your branch is ahead of 'origin/main' by 2 commits.
  (use "git push" to publish your local commits)

nothing to commit, working tree clean


In [26]:
!cd dvc-testing && git push

git: 'credential-manager-core' is not a git command. See 'git --help'.
Enumerating objects: 11, done.
Counting objects: 100% (11/11), done.
Delta compression using up to 8 threads
Compressing objects: 100% (8/8), done.
Writing objects: 100% (8/8), 1.01 KiB | 1.01 MiB/s, done.
Total 8 (delta 1), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (1/1), done.[K
remote: This repository moved. Please use the new location:[K
remote:   https://github.com/iIias/dvc-testing.git[K
To https://github.com/iiias/dvc-testing.git
   eea259d..26fd6db  main -> main


In [27]:
# To enable auto staging, run:
!dvc config core.autostage true

zsh:1: /Users/ennmouri/csm/mlops-sustainability-oss/venv/bin/dvc: bad interpreter: /Users/ennmouri/csm/mlops-sustainability/venv/bin/python3: no such file or directory
[31mERROR[39m: configuration error - config file error: Not inside a DVC repo
[0m

In [28]:
!cd dvc-testing && git config --global user.email "ilias.ennmouri@ibm.com"
!cd dvc-testing && git config --global user.name "Ilias Ennmouri"

In [None]:
!cd dvc-testing && git add 

In [26]:
!cd dvc-testing && git commit -m "Add merged ERA5 and GloFas data"

[main 65624de] Add merged ERA5 and GloFas data
 1 file changed, 1 insertion(+)


In [27]:
!cd dvc-testing && dvc push

zsh:1: /Users/ennmouri/csm/mlops-sustainability-oss/venv/bin/dvc: bad interpreter: /Users/ennmouri/csm/mlops-sustainability/venv/bin/python3: no such file or directory
  0% Transferring|                                   |0/1 [00:00<?,     ?file/s]
![A
  0%|          |/Users/ennmouri/csm/mlops-sustainab0.00/? [00:00<?,        ?B/s][A
  0%|          |/Users/ennmouri/csm/mlops-sustai0.00/549M [00:00<?,        ?B/s][A
  9%|▉         |/Users/ennmouri/csm/mlops-s50.0M/549M [00:17<02:54,    3.00MB/s][A^C

[31mERROR[39m: interrupted by the user                                        [A
[0m

In [None]:
#!dvc get $GIT_REPOSITORY data/era5-glofas-merged.pkl -o data/era5-glofas-merged.pkl

### 3. Hand-off to Next Pipeline Node

In [None]:
validation_params = {}
validation_params["tracking_merged"] = True

In [None]:
pipelines_client = WSPipelines.from_apikey(apikey=CLOUD_API_KEY)
pipelines_client.store_results(validation_params)