# Pull Newest Full Data, make Train Test split and Track those.

When this notebook is executed, we expect <br>**(1)** dataset to split <br>**(2)** the dataset to be tracked (since we are retrieving it via DVC)

Steps covered in this notebook:
1. Retrieve parameters
2. Download dataset via DVC and deserialize
3. Initial data preprocessing (e.g. drop single value columns)
4. Make Test-Train-Split
5. Serialize split data as train and test package (train_package = X_train, y_train and *vice versa*)
6. Set-up DVC
7. Track train and test package
8. Check whether both data packages are tracked via ```DVCFileSystem```

In [None]:
# Install required packages.
# TODO: Create IBM Cloud Software Configuration for those
!pip install ibm-cos-sdk ibm_watson_studio_pipelines 'dvc[all]' # dvc[all] alternatively, however, COS is covered by S3

In [None]:
from ibm_watson_studio_pipelines import WSPipelines
import ibm_boto3

from botocore.client import Config
from sklearn.model_selection import train_test_split
from dataclasses import dataclass
import numpy as np
import pandas as pd

import pickle
import dvc.api
import io

import logging
import os, types
import warnings

warnings.filterwarnings("ignore")

### Retrieve parameters

**Note**: If you are running this notebook outside of a Watson Studio Pipeline execution. Make sure to set the environment variables that the Pipeline environment would have passed to the notebook.
Refer to ```credentials.py```.

In [None]:
# Uncomment this cell and put your credentials in credentials.py to run locally.
# from credentials2 import set_env_variables_for_credentials
# set_env_variables_for_credentials()

In [None]:
CLOUD_API_KEY = os.getenv("CLOUD_API_KEY")
DATA_FILENAME = os.getenv("serialized_data_filename")
GIT_REPOSITORY = os.getenv("GIT_REPOSITORY")
REPO_NAME = os.getenv("REPO_NAME")

In [None]:
REPO_NAME = "dvc-testing"

### DVC Pull and Deserialize Data

In [None]:
# TODO: Make pipeline param
repo = \
    GIT_REPOSITORY

In [None]:
# Retrieve dataset from tracking information in git. The repository itself contains the remote storage info and credentials.
data = pickle.load(io.BytesIO(dvc.api.read(f"data/{DATA_FILENAME}",repo=repo, mode="rb")))

### Data Preprocessing

In [None]:
# Drop rows where at least one col-value is NaN
print(f"Dropped {len(data)-len(data.dropna(axis=0))} rows.")
data = data.dropna(axis=0)

In [None]:
# E.g. col 'step' has only a single unique value. Its existence has no effect on training is solely a waste of resources.
# Therefore we will drop all cols with that characteristic
for key in data.keys():
    if len(data[key].unique()) < 2:
        print(f"col '{key}' dropped because it bears no more than one unique value.")
        data = data.drop(key, axis=1)

In [None]:
# Convert non-numeric columns to numeric values
data['time'] = pd.to_datetime(data['time'])  # Convert dates to datetime objects

#data['latitude'] = data['latitude'].astype('category').cat.codes  # Encode coordinates as categorical codes
#data['longitude'] = data['longitude'].astype('category').cat.codes  # Encode coordinates as categorical codes

### Test Train Split

In [None]:
from sklearn.model_selection import train_test_split

# Assuming your large table is stored in a pandas DataFrame called 'df'
X = data.drop('dis24', axis=1)  # Extract input features by dropping the target column
y = data['dis24']  # Extract the target column


# Perform train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
def serialize(obj, target_path):
    try:
        with open(target_path, 'wb') as _file:
            pickle.dump(obj, _file)
    except Exception as e:
        print(e)

In [None]:
train_target = "/data/train_package.pkl"

In [None]:
train_package = {}
train_package["X_train"] = X_train
train_package["y_train"] = y_train

serialize(train_package, f"{REPO_NAME}{train_target}")

In [None]:
test_target = "/data/test_package.pkl"

In [None]:
test_package = {}
test_package["X_test"] = X_test
test_package["y_test"] = y_test

serialize(test_package, f"{REPO_NAME}{test_target}")

###  Setup DVC Situation

Since we assume CPDaaS as environment, we will need to clone the dvc setup repository again.
Run the line shown below.

```
!git clone https://[GIT_TOKEN]@github.com/[GIT_REPOSITORY].git
````


In [None]:
# @hidden_cell
!git clone $GIT_REPOSITORY

In [None]:
!cd dvc-testing && dvc add data/train_package.pkl data/test_package.pkl

In [None]:
!cd dvc-testing && git add data/.gitignore data/train_package.pkl.dvc data/test_package.pkl.dvc

In [None]:
!cd dvc-testing && git config --global user.email "ilias.ennmouri@ibm.com"
!cd dvc-testing && git config --global user.name "Ilias Ennmouri"

In [None]:
!cd dvc-testing && git commit -m "New train test subsets"

In [None]:
!cd dvc-testing && dvc push && git push

In [None]:
from dvc.api import DVCFileSystem

In [None]:
fs = DVCFileSystem(GIT_REPOSITORY, rev="main")

In [None]:
dvc_tracked = fs.find("/", detail=False, dvc_only=True)

In [None]:
training_tracked = True if train_target in dvc_tracked else False
training_tracked

In [None]:
test_tracked = True if test_target in dvc_tracked else False
test_tracked

In [None]:
validation_params = {}
validation_params['training_package_tracked'] = training_tracked
validation_params['test_package_tracked'] = test_tracked
validation_params['train_package_dvc_location'] = train_target
validation_params['test_package_dvc_location'] = test_target

In [None]:
pipelines_client = WSPipelines.from_apikey(apikey=CLOUD_API_KEY)
pipelines_client.store_results(validation_params)