<a href="https://colab.research.google.com/github/prayas99/Experimentation-Version-Control-Using-Git-DVC/blob/main/Experimentation_Version_Control_Using_Git_DVC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Configure DAGsHub & Git

In [None]:
import requests
import getpass
import datetime

**Set Environment Variables**


In [None]:
#@title Enter the repository name for the project:

REPO_NAME= "mark4" #@param {type:"string"}

In [None]:
#@title Enter the username of your DAGsHub account:

USER_NAME = "prayas99" #@param {type:"string"}

In [None]:
#@title Enter the email for your DAGsHub account:

EMAIL = "prayas99@gmail.com" #@param {type:"string"}

We take security very seriously and don't want your DAGsHub password to be saved in the notebook runtime. Thus, we created an API that generates an access token to your DAGsHub account. With this token, you will push your Git tracked files without saving the password as a variable.

In [None]:
r = requests.post('https://dagshub.com/api/v1/user/tokens', 
                  json={"name": f"colab-token-{datetime.datetime.now()}"}, 
                  auth=(USER_NAME, getpass.getpass('DAGsHub password:')))
r.raise_for_status()
TOKEN=r.json()['sha1']

DAGsHub password:··········


**Configure Git**

In [None]:
!git config --global user.email {EMAIL}
!git config --global user.name {USER_NAME}

**Clone the Repository**

In [None]:
!git clone https://dagshub.com/{USER_NAME}/{REPO_NAME}.git

%cd {REPO_NAME}

Cloning into 'mark4'...
remote: Enumerating objects: 33, done.[K
remote: Counting objects: 100% (33/33), done.[K
remote: Compressing objects: 100% (30/30), done.[K
remote: Total 33 (delta 10), reused 0 (delta 0)[K
Unpacking objects: 100% (33/33), done.
/content/mark4


# Install and Configure DVC

**Initialize DVC**

In [None]:
# Install DVC
!pip install dvc &> /dev/null 

# Import DVC package - relevant only when working in a Colab environment
import dvc

# Initilize DVC in the local directory
!dvc init &> /dev/null 

# Track the changes with git
!git add .dvc .dvcignore .gitignore
!git commit -m "Initialize DVC"

On branch master
Your branch is up to date with 'origin/master'.

nothing to commit, working tree clean


**Configure DVC**

In [None]:
# Set DVC remote storage as 'DAGsHub storage'
!dvc remote add origin --local https://dagshub.com/{USER_NAME}/{REPO_NAME}.dvc

# General DVC configuration
!dvc remote modify --local origin auth basic
!dvc remote modify --local origin user {USER_NAME}
!dvc remote modify --local origin password {TOKEN}

# Track Files Using DVC and Git 

The data directory contains the data sets for this project, which are quite big. Thus, we will track this directory using DVC and use Git to track the rest of the project's files.

**Track Files with DVC**



In [None]:
!ls

data  data.dvc	metrics.csv  params.yml


In [None]:
if not os.path.isdir('data'):
  os.mkdir('data')
%cd data

/content/mark4/data


# Put the hmeq.csv dataset id here, present at [LINK](https://drive.google.com/file/d/1-lwRFwFNY6n0y6aIvZnclvlKQrAcHFx-/view?usp=sharing)

In [None]:
!gdown --id 1-lwRFwFNY6n0y6aIvZnclvlKQrAcHFx-
os.chdir("..")

Downloading...
From: https://drive.google.com/uc?id=1nESRZ4AWMUhU1kkMti5U3CvaxoxE8rV5
To: /content/mark4/data/hmeq.csv
100% 403k/403k [00:00<00:00, 3.53MB/s]


In [None]:
# Add the data directory to DVC tracking
!dvc add data

⠸ Checking graph
Adding...:   0% 0/1 [00:00<?, ?file/s{'info': ''}]
!
Computing file/dir hashes (only done once)          |0.00 [00:00,      ?md5/s]
                                                                              
!
          |0.00 [00:00,       ?it/s]
                                    
Saving files:   0% 0/1 [00:00<?, ?file/s]
Saving files:   0% 0/1 [00:00<?, ?file/s{'info': ''}]
                                                     
.Lkv63FQXkEq5wkZPXFSrKT.tmp:   0% 0.00/403k [00:00<?, ?it/s]
.Lkv63FQXkEq5wkZPXFSrKT.tmp:   0% 0.00/403k [00:00<?, ?it/s{'info': ''}]
Adding...: 100% 1/1 [00:00<00:00, 11.64file/s{'info': ''}]

To track the changes with git, run:

	git add data.dvc .gitignore


In [None]:
# Track the changes with Git
!git add data.dvc .gitignore
!git commit -m "Add the data directory to DVC tracking"

[master 7a61410] Add the data directory to DVC tracking
 2 files changed, 6 insertions(+)
 create mode 100644 data.dvc


**Track Files with Git**

In [None]:
# !git add requirements.txt src/
# !git commit -m "Add requirements and src to Git tracking"

# Push the Files to the Remotes 

**Push Git tracked files**


In [None]:
!git push https://{USER_NAME}:{TOKEN}@dagshub.com/{USER_NAME}/{REPO_NAME}.git

Counting objects: 17, done.
Delta compression using up to 2 threads.
Compressing objects:   6% (1/15)   Compressing objects:  13% (2/15)   Compressing objects:  20% (3/15)   Compressing objects:  26% (4/15)   Compressing objects:  33% (5/15)   Compressing objects:  40% (6/15)   Compressing objects:  46% (7/15)   Compressing objects:  53% (8/15)   Compressing objects:  60% (9/15)   Compressing objects:  66% (10/15)   Compressing objects:  73% (11/15)   Compressing objects:  80% (12/15)   Compressing objects:  86% (13/15)   Compressing objects:  93% (14/15)   Compressing objects: 100% (15/15)   Compressing objects: 100% (15/15), done.
Writing objects:   5% (1/17)   Writing objects:  11% (2/17)   Writing objects:  17% (3/17)   Writing objects:  23% (4/17)   Writing objects:  29% (5/17)   Writing objects:  35% (6/17)   Writing objects:  41% (7/17)   Writing objects:  52% (9/17)   Writing objects:  58% (10/17)   Writing objects:  64% (11/17)   Writing objects:  70% 

**Push DVC tracked files**


In [None]:
!dvc push -r origin

Uploading:   0% 0/2 [00:00<?, ?file/s{'info': ''}]
data/hmeq.csv:   0% 0.00/403k [00:00<?, ?it/s]
data/hmeq.csv:   0% 0.00/403k [00:00<?, ?it/s{'info': ''}]
Uploading:  50% 1/2 [00:00<00:00,  3.31file/s{'info': ''}]
data:   0% 0/68 [00:00<?, ?it/s]
data:   0% 0/68 [00:00<?, ?it/s{'info': ''}]
2 files pushed


# Importing Libraries

In [None]:
!pip3 install dagshub &> /dev/null

In [None]:
import numpy as np
import sys  
import os
import sklearn.cluster as cluster
from IPython.display import clear_output
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import rcParams
from scipy.stats import skew
import lightgbm
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.metrics import average_precision_score
from xgboost.sklearn import XGBClassifier
from xgboost import plot_importance, to_graphviz
from sklearn.metrics import make_scorer, roc_auc_score
from sklearn.metrics import roc_curve, auc
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import brier_score_loss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import balanced_accuracy_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
import logging
logging.getLogger().setLevel(logging.INFO)
import torch
from torch import nn, optim
from sklearn.tree import DecisionTreeClassifier
import torch.nn.functional as F
from torch.autograd import grad as torch_grad
from sklearn.metrics import log_loss
from sklearn.model_selection import cross_val_predict
import joblib
from pathlib import Path
import dagshub
rcParams['figure.figsize'] = 9,5
randomState = 5
np.random.seed(randomState)
import pickle

In [None]:
# Messages
M_PRO_INIT = '[DEBUG] Preprocessing raw data'
M_PRO_LOAD_DATA = '     [DEBUG] Loading raw data'
M_PRO_SPLIT_DATA = '     [DEBUG] Splitting data to train and test'
M_PRO_SAVE_DATA = '     [DEBUG] Saving data to file'

M_MOD_INIT = '[DEBUG] Initialize Modeling'
M_MOD_LOAD_DATA = '     [DEBUG] Loading data sets for modeling'
M_MOD_RFC = '     [DEBUG] Runing Random Forest Classifier'
M_MOD_SCORE = '     [INFO] Finished modeling with GINI,AUPRC Score:'

M_FE_PEAR =' [INFO] Taken best FE Pearson'
M_FE_PEAR =' [INFO] Finished taking best features with Pearson'

PREFIX = ""

RAW_DATA_PATH = os.path.join(PREFIX, 'data/enron.csv')
X_TRAIN_PATH = os.path.join(PREFIX, 'data/X_train.csv')
X_TEST_PATH = os.path.join(PREFIX, 'data/X_test.csv')
Y_TRAIN_PATH = os.path.join(PREFIX, 'data/y_train.csv')
Y_TEST_PATH = os.path.join(PREFIX, 'data/y_test.csv')

In [None]:
!dvc pull -r origin

# Pre-Processing

In [None]:
df_orig = pd.read_csv('./data/hmeq.csv')

X = df_orig.sample(frac=1, random_state=randomState)
Y = X['BAD']
del X['BAD']

cat_cols = ['REASON', 'JOB']
num_cols = ['LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG',
            'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC']

target_col = 'BAD'

trainX, testX, trainY, testY = train_test_split(X, Y, test_size = 0.2, \
                                                random_state = 5)

num_prep = make_pipeline(SimpleImputer(strategy='mean'),
                         MinMaxScaler())
cat_prep = make_pipeline(SimpleImputer(strategy='most_frequent'),
                         OneHotEncoder(handle_unknown='ignore', sparse=False))
prep = ColumnTransformer([
    ('num', num_prep, num_cols),
    ('cat', cat_prep, cat_cols)],
    remainder='drop')
trainX = prep.fit_transform(trainX)
trainX = pd.DataFrame(trainX)
testX = prep.transform(testX)
testX = pd.DataFrame(testX)


scaler = StandardScaler()

# Fit only to the training data
scaler.fit(trainX)

# Now apply the transformations to the data:
trainX_scaled = scaler.transform(trainX)
testX_scaled = scaler.transform(testX)

trainX = pd.DataFrame(trainX_scaled, columns=trainX.columns)
testX = pd.DataFrame(testX_scaled, columns=testX.columns)

trainX.to_csv(X_TRAIN_PATH)
testX.to_csv(X_TEST_PATH)
trainY.to_csv(Y_TRAIN_PATH)
testY.to_csv(Y_TEST_PATH)

In [None]:
# Process the Data
# !python src/data_preprocessing.py

[DEBUG] Preprocessing raw data 
     [DEBUG] Loading raw data
     [DEBUG] Removing punctuation from Emails
     [DEBUG] Label encoding target column
     [DEBUG] vectorizing the emails by words
     [DEBUG] Splitting data to train and test
     [DEBUG] Saving data to file


In [None]:
!dvc status

data.dvc:
	changed outs:
		modified:           data


In [None]:
# Track the Changes
!dvc add data &> /dev/null 
!git add data.dvc
!git commit -m "Process raw-data and save it to data directory"

[master d99e9b9] Process raw-data and save it to data directory
 1 file changed, 3 insertions(+), 3 deletions(-)


**Push the Files to the remotes**

In [None]:
!git push https://{USER_NAME}:{TOKEN}@dagshub.com/{USER_NAME}/{REPO_NAME}.git &> /dev/null 

!dvc push -r origin &> /dev/null 

# Create Experiments

In [None]:
print(M_MOD_INIT,'\n'+M_MOD_LOAD_DATA)
X_train = pd.read_csv(X_TRAIN_PATH,index_col=0)
X_test = pd.read_csv(X_TEST_PATH,index_col=0)
y_train = pd.read_csv(Y_TRAIN_PATH,index_col=0)
y_test = pd.read_csv(Y_TEST_PATH,index_col=0)

[DEBUG] Initialize Modeling 
     [DEBUG] Loading data sets for modeling


In [None]:
print(M_MOD_RFC)
with dagshub.dagshub_logger() as logger:
    rfc = RandomForestClassifier(n_estimators=300, min_samples_leaf=1, max_features='sqrt', bootstrap=True,
                             random_state=2020, n_jobs=2)
    # log the model's parameters
    logger.log_hyperparams(model_class=type(rfc).__name__)
    logger.log_hyperparams({'model': rfc.get_params()})

    # Train the model
    rfc.fit(X_train, y_train)
    probabilities = rfc.predict_proba(X_test.values)
    testY_prob = probabilities[:, 1]
    testY_pred = rfc.predict(X_test.values)

    # log the model's performances
    logger.log_metrics({f'GINI':round(2*(roc_auc_score(y_test, testY_prob))-1,3)})
    logger.log_metrics({f'AUPRC':round(average_precision_score(y_test,testY_prob),3)})
    print(M_MOD_SCORE, round(2*(roc_auc_score(y_test, testY_prob))-1,3), 
          round(average_precision_score(y_test,testY_prob),3))

     [DEBUG] Runing Random Forest Classifier
     [INFO] Finished modeling with GINI,AUPRC Score: 0.876 0.929


In [None]:
# !python3 src/modeling.py

**Track the Experiment Files**

In [None]:
!git add metrics.csv params.yml
!git commit -m "New Experiment - Random Forest Classifier with basic processing"

[master 6ee778c] New Experiment - Random Forest Classifier with basic processing
 2 files changed, 24 insertions(+)
 create mode 100644 metrics.csv
 create mode 100644 params.yml


**Push the Files to the Remotes**

In [None]:
!git push https://{USER_NAME}:{TOKEN}@dagshub.com/{USER_NAME}/{REPO_NAME}.git

Counting objects: 4, done.
Delta compression using up to 2 threads.
Compressing objects:  25% (1/4)   Compressing objects:  50% (2/4)   Compressing objects:  75% (3/4)   Compressing objects: 100% (4/4)   Compressing objects: 100% (4/4), done.
Writing objects:  25% (1/4)   Writing objects:  50% (2/4)   Writing objects:  75% (3/4)   Writing objects: 100% (4/4)   Writing objects: 100% (4/4), 653 bytes | 653.00 KiB/s, done.
Total 4 (delta 1), reused 0 (delta 0)
To https://dagshub.com/prayas99/mark4.git
   d99e9b9..6ee778c  master -> master


# FE

In [None]:
df_orig = pd.read_csv('./data/hmeq.csv')

df = df_orig.copy()
df.loc[df["CLAGE"]>=600,"CLAGE"] = 600
df.loc[df["VALUE"]>=400000,"VALUE"] = 400000
df.loc[df["MORTDUE"]>=300000,"MORTDUE"] = 300000
df.loc[df["DEBTINC"]>=100,"DEBTINC"] = 100

df["B_DEROG"] = (df["DEROG"]>=1)*1
df["B_DELINQ"] = (df["DELINQ"]>=1)*1

df["YOJ"] = df["YOJ"].apply(lambda t : np.log(t+1))

In [None]:
df.columns

Index(['BAD', 'LOAN', 'MORTDUE', 'VALUE', 'REASON', 'JOB', 'YOJ', 'DEROG',
       'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC', 'B_DEROG', 'B_DELINQ'],
      dtype='object')

In [None]:
X = df.sample(frac=1, random_state=randomState)
Y = X['BAD']
del X['BAD']

cat_cols = ['REASON', 'JOB','B_DEROG','B_DELINQ']
num_cols = ['LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG',
            'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC']

target_col = 'BAD'

trainX, testX, trainY, testY = train_test_split(X, Y, test_size = 0.2, \
                                                random_state = 5)

num_prep = make_pipeline(SimpleImputer(strategy='mean'),
                         MinMaxScaler())
cat_prep = make_pipeline(SimpleImputer(strategy='most_frequent'),
                         OneHotEncoder(handle_unknown='ignore', sparse=False))
prep = ColumnTransformer([
    ('num', num_prep, num_cols),
    ('cat', cat_prep, cat_cols)],
    remainder='drop')
trainX = prep.fit_transform(trainX)
trainX = pd.DataFrame(trainX)
testX = prep.transform(testX)
testX = pd.DataFrame(testX)


scaler = StandardScaler()

# Fit only to the training data
scaler.fit(trainX)

# Now apply the transformations to the data:
trainX_scaled = scaler.transform(trainX)
testX_scaled = scaler.transform(testX)

trainX = pd.DataFrame(trainX_scaled, columns=trainX.columns)
testX = pd.DataFrame(testX_scaled, columns=testX.columns)

trainX.to_csv(X_TRAIN_PATH)
testX.to_csv(X_TEST_PATH)
trainY.to_csv(Y_TRAIN_PATH)
testY.to_csv(Y_TEST_PATH)

In [None]:
!dvc status

data.dvc:
	changed outs:
		modified:           data


In [None]:
# Track the Changes
!dvc add data &> /dev/null 
!git add data.dvc
!git commit -m "Done Feature Engineering"

[master 3f6f2e3] Done Feature Engineering
 1 file changed, 2 insertions(+), 2 deletions(-)


In [None]:
!git push https://{USER_NAME}:{TOKEN}@dagshub.com/{USER_NAME}/{REPO_NAME}.git &> /dev/null 

!dvc push -r origin &> /dev/null 

# Create Experiments

In [None]:
print(M_MOD_INIT,'\n'+M_MOD_LOAD_DATA)
X_train = pd.read_csv(X_TRAIN_PATH,index_col=0)
X_test = pd.read_csv(X_TEST_PATH,index_col=0)
y_train = pd.read_csv(Y_TRAIN_PATH,index_col=0)
y_test = pd.read_csv(Y_TEST_PATH,index_col=0)

[DEBUG] Initialize Modeling 
     [DEBUG] Loading data sets for modeling


In [None]:
print(M_MOD_RFC)
with dagshub.dagshub_logger() as logger:
    rfc = RandomForestClassifier(n_estimators=300, min_samples_leaf=1, max_features='sqrt', bootstrap=True,
                             random_state=2020, n_jobs=2)
    # log the model's parameters
    logger.log_hyperparams(model_class=type(rfc).__name__)
    logger.log_hyperparams({'model': rfc.get_params()})

    # Train the model
    rfc.fit(X_train, y_train)
    probabilities = rfc.predict_proba(X_test.values)
    testY_prob = probabilities[:, 1]
    testY_pred = rfc.predict(X_test.values)

    # log the model's performances
    logger.log_metrics({f'GINI':round(2*(roc_auc_score(y_test, testY_prob))-1,3)})
    logger.log_metrics({f'AUPRC':round(average_precision_score(y_test,testY_prob),3)})
    print(M_MOD_SCORE, round(2*(roc_auc_score(y_test, testY_prob))-1,3), 
          round(average_precision_score(y_test,testY_prob),3))

     [DEBUG] Runing Random Forest Classifier
     [INFO] Finished modeling with GINI,AUPRC Score: 0.96 0.936


In [None]:
!git add metrics.csv params.yml
!git commit -m "New Experiment - RF with FE"

[master 5156fcf] New Experiment - RF with FE
 1 file changed, 2 insertions(+), 2 deletions(-)


**Push the Files to the Remotes**

In [None]:
!git push https://{USER_NAME}:{TOKEN}@dagshub.com/{USER_NAME}/{REPO_NAME}.git

Counting objects: 3, done.
Delta compression using up to 2 threads.
Compressing objects:  33% (1/3)   Compressing objects:  66% (2/3)   Compressing objects: 100% (3/3)   Compressing objects: 100% (3/3), done.
Writing objects:  33% (1/3)   Writing objects:  66% (2/3)   Writing objects: 100% (3/3)   Writing objects: 100% (3/3), 340 bytes | 340.00 KiB/s, done.
Total 3 (delta 1), reused 0 (delta 0)
To https://dagshub.com/prayas99/mark4.git
   3f6f2e3..5156fcf  master -> master


# Remote

In [None]:
# !dvc remote add -d storage gdrive://1Arw_C_oswJvn1SiUeg3Q8WLu7pmDsPEb

In [None]:
# !git add .dvc/config
# !git commit -m "Configure remote storage"

[master bdbf3b1] Configure remote storage
 1 file changed, 4 insertions(+)


# 2nd Exp

In [None]:
print(M_MOD_INIT,'\n'+M_MOD_LOAD_DATA)
X_train = pd.read_csv(X_TRAIN_PATH,index_col=0)
X_test = pd.read_csv(X_TEST_PATH,index_col=0)
y_train = pd.read_csv(Y_TRAIN_PATH,index_col=0)
y_test = pd.read_csv(Y_TEST_PATH,index_col=0)

[DEBUG] Initialize Modeling 
     [DEBUG] Loading data sets for modeling


In [None]:
print(X_train.head())

          0         1         2  ...        19        20        21
0 -0.749129  0.685907  0.242543  ... -0.366644  0.501769 -0.501769
1  0.054510  1.930613  1.870727  ... -0.366644  0.501769 -0.501769
2 -0.499108 -0.228296 -0.295717  ... -0.366644  0.501769 -0.501769
3  0.000935 -1.282888 -1.169475  ... -0.366644  0.501769 -0.501769
4 -1.097373 -1.487084 -1.312982  ...  2.727442 -1.992949  1.992949

[5 rows x 22 columns]


In [None]:
!git checkout HEAD~2 data.dvc

In [None]:
!dvc pull -r origin

Checkout:   0% 0/5 [00:00<?, ?file/s{'info': ''}]
.FMPQoEwanEPbYhpF86wVbZ.tmp:   0% 0.00/435k [00:00<?, ?it/s]
.FMPQoEwanEPbYhpF86wVbZ.tmp:   0% 0.00/435k [00:00<?, ?it/s{'info': ''}]
                                                                        
.GQByxcDBfnH3Bveuc2Ympg.tmp:   0% 0.00/1.74M [00:00<?, ?it/s]
.GQByxcDBfnH3Bveuc2Ympg.tmp:   0% 0.00/1.74M [00:00<?, ?it/s{'info': ''}]
M       data/
1 file modified


In [None]:
!dvc diff

Modified:
    data/
    data/X_test.csv
    data/X_train.csv

files summary: 2 modified


In [None]:
print(M_MOD_INIT,'\n'+M_MOD_LOAD_DATA)
X_train = pd.read_csv(X_TRAIN_PATH,index_col=0)
X_test = pd.read_csv(X_TEST_PATH,index_col=0)
y_train = pd.read_csv(Y_TRAIN_PATH,index_col=0)
y_test = pd.read_csv(Y_TEST_PATH,index_col=0)

[DEBUG] Initialize Modeling 
     [DEBUG] Loading data sets for modeling


In [None]:
print(X_train.head())

          0         1         2  ...        15        16        17
0 -0.749129  0.666588  0.219361  ... -0.523302 -0.140267 -0.182079
1  0.054510  1.883295  1.743188  ...  1.910944 -0.140267 -0.182079
2 -0.499108 -0.227050 -0.284400  ... -0.523302 -0.140267 -0.182079
3  0.000935 -1.257920 -1.102155  ... -0.523302 -0.140267 -0.182079
4 -1.097373 -1.457523 -1.236464  ... -0.523302 -0.140267 -0.182079

[5 rows x 18 columns]
