# Example of versioning ML experiments using DVC

This notebook aims to be a guideline for versioning your ML projects using DVC, from a Jupyter notebook.

This notebook allows you to experiment as much as you like, and when you are in a state that you would like to preserve for future reference as a git commit, use the DVC cells to version all your relevant files. 

The cells marked with a green markdown box are responsible for creating a snapshot of your raw data, processed data, and trained models.

This snapshot is implemented as md5 hashes of the respective files saved as text in the `.dvc` files. The hashes in the .dvc files will be part of the git commit.

## Imports and global declarations

In [30]:
import sklearn
from sklearn import datasets
from sklearn import preprocessing
from sklearn import metrics
from sklearn import model_selection
from sklearn.linear_model import LinearRegression
import numpy as np
import pandas as pd
import pickle
import joblib
import json
import os

<div class="alert alert-block alert-success">
<h2>Download and version raw data</h2>
</div>

In [31]:
raw_data = datasets.fetch_california_housing(data_home="data/raw")
# Save the raw input data for reproducibility
!dvc commit -f raw.dvc

[0m                                                                            

## Data preprocessing

In [32]:
def to_dataframe(X, y):
    return pd.concat([
            pd.DataFrame(data=X, columns=raw_data.feature_names),
            pd.DataFrame(data=y, columns=['Value'])
        ],
        axis=1)

In [33]:
raw_df = to_dataframe(raw_data.data, raw_data.target)
raw_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Value
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [34]:
raw_df.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Value
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704,2.068558
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532,1.153956
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35,0.14999
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8,1.196
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49,1.797
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01,2.64725
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31,5.00001


### Test train split

In [35]:
train_X, test_X, train_y, test_y = model_selection.train_test_split(raw_df[raw_df.columns[:-1]], raw_df['Value'])

In [36]:
train_X.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
6374,9.4667,32.0,7.325758,1.037879,1558.0,2.950758,34.17,-118.02
13711,3.0587,5.0,5.284007,1.272978,2382.0,2.189338,34.08,-117.21
5543,5.611,36.0,5.727891,1.024943,996.0,2.258503,33.98,-118.4
4263,1.3157,43.0,1.911826,1.151854,3049.0,2.13366,34.1,-118.33
11504,4.7773,37.0,3.535461,0.929078,531.0,1.882979,33.74,-118.1


### Normalize feature columns by training data only

In [37]:
scaler = preprocessing.StandardScaler()
train_X_scaled = pd.DataFrame(scaler.fit_transform(train_X), index=train_X.index, columns=train_X.columns)
train_X_scaled.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
6374,2.947156,0.264726,0.741246,-0.122564,0.112211,-0.010226,-0.684439,0.775534
13711,-0.426527,-1.884132,-0.059881,0.355262,0.818678,-0.080542,-0.726595,1.180526
5543,0.917208,0.583075,0.114287,-0.148854,-0.369627,-0.074154,-0.773436,0.585537
4263,-1.344181,1.140186,-1.383034,0.109085,1.39054,-0.085683,-0.717227,0.620537
11504,0.478282,0.662662,-0.745964,-0.343695,-0.768301,-0.108833,-0.885854,0.735534


In [38]:
test_X_scaled = pd.DataFrame(scaler.transform(test_X), index=test_X.index, columns=test_X.columns)
test_X_scaled.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
12916,0.112904,-0.690322,0.230982,-0.19955,0.774096,-0.023481,1.409347,-0.864434
3938,0.92958,0.503487,0.013131,-0.316807,0.000754,-0.000943,-0.661018,0.495539
11816,0.438585,-1.167846,0.277412,-0.276662,-0.246167,0.00258,1.568606,-0.744436
14983,-0.020716,-1.247434,0.294054,-0.190097,-0.607117,0.026106,-1.363631,1.290523
9272,0.793222,-0.690322,0.650213,-0.136036,-0.708286,-0.03495,1.170459,-1.509421


In [39]:
train_df = pd.concat([train_X_scaled, train_y], axis=1)
test_df = pd.concat([test_X_scaled, test_y], axis=1)

<div class="alert alert-block alert-success">
<h2>Optional: Version the processed data with DVC for efficiency and/or reproducibility</h2>
</div>

In [41]:
os.makedirs('data/processed/')
train_df.to_csv('data/processed/california_households_train.csv', index_label='Index')
test_df.to_csv('data/processed/california_households_test.csv', index_label='Index')
joblib.dump(scaler, 'data/processed/california_households_scaler.pkl')
!dvc commit -f process_data

  0% Transferring|                                   |0/4 [00:00<?,     ?file/s]
![A
  0%|          |california_households_scaler.pkl   0.00/? [00:00<?,        ?B/s][A
                                                                                [A
![A
  0%|          |california_households_train.csv    0.00/? [00:00<?,        ?B/s][A

  0%|          |california_households_test.csv     0.00/? [00:00<?,        ?B/s][A[A

                                                                                [A[A
                                                                                [A
![A
  0%|          |memory://.UjZXBmfvLgdiWjagdxFmGC.tm0.00/? [00:00<?,        ?B/s][A
  0%|          |memory://.UjZXBmfvLgdiWjagdxFmGC.0.00/273 [00:00<?,        ?B/s][A

![A[A

  0%|          |.UjZXBmfvLgdiWjagdxFmGC.tmp        0.00/? [00:00<?,        ?B/s][A[A
                                                                                [A

Updating lock file 'dvc.lock'             

### Use this cell to reload processed data, after switching branches

In [42]:
train_df = pd.read_csv('data/processed/california_households_train.csv', index_col=0)
test_df = pd.read_csv('data/processed/california_households_train.csv', index_col=0)
scaler = joblib.load('data/processed/california_households_scaler.pkl')

## Training

In [43]:
model = LinearRegression()
X = train_df[train_df.columns[:-1]]
y = train_df['Value']
model.fit(X, y)

<div class="alert alert-block alert-success">
<h2>Save the trained model for reproducibility</h2>
</div>

In [44]:
os.makedirs('models')
joblib.dump(model, 'models/california_households.pkl')
!dvc commit -f train_model

  0% Transferring|                                   |0/2 [00:00<?,     ?file/s]
![A
  0%|          |california_households.pkl          0.00/? [00:00<?,        ?B/s][A
                                                                                [A
![A
  0%|          |memory://.UrehUoTg9RCLnhuQcrzhfe.tm0.00/? [00:00<?,        ?B/s][A
  0%|          |memory://.UrehUoTg9RCLnhuQcrzhfe0.00/85.0 [00:00<?,        ?B/s][A

![A[A

  0%|          |.UrehUoTg9RCLnhuQcrzhfe.tmp        0.00/? [00:00<?,        ?B/s][A[A
                                                                                [A

Updating lock file 'dvc.lock'                                                   [A[A
[0m

### Use this cell to reload the model, after switching branches

In [45]:
model = joblib.load('models/california_households.pkl')

## Evaluate the model

In [46]:
predictions = model.predict(test_df[test_df.columns[:-1]])
truth = test_df['Value']
metrics_dict = {}
metrics_dict['R2'] = metrics.r2_score(truth, predictions)
metrics_dict['MAE'] = metrics.mean_absolute_error(truth, predictions)
metrics_dict['MSE'] = metrics.mean_squared_error(truth, predictions)
metrics_dict['median_absolute_error'] = metrics.median_absolute_error(truth, predictions)
metrics_dict['loss'] = metrics_dict['MSE']
pd.DataFrame(metrics_dict, index=[0])

Unnamed: 0,R2,MAE,MSE,median_absolute_error,loss
0,0.601755,0.537834,0.535798,0.42066,0.535798


<div class="alert alert-block alert-success">
<h2>Save the computed metrics for easy display in DVC and DAGsHub</h2>
</div>

In [47]:
with open('metrics/metrics.json', 'w') as f:
    json.dump(metrics_dict, f, indent=2)
!dvc commit -f eval_model

  0% Transferring|                                   |0/1 [00:00<?,     ?file/s]
![A
  0%|          |metrics.json                       0.00/? [00:00<?,        ?B/s][A
Updating lock file 'dvc.lock'                                                   [A
[0m

<div class="alert alert-block alert-success">
<h2>Versioning section - use the following cells to create a full commit of your current state</h2>
</div>

### Make sure all data and models are committed to DVC
The output of the following cell should be: `Pipeline is up to date. Nothing to reproduce.`

If you get something else, then maybe you forgot to `dvc commit` earlier in the notebook.
We recommend to make sure that the current contents in the data and models directories are to your liking,
and if so, use the commit cell below to automatically commit all current files to DVC.

In [48]:
!dvc status

Data and pipelines are up to date.                                              
[0m

In [49]:
# Use this if dvc status is not up-to-date and you're sure the current state is OK.
!dvc commit -f

[0m                                                                            