# Preparation


Check README.md file for install/setup instructions 

**References**


https://dvc.org/doc/tutorial/define-ml-pipeline - used as example

## Initialize DVC

In [1]:
!dvc init -f

[KAdding '.dvc/state' to '.dvc/.gitignore'.
[KAdding '.dvc/lock' to '.dvc/.gitignore'.
[KAdding '.dvc/config.local' to '.dvc/.gitignore'.
[KAdding '.dvc/updater' to '.dvc/.gitignore'.
[KAdding '.dvc/updater.lock' to '.dvc/.gitignore'.
[KAdding '.dvc/state-journal' to '.dvc/.gitignore'.
[KAdding '.dvc/state-wal' to '.dvc/.gitignore'.
[KAdding '.dvc/cache' to '.dvc/.gitignore'.
[K
You can now commit the changes to git.

[K[31m+---------------------------------------------------------------------+
[39m[31m|[39m                                                                     [31m|[39m
[31m|[39m        DVC has enabled anonymous aggregate usage analytics.         [31m|[39m
[31m|[39m     Read the analytics documentation (and how to opt-out) here:     [31m|[39m
[31m|[39m              [34mhttps://dvc.org/doc/user-guide/analytics[39m               [31m|[39m
[31m|[39m                                                                     [31m|[39m
[31m+--------

In [2]:
%%bash

git add .
git commit -m "Initialize DVC"

[dvc-tutorial 39329bb] Initialize DVC
 2 files changed, 8 insertions(+)
 create mode 100644 .dvc/.gitignore
 create mode 100644 .dvc/config


### Files and Directories 

In [3]:
!ls -a .dvc 

[1m[36m.[m[m            .gitignore   config       updater.lock
[1m[36m..[m[m           [1m[36mcache[m[m        updater


In [4]:
!cat .dvc/.gitignore

/state
/lock
/config.local
/updater
/updater.lock
/state-journal
/state-wal
/cache

# Control versions of data

In [5]:
# Get data 

!wget -P data/ https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv
!du -sh data/*

--2019-06-09 10:57:10--  https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv
Resolving raw.githubusercontent.com... 151.101.112.133
Connecting to raw.githubusercontent.com|151.101.112.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3716 (3.6K) [text/plain]
Saving to: ‘data/iris.csv’


2019-06-09 10:57:10 (42.7 MB/s) - ‘data/iris.csv’ saved [3716/3716]

4.0K	data/iris.csv


In [6]:
# Look on data

import pandas as pd

df = pd.read_csv('data/iris.csv')
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


## Add file under DVC control

In [7]:
%%bash

dvc add data/iris.csv
du -sh data/*

Adding 'data/iris.csv' to 'data/.gitignore'.
Saving 'data/iris.csv' to '.dvc/cache/57/fce90c81521889c736445f058c4838'.
Saving information to 'data/iris.csv.dvc'.

To track the changes with git run:

	git add data/.gitignore data/iris.csv.dvc
4.0K	data/iris.csv
4.0K	data/iris.csv.dvc


In [8]:
!git status -s data/

[31m??[m data/.gitignore
[31m??[m data/iris.csv.dvc


In [9]:
%%bash

git add .
git commit -m "Add a source dataset"

[dvc-tutorial 6eebfb1] Add a source dataset
 2 files changed, 9 insertions(+)
 create mode 100644 data/.gitignore
 create mode 100644 data/iris.csv.dvc


## What is DVC-file?

Data file internals


>    If you take a look at the DVC-file, you will see that only outputs are defined in outs. 
    In this file, only one output is defined. The output contains the data file path in the repository and md5 cache.
    This md5 cache determines a location of the actual content file in DVC cache directory .dvc/cache
    >> Output from DVC-files defines the relationship between the data file path in a repository and the path in a cache directory. See also DVC File Format



(c) dvc.org https://dvc.org/doc/tutorial/define-ml-pipeline

In [1]:
!cat data/iris.csv.dvc

cat: data/iris.csv.dvc: No such file or directory


In [11]:
!du -sh .dvc/cache/*/*

4.0K	.dvc/cache/57/fce90c81521889c736445f058c4838


# Create ML pipeline

Stages 
- extract features 
- split dataset 
- train 
- evaluate 


## Add feature extraction stage

In [12]:
!dvc run -f stage_feature_extraction.dvc \
    -d src/featurization.py \
    -d data/iris.csv \
    -o data/iris_featurized.csv \
    python src/featurization.py

[KRunning command:
	python src/featurization.py
[KAdding 'data/iris_featurized.csv' to 'data/.gitignore'.
[KSaving 'data/iris_featurized.csv' to '.dvc/cache/04/ed69383af337e9dabf934cbc8abc11'.
[KSaving information to 'stage_feature_extraction.dvc'.
[K
To track the changes with git run:

	git add data/.gitignore stage_feature_extraction.dvc
[0m

In [13]:
!ls 

README.md                    stage_feature_extraction.dvc
[1m[36mdata[m[m                         tutorial.ipynb
requirements.txt             [1m[36mvenv[m[m
[1m[36msrc[m[m


In [14]:
!cat stage_feature_extraction.dvc

md5: 7fa7e84ef809dc5478ba9d9291915c71
cmd: python src/featurization.py
wdir: .
deps:
- md5: 33729a6a870be74dc3bc2983284a22de
  path: src/featurization.py
- md5: 57fce90c81521889c736445f058c4838
  path: data/iris.csv
outs:
- md5: 04ed69383af337e9dabf934cbc8abc11
  path: data/iris_featurized.csv
  cache: true
  metric: false
  persist: false


In [15]:
import pandas as pd

features = pd.read_csv('data/iris_featurized.csv')
features.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [16]:
!git status -s

 [31mM[m data/.gitignore
 [31mM[m tutorial.ipynb
[31m??[m stage_feature_extraction.dvc


In [17]:
%%bash
git add .
git commit -m "Add stage_features_extraction"

[dvc-tutorial 3befc9f] Add stage_features_extraction
 3 files changed, 67 insertions(+), 46 deletions(-)
 create mode 100644 stage_feature_extraction.dvc


## Add split train/test stage

In [26]:
!dvc run -f stage_split_dataset.dvc \
    -d src/split_dataset.py \
    -d data/iris_featurized.csv \
    -o data/train.csv \
    -o data/test.csv \
    python src/split_dataset.py --test_size 0.4

[KRunning command:
	python src/split_dataset.py --test_size 0.4
[KAdding 'data/train.csv' to 'data/.gitignore'.
[KAdding 'data/test.csv' to 'data/.gitignore'.
[KSaving 'data/train.csv' to '.dvc/cache/3b/7d94d2e2675fb9132c42b1da980794'.
[KSaving 'data/test.csv' to '.dvc/cache/b3/7f00bfd2ce46c61fe744d6d26b41d3'.
[KSaving information to 'stage_split_dataset.dvc'.
[K
To track the changes with git run:

	git add data/.gitignore data/.gitignore stage_split_dataset.dvc
[0m

In [27]:
!cat stage_split_dataset.dvc

md5: 334b801dc736aeb2275c48e1db84cfd7
cmd: python src/split_dataset.py --test_size 0.4
wdir: .
deps:
- md5: 5da797093e3e72ca65720df86c842a39
  path: src/split_dataset.py
- md5: 04ed69383af337e9dabf934cbc8abc11
  path: data/iris_featurized.csv
outs:
- md5: 3b7d94d2e2675fb9132c42b1da980794
  path: data/train.csv
  cache: true
  metric: false
  persist: false
- md5: b37f00bfd2ce46c61fe744d6d26b41d3
  path: data/test.csv
  cache: true
  metric: false
  persist: false


## Add train stage

In [28]:
!dvc run -f stage_train.dvc \
    -d src/train.py \
    -d data/train.csv \
    -o data/model.joblib \
    python src/train.py

[KRunning command:
	python src/train.py
[KAdding 'data/model.joblib' to 'data/.gitignore'.
[KSaving 'data/model.joblib' to '.dvc/cache/d6/31150e260de432a811bfe7dd8935ec'.
[KSaving information to 'stage_train.dvc'.
[K
To track the changes with git run:

	git add data/.gitignore stage_train.dvc
[0m

In [29]:
!cat stage_train.dvc

md5: 1409d57473a56f42048a84c797dbbec5
cmd: python src/train.py
wdir: .
deps:
- md5: 025acbe1552887fab33f5314d036e907
  path: src/train.py
- md5: 3b7d94d2e2675fb9132c42b1da980794
  path: data/train.csv
outs:
- md5: d631150e260de432a811bfe7dd8935ec
  path: data/model.joblib
  cache: true
  metric: false
  persist: false


### Add evaluate stage

In [30]:
!dvc run -f stage_evaluate.dvc \
    -d src/train.py \
    -d src/evaluate.py \
    -d data/test.csv \
    -d data/model.joblib \
    -m data/eval.txt \
    python src/evaluate.py

[KRunning command:
	python src/evaluate.py
[KAdding 'data/eval.txt' to 'data/.gitignore'.
[KSaving 'data/eval.txt' to '.dvc/cache/1f/7764d988d8d251dc3e9b1c5419f58b'.
[KSaving information to 'stage_evaluate.dvc'.
[K
To track the changes with git run:

	git add data/.gitignore stage_evaluate.dvc
[0m

In [31]:
!cat stage_evaluate.dvc

md5: 2c5f02b139310b839b97f2a093b802b9
cmd: python src/evaluate.py
wdir: .
deps:
- md5: 025acbe1552887fab33f5314d036e907
  path: src/train.py
- md5: 9b394d26e9427759256195b47917028b
  path: src/evaluate.py
- md5: b37f00bfd2ce46c61fe744d6d26b41d3
  path: data/test.csv
- md5: d631150e260de432a811bfe7dd8935ec
  path: data/model.joblib
outs:
- md5: 1f7764d988d8d251dc3e9b1c5419f58b
  path: data/eval.txt
  cache: true
  metric: true
  persist: false


# Metrics tracking

In [32]:
!dvc metrics show

[K	data/eval.txt: {"f1_score": 0.7861833464670345, "confusion_matrix": {"classes": ["setosa", "versicolor", "virginica"], "matrix": [[23, 0, 0], [0, 8, 0], [0, 11, 18]]}}
[0m

## Commit dvc pipelines

In [33]:
!git status -s

 [31mM[m data/.gitignore
 [31mM[m src/split_dataset.py
 [31mM[m tutorial.ipynb
[31m??[m stage_evaluate.dvc
[31m??[m stage_split_dataset.dvc
[31m??[m stage_train.dvc


In [34]:
%%bash
git add .
git commit -m "Add pipelines"

[dvc-tutorial 7304361] Add pipelines
 6 files changed, 130 insertions(+), 91 deletions(-)
 create mode 100644 stage_evaluate.dvc
 create mode 100644 stage_split_dataset.dvc
 create mode 100644 stage_train.dvc


# Reproducibility

## How does it work?

> The most exciting part of DVC is reproducibility.
>> Reproducibility is the time you are getting benefits out of DVC instead of spending time defining the ML pipelines.

> DVC tracks all the dependencies, which helps you iterate on ML models faster without thinking what was affected by your last change.
>> In order to track all the dependencies, DVC finds and reads ALL the DVC-files in a repository and builds a dependency graph (DAG) based on these files.

> This is one of the differences between DVC reproducibility and traditional Makefile-like build automation tools (Make, Maven, Ant, Rakefile etc). It was designed in such a way to localize specification of DAG nodes.
If you run repro on any created DVC-file from our repository, nothing happens because nothing was changed in the defined pipeline.

(c) dvc.org https://dvc.org/doc/tutorial/reproducibility

In [35]:
# Nothing to reproduce
!dvc repro stage_evaluate.dvc

[KStage 'data/iris.csv.dvc' didn't change.
[KStage 'stage_feature_extraction.dvc' didn't change.
[KStage 'stage_split_dataset.dvc' didn't change.
[KStage 'stage_train.dvc' didn't change.
[KStage 'stage_evaluate.dvc' didn't change.
[KPipeline is up to date. Nothing to reproduce.
[0m

## Add features



### Create new experiment branch

Before editing the code/featurization.py file, please create and checkout a new branch __ratio_features__

In [36]:
# create new branch

!git checkout -b ratio_features
!git branch

Switched to a new branch 'ratio_features'


### Update featurization.py

in file __featurization.py__ uncomment lines 

    features['sepal_length_to_sepal_width'] = features['sepal_length'] / features['sepal_width']
    features['petal_length_to_petal_width'] = features['petal_length'] / features['petal_width']

## Reproduce pipeline 

In [38]:
!dvc repro stage_evaluate.dvc

[KStage 'data/iris.csv.dvc' didn't change.
[KReproducing 'stage_feature_extraction.dvc'
[KRunning command:
	python src/featurization.py
[KSaving 'data/iris_featurized.csv' to '.dvc/cache/cd/9e208c0232da2fb80b4c927da35dbb'.
[KSaving information to 'stage_feature_extraction.dvc'.
[KReproducing 'stage_split_dataset.dvc'
[KRunning command:
	python src/split_dataset.py --test_size 0.4
[KSaving 'data/train.csv' to '.dvc/cache/87/43ef62798f623fbaae4401f4aab654'.
[KSaving 'data/test.csv' to '.dvc/cache/3d/40f0c85187dda2cd9bf58b3e916630'.
[KSaving information to 'stage_split_dataset.dvc'.
[KReproducing 'stage_train.dvc'
[KRunning command:
	python src/train.py
[KSaving 'data/model.joblib' to '.dvc/cache/d7/bb60e8f731671aa212cf137c6f1e52'.
[KSaving information to 'stage_train.dvc'.
[KReproducing 'stage_evaluate.dvc'
[KRunning command:
	python src/evaluate.py
[KSaving 'data/eval.txt' to '.dvc/cache/ef/d3a0fee43c80da52d807308c56843b'.
[KSaving information to 'stage_evaluate.dvc'.


In [42]:
# Check features used in this pipeline

import pandas as pd

features = pd.read_csv('data/iris_featurized.csv')
features.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,sepal_length_to_sepal_width,petal_length_to_petal_width
0,5.1,3.5,1.4,0.2,setosa,1.457143,7.0
1,4.9,3.0,1.4,0.2,setosa,1.633333,7.0
2,4.7,3.2,1.3,0.2,setosa,1.46875,6.5
3,4.6,3.1,1.5,0.2,setosa,1.483871,7.5
4,5.0,3.6,1.4,0.2,setosa,1.388889,7.0


## Compare metrics for all runs (experiments)

In [43]:
# this pipeline metrics 

!dvc metrics show

[K	data/eval.txt: {"f1_score": 0.8084886128364389, "confusion_matrix": {"classes": ["setosa", "versicolor", "virginica"], "matrix": [[23, 0, 0], [0, 9, 0], [0, 10, 18]]}}
[0m

In [51]:
# show all commited pipelines metrics 

!dvc metrics show -a

[KWorking Tree:
[K	data/eval.txt: {"f1_score": 0.8084886128364389, "confusion_matrix": {"classes": ["setosa", "versicolor", "virginica"], "matrix": [[23, 0, 0], [0, 9, 0], [0, 10, 18]]}}
[Kdvc-tutorial:
[K	data/eval.txt: {"f1_score": 0.7861833464670345, "confusion_matrix": {"classes": ["setosa", "versicolor", "virginica"], "matrix": [[23, 0, 0], [0, 8, 0], [0, 11, 18]]}}
[Kratio_features:
[K	data/eval.txt: {"f1_score": 0.7861833464670345, "confusion_matrix": {"classes": ["setosa", "versicolor", "virginica"], "matrix": [[23, 0, 0], [0, 8, 0], [0, 11, 18]]}}
[0m

## Commit new results

In [59]:
!git status -s

 [31mM[m tutorial.ipynb


In [60]:
!git add .
!git commit -m "New features experiment"

[ratio_features 20edbac] New features experiment
 1 file changed, 24 insertions(+), 25 deletions(-)


# Checkout (start over new experiment)

- in case new features doesn't result improvements 
- or we want to improve the model by changing the hyperparameters (with OLD dataset)

## Checkout code and data files 

In [61]:
%%bash

git checkout dvc-tutorial
dvc checkout

[##############################] 100% Checkout finished!


Switched to branch 'dvc-tutorial'


In [62]:
!git branch

* [32mdvc-tutorial[m
  master[m
  ratio_features[m


In [63]:
!dvc metrics show

[K	data/eval.txt: {"f1_score": 0.7861833464670345, "confusion_matrix": {"classes": ["setosa", "versicolor", "virginica"], "matrix": [[23, 0, 0], [0, 8, 0], [0, 11, 18]]}}
[0m

In [65]:
# Nothing to reproduce since code was checked out by `git checkout`
# and data files were checked out by `dvc checkout`
!dvc repro stage_evaluate.dvc

[KStage 'data/iris.csv.dvc' didn't change.
[KStage 'stage_feature_extraction.dvc' didn't change.
[KStage 'stage_split_dataset.dvc' didn't change.
[KStage 'stage_train.dvc' didn't change.
[KStage 'stage_evaluate.dvc' didn't change.
[KPipeline is up to date. Nothing to reproduce.
[0m

In [66]:
!dvc metrics show

[K	data/eval.txt: {"f1_score": 0.7861833464670345, "confusion_matrix": {"classes": ["setosa", "versicolor", "virginica"], "matrix": [[23, 0, 0], [0, 8, 0], [0, 11, 18]]}}
[0m

In [67]:
# Check features used in this pipeline

import pandas as pd

features = pd.read_csv('data/iris_featurized.csv')
features.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


## Tune the model

In [68]:
# create new branch for experiment

!git checkout -b tuning
!git branch

M	tutorial.ipynb
Switched to a new branch 'tuning'
  dvc-tutorial[m
  master[m
  ratio_features[m
* [32mtuning[m


### Change parameters of classifier (LogisticRegression)

in file __train.py__ in constructor of LogisticRegression:

* change C param to 0.1

in the end you should get:

```python
clf = LogisticRegression(C=0.1, solver='newton-cg', multi_class='multinomial', max_iter=100)
```

### Reproduce pipelines

In [69]:
# re-run pipeline 

!dvc repro stage_evaluate.dvc 

[KStage 'data/iris.csv.dvc' didn't change.
[KStage 'stage_feature_extraction.dvc' didn't change.
[KStage 'stage_split_dataset.dvc' didn't change.
[KReproducing 'stage_train.dvc'
[KRunning command:
	python src/train.py
[KSaving 'data/model.joblib' to '.dvc/cache/be/122f7e293f77f4f636345868dd57c2'.
[KSaving information to 'stage_train.dvc'.
[KReproducing 'stage_evaluate.dvc'
[KRunning command:
	python src/evaluate.py
[KSaving 'data/eval.txt' to '.dvc/cache/04/a34aa507199d5f1f5c73a841a76463'.
[KSaving information to 'stage_evaluate.dvc'.
[K
To track the changes with git run:

	git add stage_train.dvc stage_evaluate.dvc
[0m

In [70]:
!cat data/eval.txt

{"f1_score": 0.9639376218323586, "confusion_matrix": {"classes": ["setosa", "versicolor", "virginica"], "matrix": [[23, 0, 0], [0, 17, 0], [0, 2, 18]]}}

In [71]:
!dvc metrics show -a

[KWorking Tree:
[K	data/eval.txt: {"f1_score": 0.9639376218323586, "confusion_matrix": {"classes": ["setosa", "versicolor", "virginica"], "matrix": [[23, 0, 0], [0, 17, 0], [0, 2, 18]]}}
[Kdvc-tutorial:
[K	data/eval.txt: {"f1_score": 0.7861833464670345, "confusion_matrix": {"classes": ["setosa", "versicolor", "virginica"], "matrix": [[23, 0, 0], [0, 8, 0], [0, 11, 18]]}}
[Kratio_features:
[K	data/eval.txt: {"f1_score": 0.8084886128364389, "confusion_matrix": {"classes": ["setosa", "versicolor", "virginica"], "matrix": [[23, 0, 0], [0, 9, 0], [0, 10, 18]]}}
[Ktuning:
[K	data/eval.txt: {"f1_score": 0.7861833464670345, "confusion_matrix": {"classes": ["setosa", "versicolor", "virginica"], "matrix": [[23, 0, 0], [0, 8, 0], [0, 11, 18]]}}
[0m

In [72]:
# Check features used in this pipeline

import pandas as pd

features = pd.read_csv('data/iris_featurized.csv')
features.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


### Commit

In [73]:
%%bash

git add .
git commit -m "Tune model. C=0.1"

[tuning 857cbe5] Tune model. C=0.1
 4 files changed, 154 insertions(+), 147 deletions(-)


### Merge the model to dvc-tutorial

In [74]:
%%bash

git checkout dvc-tutorial
git merge tuning

Updating 7304361..857cbe5
Fast-forward
 src/train.py       |   2 +-
 stage_evaluate.dvc |   8 +-
 stage_train.dvc    |   6 +-
 tutorial.ipynb     | 285 +++++++++++++++++++++++++++--------------------------
 4 files changed, 154 insertions(+), 147 deletions(-)


Switched to branch 'dvc-tutorial'


# Share data

## Setup remote storage (i.e. cloud)

In [75]:
# Create new remote

!dvc remote add -d local /tmp/dvc

[KSetting 'local' as a default remote.
[0m

In [76]:
# as you can see, .dvc/config is changed

!git status -s

 [31mM[m .dvc/config


In [77]:
# check config file 

!cat .dvc/config

['remote "local"']
url = /tmp/dvc
[core]
remote = local


In [78]:
%%bash

git add .
git commit -m "Add remote storage"

[dvc-tutorial 3fc009c] Add remote storage
 1 file changed, 4 insertions(+)


## Push data to remote

In [79]:
# Push data to remote

!dvc push

[KPreparing to upload data to '/tmp/dvc'
[KPreparing to collect status from /tmp/dvc
[K[##############################] 100% Collecting information
[K[##############################] 100% Analysing status.
[K(1/5): [##############################] 100% data/train.csv
[K(2/5): [##############################] 100% data/eval.txtturized.csv
[K(3/5): [##############################] 100% data/iris_featurized.csv
[K(4/5): [##############################] 100% data/test.csv
[K(5/5): [##############################] 100% data/model.joblib
[0m

## Pull date from remote

In [80]:
!dvc pull

[KPreparing to download data from '/tmp/dvc'
[KPreparing to collect status from /tmp/dvc
[K[##############################] 100% Collecting information
[K[##############################] 100% Analysing status.
[K[##############################] 100% Checkout finished!

[KEverything is up to date.
[0m