# Install and init DVC

Prerequisites: 
-  DVC and requirements.txt packages installed (if not - check README.md file for instructions)
-  A project repository is a Git repo 



## Checkout branch `tutorial`

```bash
git checkout -b dvc-tutorial
```

## Initialize DVC

References: 
- https://dvc.org/doc/get-started/initialize 

```bash
dvc init
```

## Commit changes

```bash

git add .
git commit -m "Initialize DVC"
```

# Build automated pipelines

## Create `data_load` stage


```bash
# Create `data` directory

mkdir -p data
```

```bash
# Create data_load pipeline stage

dvc run -n data_load \
    -d src/data_load.py \
    -o data/iris.csv \
    -o data/classes.json \
    -p data_load \
    python src/data_load.py \
        --config=params.yaml

```

In [11]:
%%bash

du -sh data/*

4.0K	data/classes.json
4.0K	data/iris.csv


In [4]:
# Note: we use `tree -I ...` pattern to not list those files that match the wild-card pattern.

!tree -I dvc-venv

[01;34m.[00m
├── README.md
├── dvc-3-automate-experiments.ipynb
├── params.yaml
├── requirements.txt
└── [01;34msrc[00m
    ├── __init__.py
    ├── data_load.py
    ├── evaluate.py
    ├── featurization.py
    ├── split_dataset.py
    └── train.py

1 directory, 10 files


## dvc.yaml

In [8]:
!cat dvc.yaml

stages:
  data_load:
    cmd: python src/data_load.py --config=params.yaml
    deps:
    - src/data_load.py
    params:
    - data_load
    outs:
    - data/classes.json
    - data/iris.csv


## params.yaml

In [9]:
!cat params.yaml


data_load:
  raw_data_path: data/iris.csv
  classes_names_path: data/classes.json

featurize:
  features_path: data/iris_featurized.csv
  target_column: target


data_split:
  test_size: 0.2
  train_path: data/train.csv
  test_path: data/test.csv


train:
  model_path: data/model.joblib


evaluate:
  metrics_file: data/metrics.json
  confusion_matrix: data/cm.csv


## Reproduce a pipeline

In [10]:
!dvc repro

Stage 'data_load' is cached - skipping run, checking out outputs      core[39m>
[0m

## Change params.yaml and reproduce 

Add a new line into `data_load` section:
    `dummy_param: dummy_value`

In [11]:
!dvc repro

Running stage 'data_load' with command:                               core[39m>
	python src/data_load.py --config=params.yaml
Updating lock file 'dvc.lock'                                         core[39m>

To track the changes with git, run:

	git add dvc.lock
[0m

# Build end-to-end Machine Learning pipeline
Stages 
- extract features 
- split dataset 
- train 
- evaluate 


## Add feature extraction stage

```bash

dvc run -n feature_extraction \
    -d src/featurization.py \
    -d data/iris.csv \
    -o data/iris_featurized.csv \
    -p data_load,featurize \
    python src/featurization.py \
        --config=params.yaml

```

In [12]:
!ls 

README.md                        dvc.yaml
[1m[36mdata[m[m                             params.yaml
dvc-3-automate-experiments.ipynb requirements.txt
[1m[36mdvc-venv[m[m                         [1m[36msrc[m[m
dvc.lock


In [13]:
!cat dvc.yaml

stages:
  data_load:
    cmd: python src/data_load.py --config=params.yaml
    deps:
    - src/data_load.py
    params:
    - data_load
    outs:
    - data/classes.json
    - data/iris.csv
  feature_extraction:
    cmd: python src/featurization.py --config=params.yaml
    deps:
    - data/iris.csv
    - src/featurization.py
    params:
    - data_load
    - featurize
    outs:
    - data/iris_featurized.csv


In [14]:
import pandas as pd

features = pd.read_csv('data/iris_featurized.csv')
features.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


## Commit changes 

```bash
# Check Git status

git status -s
```

```bash
# Commit changes 

git add .
git commit -m "Add stage features_extraction"
```

## Add split train/test stage

```bash

dvc run -n split_dataset \
    -d src/split_dataset.py \
    -d data/iris_featurized.csv \
    -o data/train.csv \
    -o data/test.csv \
    -p featurize,data_split \
        python src/split_dataset.py \
            --config=params.yaml
```

```bash
# Commit changes

git add .
git commit -m "Add stage split_dataset"

```

In [15]:
!cat dvc.yaml

stages:
  data_load:
    cmd: python src/data_load.py --config=params.yaml
    deps:
    - src/data_load.py
    params:
    - data_load
    outs:
    - data/classes.json
    - data/iris.csv
  feature_extraction:
    cmd: python src/featurization.py --config=params.yaml
    deps:
    - data/iris.csv
    - src/featurization.py
    params:
    - data_load
    - featurize
    outs:
    - data/iris_featurized.csv
  split_dataset:
    cmd: python src/split_dataset.py --config=params.yaml
    deps:
    - data/iris_featurized.csv
    - src/split_dataset.py
    params:
    - data_split
    - featurize
    outs:
    - data/test.csv
    - data/train.csv


## Add train stage

```bash

dvc run -n train \
    -d src/train.py \
    -d data/train.csv \
    -o data/model.joblib \
    -p data_split,train \
        python src/train.py \
            --config=params.yaml
```

```bash
# Commit changes

git add .
git commit -m "Add stage train"

```

In [16]:
!cat dvc.yaml

stages:
  data_load:
    cmd: python src/data_load.py --config=params.yaml
    deps:
    - src/data_load.py
    params:
    - data_load
    outs:
    - data/classes.json
    - data/iris.csv
  feature_extraction:
    cmd: python src/featurization.py --config=params.yaml
    deps:
    - data/iris.csv
    - src/featurization.py
    params:
    - data_load
    - featurize
    outs:
    - data/iris_featurized.csv
  split_dataset:
    cmd: python src/split_dataset.py --config=params.yaml
    deps:
    - data/iris_featurized.csv
    - src/split_dataset.py
    params:
    - data_split
    - featurize
    outs:
    - data/test.csv
    - data/train.csv
  train:
    cmd: python src/train.py --config=params.yaml
    deps:
    - data/train.csv
    - src/train.py
    params:
    - data_split
    - train
    outs:
    - data/model.joblib


## Add evaluate stage

```bash

dvc run -n evaluate \
    -d src/evaluate.py \
    -d data/test.csv \
    -d data/model.joblib \
    -d data/classes.json \
    -m data/metrics.json \
    --plots data/cm.csv \
    -p data_load,data_split,train,evaluate \
        python src/evaluate.py \
            --config=params.yaml
```

```bash
# Commit changes

git add .
git commit -m "Add stage evaluate"
```

In [17]:
!cat dvc.yaml

stages:
  data_load:
    cmd: python src/data_load.py --config=params.yaml
    deps:
    - src/data_load.py
    params:
    - data_load
    outs:
    - data/classes.json
    - data/iris.csv
  feature_extraction:
    cmd: python src/featurization.py --config=params.yaml
    deps:
    - data/iris.csv
    - src/featurization.py
    params:
    - data_load
    - featurize
    outs:
    - data/iris_featurized.csv
  split_dataset:
    cmd: python src/split_dataset.py --config=params.yaml
    deps:
    - data/iris_featurized.csv
    - src/split_dataset.py
    params:
    - data_split
    - featurize
    outs:
    - data/test.csv
    - data/train.csv
  train:
    cmd: python src/train.py --config=params.yaml
    deps:
    - data/train.csv
    - src/train.py
    params:
    - data_split
    - train
    outs:
    - data/model.joblib
  evaluate:
    cmd: python src/evaluate.py --config=params.yaml
    deps:
    - data/classes.json
    - data/model.jobl

# Experimenting with reproducible pipelines

## How reproduce experiments?

> The most exciting part of DVC is reproducibility.
>> Reproducibility is the time you are getting benefits out of DVC instead of spending time defining the ML pipelines.

> DVC tracks all the dependencies, which helps you iterate on ML models faster without thinking what was affected by your last change.
>> In order to track all the dependencies, DVC finds and reads ALL the DVC-files in a repository and builds a dependency graph (DAG) based on these files.

> This is one of the differences between DVC reproducibility and traditional Makefile-like build automation tools (Make, Maven, Ant, Rakefile etc). It was designed in such a way to localize specification of DAG nodes.
If you run repro on any created DVC-file from our repository, nothing happens because nothing was changed in the defined pipeline.

(c) dvc.org https://dvc.org/doc/tutorial/reproducibility

In [27]:
# Nothing to reproduce

!dvc repro

Stage 'data_load' didn't change, skipping                             core[39m>
Stage 'feature_extraction' didn't change, skipping
Stage 'split_dataset' didn't change, skipping
Stage 'train' didn't change, skipping
Stage 'evaluate' didn't change, skipping
Data and pipelines are up to date.
[0m

## Experiment 1: Add features



### Create new experiment branch

Before editing the code/featurization.py file, please create and checkout a new branch __ratio_features__

```bash
# Create new branch

git checkout -b exp1-ratio-features
git branch
```

### Update featurization.py

in file __featurization.py__  in function`get_features()` after line 

```python
    features = dataset.copy()
```

add lines:

```python
    features['sepal_length_to_sepal_width'] = features['sepal_length'] / features['sepal_width']
    features['petal_length_to_petal_width'] = features['petal_length'] / features['petal_width']
```

### Reproduce pipeline 

In [21]:
!dvc repro

Stage 'data_load' didn't change, skipping                             core[39m>
Stage 'feature_extraction' is cached - skipping run, checking out outputs
Updating lock file 'dvc.lock'

Stage 'split_dataset' is cached - skipping run, checking out outputs
Updating lock file 'dvc.lock'

Stage 'train' is cached - skipping run, checking out outputs
Updating lock file 'dvc.lock'

Stage 'evaluate' is cached - skipping run, checking out outputs
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.lock
[0m

In [22]:
# Check features used in this pipeline

import pandas as pd

features = pd.read_csv('data/iris_featurized.csv')
features.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target,sepal_length_to_sepal_width,petal_length_to_petal_width
0,5.1,3.5,1.4,0.2,0,1.457143,7.0
1,4.9,3.0,1.4,0.2,0,1.633333,7.0
2,4.7,3.2,1.3,0.2,0,1.46875,6.5
3,4.6,3.1,1.5,0.2,0,1.483871,7.5
4,5.0,3.6,1.4,0.2,0,1.388889,7.0


In [23]:
!git status

On branch exp1-ratio-features
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   dvc-3-automate-experiments.ipynb[m
	[31mmodified:   dvc.lock[m
	[31mmodified:   src/featurization.py[m

no changes added to commit (use "git add" and/or "git commit -a")


In [24]:
# Get difference with metric from previous pipeline
!dvc metrics diff --all

Path               Metric    Old      New      Change                 core[39m>
data/metrics.json  f1_score  0.15385  0.15385  0.0
[0m

### Commit the experiment changes

```bash
# Commit changes

git add .
git commit -m "Experiment with new features"
git tag -a "exp1_ratio_features" -m "Experiment with new features"

```

## Experiment 2: Tune Logistic Regression

### Create a new experiment branch

```bash
# Create new branch for experiment

git checkout -b exp2-tuning-logreg
git branch
```

In [35]:
# Nothing to reproduce since code was checked out by `git checkout`
# and data files were checked out by `dvc checkout`
!dvc repro

Stage 'data_load' didn't change, skipping                             core[39m>
Stage 'feature_extraction' didn't change, skipping
Stage 'split_dataset' didn't change, skipping
Stage 'train' didn't change, skipping
Stage 'evaluate' didn't change, skipping
Data and pipelines are up to date.
[0m

### Tuning parameters

in file __train.py__ :

replace LogisticRegression params with:

```python
    clf = LogisticRegression(C=0.01, solver='lbfgs', multi_class='multinomial', max_iter=100)
```
__Note__: here we changed logistic regresssion hyperparameters: C  to 0.1


https://dvc.org/doc/tutorials/get-started/experiments#tuning-parameters

### Reproduce pipelines

In [25]:
# Re-run pipeline 

!dvc repro

Stage 'data_load' didn't change, skipping                             core[39m>
Stage 'feature_extraction' is cached - skipping run, checking out outputs
Updating lock file 'dvc.lock'

Stage 'split_dataset' is cached - skipping run, checking out outputs
Updating lock file 'dvc.lock'

Running stage 'train' with command:
	python src/train.py --config=params.yaml
Updating lock file 'dvc.lock'                                         core[39m>

Running stage 'evaluate' with command:
	python src/evaluate.py --config=params.yaml
Updating lock file 'dvc.lock'                                         core[39m>

To track the changes with git, run:

	git add dvc.lock
[0m

In [30]:
# Get difference with metric from previous pipeline

!cat data/metrics.json

{"f1_score": 0.9305555555555555}

In [31]:
!dvc metrics show

	data/metrics.json:                                                   core[39m>
		f1_score: 0.9305555555555555
[0m

In [32]:
!dvc metrics diff --all

Path               Metric    Old      New      Change                 core[39m>
data/metrics.json  f1_score  0.15385  0.93056  0.77671
[0m

### Commit changes

```bash
# Commit changes

git add .
git commit -m "Tune model. LogisticRegression. C=0.1"
git tag -a "exp2_tuning_logreg" -m "Tune model. LogisticRegression. C=0.01"

```

## Experiment 3: Use SVM

### Create a new experiment branch

```bash
# Create a new experiment branch 

git checkout -b exp3-svm
```

### Update train.py

in file __train.py__ replace line

```python
    clf = LogisticRegression(C=0.1, solver='newton-cg', multi_class='multinomial', max_iter=100)
```

with line

```python
    clf = SVC(C=0.01, kernel='linear', gamma='scale', degree=5)
```


### Reproduce pipeline 

In [42]:
!dvc repro

Stage 'data_load' didn't change, skipping                             core[39m>
Stage 'feature_extraction' didn't change, skipping
Stage 'split_dataset' didn't change, skipping
Running stage 'train' with command:
	python src/train.py --config=params.yaml
Updating lock file 'dvc.lock'                                         core[39m>

Running stage 'evaluate' with command:
	python src/evaluate.py --config=params.yaml
Updating lock file 'dvc.lock'                                         core[39m>

To track the changes with git, run:

	git add dvc.lock
[0m

In [33]:
!dvc metrics show

	data/metrics.json:                                                   core[39m>
		f1_score: 0.9665831244778613
[0m

In [34]:
!git status

On branch exp3-svm
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   dvc.lock[m
	[31mmodified:   src/train.py[m

no changes added to commit (use "git add" and/or "git commit -a")


### Commit changes

```bash
# Commit changes

git add .
git commit -m "Experiment 3 with SVM estimator"
git tag -a "exp3_svm" -m "Experiment 3 with SVM estimator"
```

## Merge best experiment `dvc-tutorial ` branch

```bash
# Merge the best experiment

git checkout dvc-tutorial 
git merge exp3_svm
```

# Compare experiment

## Compare params 

In [36]:
# Get params diffs 

!dvc params diff

[0m                                                                  core[39m>

In [37]:
# Compare parameters with a specific commit, a tag or any revision

!dvc params diff --all

Path         Param                         Old                       Newre[39m>
params.yaml  data_load.classes_names_path  data/classes.json         data/classes.json
params.yaml  data_load.raw_data_path       data/iris.csv             data/iris.csv
params.yaml  data_split.test_path          data/test.csv             data/test.csv
params.yaml  data_split.test_size          0.2                       0.2
params.yaml  data_split.train_path         data/train.csv            data/train.csv
params.yaml  evaluate.confusion_matrix     data/cm.csv               data/cm.csv
params.yaml  evaluate.metrics_file         data/metrics.json         data/metrics.json
params.yaml  featurize.features_path       data/iris_featurized.csv  data/iris_featurized.csv
params.yaml  featurize.target_column       target                    target
params.yaml  train.model_path              data/model.joblib         data/model.joblib
[0m

In [49]:
!dvc params diff --show-json --all

{"params.yaml": {"data_split.test_path": {"old": "data/test.csv", "new": "data/test.csv"}, "featurize.target_column": {"old": "target", "new": "target"}, "evaluate.confusion_matrix": {"old": "data/cm.csv", "new": "data/cm.csv"}, "train.model_path": {"old": "data/model.joblib", "new": "data/model.joblib"}, "data_load.classes_names_path": {"old": "data/classes.json", "new": "data/classes.json"}, "data_split.test_size": {"old": 0.2, "new": 0.2, "diff": 0.0}, "data_load.dummy_param": {"old": "dummy_value", "new": "dummy_value"}, "data_load.raw_data_path": {"old": "data/iris.csv", "new": "data/iris.csv"}, "data_split.train_path": {"old": "data/train.csv", "new": "data/train.csv"}, "evaluate.metrics_file": {"old": "data/metrics.json", "new": "data/metrics.json"}, "featurize.features_path": {"old": "data/iris_featurized.csv", "new": "data/iris_featurized.csv"}}}
[0m

In [50]:
!dvc params diff --show-md --all

| Path        | Param                        | Old                      | New                      |
|-------------|------------------------------|--------------------------|--------------------------|
| params.yaml | data_load.classes_names_path | data/classes.json        | data/classes.json        |
| params.yaml | data_load.dummy_param        | dummy_value              | dummy_value              |
| params.yaml | data_load.raw_data_path      | data/iris.csv            | data/iris.csv            |
| params.yaml | data_split.test_path         | data/test.csv            | data/test.csv            |
| params.yaml | data_split.test_size         | 0.2                      | 0.2                      |
| params.yaml | data_split.train_path        | data/train.csv           | data/train.csv           |
| params.yaml | evaluate.confusion_matrix    | data/cm.csv              | data/cm.csv              |
| params.yaml | evaluate.metrics_file        | data/metrics.json        | data/metrics.json

In [38]:
# To see the difference between two specific commits, both need to be specified:

!git log

[33mcommit 336832e6c8c51861d58e258b6cf7bc5ddc750459[m[33m ([m[1;36mHEAD -> [m[1;32mdvc-tutorial[m[33m, [m[1;33mtag: exp3_svm[m[33m)[m
Author: Mikhail <mnrozhkov@gmail.com>
Date:   Thu Oct 22 17:05:44 2020 +0300

    Experiment 3 with SVM estimator

[33mcommit aff5b7f5d143895108b4dac9939a9c0cd06a349d[m[33m ([m[1;33mtag: exp2_tuning_logreg[m[33m, [m[1;32mexp2-tuning-logreg[m[33m)[m
Author: Mikhail <mnrozhkov@gmail.com>
Date:   Thu Oct 22 17:04:33 2020 +0300

    Tune model. LogisticRegression. C=0.1

[33mcommit 7ab2b518063b63742a396ca83ce6a092a260589a[m[33m ([m[1;33mtag: exp1_ratio_features[m[33m, [m[1;32mexp1-ratio-features[m[33m)[m
Author: Mikhail <mnrozhkov@gmail.com>
Date:   Wed Oct 21 17:27:03 2020 +0300

    Experiment with new features

[33mcommit 7619688214cc3b9fe3d3b59674c07c12fc134b47[m
Author: Mikhail <mnrozhkov@gmail.com>
Date:   Wed Oct 21 17:24:13 2020 +0300

    Add stage evaluate

[33mcommit 2a59d083d38b1a15

In [41]:
!dvc params diff 7619688214cc3b9fe3d3b59674c07c12fc134b47 HEAD^

[0m                                                                  core[39m>

## Show metrics

In [42]:
# this pipeline metrics 

!dvc metrics show

	data/metrics.json:                                                   core[39m>
		f1_score: 0.9665831244778613
[0m

In [43]:
# show all commited pipelines metrics (all branch and tags)

!dvc metrics show -a -T

dvc-tutorial:                                                         core[39m>
	data/metrics.json:
		f1_score: 0.9665831244778613
exp1-ratio-features:
	data/metrics.json:
		f1_score: 0.15384615384615383
exp2-tuning-logreg:
	data/metrics.json:
		f1_score: 0.9305555555555555
exp3-svm:
	data/metrics.json:
		f1_score: 0.9665831244778613
exp1_ratio_features:
	data/metrics.json:
		f1_score: 0.15384615384615383
exp2_tuning_logreg:
	data/metrics.json:
		f1_score: 0.9305555555555555
exp3_svm:
	data/metrics.json:
		f1_score: 0.9665831244778613
[0m

## Compare metrics (get differences)


In [44]:
!dvc metrics diff

[0m                                                                  core[39m>

In [45]:
# --all - list all metrics, even those without changes

!dvc metrics diff --all

Path               Metric    Old      New      Change                 core[39m>
data/metrics.json  f1_score  0.96658  0.96658  0.0
[0m

* чтобы сравнить текущую метрики из текущего коммита и из другого, нужно указать другой (old) коммит:

In [46]:
# Compare old and new branches

!dvc metrics diff exp1-ratio-features exp3-svm

Path               Metric    Old      New      Change                 core[39m>
data/metrics.json  f1_score  0.15385  0.96658  0.81274
[0m

In [47]:
# Equivalent to `!dvc metrics diff exp1-ratio-features dvc-tutorial`, because dvc-tutorial - current branch

!dvc metrics diff exp1-ratio-features

Path               Metric    Old      New      Change                 core[39m>
data/metrics.json  f1_score  0.15385  0.96658  0.81274
[0m

In [49]:
!dvc metrics diff exp1-ratio-features --show-md

| Path              | Metric   | Old     | New     | Change   |       core[39m>
|-------------------|----------|---------|---------|----------|
| data/metrics.json | f1_score | 0.15385 | 0.96658 | 0.81274  |

[0m

## Build Plots


In [50]:
from IPython.display import IFrame

### Show

In [60]:
!dvc plots show  --template confusion "data/cm.csv" -x actual -y predicted -o data/plots-show.html

file:///Users/mnrozhkov/dev/dvc/dvc-3-automate-experiments/data/plots-show.html
[0m

In [61]:
IFrame(src='data/plots-show.html', width=800, height=500)

### Diff

In [62]:
# Build metircs plots for all 3 experiments
!dvc plots diff -t confusion -o data/plots-diff.html exp1-ratio-features exp3-svm -x predicted

file:///Users/mnrozhkov/dev/dvc/dvc-3-automate-experiments/data/plots-diff.html
[0m

In [64]:
IFrame(src='data/plots-diff.html', width=800, height=500)