# Install and init DVC

Prerequisites: 
-  DVC and requirements.txt packages installed (if not - check README.md file for instructions)
-  A project repository is a Git repo 



## Install with pip

In [17]:
!pip install "dvc>=1.0.0a6"



## Checkout branch `tutorial`

In [19]:
!git checkout -b dvc-tutorial

Переключено на новую ветку «dvc-tutorial»


## Initialize DVC

References: 
- https://dvc.org/doc/get-started/initialize 

In [20]:
!dvc init


You can now commit the changes to git.

[31m+---------------------------------------------------------------------+
[39m[31m|[39m                                                                     [31m|[39m
[31m|[39m        DVC has enabled anonymous aggregate usage analytics.         [31m|[39m
[31m|[39m     Read the analytics documentation (and how to opt-out) here:     [31m|[39m
[31m|[39m              [34mhttps://dvc.org/doc/user-guide/analytics[39m               [31m|[39m
[31m|[39m                                                                     [31m|[39m
[31m+---------------------------------------------------------------------+
[39m
[33mWhat's next?[39m
[33m------------[39m
- Check out the documentation: [34mhttps://dvc.org/doc[39m
- Get help and share ideas: [34mhttps://dvc.org/chat[39m
- Star us on GitHub: [34mhttps://github.com/iterative/dvc[39m
[0m

## Commit changes

In [21]:
%%bash

git add .
git commit -m "Initialize DVC"

[dvc-tutorial 789d024] Initialize DVC
 5 files changed, 89 insertions(+)
 create mode 100644 .dvc/.gitignore
 create mode 100644 .dvc/config
 create mode 100644 .dvc/plots/confusion.json
 create mode 100644 .dvc/plots/default.json
 create mode 100644 .dvc/plots/scatter.json


## Get data 

In [22]:
# Get data 

import pandas as pd
from sklearn.datasets import load_iris

data = load_iris(as_frame=True)
list(data.target_names)
data.frame.to_csv('data/iris.csv')

In [23]:
# Look on data

data.frame.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [24]:
%%bash

du -sh data/*

4,0K	data/iris.csv


## Add data under DVC control

In [25]:
!dvc add data/iris.csv
!git add data/.gitignore data/iris.csv.dvc
!git commit -m "add raw data"

100% Add|██████████████████████████████████████████████|1/1 [00:02,  2.46s/file]

To track the changes with git, run:

	git add data/iris.csv.dvc data/.gitignore
[0m[dvc-tutorial 57d46fa] add raw data
 2 files changed, 4 insertions(+)
 create mode 100644 data/.gitignore
 create mode 100644 data/iris.csv.dvc


# Build end-to-end Machine Learning pipeline
Stages 
- extract features 
- split dataset 
- train 
- evaluate 


## Add feature extraction stage

In [26]:
!dvc run -n feature_extraction \
    -d src/featurization.py \
    -d data/iris.csv \
    -o data/iris_featurized.csv \
    python src/featurization.py

Running stage 'feature_extraction' with command:                                
	python src/featurization.py
Creating 'dvc.yaml'                                                             
Adding stage 'feature_extraction' to 'dvc.yaml'
Generating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.yaml data/.gitignore dvc.lock
[0m

In [27]:
!ls 

data	  dvc.yaml	  README.md	    src
dvc.lock  Lesson 4.ipynb  requirements.txt  venv-lesson4


In [28]:
!cat dvc.yaml

stages:
  feature_extraction:
    cmd: python src/featurization.py
    deps:
    - data/iris.csv
    - src/featurization.py
    outs:
    - data/iris_featurized.csv


In [29]:
import pandas as pd

features = pd.read_csv('data/iris_featurized.csv')
features.head()

Unnamed: 0.1,Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,0,5.1,3.5,1.4,0.2,0
1,1,4.9,3.0,1.4,0.2,0
2,2,4.7,3.2,1.3,0.2,0
3,3,4.6,3.1,1.5,0.2,0
4,4,5.0,3.6,1.4,0.2,0


In [30]:
!git status -s

 [31mM[m data/.gitignore
[31m??[m dvc.lock
[31m??[m dvc.yaml


In [31]:
%%bash
git add .
git commit -m "Add stage features_extraction"

[dvc-tutorial 6e5d19d] Add stage features_extraction
 3 files changed, 19 insertions(+)
 create mode 100644 dvc.lock
 create mode 100644 dvc.yaml


## Add split train/test stage

In [32]:
!dvc run -n split_dataset \
    -d src/split_dataset.py \
    -d data/iris_featurized.csv \
    -o data/train.csv \
    -o data/test.csv \
    python src/split_dataset.py --test_size 0.4

Running stage 'split_dataset' with command:                                     
	python src/split_dataset.py --test_size 0.4
Adding stage 'split_dataset' to 'dvc.yaml'                                      
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.lock data/.gitignore dvc.yaml
[0m

In [33]:
!cat dvc.yaml

stages:
  feature_extraction:
    cmd: python src/featurization.py
    deps:
    - data/iris.csv
    - src/featurization.py
    outs:
    - data/iris_featurized.csv
  split_dataset:
    cmd: python src/split_dataset.py --test_size 0.4
    deps:
    - data/iris_featurized.csv
    - src/split_dataset.py
    outs:
    - data/test.csv
    - data/train.csv


## Add train stage

In [34]:
!dvc run -n train \
    -d src/train.py \
    -d data/train.csv \
    -o data/model.joblib \
    python src/train.py

Running stage 'train' with command:                                             
	python src/train.py
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
Adding stage 'train' to 'dvc.yaml'                                              
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add data/.gitignore dvc.lock dvc.yaml
[0m

In [35]:
!cat dvc.yaml

stages:
  feature_extraction:
    cmd: python src/featurization.py
    deps:
    - data/iris.csv
    - src/featurization.py
    outs:
    - data/iris_featurized.csv
  split_dataset:
    cmd: python src/split_dataset.py --test_size 0.4
    deps:
    - data/iris_featurized.csv
    - src/split_dataset.py
    outs:
    - data/test.csv
    - data/train.csv
  train:
    cmd: python src/train.py
    deps:
    - data/train.csv
    - src/train.py
    outs:
    - data/model.joblib


## Add evaluate stage

In [36]:
!dvc run -n evaluate \
    -d src/train.py \
    -d src/evaluate.py \
    -d data/test.csv \
    -d data/model.joblib \
    -m data/eval.txt \
    python src/evaluate.py

Running stage 'evaluate' with command:                                          
	python src/evaluate.py
Adding stage 'evaluate' to 'dvc.yaml'                                           
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add data/.gitignore dvc.lock dvc.yaml
[0m

In [37]:
!cat dvc.yaml

stages:
  feature_extraction:
    cmd: python src/featurization.py
    deps:
    - data/iris.csv
    - src/featurization.py
    outs:
    - data/iris_featurized.csv
  split_dataset:
    cmd: python src/split_dataset.py --test_size 0.4
    deps:
    - data/iris_featurized.csv
    - src/split_dataset.py
    outs:
    - data/test.csv
    - data/train.csv
  train:
    cmd: python src/train.py
    deps:
    - data/train.csv
    - src/train.py
    outs:
    - data/model.joblib
  evaluate:
    cmd: python src/evaluate.py
    deps:
    - data/model.joblib
    - data/test.csv
    - src/evaluate.py
    - src/train.py
    metrics:
    - data/eval.txt


# Experimenting with reproducible pipelines

## How reproduce experiments?

> The most exciting part of DVC is reproducibility.
>> Reproducibility is the time you are getting benefits out of DVC instead of spending time defining the ML pipelines.

> DVC tracks all the dependencies, which helps you iterate on ML models faster without thinking what was affected by your last change.
>> In order to track all the dependencies, DVC finds and reads ALL the DVC-files in a repository and builds a dependency graph (DAG) based on these files.

> This is one of the differences between DVC reproducibility and traditional Makefile-like build automation tools (Make, Maven, Ant, Rakefile etc). It was designed in such a way to localize specification of DAG nodes.
If you run repro on any created DVC-file from our repository, nothing happens because nothing was changed in the defined pipeline.

(c) dvc.org https://dvc.org/doc/tutorial/reproducibility

In [38]:
# Nothing to reproduce
!dvc repro

Stage 'data/iris.csv.dvc' didn't change, skipping                               
Stage 'feature_extraction' didn't change, skipping
Stage 'split_dataset' didn't change, skipping
Stage 'train' didn't change, skipping
Stage 'evaluate' didn't change, skipping
Data and pipelines are up to date.
[0m

## Experiment 1: Add features



### Create new experiment branch

Before editing the code/featurization.py file, please create and checkout a new branch __ratio_features__

In [39]:
# create new branch

!git checkout -b ratio_features
!git branch

M	data/.gitignore
M	dvc.lock
M	dvc.yaml
Переключено на новую ветку «ratio_features»
  dev[m
  develop-tutorial[m
  dvc-tutorial[m
  master[m
* [32mratio_features[m


### Update featurization.py

in file __featurization.py__ uncomment lines 

    features['sepal_length_to_sepal_width'] = features['sepal_length'] / features['sepal_width']
    features['petal_length_to_petal_width'] = features['petal_length'] / features['petal_width']

### Reproduce pipeline 

In [41]:
!dvc repro

Stage 'data/iris.csv.dvc' didn't change, skipping                               
Running stage 'feature_extraction' with command:
	python src/featurization.py
Updating lock file 'dvc.lock'                                                   

Running stage 'split_dataset' with command:
	python src/split_dataset.py --test_size 0.4
Updating lock file 'dvc.lock'                                                   

Running stage 'train' with command:
	python src/train.py
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
Updating lock file 'dvc.lock'                                                   

Running stage 'evaluate' with command:
	python src/evaluate.py
Updating lock file 'dvc.lock'                   

In [42]:
# Check features used in this pipeline

import pandas as pd

features = pd.read_csv('data/iris_featurized.csv')
features.head()

Unnamed: 0.1,Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,sepal_length_to_sepal_width,petal_length_to_petal_width
0,0,5.1,3.5,1.4,0.2,0,1.457143,7.0
1,1,4.9,3.0,1.4,0.2,0,1.633333,7.0
2,2,4.7,3.2,1.3,0.2,0,1.46875,6.5
3,3,4.6,3.1,1.5,0.2,0,1.483871,7.5
4,4,5.0,3.6,1.4,0.2,0,1.388889,7.0


In [44]:
!git status

На ветке ratio_features
Изменения, которые не в индексе для коммита:
  (используйте «git add <файл>…», чтобы добавить файл в индекс)
  (используйте «git checkout -- <файл>…», чтобы отменить изменения
   в рабочем каталоге)

	[31mизменено:      Lesson 4.ipynb[m
	[31mизменено:      data/.gitignore[m
	[31mизменено:      dvc.lock[m
	[31mизменено:      dvc.yaml[m
	[31mизменено:      src/featurization.py[m

нет изменений добавленных для коммита
(используйте «git add» и/или «git commit -a»)


In [45]:
!git add .
!git commit -m "Experiment with new features"
!git tag -a "exp1-new-features" -m "Experiment with new features"

[ratio_features 88b8f38] Experiment with new features
 5 files changed, 851 insertions(+), 54 deletions(-)


## Experiment 2: Use SVM

### Create new experiment branch

In [47]:
!git checkout -b exp2-svm
!git branch

Переключено на новую ветку «exp2-svm»
  dev[m
  develop-tutorial[m
  dvc-tutorial[m
* [32mexp2-svm[m
  master[m
  ratio_features[m


### Update train.py

in file __train.py__ comment out line

```python
    clf = LogisticRegression(C=0.01, solver='lbfgs', multi_class='multinomial', max_iter=100)
```

and uncomment line

```python
    # clf = SVC(C=0.1, kernel='linear', gamma='scale', degree=5)
```


### Reproduce pipeline 

In [48]:
!dvc repro

Stage 'data/iris.csv.dvc' didn't change, skipping                               
Stage 'feature_extraction' didn't change, skipping
Stage 'split_dataset' didn't change, skipping
Running stage 'train' with command:
	python src/train.py
Updating lock file 'dvc.lock'                                                   

Running stage 'evaluate' with command:
	python src/evaluate.py
Updating lock file 'dvc.lock'                                                   

To track the changes with git, run:

	git add dvc.lock
[0m

In [49]:
!git status

На ветке exp2-svm
Изменения, которые не в индексе для коммита:
  (используйте «git add <файл>…», чтобы добавить файл в индекс)
  (используйте «git checkout -- <файл>…», чтобы отменить изменения
   в рабочем каталоге)

	[31mизменено:      dvc.lock[m
	[31mизменено:      src/train.py[m

нет изменений добавленных для коммита
(используйте «git add» и/или «git commit -a»)


In [50]:
!git add .
!git commit -m "Experiment 2 with SVM estimator"
!git tag -a "exp2-svm" -m "Experiment 2 with SVM estimator"

[exp2-svm 02af5c1] Experiment 2 with SVM estimator
 3 files changed, 84 insertions(+), 12 deletions(-)


## Experiment 3: Tune Logistic Regression

### Create a new experiment branch

In [52]:
# create new branch for experiment

!git checkout -b tuning
!git branch

Переключено на новую ветку «tuning»
  dev[m
  develop-tutorial[m
  dvc-tutorial[m
  exp2-svm[m
  master[m
  ratio_features[m
* [32mtuning[m


In [62]:
!dvc metrics show

	data/eval.txt:                                                                 
		f1_score: 1.0
[0m

In [57]:
# Nothing to reproduce since code was checked out by `git checkout`
# and data files were checked out by `dvc checkout`
!dvc repro

Stage 'data/iris.csv.dvc' didn't change, skipping                               
Stage 'feature_extraction' didn't change, skipping
Stage 'split_dataset' didn't change, skipping
Stage 'train' didn't change, skipping
Stage 'evaluate' didn't change, skipping
Data and pipelines are up to date.
[0m

### Tuning parameters

in file __train.py__ in constructor of LogisticRegression:

* comment out line:
```python
    clf = SVC(C=0.1, kernel='linear', gamma='scale', degree=5)
```
* uncomment line:
```python
    # clf = LogisticRegression(C=0.01, solver='lbfgs', multi_class='multinomial', max_iter=100)
```
* change C param to 0.1

in the end you should get:

```python
clf = LogisticRegression(C=0.1, solver='newton-cg', multi_class='multinomial', max_iter=100)
```

https://dvc.org/doc/tutorials/get-started/experiments#tuning-parameters

### Reproduce pipelines

In [63]:
# re-run pipeline 

!dvc repro

Stage 'data/iris.csv.dvc' didn't change, skipping                               
Stage 'feature_extraction' didn't change, skipping
Stage 'split_dataset' didn't change, skipping
Running stage 'train' with command:
	python src/train.py
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
Updating lock file 'dvc.lock'                                                   

Running stage 'evaluate' with command:
	python src/evaluate.py
Updating lock file 'dvc.lock'                                                   

To track the changes with git, run:

	git add dvc.lock
[0m

In [64]:
!cat data/eval.txt

{"f1_score": 1.0, "confusion_matrix": {"classes": [0, 1, 2], "matrix": [[23, 0, 0], [0, 19, 0], [0, 0, 18]]}}

In [65]:
!dvc metrics show -a

exp2-svm:                                                                       
	data/eval.txt:
		f1_score: 1.0
ratio_features:
	data/eval.txt:
		f1_score: 1.0
tuning:
	data/eval.txt:
		f1_score: 1.0
[0m

### Commit

In [66]:
%%bash

git add .
git commit -m "Tune model. LogisticRegression. C=0.1"
git tag -a "exp3-tune-logreg" -m "Tune model. LogisticRegression. C=0.1"

[tuning b1b4e56] Tune model. LogisticRegression. C=0.1
 3 files changed, 113 insertions(+), 91 deletions(-)


### Merge the model to dvc-tutorial

In [67]:
%%bash

git checkout dvc-tutorial
git merge tuning

Обновление 6e5d19d..b1b4e56
Fast-forward
 Lesson 4.ipynb       | 1109 +++++++++++++++++++++++++++++++++++++++++++-------
 data/.gitignore      |    4 +
 dvc.lock             |   40 +-
 dvc.yaml             |   24 ++
 src/featurization.py |    4 +-
 src/train.py         |    2 +-
 6 files changed, 1037 insertions(+), 146 deletions(-)


Переключено на ветку «dvc-tutorial»


# Compare experiment results

## Compare metrics for all runs (experiments)

In [68]:
# this pipeline metrics 

!dvc metrics show

	data/eval.txt:                                                                 
		f1_score: 1.0
[0m

In [72]:
# show all commited pipelines metrics 

!dvc metrics show -a -T

workspace:                                                                      
	data/eval.txt:
		f1_score: 1.0
dvc-tutorial, tuning:
	data/eval.txt:
		f1_score: 1.0
exp2-svm, exp2-svm:
	data/eval.txt:
		f1_score: 1.0
ratio_features:
	data/eval.txt:
		f1_score: 1.0
exp1-new-features:
	data/eval.txt:
		f1_score: 1.0
exp3-tune-logreg:
	data/eval.txt:
		f1_score: 1.0
[0m