# Install and init DVC

Prerequisites: 
-  DVC and requirements.txt packages installed (if not - check README.md file for instructions)
-  A project repository is a Git repo 



## Install with pip

In [None]:
!pip install "dvc>=1.0.0a6"

## Checkout branch `tutorial`

In [None]:
!git checkout -b dvc-tutorial

## Initialize DVC

References: 
- https://dvc.org/doc/get-started/initialize 

In [None]:
!dvc init

## Commit changes

In [None]:
%%bash

git add .
git commit -m "Initialize DVC"

## Get data 

In [None]:
# Get data 

import pandas as pd
from sklearn.datasets import load_iris

data = load_iris(as_frame=True)
list(data.target_names)
data.frame.to_csv('data/iris.csv', index=False)

In [None]:
# Look on data

data.frame.head()

In [None]:
%%bash

du -sh data/*

## Add data under DVC control

In [None]:
!dvc add data/iris.csv
!git add data/.gitignore data/iris.csv.dvc
!git commit -m "add raw data"

# Build end-to-end Machine Learning pipeline
Stages 
- extract features 
- split dataset 
- train 
- evaluate 


## Add feature extraction stage

In [None]:
!dvc run -n feature_extraction \
    -d src/featurization.py \
    -d data/iris.csv \
    -o data/iris_featurized.csv \
    python src/featurization.py \
        --raw-dataset=data/iris.csv \
        --featurized-dataset=data/iris_featurized.csv

In [None]:
!ls 

In [None]:
!cat dvc.yaml

In [None]:
import pandas as pd

features = pd.read_csv('data/iris_featurized.csv')
features.head()

In [None]:
!git status -s

In [None]:
%%bash
git add .
git commit -m "Add stage features_extraction"

## Add split train/test stage

In [None]:
!dvc run -n split_dataset \
    -d src/split_dataset.py \
    -d data/iris_featurized.csv \
    -o data/train.csv \
    -o data/test.csv \
        python src/split_dataset.py \
            --featurized-dataset=data/iris_featurized.csv \
            --train-dataset=data/train.csv \
            --test-dataset=data/test.csv \
            --test-size=0.4

In [None]:
!cat dvc.yaml

In [None]:
%%bash
git add .
git commit -m "Add stage split_dataset"

## Add train stage

In [None]:
!dvc run -n train \
    -d src/train.py \
    -d data/train.csv \
    -o data/model.joblib \
        python src/train.py \
            --train-dataset=data/train.csv \
            --model=data/model.joblib

In [None]:
!cat dvc.yaml

In [None]:
%%bash
git add .
git commit -m "Add stage train"

## Add evaluate stage

In [None]:
!dvc run -n evaluate \
    -d src/train.py \
    -d src/evaluate.py \
    -d data/test.csv \
    -d data/model.joblib \
    -m data/eval.txt \
        python src/evaluate.py \
            --raw-dataset=data/iris.csv \
            --test-dataset=data/test.csv \
            --model=data/model.joblib \
            --eval-report=data/eval.json

In [None]:
!cat dvc.yaml

In [None]:
%%bash
git add .
git commit -m "Add stage evaluate"

# Experimenting with reproducible pipelines

## How reproduce experiments?

> The most exciting part of DVC is reproducibility.
>> Reproducibility is the time you are getting benefits out of DVC instead of spending time defining the ML pipelines.

> DVC tracks all the dependencies, which helps you iterate on ML models faster without thinking what was affected by your last change.
>> In order to track all the dependencies, DVC finds and reads ALL the DVC-files in a repository and builds a dependency graph (DAG) based on these files.

> This is one of the differences between DVC reproducibility and traditional Makefile-like build automation tools (Make, Maven, Ant, Rakefile etc). It was designed in such a way to localize specification of DAG nodes.
If you run repro on any created DVC-file from our repository, nothing happens because nothing was changed in the defined pipeline.

(c) dvc.org https://dvc.org/doc/tutorial/reproducibility

In [None]:
# Nothing to reproduce
!dvc repro

## Experiment 1: Add features



### Create new experiment branch

Before editing the code/featurization.py file, please create and checkout a new branch __ratio_features__

In [None]:
# create new branch

!git checkout -b exp1-ratio-features
!git branch

### Update featurization.py

in file __featurization.py__  in function`get_features()` after line 

```python
    features = dataset.copy()
```

add lines:

```python
    features['sepal_length_to_sepal_width'] = features['sepal length (cm)'] / features['sepal width (cm)']
    features['petal_length_to_petal_width'] = features['petal length (cm)'] / features['petal width (cm)']
```

### Reproduce pipeline 

In [None]:
!dvc repro

In [None]:
# Check features used in this pipeline

import pandas as pd

features = pd.read_csv('data/iris_featurized.csv')
features.head()

In [None]:
!git status

In [None]:
!git add .
!git commit -m "Experiment with new features"
!git tag -a "exp1_ratio_features" -m "Experiment with new features"

## Experiment 2: Use SVM

### Create new experiment branch

In [None]:
!git checkout -b exp2-svm
!git branch

### Update train.py

in file __train.py__ replace line

```python
    clf = LogisticRegression(C=0.00001, solver='lbfgs', multi_class='multinomial', max_iter=100)
```

with line

```python
    clf = SVC(C=0.01, kernel='linear', gamma='scale', degree=5)
```


### Reproduce pipeline 

In [None]:
!dvc repro

In [None]:
!git status

In [None]:
!git add .
!git commit -m "Experiment 2 with SVM estimator"
!git tag -a "exp2_svm" -m "Experiment 2 with SVM estimator"

## Experiment 3: Tune Logistic Regression

### Create a new experiment branch

In [None]:
# create new branch for experiment

!git checkout -b exp3-tuning-logreg
!git branch

In [None]:
!dvc metrics show

In [None]:
# Nothing to reproduce since code was checked out by `git checkout`
# and data files were checked out by `dvc checkout`
!dvc repro

### Tuning parameters

in file __train.py__ :

* replace line:
```python
    clf = SVC(C=0.01, kernel='linear', gamma='scale', degree=5)
```
* with line:
```python
    clf = LogisticRegression(C=0.01, solver='lbfgs', multi_class='multinomial', max_iter=100)
```
* change parameters: C  to 0.1 and solver to newton-cg

in the end you should get:

```python
clf = LogisticRegression(C=0.1, solver='newton-cg', multi_class='multinomial', max_iter=100)
```

https://dvc.org/doc/tutorials/get-started/experiments#tuning-parameters

### Reproduce pipelines

In [None]:
# re-run pipeline 

!dvc repro

In [None]:
!cat data/eval.txt

In [None]:
!dvc metrics show -a

### Commit

In [None]:
%%bash

git add .
git commit -m "Tune model. LogisticRegression. C=0.1, solver=newton-cg"
git tag -a "exp3_tuning_logreg" -m "Tune model. LogisticRegression. C=0.1, solver=newton-cg"

### Merge the model to dvc-tutorial

In [None]:
%%bash

git checkout dvc-tutorial
git merge exp3-tuning-logreg

# Compare experiment results

## Compare metrics for all runs (experiments)

In [None]:
# this pipeline metrics 

!dvc metrics show

In [None]:
# show all commited pipelines metrics (all branch and tags)

!dvc metrics show -a -T