# Install and init DVC

Prerequisites: 
-  DVC and requirements.txt packages installed (if not - check README.md file for instructions)
-  A project repository is a Git repo 

## Install with pip

In [None]:
# !pip install dvc==2.10.1

## Checkout branch `experiments`

In [None]:
!git checkout -b experiments

## Initialize DVC

References: 
- https://dvc.org/doc/get-started/initialize 

In [None]:
!dvc init

## Commit changes

In [None]:
%%bash

git add .
git commit -m "Initialize DVC"

## Review Files and Directories created by DVC

In [None]:
!ls -a .dvc 

In [None]:
!cat .dvc/.gitignore

# Quick Tour of DVC features

## Data Versioning

In [1]:
# Get data 

import pandas as pd
from sklearn.datasets import load_iris

data = load_iris(as_frame=True)
list(data.target_names)
data.frame.to_csv('data/iris.csv', index=False)

In [2]:
# Look on data

data.frame.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [None]:
%%bash

du -sh data/*

## Add file under DVC control

In [None]:
%%bash

dvc add data/iris.csv

In [None]:
!du -sh data/*

In [None]:
!git status -s data/

In [None]:
%%bash

git add .
git commit -m "Add a source dataset"

### What is DVC-file?

Data file internals


>    If you take a look at the DVC-file, you will see that only outputs are defined in outs. 
    In this file, only one output is defined. The output contains the data file path in the repository and md5 cache.
    This md5 cache determines a location of the actual content file in DVC cache directory .dvc/cache
    >> Output from DVC-files defines the relationship between the data file path in a repository and the path in a cache directory. See also DVC File Format



(c) dvc.org https://dvc.org/doc/tutorial/define-ml-pipeline

In [None]:
!cat data/iris.csv.dvc

## Create and Reproduce ML pipelines 

Stages 
- extract features 
- split dataset 
- train 
- evaluate 


### Add a pipeline stage with 'dvc stage add'

In [None]:
!dvc stage add \
    -n feature_extraction \
    -d src/featurization.py \
    -d data/iris.csv \
    -o data/iris_featurized.csv \
    python src/featurization.py

In [None]:
!cat dvc.yaml

### Add split train/test stage

In [None]:
!dvc stage add \
    -n split_dataset \
    -d src/split_dataset.py \
    -d data/iris_featurized.csv \
    -o data/train.csv \
    -o data/test.csv \
    python src/split_dataset.py --test_size 0.4

In [None]:
!cat dvc.yaml

### Add train stage

In [None]:
!dvc stage add \
    -n train \
    -d src/train.py \
    -d data/train.csv \
    -o data/model.joblib \
    python src/train.py

In [None]:
!cat dvc.yaml

### Add evaluate stage

In [None]:
!dvc stage add \
    -n evaluate \
    -d src/train.py \
    -d src/evaluate.py \
    -d data/test.csv \
    -d data/model.joblib \
    -m data/eval.txt \
    python src/evaluate.py

In [None]:
!cat dvc.yaml

### Run DVC pipeline (all stages)

In [None]:
!dvc repro

In [None]:
!ls 

In [None]:
import pandas as pd

features = pd.read_csv('data/iris_featurized.csv')
features.head()

In [None]:
!git status -s

In [None]:
%%bash
git add .
git commit -m "Create DVC pipeline"

### Reproduce pipeline

In [None]:
!dvc repro split_dataset

## Collaborate on ML Experiments 

### Specify remote storage (local ~ /tmp/dvc)


In [None]:
!dvc remote add -d local /tmp/dvc

### Push features to remote storage

In [None]:
!dvc push

### Create tag `experiment-1`

In [None]:
!git tag -a experiment-1 -m "experiment-1"

### Checkout into your teammate experiment state

In [None]:
%%bash 

git checkout experiment-1
dvc checkout

### Check Metrics

In [None]:
!dvc metrics show

### Reproduce experiment

In [None]:
# Nothing to reproduce
!dvc repro

In [None]:
!dvc repro -f

In [None]:
# Check Metrics
!dvc metrics show