# Install and init DVC

Prerequisites: 
-  DVC and requirements.txt packages installed (if not - check README.md file for instructions)
-  A project repository is a Git repo 

## Install with pip

In [None]:
!pip install dvc

## Initialize DVC

References: 
- https://dvc.org/doc/get-started/initialize 

In [5]:
!dvc init

[31mERROR[39m: failed to initiate DVC - '.dvc' exists. Use `-f` to force.

[33mHaving any troubles?[39m Hit us up at [34mhttps://dvc.org/support[39m, we are always happy to help!
[0m

## Commit changes

In [2]:
%%bash

git add .
git commit -m "Initialize DVC"

[dvc-tutorial b2b5dd4] Initialize DVC
 5 files changed, 89 insertions(+)
 create mode 100644 .dvc/.gitignore
 create mode 100644 .dvc/config
 create mode 100644 .dvc/plots/confusion.json
 create mode 100644 .dvc/plots/default.json
 create mode 100644 .dvc/plots/scatter.json


## Review Files and Directories created by DVC

In [12]:
!ls -a .dvc 

[1m[36m.[m[m          [1m[36m..[m[m         .gitignore config     [1m[36mplot[m[m       [1m[36mtmp[m[m


In [14]:
!cat .dvc/.gitignore

/config.local
/tmp
/cache


# Quick Tour of DVC features

## Data Versioning

In [27]:
# Get data 

import pandas as pd
from sklearn.datasets import load_iris

data = load_iris(as_frame=True)
list(data.target_names)
data.frame.to_csv('data/iris.csv', index=False)

In [12]:
# Look on data

data.frame.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [14]:
%%bash

du -sh data/*

4.0K	data/iris.csv


## Add file under DVC control

In [15]:
%%bash

dvc add data/iris.csv


To track the changes with git, run:

	git add data/.gitignore data/iris.csv.dvc


In [19]:
!du -sh data/*

4.0K	data/iris.csv
4.0K	data/iris.csv.dvc


In [20]:
!git status -s data/

[31m??[m data/.gitignore
[31m??[m data/iris.csv.dvc


In [9]:
%%bash

git add .
git commit -m "Add a source dataset"

[dvc-tutorial db97f80] Add a source dataset
 2 files changed, 4 insertions(+)
 create mode 100644 data/.gitignore
 create mode 100644 data/iris.csv.dvc


### What is DVC-file?

Data file internals


>    If you take a look at the DVC-file, you will see that only outputs are defined in outs. 
    In this file, only one output is defined. The output contains the data file path in the repository and md5 cache.
    This md5 cache determines a location of the actual content file in DVC cache directory .dvc/cache
    >> Output from DVC-files defines the relationship between the data file path in a repository and the path in a cache directory. See also DVC File Format



(c) dvc.org https://dvc.org/doc/tutorial/define-ml-pipeline

In [10]:
!cat data/iris.csv.dvc

outs:
- md5: 57fce90c81521889c736445f058c4838
  path: iris.csv


## Create and Reproducve ML pipelines 

Stages 
- extract features 
- split dataset 
- train 
- evaluate 


### Add a pipeline stage with 'dvc run'

In [36]:
!dvc run -n feature_extraction \
    -d src/featurization.py \
    -d data/iris.csv \
    -o data/iris_featurized.csv \
    python src/featurization.py

[31mERROR[39m: Stage 'feature_extraction' already exists in 'dvc.yaml'.

[33mHaving any troubles?[39m Hit us up at [34mhttps://dvc.org/support[39m, we are always happy to help!
[0m

In [37]:
!ls 

Lesson 2.ipynb       dvc.lock             [1m[36msrc[m[m
README.md            dvc.yaml             tutorial-Copy1.ipynb
[1m[36mdata[m[m                 requirements.txt     [1m[36mvenv[m[m


In [41]:
!cat dvc.yaml

stages:
  feature_extraction:
    cmd: python src/featurization.py
    deps:
    - data/iris.csv
    - src/featurization.py
    outs:
    - data/iris_featurized.csv


In [46]:
import pandas as pd

features = pd.read_csv('data/iris_featurized.csv')
features.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [43]:
!git status -s

 [31mM[m README.md
 [31mM[m requirements.txt
 [31mD[m tutorial.ipynb
[31m??[m Lesson 2.ipynb
[31m??[m data/.gitignore
[31m??[m data/iris.csv.dvc
[31m??[m dvc.lock
[31m??[m dvc.yaml
[31m??[m tutorial-Copy1.ipynb


In [44]:
%%bash
git add .
git commit -m "Add stage features_extraction"

[dvc-tutorial 9500262] Add stage features_extraction
 8 files changed, 2688 insertions(+), 142 deletions(-)
 create mode 100644 Lesson 2.ipynb
 create mode 100644 data/.gitignore
 create mode 100644 data/iris.csv.dvc
 create mode 100644 dvc.lock
 create mode 100644 dvc.yaml
 rename tutorial.ipynb => tutorial-Copy1.ipynb (92%)


### Add split train/test stage (via dvc.yaml)

In [49]:
# !dvc run -n split_dataset \
#     -d src/split_dataset.py \
#     -d data/iris_featurized.csv \
#     -o data/train.csv \
#     -o data/test.csv \
#     python src/split_dataset.py --test_size 0.4

Running command:                                                        
	python src/split_dataset.py --test_size 0.4
                                                                        
To track the changes with git, run:

	git add dvc.yaml dvc.lock data/.gitignore
[0m

In [72]:
!cat dvc.yaml

stages:
  feature_extraction:
    cmd: python src/featurization.py
    deps:
    - data/iris.csv
    - src/featurization.py
    outs:
    - data/iris_featurized.csv
  split_dataset:
    cmd: python src/split_dataset.py --test_size 0.4
    deps:
    - data/iris_featurized.csv
    - src/split_dataset.py
    outs:
    - data/test.csv
    - data/train.csv


### Reproduce pipeline

In [73]:
!dvc repro split_dataset

[31mERROR[39m: 'split_dataset' does not exist.                        

[33mHaving any troubles?[39m Hit us up at [34mhttps://dvc.org/support[39m, we are always happy to help!
[0m

## Collaborate on ML Experiments 

### Specify remote storage (local ~ /tmp/dvc)


In [53]:
# Add code here

### Push features to remote storage

In [52]:
!dvc push

[31mERROR[39m: failed to push data to the cloud - config file error: no remote specified. Create a default remote with
    dvc remote add -d <remote name> <remote url>

[33mHaving any troubles?[39m Hit us up at [34mhttps://dvc.org/support[39m, we are always happy to help!
[0m

### Checkout into your teammate experiment state

In [54]:
%%bash 

git checkout experiment-1

dvc checkout

### Check Metrics

In [24]:
!dvc metrics show

	data/eval.txt:                                                                 
		f1_score: 0.7861833464670345
[0m

### Reproduce experiment

In [27]:
# Nothing to reproduce
!dvc repro

Stage 'data/iris.csv.dvc' didn't change, skipping                               
Stage 'feature_extraction' didn't change, skipping
Stage 'split_dataset' didn't change, skipping
Stage 'train' didn't change, skipping
Stage 'evaluate' didn't change, skipping
Data and pipelines are up to date.
[0m

In [25]:
!dvc repro -f

 [31mM[m data/.gitignore
 [31mM[m dvc.lock
 [31mM[m dvc.yaml
 [31mM[m tutorial.ipynb


In [None]:
# Check Metrics

!dvc metrics show