# DVC data tracking

## Add dvc remote storage

In [3]:
! dvc doctor

DVC version: 1.11.15 (pip)
---------------------------------
Platform: Python 3.7.6 on Darwin-18.7.0-x86_64-i386-64bit
Supports: hdfs, http, https
Cache types: <[36mhttps://error.dvc.org/no-dvc-cache[39m>
Caches: local
Remotes: None
Repo: dvc, git
[0m

In [5]:
! dvc remote add myremote gdrive://18jPWh1VHDjd2sdBRqMnNvm_8uOmUhfoP

[0m

In [9]:
! dvc remote list

myremote	gdrive://18jPWh1VHDjd2sdBRqMnNvm_8uOmUhfoP
[0m

In [12]:
! dvc doctor

DVC version: 1.11.15 (pip)
---------------------------------
Platform: Python 3.7.6 on Darwin-18.7.0-x86_64-i386-64bit
Supports: hdfs, http, https
Cache types: <[36mhttps://error.dvc.org/no-dvc-cache[39m>
Caches: local
Remotes: gdrive
Repo: dvc, git
[0m

Notice that now the remote storage for data is listed in DVC configs:

In [14]:
! cat .dvc/config

['remote "myremote"']
    url = gdrive://18jPWh1VHDjd2sdBRqMnNvm_8uOmUhfoP


Do not forget to commit this in git:

In [10]:
! git commit .dvc/config -m "Configure local remote"

[master 76bc73d] Configure local remote
 1 file changed, 2 insertions(+)


## Add a folder with data to DVC

In [15]:
! ls data/processed

features.feather      target.feather        user_features.feather


In [16]:
! dvc add data/processed

Adding...                                                                       
![A
  0%|          |Computing file/dir hashes (only done o0/4 [00:00<?,      ?md5/s][A
 75%|███████▌  |Computing file/dir hashes (only do3/4 [00:00<00:00,   9.12md5/s][A
                                                                                [A
![A
Saving processed                                      |0.00 [00:00,     ?file/s][A
100% Add|██████████████████████████████████████████████|1/1 [00:01,  1.03s/file][A

To track the changes with git, run:

	git add data/processed.dvc
[0m

Look now what is in the metafile that describes this folder:

In [18]:
! cat data/processed.dvc

outs:
- md5: ea8b8d7ea046a9f40691fda47247189b.dir
  size: 267733246
  nfiles: 4
  path: processed


Also, data moved from to `.dvc/cache`:

In [22]:
! tree .dvc/cache

[01;34m.dvc/cache[00m
├── [01;34m16[00m
│   └── 5404e51882c0f44029968ee161e16a
├── [01;34m60[00m
│   └── 252c4d3b15ae011aa1ab373dd13fa1
├── [01;34md3[00m
│   └── 8f5c849227f93171eccb777043e526
├── [01;34md4[00m
│   └── 1d8cd98f00b204e9800998ecf8427e
└── [01;34mea[00m
    └── 8b8d7ea046a9f40691fda47247189b.dir

5 directories, 5 files


In [25]:
! git add data/processed.dvc -f

## Push added folder to remote

In [1]:
! dvc remote list

myremote	gdrive://18jPWh1VHDjd2sdBRqMnNvm_8uOmUhfoP
[0m

In [32]:
! dvc push -r myremote 

[31mERROR[39m: failed to push data to the cloud - URL 'gdrive://18jPWh1VHDjd2sdBRqMnNvm_8uOmUhfoP' is supported but requires these missing dependencies: ['pydrive2']. To install dvc with those dependencies, run:

	pip install 'dvc[gdrive]'

See <[36mhttps://dvc.org/doc/install[39m> for more info.
[0m

In [None]:
! pip install 'dvc[gdrive]'

In [None]:
! dvc push -r myremote

  0% Querying remote cache|                          |0/1 [00:00<?,     ?file/s]Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?client_id=710796635688-iivsgbgsb6uv1fap6635dhvuei09o66c.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.appdata&access_type=offline&response_type=code&approval_prompt=force

Enter verification code: 

In [32]:
! git commit -m "Add a folder data/processed to DVC"

[master cd80bc6] Add a folder data/processed to DVC
 1 file changed, 5 insertions(+)
 create mode 100644 data/processed.dvc


## Making changes to data

In [33]:
! ls data/processed

features.feather      target.feather        user_features.feather


The minimal `dvc diff`, run without arguments, defaults to comparing DVC-tracked files between HEAD (last Git commit) and the current workspace (uncommitted changes, if any):

In [37]:
! dvc diff

[0m                                                                  core[39m>

In [39]:
! dvc diff HEAD^1

[32mAdded[39m:                                                      core[39m>
    data/processed/
    data/processed/.gitkeep
    data/processed/features.feather
    data/processed/target.feather
    data/processed/user_features.feather

files summary: 4 added, 0 deleted, 0 modified, 0 not in cache
[0m

In [38]:
! dvc status

Data and pipelines are up to date.                                    core[39m>
[0m

`b7` with file `.dir` appeared:

Add a new file to the tracked folder:

In [43]:
! ls data/processed

features.feather             target.feather
scoring_target_added.feather user_features.feather


In [45]:
! dvc diff

[32mAdded[39m:                                                      core[39m>
    data/processed/scoring_target_added.feather

[33mModified[39m:
    data/processed/

files summary: 1 added, 0 deleted, 0 modified, 0 not in cache
[0m

In [46]:
! dvc status

data/processed.dvc:                                                   core[39m>
	changed outs:
		modified:           data/processed
[0m

In [15]:
! git commit -m "Add a new file to data/processed which is tracked by DVC"

[master d66f7c2] Add a new file to data/processed which is tracked by DVC
 2 files changed, 15 insertions(+), 108 deletions(-)
 delete mode 100644 requrements.txt


- Add yet another .txt file in our DVC folder:

In [28]:
! dvc diff

[32mAdded[39m:                                                                
    data/processed/dvc_test_data.txt
    data/processed/scoring_target_added.feather

[33mModified[39m:
    data/processed/

files summary: 2 added, 0 deleted, 0 modified, 0 not in cache
[0m

In [29]:
! dvc status

data/processed.dvc:                                                   core[39m>
	changed outs:
		modified:           data/processed
[0m

In [30]:
! cat data/processed.dvc

outs:
- md5: ea8b8d7ea046a9f40691fda47247189b.dir
  size: 267733246
  nfiles: 4
  path: processed


In [31]:
! ls data/processed

dvc_test_data.txt            target.feather
features.feather             user_features.feather
scoring_target_added.feather


**DO NOT FORGET TO RUN `dvc add` TO TRACK CHANGES!**

In [34]:
! dvc add data/processed

Adding...                                                                       
![A
Saving processed                                      |0.00 [00:00,     ?file/s][A
100% Add|██████████████████████████████████████████████|1/1 [00:00,  1.32file/s][A

To track the changes with git, run:

	git add data/processed.dvc
[0m

In [35]:
! git add data/processed.dvc

In [36]:
! git status

On branch master
Your branch is ahead of 'origin/master' by 4 commits.
  (use "git push" to publish your local commits)

Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

	[32mmodified:   data/processed.dvc[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	[31m2.1-antong_step4_dvc.ipynb[m



In [37]:
! git commit -m "Add a new .TXT file to data/processed which is tracked by DVC"

[master 719cb56] Add a new .TXT file to data/processed which is tracked by DVC
 1 file changed, 3 insertions(+), 3 deletions(-)


## Switch between data versions

In [38]:
! git status

On branch master
Your branch is ahead of 'origin/master' by 5 commits.
  (use "git push" to publish your local commits)

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	[31m2.1-antong_step4_dvc.ipynb[m

nothing added to commit but untracked files present (use "git add" to track)


In [39]:
! git log -5

[33mcommit 719cb56ce0d98c5666f04983d54debce2d9591dc[m[33m ([m[1;36mHEAD -> [m[1;32mmaster[m[33m)[m
Author: AntonGusarov <gusarov@kth.se>
Date:   Thu Feb 11 14:22:00 2021 +0100

    Add a new .TXT file to data/processed which is tracked by DVC

[33mcommit d66f7c20adbeea7bbd99684b564cd74f6ef9ec77[m
Author: AntonGusarov <gusarov@kth.se>
Date:   Thu Feb 11 13:56:22 2021 +0100

    Add a new file to data/processed which is tracked by DVC

[33mcommit cd80bc6274b4e0ec9c58cfdf32c315f5b65d0c73[m
Author: AntonGusarov <gusarov@kth.se>
Date:   Wed Feb 10 22:58:51 2021 +0100

    Add a folder data/processed to DVC

[33mcommit 76bc73dc6b879719e5817ec1ccf7e93ed7bca38b[m
Author: AntonGusarov <gusarov@kth.se>
Date:   Wed Feb 10 21:43:51 2021 +0100

    Configure local remote

[33mcommit 3c8d8cb9766220742025a1f6aa37c8da19e9fd7b[m
Author: AntonGusarov <gusarov@kth.se>
Date:   Wed Feb 10 21:02:22 2021 +0100

    Initialize DVC


The two ways to track a difference in data between commits:
- Compare relatively to the current state (HEAD)
- Compare two commits by their id:s

In [42]:
! dvc diff HEAD^1

[32mAdded[39m:                                                      core[39m>
    data/processed/dvc_test_data.txt
    data/processed/scoring_target_added.feather

[33mModified[39m:
    data/processed/

files summary: 2 added, 0 deleted, 0 modified, 0 not in cache
[0m

In [43]:
! dvc diff cd80bc627 719cb56ce

[32mAdded[39m:                                                      core[39m>
    data/processed/dvc_test_data.txt
    data/processed/scoring_target_added.feather

[33mModified[39m:
    data/processed/

files summary: 2 added, 0 deleted, 0 modified
[0m

### Git tags

First, add git tags to conveniently refer to commits with different data. 

Add an annotated tag to the current commit:

In [45]:
! git tag -a v0.2 -m "dvc tag #2 - added new files"

View added tag.

Simple view:

In [46]:
! git tag

v0.2


View with more verbose options: 

In [47]:
! git tag -n

v0.2            dvc tag #2 - added new files


Add annotated tag to the old commit:

In [48]:
! git tag -a v0.1 cd80bc6274b -m "dvc tag #1 - initial dvc commit"

In [49]:
! git tag -n

v0.1            dvc tag #1 - initial dvc commit
v0.2            dvc tag #2 - added new files


Check the difference between two commits (experiments):

In [50]:
! dvc diff v0.1 v0.2

[32mAdded[39m:                                                      core[39m>
    data/processed/dvc_test_data.txt
    data/processed/scoring_target_added.feather

[33mModified[39m:
    data/processed/

files summary: 2 added, 0 deleted, 0 modified
[0m

In [51]:
! git log -5

[33mcommit 719cb56ce0d98c5666f04983d54debce2d9591dc[m[33m ([m[1;36mHEAD -> [m[1;32mmaster[m[33m, [m[1;33mtag: v0.2[m[33m)[m
Author: AntonGusarov <gusarov@kth.se>
Date:   Thu Feb 11 14:22:00 2021 +0100

    Add a new .TXT file to data/processed which is tracked by DVC

[33mcommit d66f7c20adbeea7bbd99684b564cd74f6ef9ec77[m
Author: AntonGusarov <gusarov@kth.se>
Date:   Thu Feb 11 13:56:22 2021 +0100

    Add a new file to data/processed which is tracked by DVC

[33mcommit cd80bc6274b4e0ec9c58cfdf32c315f5b65d0c73[m[33m ([m[1;33mtag: v0.1[m[33m)[m
Author: AntonGusarov <gusarov@kth.se>
Date:   Wed Feb 10 22:58:51 2021 +0100

    Add a folder data/processed to DVC

[33mcommit 76bc73dc6b879719e5817ec1ccf7e93ed7bca38b[m
Author: AntonGusarov <gusarov@kth.se>
Date:   Wed Feb 10 21:43:51 2021 +0100

    Configure local remote

[33mcommit 3c8d8cb9766220742025a1f6aa37c8da19e9fd7b[m
Author: AntonGusarov <gusarov@kth.se>
Date:   Wed Feb 10 21:02:22

Switch between experimets and see the appropriate data:

1. Current experiment (v0.2):

In [53]:
! ls data/processed

dvc_test_data.txt            target.feather
features.feather             user_features.feather
scoring_target_added.feather


Switch to old experiment:

In [54]:
! git checkout v0.1

Note: checking out 'v0.1'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at cd80bc6 Add a folder data/processed to DVC


In [57]:
! git log -3

[33mcommit cd80bc6274b4e0ec9c58cfdf32c315f5b65d0c73[m[33m ([m[1;36mHEAD[m[33m, [m[1;33mtag: v0.1[m[33m)[m
Author: AntonGusarov <gusarov@kth.se>
Date:   Wed Feb 10 22:58:51 2021 +0100

    Add a folder data/processed to DVC

[33mcommit 76bc73dc6b879719e5817ec1ccf7e93ed7bca38b[m
Author: AntonGusarov <gusarov@kth.se>
Date:   Wed Feb 10 21:43:51 2021 +0100

    Configure local remote

[33mcommit 3c8d8cb9766220742025a1f6aa37c8da19e9fd7b[m
Author: AntonGusarov <gusarov@kth.se>
Date:   Wed Feb 10 21:02:22 2021 +0100

    Initialize DVC


**DO NOT FORGET TO DO A `dvc checkout`**:

In [59]:
! dvc checkout

[33mM[39m	data/processed/                                                     
[0m

In [34]:
! ls data/processed

dvc_test_data.txt            target.feather
features.feather             user_features.feather
scoring_target_added.feather


**Return back** to the latest commit:

In [61]:
! git checkout -

Previous HEAD position was cd80bc6 Add a folder data/processed to DVC
Switched to branch 'master'
Your branch is ahead of 'origin/master' by 5 commits.
  (use "git push" to publish your local commits)


In [62]:
! git log -5

[33mcommit 719cb56ce0d98c5666f04983d54debce2d9591dc[m[33m ([m[1;36mHEAD -> [m[1;32mmaster[m[33m, [m[1;33mtag: v0.2[m[33m)[m
Author: AntonGusarov <gusarov@kth.se>
Date:   Thu Feb 11 14:22:00 2021 +0100

    Add a new .TXT file to data/processed which is tracked by DVC

[33mcommit d66f7c20adbeea7bbd99684b564cd74f6ef9ec77[m
Author: AntonGusarov <gusarov@kth.se>
Date:   Thu Feb 11 13:56:22 2021 +0100

    Add a new file to data/processed which is tracked by DVC

[33mcommit cd80bc6274b4e0ec9c58cfdf32c315f5b65d0c73[m[33m ([m[1;33mtag: v0.1[m[33m)[m
Author: AntonGusarov <gusarov@kth.se>
Date:   Wed Feb 10 22:58:51 2021 +0100

    Add a folder data/processed to DVC

[33mcommit 76bc73dc6b879719e5817ec1ccf7e93ed7bca38b[m
Author: AntonGusarov <gusarov@kth.se>
Date:   Wed Feb 10 21:43:51 2021 +0100

    Configure local remote

[33mcommit 3c8d8cb9766220742025a1f6aa37c8da19e9fd7b[m
Author: AntonGusarov <gusarov@kth.se>
Date:   Wed Feb 10 21:02:22

# DVC automated pipelines

In [19]:
! dvc doctor

DVC version: 1.11.15 (pip)
---------------------------------
Platform: Python 3.7.6 on Darwin-18.7.0-x86_64-i386-64bit
Supports: gdrive, hdfs, http, https
Cache types: reflink, hardlink, symlink
Caches: local
Remotes: gdrive
Repo: dvc, git
[0m

In [33]:
%%bash
dvc run -n load_data_stage \
        -d src/data/load.py -d src/data/process.py -d src/utils/logging.py \
        -p config/params.yaml:base,data_load \
        -o data/processed \
        python -m src.pipelines.load_data --config=config/params.yaml

ERROR: output 'data/processed' is already specified in stage: 'data/processed.dvc'.


CalledProcessError: Command 'b'dvc run -n load_data_stage \\\n        -d src/data/load.py -d src/data/process.py -d src/utils/logging.py \\\n        -p config/params.yaml:base,data_load \\\n        -o data/processed \\\n        python -m src.pipelines.load_data --config=config/params.yaml\n'' returned non-zero exit status 1.

The problem might be that this -o folder is already tracked by DVC

Try again to create a pipeline stage:

In [28]:
%%bash
dvc run -n load_data_stage \
        -d src/data/load.py -d src/data/process.py -d src/utils/logging.py \
        -p config/params.yaml:base,data_load \
        python -m src.pipelines.load_data --config=config/params.yaml

Running stage 'load_data_stage' with command:
	python -m src.pipelines.load_data --config=config/params.yaml
2021-02-12 18:33:00,230 — DATA_LOAD — INFO — Load dataset
2021-02-12 18:33:00,465 — DATA_LOAD — INFO — Process target
2021-02-12 18:33:00,491 — DATA_LOAD — INFO — Process dataset
2021-02-12 18:33:00,786 — DATA_LOAD — INFO — Save processed data and target
2021-02-12 18:33:00,984 — DATA_LOAD — DEBUG — Processed data path: data/processed/user_features.feather
2021-02-12 18:33:00,984 — DATA_LOAD — DEBUG — Processed data path: data/processed/target.feather
Creating 'dvc.yaml'
Adding stage 'load_data_stage' in 'dvc.yaml'
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.yaml dvc.lock


Update git repo accordingly:

In [31]:
! git status

On branch dvc_pipelines
Untracked files:
  (use "git add <file>..." to include in what will be committed)

	[31m2.1-antong_step4_dvc.ipynb[m
	[31mdvc.lock[m
	[31mdvc.yaml[m

nothing added to commit but untracked files present (use "git add" to track)


In [32]:
! git add dvc.yaml dvc.lock

In [33]:
! git status

On branch dvc_pipelines
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

	[32mnew file:   dvc.lock[m
	[32mnew file:   dvc.yaml[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	[31m2.1-antong_step4_dvc.ipynb[m



Inspect created files:

In [29]:
! cat dvc.yaml

stages:
  load_data_stage:
    cmd: python -m src.pipelines.load_data --config=config/params.yaml
    deps:
    - src/data/load.py
    - src/data/process.py
    - src/utils/logging.py
    params:
    - config/params.yaml:
      - base
      - data_load


In [30]:
! cat dvc.lock

load_data_stage:
  cmd: python -m src.pipelines.load_data --config=config/params.yaml
  deps:
  - path: src/data/load.py
    md5: 0a5f96acb043c689f0ed53ec95c89c91
    size: 435
  - path: src/data/process.py
    md5: 50961f7de0d85f0141cff8623921318a
    size: 908
  - path: src/utils/logging.py
    md5: 65c8fc45ee7ec9baf85c1aa9050b27ed
    size: 1059
  params:
    config/params.yaml:
      base:
        project_dir: .
        random_state: 42
        log_level: DEBUG
      data_load:
        target: data/raw/target.feather
        dataset: data/raw/user_features.feather
        target_processed: data/processed/target.feather
        dataset_processed: data/processed/user_features.feather


In [35]:
%%bash
dvc run -n featurize_stage \
        -d src/data/load.py -d src/data/process.py -d src/utils/logging.py \
        -p config/params.yaml:base,featurize \
        python -m src.pipelines.featurize --config=config/params.yaml

Running stage 'featurize_stage' with command:
	python -m src.pipelines.featurize --config=config/params.yaml
2021-02-12 18:49:37,553 — FEATURIZE — INFO — Load dataset
2021-02-12 18:49:37,996 — FEATURIZE — INFO — Process dataset
2021-02-12 18:49:38,210 — FEATURIZE — INFO — Add target column
2021-02-12 18:49:38,992 — FEATURIZE — INFO — Process nulls
2021-02-12 18:49:39,466 — FEATURIZE — INFO — Save features
2021-02-12 18:49:39,822 — FEATURIZE — DEBUG — Features path: data/processed/features.feather
Adding stage 'featurize_stage' in 'dvc.yaml'
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.lock dvc.yaml


In [36]:
! cat dvc.yaml

stages:
  load_data_stage:
    cmd: python -m src.pipelines.load_data --config=config/params.yaml
    deps:
    - src/data/load.py
    - src/data/process.py
    - src/utils/logging.py
    params:
    - config/params.yaml:
      - base
      - data_load
  featurize_stage:
    cmd: python -m src.pipelines.featurize --config=config/params.yaml
    deps:
    - src/data/load.py
    - src/data/process.py
    - src/utils/logging.py
    params:
    - config/params.yaml:
      - base
      - featurize


Remove data with train metrics and model outputs out of git (so that DVC will start to track it)

In [46]:
%%bash
git rm -r --cached 'reports/raw_metrics.csv'
git commit -m "stop tracking reports/raw_metrics.csv" 

rm 'reports/raw_metrics.csv'
[dvc_pipelines e6a9b73] stop tracking reports/raw_metrics.csv
 3 files changed, 34 insertions(+), 5 deletions(-)
 create mode 100644 dvc.lock
 create mode 100644 dvc.yaml
 delete mode 100644 reports/raw_metrics.csv


In [48]:
%%bash
git rm -r --cached 'models/model.joblib'
git commit -m "stop tracking models/model.joblib" 

rm 'models/model.joblib'
[dvc_pipelines 887a476] stop tracking models/model.joblib
 1 file changed, 0 insertions(+), 0 deletions(-)
 delete mode 100644 models/model.joblib


In [49]:
%%bash
dvc run -n train_stage \
        -d src/train/train.py -d src/evaluate/metrics.py -d src/utils/logging.py \
        -o models/model.joblib \
        -p config/params.yaml:base,featurize.features_path,train \
        -m reports/raw_metrics.csv \
        python -m src.pipelines.train --config=config/params.yaml

Running stage 'train_stage' with command:
	python -m src.pipelines.train --config=config/params.yaml
{'project_dir': '.', 'random_state': 42, 'log_level': 'DEBUG'}
2021-02-12 19:25:51,164 — TRAIN — INFO — Load data
2021-02-12 19:25:51,327 — TRAIN — INFO — Instantiate model
2021-02-12 19:25:51,405 — TRAIN — INFO — Top_K 5.0% of the dataset size: 37606
2021-02-12 19:25:51,405 — TRAIN — INFO — Fold 1:
2021-02-12 19:25:51,405 — TRAIN — INFO — Train: 2020-04-30 00:00:00 - 2020-04-30 00:00:00
2021-02-12 19:25:51,406 — TRAIN — INFO — Test: 2020-05-31 00:00:00 

2021-02-12 19:25:51,508 — TRAIN — INFO — Train shapes: X - (150484, 30), y - (150484,)
2021-02-12 19:25:51,508 — TRAIN — INFO — Test shapes: X - (150411, 30), y - (150411,)
Learning rate set to 0.5
0:	learn: 0.6136792	total: 129ms	remaining: 2.45s
1:	learn: 0.5580362	total: 181ms	remaining: 1.63s
2:	learn: 0.5270051	total: 232ms	remaining: 1.31s
3:	learn: 0.5080045	total: 288ms	remaining: 1.15s
4:	learn: 0.4978499	total: 341ms	remainin

In [52]:
!git status

On branch dvc_pipelines
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

	[32mmodified:   dvc.lock[m
	[32mmodified:   dvc.yaml[m
	[32mnew file:   models/.gitignore[m
	[32mmodified:   reports/.gitignore[m

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	[31mmodified:   .gitignore[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	[31m2.1-antong_step4_dvc.ipynb[m



In [51]:
! git add models/.gitignore reports/.gitignore dvc.lock dvc.yaml

In [None]:
! dvc dag

7[?47h[?1h=






















[H[2J[H[H[2J[H+-----------------+  [m
| load_data_stage |  [m
+-----------------+  [m
+-----------------+  [m
| featurize_stage |  [m
+-----------------+  [m
+-------------+  [m
| train_stage |  [m
+-------------+  [m
+--------------------+ [m
| data/processed.dvc | [m
+--------------------+ [m
[1m~[m
[1m~[m
[1m~[m
[1m~[m
[1m~[m
[1m~[m
[1m~[m
[1m~[m
[1m~[m
[1m~[m
[1m~[m
[7m/var/folders/vp/zg2nfhts6q74ptj2dn93gxfr0000gn/T/tmpmfa84sm2 (END)[m[K

In [6]:
! cat dvc.yaml

stages:
  load_data_stage:
    cmd: python -m src.pipelines.load_data --config=config/params.yaml
    deps:
    - src/data/load.py
    - src/data/process.py
    - src/utils/logging.py
    params:
    - config/params.yaml:
      - base
      - data_load
  featurize_stage:
    cmd: python -m src.pipelines.featurize --config=config/params.yaml
    deps:
    - src/data/load.py
    - src/data/process.py
    - src/utils/logging.py
    params:
    - config/params.yaml:
      - base
      - featurize
  train_stage:
    cmd: python -m src.pipelines.train --config=config/params.yaml
    deps:
    - src/evaluate/metrics.py
    - src/train/train.py
    - src/utils/logging.py
    params:
    - config/params.yaml:
      - base
      - featurize.features_path
      - train
    outs:
    - models/model.joblib
    metrics:
    - reports/raw_metrics.csv
    - reports/train_metrics.json


In [7]:
! cat dvc.lock

load_data_stage:
  cmd: python -m src.pipelines.load_data --config=config/params.yaml
  deps:
  - path: src/data/load.py
    md5: 0a5f96acb043c689f0ed53ec95c89c91
    size: 435
  - path: src/data/process.py
    md5: 50961f7de0d85f0141cff8623921318a
    size: 908
  - path: src/utils/logging.py
    md5: 65c8fc45ee7ec9baf85c1aa9050b27ed
    size: 1059
  params:
    config/params.yaml:
      base:
        project_dir: .
        random_state: 42
        log_level: DEBUG
      data_load:
        target: data/raw/target.feather
        dataset: data/raw/user_features.feather
        target_processed: data/processed/target.feather
        dataset_processed: data/processed/user_features.feather
featurize_stage:
  cmd: python -m src.pipelines.featurize --config=config/params.yaml
  deps:
  - path: src/data/load.py
    md5: 0a5f96acb043c689f0ed53ec95c89c91
    size: 435
  - path: src/data/process.py
    md5: 50961f7de0d85f0141cff8623921318a
    size: 908
  - path: 

##  Run the pipeline:

In [9]:
! dvc repro

Stage 'featurize_stage' didn't change, skipping                       core[39m>
Stage 'train_stage' didn't change, skipping
Stage 'load_data_stage' didn't change, skipping
Data and pipelines are up to date.
[0m

In [9]:
! dvc metrics show

	reports/train_metrics.json:                                          core[39m>
		lift_max: 2.1505471897179307
		lift_min: 2.1383669439424997
		lift_std: 0.005093759905144943
		lift_mean: 2.145029726481869
		precision_at_k_max: 0.8380311652395894
		precision_at_k_min: 0.8346540445673563
		precision_at_k_std: 0.0016701742169299028
		precision_at_k_mean: 0.8364556187842366
		recall_at_k_max: 1.0
		recall_at_k_min: 1.0
		recall_at_k_std: 0.0
		recall_at_k_mean: 1.0
[0m

In [2]:
! git rm -r --cached 'reports/train_metrics.json'

rm 'reports/train_metrics.json'


In [6]:
! git reset --soft HEAD~1

In [7]:
! git log

[33mcommit d9e2823506a24e4bfd171427f7ca51f8686997d8[m[33m ([m[1;36mHEAD -> [m[1;32mdvc_pipelines[m[33m, [m[1;33mtag: exp_1.0[m[33m, [m[1;31morigin/dvc_pipelines[m[33m)[m
Author: AntonGusarov <gusarov@kth.se>
Date:   Fri Feb 12 19:42:33 2021 +0100

    pipeline is automated by DVC

[33mcommit 887a476a2fc05c092c15c7aaa11bd2dbe7d2790f[m
Author: AntonGusarov <gusarov@kth.se>
Date:   Fri Feb 12 19:25:33 2021 +0100

    stop tracking models/model.joblib

[33mcommit e6a9b7364068d6dbf69a18596759f02a4fd6b5f5[m
Author: AntonGusarov <gusarov@kth.se>
Date:   Fri Feb 12 19:23:29 2021 +0100

    stop tracking reports/raw_metrics.csv

[33mcommit 719cb56ce0d98c5666f04983d54debce2d9591dc[m[33m ([m[1;33mtag: v0.2[m[33m)[m
Author: AntonGusarov <gusarov@kth.se>
Date:   Thu Feb 11 14:22:00 2021 +0100

    Add a new .TXT file to data/processed which is tracked by DVC

[33mcommit d66f7c20adbeea7bbd99684b564cd74f6ef9ec77[m
Author: AntonGusarov <gusarov@