# Tutorial: Versioning Data and Model 

ML REPA School course: **Machine Learning experiments reproducibility and engineering with DVC**

## Checkout branch `tutorial`

In [None]:
!git checkout -b tutorial

## Initialize DVC

In [None]:
!dvc init

In [None]:
!ls .dvc/

In [None]:
!git status

In [None]:
!git add .dvc
!git commit -m "DVC init"

# How data versioning works?

In [None]:
!dvc status

## Add a file under DVC control

In [None]:
# Get data 

!dvc get https://github.com/iterative/dataset-registry \
          get-started/data.xml -o data/data.xml

In [None]:
!dvc add data/data.xml -v

In [None]:
%%bash

git add data/.gitignore data/data.xml.dvc
git commit -m "Add raw data"

In [None]:
!ls .dvc/cache

In [None]:
!du -sh .dvc/cache/*/*

In [None]:
!cat .dvc/cache/a3/04afb96060aad90176268345e10355

In [None]:
!cat data/data.xml.dvc

##  Add a directory under version control

In [None]:
!git checkout -b cats-dogs-v1

In [None]:
# Download cata-dogs data sets ()

!dvc get --rev cats-dogs-v1 \
          https://github.com/iterative/dataset-registry \
          use-cases/cats-dogs -o datadir

In [None]:
!ls datadir/data/train/cats 

In [None]:
!dvc add datadir

In [None]:
!cat datadir.dvc

In [None]:
!git status

In [None]:
!git add .gitignore datadir.dvc
!git commit -m "Add datadir"
!git tag -a cats-dogs-v1 -m "Create data version v1"

# Tracking changes 

## Track data status changes 

In [None]:
!dvc status

## Updating Tracked Files

In [None]:
!git checkout -b cats-dogs-v2

In [None]:
!dvc get --rev cats-dogs-v2 \
          https://github.com/iterative/dataset-registry \
          use-cases/cats-dogs -o datadir

In [None]:
!dvc status 

In [None]:
!dvc add datadir

In [None]:
!git status

In [None]:
!git add datadir.dvc
!git commit -m "Change data"
!git tag -a cats-dogs-v2 -m "Create data version v2"

# Switching versions

## Checkout to the initital branch `tutorial`

In [None]:
!git checkout tutorial
!dvc checkout

In [None]:
# No `datadir` directory there! 

!ls

## Switch to the first version of the Cats&Dogs data (branch cats-dogs-v1)

In [None]:
# Checkout to the cats-dogs-v1

!git checkout cats-dogs-v1

In [None]:
# Still - No `datadir` directory there! Why? 

!ls

In [None]:
# DVC can see the status and why there is no 'datadir' over there

!dvc status

In [None]:
# To bring the 'datadir' back we need to do `dvc checkout` 

!dvc checkout

In [None]:
!ls
!dvc status

# Store and share data 

## Setup your remote storage (local)

**IMPORTANT**: we use `/tmp` folder in examples in this course just for simplicity! DON NOT use `/tmp` folder to store files for a long term! This folder is cleaned up frequently by your system. 

In [None]:
# Create new remote

!mkdir -p /tmp/dvc
!dvc remote add -d local /tmp/dvc

In [None]:
# As you can see, .dvc/config is changed

!git status -s

In [None]:
# Check config file 

!cat .dvc/config

In [None]:
%%bash

git add .
git commit -m "Add remote storage"

## Push data to remote storage

In [None]:
# Push data to remote

!dvc push -v

In [None]:
!cat datadir.dvc

In [None]:
!ls /tmp/dvc/b6/

In [None]:
!cat /tmp/dvc/b6/923e1e4ad16ea1a7e2b328842d56a2.dir

## Retrieve data from remote storage 

In [None]:
# For example - Remove local cached file

!rm -rf .dvc/cache
!rm -rf datadir

In [None]:
!ls 

In [None]:
!dvc pull -v

In [None]:
!ls

# Data Access

## Find a dataset

> You can use dvc list to explore a DVC repository hosted on any Git server. For example, let's see what's in the use-cases/ directory of out dataset-registry repo:

In [None]:
!dvc list https://github.com/iterative/dataset-registry use-cases

## dvc get

In [None]:
# dvc get = just download dataset

!dvc get https://github.com/iterative/dataset-registry \
          use-cases/cats-dogs

In [None]:
# DVC doesn't control cats-dogs/ folder. There is no cats-dogs.dvc 

!ls

## dvc import 

In [None]:
# dvc import = download dataset and add it under DVC control = dvc get + dvc add

!dvc import git@github.com:iterative/example-get-started \
             data/data.xml

In [None]:
# New data.xml file and data.xml.dvc appeared 

!ls

In [None]:
!cat data.xml.dvc

## dvc import-url

To illustrate these examples we will be using the project explained in Example: Tracking a remote file https://dvc.org/doc/command-reference/import-url

In [None]:
!dvc import-url https://data.dvc.org/get-started/data.xml \
                 data/data.xml

In [None]:
!cat data.xml.dvc

# Special section

## What is DVC-file?

Data file internals


>    If you take a look at the DVC-file, you will see that only outputs are defined in outs. 
    In this file, only one output is defined. The output contains the data file path in the repository and md5 cache.
    This md5 cache determines a location of the actual content file in DVC cache directory .dvc/cache
    >> Output from DVC-files defines the relationship between the data file path in a repository and the path in a cache directory. See also DVC File Format



(c) dvc.org https://dvc.org/doc/tutorial/define-ml-pipeline

In [None]:
!cat datadir.dvc

In [None]:
!du -sh .dvc/cache/*/*

In [None]:
!tree .dvc/cache

In [None]:
!cat .dvc/cache/b6/923e1e4ad16ea1a7e2b328842d56a2.dir

In [None]:
!cat .dvc/cache/a3/04afb96060aad90176268345e10355

## Review Files and Directories created by DVC

> Once initialized in a project, DVC populates its installation directory (.dvc/) with the internal files and directories needed for DVC operation: https://dvc.org/doc/user-guide/dvc-files-and-directories

In [None]:
!ls -la .dvc

In [None]:
!cat .dvc/.gitignore

In [None]:
!tree .dvc/plots

In [None]:
!tree .dvc/tmp

## Explore Structure of cache directory

> There are two ways in which the data is stored in cache: As a single file (eg. data.csv), or a directory of files.
>
>
>For the first case, we calculate the file hash, a 32 characters long string (usually MD5). The first two characters are used to name the directory inside .dvc/cache, and the rest become the file name of the cached file.
>
>
>>Note that file hashes are calculated from file contents only. 2 or more files with different names but the same contents can exist in the workspace and be tracked by DVC, but only one copy is stored in the cache. This helps avoid data duplication in cache and remotes.
#### (c) dvc.org https://dvc.org/doc/user-guide/dvc-files-and-directories

In [None]:
!ls .dvc/cache

In [None]:
!tree .dvc/cache