### An introduction to Data Version Control (DVC)

### by Kunal Pathak

press down arrow to move to the next slide


### Navigation

- Use arrows in right bottom corner to navigate.
- Or use arrow keys work from your keyboard.
- Press escape `ESC` to see an overview of the slides.


In [None]:
#Prerequisite

# Initialize Git repository
# git init

# Clone reveal.js
# git submodule add https://github.com/hakimel/reveal.js.git reveal.js

# Install dependencies inside a virtual environment
#python3 -m venv dvc_venv
# source dvc_venv/bin/activate
#pip3 install -r requirements.txt

In [1]:
# This cell contains code to allow mermaid diagrams to be displayed in Jupyter notebooks.
# This code is unrelated to the DVC topic.
# https://mermaid.js.org/ecosystem/tutorials.html#jupyter-integration-with-mermaid-js
import base64
from IPython.display import Image, display
import matplotlib.pyplot as plt

def mm(graph):
    graphbytes = graph.encode("utf8")
    base64_bytes = base64.urlsafe_b64encode(graphbytes)
    base64_string = base64_bytes.decode("ascii")
    display(Image(url="https://mermaid.ink/img/" + base64_string))


# Evolution of Versioning

### Press down arrow to go to the next slide.


### Versioning with file names


![mermaid](./../assets/dvc-100.png)


In [8]:
mm("""
graph TD
		A[Original File: file.txt] --> B[file_updated.txt]
		A --> C[file_latest.txt]
		A --> D[file_changed.txt]

		B --> E[Which version?]
		C --> E
		D --> E

		E --> F[Latest version?]
		E --> G[What changes?]
		E --> H[When and by whom?]
		E --> I[Merge changes?]
""")

### Git benefits for Code Files


![mermaid](./../assets/dvc-105.png)


In [9]:
mm("""
graph TD
		A[Code versioning with Git]
		A --> B[Track exact changes]
		A --> C[Branch and merge]
		A --> D[Track commit history]
""")

### Git limitations for Big Data


![mermaid](./../assets/dvc-110.png)


In [10]:
mm("""
graph LR
		A[Version non-text files]
		B[Track large datasets]
		C[Track experiments]
		D[Version pipelines]
		E[Track models]

		A --> B
		A --> C
		A --> D
		A --> E
""")

### Data Version Control


![mermaid](./../assets/dvc-115.png)


In [3]:
mm("""
graph LR
    A[DVC Features]
    A --> B[Versions large data files]
    A --> D[Works with Git & cloud storage]
    A --> E[Versions models & pipelines]
    A --> F[Tracks experiments]
""")

### Git & DVC together


![mermaid](./../assets/dvc-120.png)


In [4]:
mm("""
graph TD
		A[Project] --> B[Code Files]
		A --> C[Data Files]
		A --> D[ML Models]
		B --> E[Git]
		C --> F[DVC]
		D --> F
		F --> G[External Storage]
""")

# DVC Basics

- This section demonstrates the basic functionality of DVC (Data Version Control).
- These are the topics covered
  - Initialize DVC repository
  - Add data to DVC
  - Diffeent tyoes of DVC remotes
  - Compare DVC with Git


## Initialize DVC repository


In [9]:
# Backup cell
# Switch to the root of the repository
#import os
# Switch to the root of the repository
#os.chdir('/workspaces/dvc3/')
# Get the current directory
#current_directory = os.getcwd()
#print(current_directory)
# Print the contents of the current directory
#directory_contents = os.listdir(current_directory)
#print(directory_contents)

In [None]:
# Initialize DVC repository

!dvc init

![mermaid](./../assets/dvc-125.png)


In [11]:
mm("""
graph TD
    A[dvc init] --> B[.dvc directory]
    A --> C[.dvcignore file]
    B --> D[config]
    B --> E[cache]
    B --> F[tmp]
    B --> G[.gitignore]
""")

- The `dvc init` command initializes a DVC project in the current directory.
- This is similar to `git init` for Git, but it sets up the necessary structure for DVC to work with your data.
- After running this command, you'll notice the following changes in your VSCode sidebar:
  - A `.dvc` directory is created. This is where DVC stores its internal files and configurations.
  - A `.dvcignore` file is created. This is similar to `.gitignore` but for DVC.
- The `.dvc` directory contains:
  - `config`: DVC configuration file.
  - `cache`: Directory where DVC stores the unique copies of your data and models
  - `tmp`: Directory where DVC stores temporary files.
  - `.gitignore`: File to ignore DVC-specific files.
- For more details, check out the detailed documentation [here](https://dvc.org/doc/command-reference/init).


In [None]:
# Check what changes `dvc init` made

!git status

In [4]:
#Stage the changes caused by `dvc init`

!git add .dvc .dvcignore

In [None]:
# Commit the changes and push them

!git commit -m "Initialize DVC"
!git push origin main

### DVC Get

- Now, let's use `dvc get` to retrieve a file from a remote repository.
- This is similar to `git clone` but for data files. It doesn't clone the entire repository, just the specified data.
- The syntax is `dvc get <remote-repo> <file-path>`.


In [None]:
!dvc get https://github.com/kanad13/datasets/ video/part-monarch-metamorphosis.mp4 -o data/time-lapse.mp4

- The `dvc get` command is used to download data from DVC repositories.
  - `https://github.com/kanad13/datasets/`: The URL of the Git repository
  - `video/part-monarch-metamorphosis.mp4`: The path to the file within the source repository
  - `-o data/time-lapse.mp4`: The output path where the file will be saved locally
- After running this command, you'll see a new `data` directory in your VSCode sidebar containing the `time-lapse.mp4` file.
- We can play the file to see how it is only part of the full video.


### DVC Add & Commit

- Now that we have a data file, let's add it to DVC and commit the changes.
- The `dvc add` command is used to add data files to DVC. It is similar to `git add`, but for data files.
- The `dvc commit` command is used to commit the changes to the DVC repository. It is similar to `git commit` but for data files.


In [None]:
!dvc add data/time-lapse.mp4

- The dvc add command adds a file or directory to DVC.
- This creates a `.dvc` file, which is a small text file containing a unique identifier of the data and a file path.
- After running these commands, you'll notice:
  - A part-video.mp4.dvc file is created in the data directory.
  - The actual part-video.mp4 file is copied to the DVC cache (in .dvc/cache).
  - The data folder gets a `.gitignore` file of its own. It is updated to ignore the actual data file.
- **Important**
  - DVC does not actually store your large datasets or models directly within the DVC project directory
  - We do not commit `time-lapse.mp4` since that will go to our DVC repo
  - The metadata about this video file is versioned alongside the source code, while the original data file is added to .gitignore


In [None]:
# Add the updated files to the Git repository

!git add data/time-lapse.mp4.dvc data/.gitignore
!git commit -m "Add part video file to DVC"
!git push origin main

### DVC Remotes

- DVC supports multiple types of remotes to store your data and models e.g. AWS S3, Google Cloud Storage, Azure Blob Storage, etc.
- You can add a remote using the `dvc remote add` command.


In [None]:
!dvc remote add -d myremote gs://dvc-bucket-01
!git add .dvc/config
!git commit -m "Add GCS remote"

- The `dvc remote add` command adds a new remote storage.
- The `-d` flag sets it as the default remote.
- After running these commands:
  - The `.dvc/config` file is updated with the new remote information.
  - We commit this change to Git to version control our DVC configuration.
- You can now push your data to the remote using the `dvc push` command.


### DVC Push

- The `dvc push` command is used to push data to a remote storage.
- `dvc push` uploads the files from your local DVC cache folder to the remote storage.
- This is similar to `git push`, but for data files.


In [None]:
# Prerequisite

# First login to the Google Cloud Platform (GCP) using the following command:
# gcloud auth application-default login

# List the projects
# gcloud projects list

# Set the project ID
# gcloud config set project YOUR_PROJECT_ID
# gcloud config set project dvc-project-436415

# Set the project ID for the quota
#gcloud auth application-default set-quota-project dvc-project-436415

# Check the project ID
# gcloud config list project

In [None]:
!dvc push

#After running `dvc push`, you won't see any changes in your local file structure, but the data will be uploaded to your GCS bucket.
# https://console.cloud.google.com/storage/browser/dvc-bucket-01

| Feature                 | DVC Get                                          | DVC Pull                                                      |
| ----------------------- | ------------------------------------------------ | ------------------------------------------------------------- |
| **Command Syntax**      | `dvc get <repo_url> <file_path>`                 | `dvc pull [file_path]`                                        |
| **Purpose**             | Retrieve files from a remote location            | Synchronize entire project with remote storage of DVC project |
| **Usage Context**       | Can be used anywhere, no DVC project required    | Must be used within an existing DVC project                   |
| **Scope**               | Individual files or directories                  | Entire project or specific tracked files/directories          |
| **Typical Use Case**    | One-time data retrieval, accessing external data | Ongoing project work, updating local data                     |
| **Project Setup**       | No setup required                                | Requires initialized DVC project                              |
| **Dependency Tracking** | Doesn't track dependencies                       | Maintains project's dependency structure                      |


### Tracking Data files with DVC

- Now, let's demonstrate how DVC handles updating and tracking data files, similar to how Git tracks code changes.


In [None]:
# Pull a new, larger version of the file from the same repo
!dvc get https://github.com/kanad13/datasets/ video/fully-monarch-metamorphosis.mp4 -o data/time-lapse.mp4 --force

- We've now downloaded a larger version of the time-lapse video, overwriting the previous file.
- This new file is similar in content but larger in size.


In [None]:
# Add the file to DVC
!dvc add data/time-lapse.mp4

In [None]:
# This command shows that the file is modified.
!git diff ./data/time-lapse.mp4.dvc
# Notice the changes in hash & size values.

In [None]:
# Commit changes to git
!git add data/time-lapse.mp4.dvc
!git commit -m "Update time-lapse video with full version"
!git push origin main

In [None]:
# Push the updated data to the DVC remote
!dvc push

![mermaid](./../assets/dvc-130.png)


In [12]:
mm("""
graph LR
    A[GitHub Repository] -->|dvc get| B[Local Project]
    B -->|dvc add| C[DVC Tracking]
    C -->|git add/commit| D[Git Repository]
    C -->|dvc push| E[Google Cloud Storage]

    subgraph Local Project
    B
    C
    D
    end

    subgraph Remote Storage
    A
    E
    end
""")

![mermaid](./../assets/dvc-135.png)


In [5]:
mm("""
graph TD
    A[Project Directory] --> B{File Type?}
    B -->|Large Data File| C[DVC Add]
    B -->|Code/Small File| D[Git Add]
    C --> E[Creates .dvc file]
    E --> F[Add .dvc file to Git]
    C --> G[Add original file to .gitignore]
    D --> H[Commit to Git repository]
    F --> H
    G --> H
    C --> I[Store data in DVC remote]
    I --> J[Push to DVC remote]
    H --> K[Push to Git remote]
""")

# Experiment Tracking


### Machine Learning Workflow


![mermaid](./../assets/dvc-140.png)


In [13]:
mm("""
graph TD
    A[Data Management & Analysis] --> B[Build & Experiment]
    B --> C[Solution Development & Test]
    C --> D[Deployment & Serving]
    D --> E[Monitor & Maintain]
    E -.-> A
    B -.-> A
""")

### ML Pipeline


![mermaid](./../assets/dvc-145.png)


In [14]:
mm("""
graph LR
    Data[("Data")]
    ParamsYAML[("params.yaml")]

    subgraph Pipeline
        DataLoad["data_load"]
        Featurize["featurize"]
        DataSplit["data_split"]
        Evaluate["Evaluate"]
        Train["Train"]
    end

    MetricsPlots["Metrics & Plots"]
    Model["Model"]

    Data --> DataLoad
    DataLoad --> Featurize
    Featurize --> DataSplit
    DataSplit --> Evaluate
    DataSplit --> Train
    Evaluate --> MetricsPlots
    Train --> Model

    ParamsYAML -.-> Pipeline
""")


- Experiment Tracking means - Tracking details of a several runs of Machine Learning pipeline.

- **Items that can be tracked include**

  - hyperparameters
  - metrics like accuracy, loss, etc.
  - different versions of models and datasets.

- **Experiment Tracking is useful for**
  - `Compare` - Compare between different runs and models.
  - `Track` - Track results and performance metrics.
  - `Reproduce` - Replicate experiments using stored models, hyperparameters, etc.
  - `Audit` - Maintain history of input data used, model used, etc.
  - `Collaborate` - Share results and allow team members to reproduce experiments.


In [None]:
mm("""
graph LR
    Data[("Data")]
    ParamsYAML[("params.yaml")]

    subgraph Pipeline
        DataLoad["data_load"]
        Featurize["featurize"]
        DataSplit["data_split"]
        Evaluate["Evaluate"]
        Train["Train"]
    end

    MetricsPlots["Metrics & Plots"]
    Model["Model"]

    Data --> DataLoad
    DataLoad --> Featurize
    Featurize --> DataSplit
    DataSplit --> Evaluate
    DataSplit --> Train
    Evaluate --> MetricsPlots
    Train --> Model

    ParamsYAML -.-> Pipeline
""")


### Experiment Versions


![mermaid](./../assets/dvc-150.png)


In [None]:
mm("""
gitGraph
    commit id: "Initial"
    branch experiment-1
    checkout experiment-1
    commit id: "Exp 1 - v1"
    commit id: "Exp 1 - v2"
    checkout main
    merge experiment-1
    branch experiment-2
    checkout experiment-2
    commit id: "Exp 2 - v1"
    checkout main
    branch experiment-3
    checkout experiment-3
    commit id: "Exp 3 - v1"
    checkout experiment-2
    commit id: "Exp 2 - v2"
    checkout main
    merge experiment-2
""")

### Reproducibility


![mermaid](./../assets/dvc-155.png)


In [None]:
mm("""
graph LR
    subgraph Development Environment
    A[Historical Data] --> B[Feature Engineering]
    B --> C[Train]
    C --> D[Scoring]
    D --> E[Predictions]
    C --> F[Model]
    end

    subgraph Production Environment
    G[Live Data] --> H[Feature Engineering]
    H --> I[Scoring]
    I --> J[Predictions]
    end

    F --> I
""")

### Experiment Tracking with DVC


![mermaid](./../assets/dvc-160.png)


In [6]:
mm("""
graph LR
    A[Raw Data] --> B[Data Preparation]
    B --> C[Feature Engineering]
    C --> D[Model Training]
    D --> E[Model Evaluation]
    E --> F[Model Deployment]

    subgraph DVC Tracking
    G[Version Control]
    H[Metrics Logging]
    I[Hyperparameter Tracking]
    end

    B -.-> G
    C -.-> G
    D -.-> G
    D -.-> I
    E -.-> H
""")

### Comparison of Experiments Tracking methods

| Traditional Method     | DVC-based            |
| ---------------------- | -------------------- |
| Manual Logging         | Automated Tracking   |
| Spreadsheets           | Structured Storage   |
| Manual Version Control | Git Integration      |
| Difficult to Reproduce | Easy Reproducibility |


### DVCLive

- **DVCLive**
  - It is a Python library for tracking metrics associated with machine learning experiments.
- **Key Features**
  - Logs metrics, parameters, plots, and artifacts during model training
  - Integrates with popular ML frameworks (e.g., PyTorch Lightning, Scikit-learn)
  - Provides real-time experiment logging capabilities
  - DVCLive is an ML logger similar to MLflow. Vertex AI provides similar logging capabilities as DVCLive, but is is designed to work within the Google Cloud ecosystem.


- **Comparison of DVCLive with other related tools**
  - [link](https://github.com/iterative/dvclive?tab=readme-ov-file#comparison-to-related-technologies)
  - DVCLive is an ML Logger, similar to:
    - [MLFlow](https://mlflow.org/)
    - [Weights & Biases](https://wandb.ai/site)
    - [Neptune](https://neptune.ai/)
  - The main differences with those ML Loggers are:
    - DVCLive does not require any additional services or servers to run.
    - DVCLive metrics, parameters, and plots are stored as plain text files that can be versioned by tools like Git or tracked as pointers to files in DVC storage.
    - DVCLive can save experiments or runs as hidden Git commits.
    - You can then use different options to visualize the metrics, parameters, and plots across experiments.


### DVC Studio

- **DVC Studio**
  - It is a web-based platform for managing and visualizing experiments tracked with DVC.
- **Key Features**
  - Provides a user interface for viewing, comparing, and analyzing DVC experiments
  - Visualizes plots, metrics, and other experiment data in a centralized dashboard
  - Enables real-time monitoring of ongoing experiments when integrated with DVCLive
- **Integration between DVCLive and DVC Studio**
  - `DVCLive` - Tool for logging experiment data during training
  - `DVC Studio` - Platform for visualizing and managing that data
  - `Together` - In DVC Studio, users can see live updates of metrics and plots as they're being logged by DVCLive, allowing for real-time monitoring of ongoing experiments.


### DVCLive Demo

- **DVCLive in action**
  - We will now execute a script that simulates a machine learning training process.
  - It demonstrates how to use DVC Live to log parameters and metrics.
  - It creates fake accuracy and loss values that mimic typical ML training behavior over multiple epochs.
- **DVCStudio in action**
  - Run this code 3-4 times to simulate different training runs. Use different parameters to see how they affect the results.
  - Once you have run this code a few times, you can view the results in the DVCLive dashboard. And also on DVCStudio.
  - DVCStudio allows you to share and visualize your DVC experiments with others.


- **Prerequisites**
- Go to DVC Studio and create a DVC project. Grant access to your repo from that project.
- Connect vscode and DVC Studio by clicking the Get Token from inside DVCStudio page in DVC extension.


In [None]:
# DVCLive & DVCStudio Demo
# Script source - https://github.com/iterative/dvclive

import time
import random
from dvclive import Live
# Define hyperparameters for the simulated training
params = {"learning_rate": 0.002, "optimizer": "Adam", "epochs": 20}
# Start a DVC Live session
with Live() as live:
    # Log the hyperparameters
    for param in params:
        live.log_param(param, params[param])
    # Generate a random offset to add variability to the simulated metrics
    offset = random.uniform(0.2, 0.1)
    # Simulate the training process
    for epoch in range(1, params["epochs"]):
        # Generate a random "fuzz" value to add noise to the metrics
        fuzz = random.uniform(0.01, 0.1)
        # Calculate simulated accuracy (increases over time)
        accuracy = 1 - (2 ** - epoch) - fuzz - offset
        # Calculate simulated loss (decreases over time)
        loss = (2 ** - epoch) + fuzz + offset
        # Log the accuracy metric
        live.log_metric("accuracy", accuracy)
        # Log the loss metric
        live.log_metric("loss", loss)
        # Move to the next step (epoch) in DVC Live
        live.next_step()
        # Add a small delay to simulate the time taken for each epoch
        time.sleep(0.2)

- DVCLive relies on Git to track the generated directory
- After you run your training code above, all the logged data will be stored in the dvclive directory and tracked as a DVC experiment for analysis and comparison.
- DVCLive tracks different metrics and artifacts in specific locations within a directory structure:
  - Metrics:
    - General metrics: `/plots/metrics`
    - System metrics: `dvclive/plots/metrics/system`
  - Parameters: `dvclive/params.yaml`
  - Images: `dvclive/plots/images`
  - Custom plots: `dvclive/plots/custom`
  - Sklearn plots: `dvclive/plots/sklearn`
  - Artifacts:
    - Default: `.dvc` files in the root directory (e.g., `model.pt.dvc`)
    - Optional: `dvclive/artifacts/{path}` or `dvclive/artifacts/{path}.dvc`
  - Summary: `dvclive/metrics.json`
- [link](https://dvc.org/doc/dvclive?tab=Parameters#outputs)


In [None]:
# Use DVCLive CLI to view the results
# https://github.com/iterative/dvclive?tab=readme-ov-file#dvc-cli
# A better way although is to use DVCStudio or the vscode extension.

!dvc exp show

In [None]:
# Use DVCLive CLI to view the results
# https://github.com/iterative/dvclive?tab=readme-ov-file#dvc-cli
# A better way although is to use DVCStudio or the vscode extension.

!dvc plots diff $(dvc exp list --names-only) --open

# DVC Pipelines


### Need for DVC Pipelines

- **Beyond Jupyter Notebooks**
  - Experimenting interactively in Jupyter notebooks is great for exploration.
  - But once you are ready to scale up your workflow, you need a more structured way to run reproducible experiments.
- **ML Pipelines**
  - ML Pipelines are a sequence of stages that process data and train models.
  - They Include stages like data preprocessing, model training, evaluation, and deployment.
- **DVC Pipelines**
  - DVC pipelines help you version control your entire machine learning workflow, including data, code, and model artifacts.
  - Pipeline definitions are stored in the `dvc.yaml` file, which describes the stages, their dependencies, and outputs.
  - DVC pipelines enable `reproducibility` by tracking the entire experiment lifecycle, from raw data to final results.


![mermaid](./../assets/dvc-145.png)


### Demo DVC Pipeline


- To demonstrate how to work with DVC pipelines, we are going to create a simple pipeline with the following stages:
  - `Prepare` - Process raw data
  - `Featurize` - Transform prepared data into feature vectors
  - `Train` - Train a machine learning model
  - `Evaluate` - Assess model performance


![mermaid](./../assets/dvc-165.png)


In [7]:
mm("""
graph LR
    A[Prepare] --> B[Featurize]
    B --> C[Train]
    C --> D[Evaluate]
""")

In [None]:
# Download some ready-made code

!wget https://code.dvc.org/get-started/code.zip && unzip code.zip && rm code.zip && rm -rf .github/ && rm -rf src/requirements.txt

In [None]:
# Download some ready-made data
!dvc get https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml

# Add the data to DVC
!dvc add data/data.xml

### Create pipeline definition

- DVC pipeline can be defined in 2 ways
  - manually adding details of each step to `dvc.yaml` file or
  - using `dvc stage add` command to define each step of the pipeline.
- For the purpose of this demo, I have already created a `dvc.yaml` file with the pipeline definition.


In [None]:
# Prerequisite

# Define the DVC Pipeline

# Stage 1: Prepare
!dvc stage add -n prepare \
              -p prepare.seed,prepare.split \
              -d src/prepare.py -d data/data.xml \
              -o data/prepared \
              python src/prepare.py data/data.xml

# Stage 2: Featurize
!dvc stage add -n featurize \
              -p featurize.max_features,featurize.ngrams \
              -d src/featurization.py -d data/prepared \
              -o data/features \
              python src/featurization.py data/prepared data/features

# Stage 3: Train
!dvc stage add -n train \
              -p train.seed,train.n_est,train.min_split \
              -d src/train.py -d data/features \
              -o model.pkl \
              python src/train.py data/features model.pkl

# Stage 4: Evaluate
!dvc stage add -n evaluate \
              -d src/evaluate.py -d model.pkl -d data/features \
              -o eval \
              python src/evaluate.py model.pkl data/features

### Execute pipeline

- **Run pipeline**
  - To execute the pipeline, we use the `dvc repro` command.
- **dvc.lock**
  - After reproducing the pipeline, a `dvc.lock` file is created
  - This file captures the
    - state of the pipeline,
    - the dependencies and values of the parameters that were used, and
    - the outputs that were generated.
- **yaml & lock**
  - `dvc repro` relies on the dependency graph defined in `dvc.yaml`, and
  - uses `dvc.lock` to determine what needs to be run


In [None]:
# Execute dvc pipeline

!dvc repro

### Check metrics

- Checkout DVCLive vscode extension after the run is finished.


In [None]:
# View pipeline execution metrics
!dvc metrics show


In [None]:
# Generate pipeline execution plots
!dvc plots show

### Commit changes

- DVCLive allows you to visualize difference between different runs.
- So submit the changes to the remote repository.


### Commit pipeline execution changes

- Commit the changes to the DVC & git repository.


### Change pipeline parameters

- Open the params.yaml file and rerun the pipeline to see the changes in the metrics.
- For example, change value of estimator from 50 to 10.


In [None]:
# Execute dvc pipeline again

!dvc repro

# DVC will analyze the changes and only re-execute the necessary stages.
# Check out plots and metrics again.

# Commit changes to Git and DVC.

In [None]:
# View pipeline execution metrics (again)
!dvc metrics show


In [None]:
# Generate pipeline execution plots (again)
!dvc plots show

In [None]:
# Compare parameters
!dvc params diff


In [None]:
# Compare metrics
!dvc metrics diff


In [None]:
# Compare plots
!dvc plots diff

In [None]:
!dvc status