**Table of contents**<a id='toc0_'></a>    
- [Introduction to Data Version Control (DVC)](#toc1_)    
    - [What is DVC?](#toc1_1_1_)    
    - [What Does This Notebook Cover?](#toc1_1_2_)    
    - [What Can DVC Be Used For?](#toc1_1_3_)    
    - [DVC Benefits](#toc1_1_4_)    
- [Evolution of Versioning](#toc2_)    
    - [Versioning with file names](#toc2_1_1_)    
    - [Git benefits for Code Files](#toc2_1_2_)    
    - [Git limitations for Big Data](#toc2_1_3_)    
    - [Data Version Control](#toc2_1_4_)    
    - [Git & DVC together](#toc2_1_5_)    
- [DVC Basics](#toc3_)    
    - [Initialize DVC repository](#toc3_1_1_)    
    - [DVC Get](#toc3_1_2_)    
    - [DVC Add & Commit](#toc3_1_3_)    
    - [DVC Remotes](#toc3_1_4_)    
    - [DVC Push](#toc3_1_5_)    
    - [Tracking Data files with DVC](#toc3_1_6_)    
- [Experiment Tracking](#toc4_)    
    - [Machine Learning Workflow](#toc4_1_1_)    
    - [ML Pipeline](#toc4_1_2_)    
    - [Experiment Versions](#toc4_1_3_)    
    - [Reproducibility](#toc4_1_4_)    
    - [Experiment Tracking with DVC](#toc4_1_5_)    
    - [Comparison of Experiments Tracking methods](#toc4_1_6_)    
    - [DVCLive](#toc4_1_7_)    
    - [DVC Studio](#toc4_1_8_)    
    - [DVCLive Demo](#toc4_1_9_)    
- [DVC Pipelines](#toc5_)    
    - [Need for DVC Pipelines](#toc5_1_1_)    
    - [Demo DVC Pipeline](#toc5_1_2_)    
    - [Another repository](#toc5_1_3_)    
- [Benefits & Limitations of DVC](#toc6_)    
  - [Benefits of DVC for Data Science and ML Teams](#toc6_1_)    
  - [Benefits of DVC for Other Teams](#toc6_2_)    
  - [Limitations of DVC](#toc6_3_)    
- [Comparison of DVC with other tools](#toc7_)    
  - [DVC comparison with standalone tools](#toc7_1_)    
    - [MLflow](#toc7_1_1_)    
    - [Pachyderm](#toc7_1_2_)    
    - [Weights & Biases (wandb)](#toc7_1_3_)    
    - [Neptune.ai](#toc7_1_4_)    
    - [Key Takeaways - DVC vs. Standalone Tools](#toc7_1_5_)    
  - [DVC vs. Cloud Provider Offerings](#toc7_2_)    
    - [Google Cloud Platform (GCP)](#toc7_2_1_)    
    - [Amazon Web Services (AWS)](#toc7_2_2_)    
    - [Microsoft Azure](#toc7_2_3_)    
    - [Key Takeaways - DVC vs. Cloud offerings](#toc7_2_4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Introduction to Data Version Control (DVC)](#toc0_)


### <a id='toc1_1_1_'></a>[What is DVC?](#toc0_)

- Data Version Control (DVC) is an open-source version control system designed to handle large datasets and machine learning models.
- It extends the capabilities of traditional version control systems, like Git, by enabling effective tracking and management of code, data, and experiments involved in data science projects.


### <a id='toc1_1_2_'></a>[What Does This Notebook Cover?](#toc0_)

This interactive jupyter notebook serves as an introduction to DVC by covering essential commands and functionalities. It includes:

- **DVC Basics:** Setting up DVC project and tracking data files with it.
- **Experiment Tracking:** ML workflow basics, need for tracking & reproducing ML experiments.
- **DVC Pipelines:** What are ML pipelines, how do DVC pipelines help MLOps.
- **DVC - Pros & Cons:** When to use DVC, when to not.
- **Compare DVC to other tools:** Techniques for tracking, comparing, and optimizing experiments.

By the end of this notebook, you'll have a foundational understanding of DVC and how to integrate it into your data science workflow for better project management and collaboration.


### <a id='toc1_1_3_'></a>[What Can DVC Be Used For?](#toc0_)

- **Data Management:** Track and version your datasets and model files, allowing you to go back to any version or state as needed.
- **Reproducibility:** Ensure that your machine learning experiments are reproducible by keeping a history of data, model, and code.
- **Collaboration:** Collaborate more efficiently with team members by sharing data files and experiment progress consistently.
- **Experiment Tracking:** Manage and compare multiple experiments effectively, providing insights into model performance variations.
- **Pipeline Automation:** Build and maintain data pipelines, defining dependencies between data processing stages, which can be executed and reproduced effortlessly.


### <a id='toc1_1_4_'></a>[DVC Benefits](#toc0_)

- **Customer**
  - Faster Time to Market
  - Improved Model Quality
  - Transparency and Explainability
- **Simplicity**
  - Easy to learn (git-like commands)
  - Integrates with existing tools
  - Easy Experiment Tracking
- **Growth**
  - Cost efficiency
  - Increased productivity
  - Accelerated Experimentation


In [None]:
# This cell contains code to allow mermaid diagrams to be displayed in Jupyter notebooks.
# Run this code cell so that the mermaid diagrams in the notebook are displayed correctly.
# This code is unrelated to the DVC topic.
# https://mermaid.js.org/ecosystem/tutorials.html#jupyter-integration-with-mermaid-js
import base64
from IPython.display import Image, display
import matplotlib.pyplot as plt

def mm(graph):
    graphbytes = graph.encode("utf8")
    base64_bytes = base64.urlsafe_b64encode(graphbytes)
    base64_string = base64_bytes.decode("ascii")
    display(Image(url="https://mermaid.ink/img/" + base64_string))


# <a id='toc2_'></a>[Evolution of Versioning](#toc0_)


### <a id='toc2_1_1_'></a>[Versioning with file names](#toc0_)


In [None]:
mm("""
graph TD
		A[Original File: file.txt] --> B[file_updated.txt]
		A --> C[file_latest.txt]
		A --> D[file_changed.txt]

		B --> E[Which version?]
		C --> E
		D --> E

		E --> F[Latest version?]
		E --> G[What changes?]
		E --> H[When and by whom?]
		E --> I[Merge changes?]
""")

### <a id='toc2_1_2_'></a>[Git benefits for Code Files](#toc0_)


In [None]:
mm("""
graph TD
		A[Code versioning with Git]
		A --> B[Track exact changes]
		A --> C[Branch and merge]
		A --> D[Track commit history]
""")

### <a id='toc2_1_3_'></a>[Git limitations for Big Data](#toc0_)


In [None]:
mm("""
graph LR
		A[Version non-text files]
		B[Track large datasets]
		C[Track experiments]
		D[Version pipelines]
		E[Track models]

		A --> B
		A --> C
		A --> D
		A --> E
""")

### <a id='toc2_1_4_'></a>[Data Version Control](#toc0_)


In [None]:
mm("""
graph LR
    A[DVC Features]
    A --> B[Versions large data files]
    A --> D[Works with Git & cloud storage]
    A --> E[Versions models & pipelines]
    A --> F[Tracks experiments]
""")

### <a id='toc2_1_5_'></a>[Git & DVC together](#toc0_)


In [None]:
mm("""
graph TD
		A[Project] --> B[Code Files]
		A --> C[Data Files]
		A --> D[ML Models]
		B --> E[Git]
		C --> F[DVC]
		D --> F
		F --> G[External Storage]
""")

# <a id='toc3_'></a>[DVC Basics](#toc0_)

- This section demonstrates the basic functionality of DVC (Data Version Control).
- These are the topics covered
  - Initialize DVC repository
  - Add data to DVC
  - Diffeent tyoes of DVC remotes
  - Compare DVC with Git


### <a id='toc3_1_1_'></a>[Initialize DVC repository](#toc0_)


In [7]:
# Backup cell
# Switch to the root of the repository
#import os
# Switch to the root of the repository
#os.chdir('/workspaces/dvc3/')
# Get the current directory
#current_directory = os.getcwd()
#print(current_directory)
# Print the contents of the current directory
#directory_contents = os.listdir(current_directory)
#print(directory_contents)

In [None]:
# Initialize DVC repository

!dvc init

In [7]:
mm("""
graph TD
    A[dvc init] --> B[.dvc directory]
    A --> C[.dvcignore file]
    B --> D[config]
    B --> E[cache]
    B --> F[tmp]
    B --> G[.gitignore]
""")

- The `dvc init` command initializes a DVC project in the current directory.
- This is similar to `git init` for Git, but it sets up the necessary structure for DVC to work with your data.
- After running this command, you'll notice the following changes in your VSCode sidebar:
  - A `.dvc` directory is created. This is where DVC stores its internal files and configurations.
  - A `.dvcignore` file is created. This is similar to `.gitignore` but for DVC.
- The `.dvc` directory contains:
  - `config`: DVC configuration file.
  - `cache`: Directory where DVC stores the unique copies of your data and models
  - `tmp`: Directory where DVC stores temporary files.
  - `.gitignore`: File to ignore DVC-specific files.
- For more details, check out the detailed documentation [here](https://dvc.org/doc/command-reference/init).


In [None]:
# Check what changes `dvc init` made

!git status

In [11]:
#Stage the changes caused by `dvc init`

!git add .dvc .dvcignore

In [None]:
# Commit the changes and push them

!git commit -m "Initialize DVC"
!git push origin main

### <a id='toc3_1_2_'></a>[DVC Get](#toc0_)

- Now, let's use `dvc get` to retrieve a file from a remote repository.
- This is similar to `git clone` but for data files. It doesn't clone the entire repository, just the specified data.
- The syntax is `dvc get <remote-repo> <file-path>`.


In [None]:
!dvc get https://github.com/kanad13/datasets/ video/part-monarch-metamorphosis.mp4 -o data/time-lapse.mp4

- The `dvc get` command is used to download data from DVC repositories.
  - `https://github.com/kanad13/datasets/`: The URL of the Git repository
  - `video/part-monarch-metamorphosis.mp4`: The path to the file within the source repository
  - `-o data/time-lapse.mp4`: The output path where the file will be saved locally
- After running this command, you'll see a new `data` directory in your VSCode sidebar containing the `time-lapse.mp4` file.
- We can play the file to see how it is only part of the full video.


### <a id='toc3_1_3_'></a>[DVC Add & Commit](#toc0_)

- Now that we have a data file, let's add it to DVC and commit the changes.
- The `dvc add` command is used to add data files to DVC. It is similar to `git add`, but for data files.
- The `dvc commit` command is used to commit the changes to the DVC repository. It is similar to `git commit` but for data files.


In [None]:
!dvc add data/time-lapse.mp4

- The dvc add command adds a file or directory to DVC.
- This creates a `.dvc` file, which is a small text file containing a unique identifier of the data and a file path.
- After running these commands, you'll notice:
  - A part-video.mp4.dvc file is created in the data directory.
  - The actual part-video.mp4 file is copied to the DVC cache (in .dvc/cache).
  - The data folder gets a `.gitignore` file of its own. It is updated to ignore the actual data file.
- **Important**
  - DVC does not actually store your large datasets or models directly within the DVC project directory
  - We do not commit `time-lapse.mp4` since that will go to our DVC repo
  - The metadata about this video file is versioned alongside the source code, while the original data file is added to .gitignore


In [None]:
# Add the updated files to the Git repository

!git add data/time-lapse.mp4.dvc data/.gitignore
!git commit -m "Add part video file to DVC"
!git push origin main

### <a id='toc3_1_4_'></a>[DVC Remotes](#toc0_)

- DVC supports multiple types of remotes to store your data and models e.g. AWS S3, Google Cloud Storage, Azure Blob Storage, etc.
- You can add a remote using the `dvc remote add` command.


In [None]:
!dvc remote add -d myremote gs://dvc-bucket-01
!git add .dvc/config
!git commit -m "Add GCS remote"
!git push origin main

- The `dvc remote add` command adds a new remote storage.
- The `-d` flag sets it as the default remote.
- After running these commands:
  - The `.dvc/config` file is updated with the new remote information.
  - We commit this change to Git to version control our DVC configuration.
- You can now push your data to the remote using the `dvc push` command.


### <a id='toc3_1_5_'></a>[DVC Push](#toc0_)

- The `dvc push` command is used to push data to a remote storage.
- `dvc push` uploads the files from your local DVC cache folder to the remote storage.
- This is similar to `git push`, but for data files.


In [17]:
# Prerequisite

# First login to the Google Cloud Platform (GCP) using the following command:
# gcloud auth application-default login

# List the projects
# gcloud projects list

# Set the project ID
# gcloud config set project YOUR_PROJECT_ID
# gcloud config set project dvc-project-436415

# Set the project ID for the quota
#gcloud auth application-default set-quota-project dvc-project-436415

# Check the project ID
# gcloud config list project

# This is a quick shortcut to the bucket I created within my own GCP project -
# https://console.cloud.google.com/storage/browser?referrer=search&project=dvc-project-436415&prefix=&forceOnBucketsSortingFiltering=true
# You must create a similar bucket in your own GCP project to have rights to push files.

In [None]:
!dvc push

#After running `dvc push`, you won't see any changes in your local file structure, but the data will be uploaded to your GCS bucket.
# https://console.cloud.google.com/storage/browser/dvc-bucket-01

| Feature                 | DVC Get                                          | DVC Pull                                                      |
| ----------------------- | ------------------------------------------------ | ------------------------------------------------------------- |
| **Command Syntax**      | `dvc get <repo_url> <file_path>`                 | `dvc pull [file_path]`                                        |
| **Purpose**             | Retrieve files from a remote location            | Synchronize entire project with remote storage of DVC project |
| **Usage Context**       | Can be used anywhere, no DVC project required    | Must be used within an existing DVC project                   |
| **Scope**               | Individual files or directories                  | Entire project or specific tracked files/directories          |
| **Typical Use Case**    | One-time data retrieval, accessing external data | Ongoing project work, updating local data                     |
| **Project Setup**       | No setup required                                | Requires initialized DVC project                              |
| **Dependency Tracking** | Doesn't track dependencies                       | Maintains project's dependency structure                      |


### <a id='toc3_1_6_'></a>[Tracking Data files with DVC](#toc0_)

- Now, let's demonstrate how DVC handles updating and tracking data files, similar to how Git tracks code changes.


In [None]:
# Pull a new, larger version of the file from the same repo
!dvc get https://github.com/kanad13/datasets/ video/fully-monarch-metamorphosis.mp4 -o data/time-lapse.mp4 --force

- We've now downloaded a larger version of the time-lapse video, overwriting the previous file.
- This new file is similar in content but larger in size.


In [None]:
# Add the file to DVC
!dvc add data/time-lapse.mp4

In [None]:
# This command shows that the file is modified.
!git diff ./data/time-lapse.mp4.dvc
# Notice the changes in hash & size values.

In [None]:
# Commit changes to git
!git add data/time-lapse.mp4.dvc
!git commit -m "Update time-lapse video with full version"
!git push origin main

In [None]:
# Push the updated data to the DVC remote
!dvc push

In [8]:
mm("""
graph LR
    A[GitHub Repository] -->|dvc get| B[Local Project]
    B -->|dvc add| C[DVC Tracking]
    C -->|git add/commit| D[Git Repository]
    C -->|dvc push| E[Google Cloud Storage]

    subgraph Local Project
    B
    C
    D
    end

    subgraph Remote Storage
    A
    E
    end
""")

In [None]:
mm("""
graph TD
    A[Project Directory] --> B{File Type?}
    B -->|Large Data File| C[DVC Add]
    B -->|Code/Small File| D[Git Add]
    C --> E[Creates .dvc file]
    E --> F[Add .dvc file to Git]
    C --> G[Add original file to .gitignore]
    D --> H[Commit to Git repository]
    F --> H
    G --> H
    C --> I[Store data in DVC remote]
    I --> J[Push to DVC remote]
    H --> K[Push to Git remote]
""")

# <a id='toc4_'></a>[Experiment Tracking](#toc0_)


### <a id='toc4_1_1_'></a>[Machine Learning Workflow](#toc0_)


In [None]:
mm("""
graph TD
    A[Data Management & Analysis] --> B[Build & Experiment]
    B --> C[Solution Development & Test]
    C --> D[Deployment & Serving]
    D --> E[Monitor & Maintain]
    E -.-> A
    B -.-> A
""")

### <a id='toc4_1_2_'></a>[ML Pipeline](#toc0_)


- Experiment Tracking means - Tracking details of a several runs of Machine Learning pipeline.

- **Items that can be tracked include**

  - hyperparameters
  - metrics like accuracy, loss, etc.
  - different versions of models and datasets.

- **Experiment Tracking is useful for**
  - `Compare` - Compare between different runs and models.
  - `Track` - Track results and performance metrics.
  - `Reproduce` - Replicate experiments using stored models, hyperparameters, etc.
  - `Audit` - Maintain history of input data used, model used, etc.
  - `Collaborate` - Share results and allow team members to reproduce experiments.


In [None]:
mm("""
graph LR
    Data[("Data")]
    ParamsYAML[("params.yaml")]

    subgraph Pipeline
        DataLoad["data_load"]
        Featurize["featurize"]
        DataSplit["data_split"]
        Evaluate["Evaluate"]
        Train["Train"]
    end

    MetricsPlots["Metrics & Plots"]
    Model["Model"]

    Data --> DataLoad
    DataLoad --> Featurize
    Featurize --> DataSplit
    DataSplit --> Evaluate
    DataSplit --> Train
    Evaluate --> MetricsPlots
    Train --> Model

    ParamsYAML -.-> Pipeline
""")


### <a id='toc4_1_3_'></a>[Experiment Versions](#toc0_)


In [None]:
mm("""
gitGraph
    commit id: "Initial"
    branch experiment-1
    checkout experiment-1
    commit id: "Exp 1 - v1"
    commit id: "Exp 1 - v2"
    checkout main
    merge experiment-1
    branch experiment-2
    checkout experiment-2
    commit id: "Exp 2 - v1"
    checkout main
    branch experiment-3
    checkout experiment-3
    commit id: "Exp 3 - v1"
    checkout experiment-2
    commit id: "Exp 2 - v2"
    checkout main
    merge experiment-2
""")

### <a id='toc4_1_4_'></a>[Reproducibility](#toc0_)


In [None]:
mm("""
graph LR
    subgraph Development Environment
    A[Historical Data] --> B[Feature Engineering]
    B --> C[Train]
    C --> D[Scoring]
    D --> E[Predictions]
    C --> F[Model]
    end

    subgraph Production Environment
    G[Live Data] --> H[Feature Engineering]
    H --> I[Scoring]
    I --> J[Predictions]
    end

    F --> I
""")

### <a id='toc4_1_5_'></a>[Experiment Tracking with DVC](#toc0_)


In [None]:
mm("""
graph LR
    A[Raw Data] --> B[Data Preparation]
    B --> C[Feature Engineering]
    C --> D[Model Training]
    D --> E[Model Evaluation]
    E --> F[Model Deployment]

    subgraph DVC Tracking
    G[Version Control]
    H[Metrics Logging]
    I[Hyperparameter Tracking]
    end

    B -.-> G
    C -.-> G
    D -.-> G
    D -.-> I
    E -.-> H
""")

### <a id='toc4_1_6_'></a>[Comparison of Experiments Tracking methods](#toc0_)

| Traditional Method     | DVC-based            |
| ---------------------- | -------------------- |
| Manual Logging         | Automated Tracking   |
| Spreadsheets           | Structured Storage   |
| Manual Version Control | Git Integration      |
| Difficult to Reproduce | Easy Reproducibility |


### <a id='toc4_1_7_'></a>[DVCLive](#toc0_)

- **DVCLive**
  - It is a Python library for tracking metrics associated with machine learning experiments.
- **Key Features**
  - Logs metrics, parameters, plots, and artifacts during model training
  - Integrates with popular ML frameworks (e.g., PyTorch Lightning, Scikit-learn)
  - Provides real-time experiment logging capabilities
  - DVCLive is an ML logger similar to MLflow. Vertex AI provides similar logging capabilities as DVCLive, but is is designed to work within the Google Cloud ecosystem.


- **Comparison of DVCLive with other related tools**
  - [link](https://github.com/iterative/dvclive?tab=readme-ov-file#comparison-to-related-technologies)
  - DVCLive is an ML Logger, similar to:
    - [MLFlow](https://mlflow.org/)
    - [Weights & Biases](https://wandb.ai/site)
    - [Neptune](https://neptune.ai/)
  - The main differences with those ML Loggers are:
    - DVCLive does not require any additional services or servers to run.
    - DVCLive metrics, parameters, and plots are stored as plain text files that can be versioned by tools like Git or tracked as pointers to files in DVC storage.
    - DVCLive can save experiments or runs as hidden Git commits.
    - You can then use different options to visualize the metrics, parameters, and plots across experiments.


### <a id='toc4_1_8_'></a>[DVC Studio](#toc0_)

- **DVC Studio**
  - It is a web-based platform for managing and visualizing experiments tracked with DVC.
- **Key Features**
  - Provides a user interface for viewing, comparing, and analyzing DVC experiments
  - Visualizes plots, metrics, and other experiment data in a centralized dashboard
  - Enables real-time monitoring of ongoing experiments when integrated with DVCLive
- **Integration between DVCLive and DVC Studio**
  - `DVCLive` - Tool for logging experiment data during training
  - `DVC Studio` - Platform for visualizing and managing that data
  - `Together` - In DVC Studio, users can see live updates of metrics and plots as they're being logged by DVCLive, allowing for real-time monitoring of ongoing experiments.


### <a id='toc4_1_9_'></a>[DVCLive Demo](#toc0_)

- **DVCLive in action**
  - We will now execute a script that simulates a machine learning training process.
  - It demonstrates how to use DVC Live to log parameters and metrics.
  - It creates fake accuracy and loss values that mimic typical ML training behavior over multiple epochs.
- **DVCStudio in action**
  - Run this code 3-4 times to simulate different training runs. Use different parameters to see how they affect the results.
  - Once you have run this code a few times, you can view the results in the DVCLive dashboard. And also on DVCStudio.
  - DVCStudio allows you to share and visualize your DVC experiments with others.


- **Prerequisites**
- Go to DVC Studio and create a DVC project. Grant access to your repo from that project.
- Connect vscode and DVC Studio by clicking the Get Token from inside DVCStudio page in DVC extension.


In [None]:
import time
import random
from dvclive import Live

# Define hyperparameters for the simulated training
params = {"learning_rate": 0.002, "optimizer": "Adam", "epochs": 11}

# Start a DVC Live session
with Live() as live:
    # Log the hyperparameters
    for param in params:
        live.log_param(param, params[param])

    # Generate a random offset to add variability to the simulated metrics
    offset = random.uniform(0.2, 0.1)

    # Simulate the training process
    for epoch in range(1, params["epochs"]):
        # Generate a random "fuzz" value to add noise to the metrics
        fuzz = random.uniform(0.01, 0.1)
        # Calculate simulated accuracy (increases over time)
        accuracy = 1 - (2 ** - epoch) - fuzz - offset
        # Calculate simulated loss (decreases over time)
        loss = (2 ** - epoch) + fuzz + offset
        # Log the accuracy metric
        live.log_metric("accuracy", accuracy)
        # Log the loss metric
        live.log_metric("loss", loss)
        # Move to the next step (epoch) in DVC Live
        live.next_step()
        # Add a small delay to simulate the time taken for each epoch
        time.sleep(0.2)


In [None]:

!git add .
!git commit -m "next run"
!git push origin main

- **IMPORTANT**
  - Commit after each run for best results.
  - DVCLive will show the metrics for each run and you can compare them by selecting the commits from the drop-down.


- **More about DVCLive**
  - DVCLive relies on Git to track the generated directory
  - After you run your training code above, all the logged data will be stored in the dvclive directory and tracked as a DVC experiment for analysis and comparison.
  - DVCLive tracks different metrics and artifacts in specific locations within a directory structure:
    - Metrics:
      - General metrics: `/plots/metrics`
      - System metrics: `dvclive/plots/metrics/system`
    - Parameters: `dvclive/params.yaml`
    - Images: `dvclive/plots/images`
    - Custom plots: `dvclive/plots/custom`
    - Sklearn plots: `dvclive/plots/sklearn`
    - Artifacts:
      - Default: `.dvc` files in the root directory (e.g., `model.pt.dvc`)
      - Optional: `dvclive/artifacts/{path}` or `dvclive/artifacts/{path}.dvc`
    - Summary: `dvclive/metrics.json`
  - [link](https://dvc.org/doc/dvclive?tab=Parameters#outputs)


In [None]:
# Use DVCLive CLI to view the results
# https://github.com/iterative/dvclive?tab=readme-ov-file#dvc-cli
# A better way although is to use DVCStudio or the vscode extension.

!dvc exp show

In [None]:
# Use DVCLive CLI to view the results
# https://github.com/iterative/dvclive?tab=readme-ov-file#dvc-cli
# A better way although is to use DVCStudio or the vscode extension.

!dvc plots diff $(dvc exp list --names-only) --open

# <a id='toc5_'></a>[DVC Pipelines](#toc0_)


### <a id='toc5_1_1_'></a>[Need for DVC Pipelines](#toc0_)

- **Beyond Jupyter Notebooks**
  - Experimenting interactively in Jupyter notebooks is great for exploration.
  - But once you are ready to scale up your workflow, you need a more structured way to run reproducible experiments.
- **ML Pipelines**
  - ML Pipelines are a sequence of stages that process data and train models.
  - They Include stages like data preprocessing, model training, evaluation, and deployment.
- **DVC Pipelines**
  - DVC pipelines help you version control your entire machine learning workflow, including data, code, and model artifacts.
  - Pipeline definitions are stored in the `dvc.yaml` file, which describes the stages, their dependencies, and outputs.
  - DVC pipelines enable `reproducibility` by tracking the entire experiment lifecycle, from raw data to final results.


### <a id='toc5_1_2_'></a>[Demo DVC Pipeline](#toc0_)


- To demonstrate how to work with DVC pipelines, we are going to create a simple pipeline with the following stages:
  - `Prepare` - Process raw data
  - `Featurize` - Transform prepared data into feature vectors
  - `Train` - Train a machine learning model
  - `Evaluate` - Assess model performance


In [None]:
mm("""
graph LR
    A[Prepare] --> B[Featurize]
    B --> C[Train]
    C --> D[Evaluate]
""")

### <a id='toc5_1_3_'></a>[Another repository](#toc0_)

- Navigate to this [github repository](https://github.com/kanad13/dvc-pipelines) to see the code for the pipeline.


# <a id='toc6_'></a>[Benefits & Limitations of DVC](#toc0_)


## <a id='toc6_1_'></a>[Benefits of DVC for Data Science and ML Teams](#toc0_)

1. Version Control: Data, models, and code
2. Reproducibility: Experiments and environments
3. Collaboration: Team sharing and open-source contribution
4. Pipeline Management: Workflow automation and dependency tracking
5. Experiment Management: Metrics tracking and visualization
6. Storage and Computation: Remote integration and resource optimization
7. Flexibility: Lightweight architecture and tool integration


## <a id='toc6_2_'></a>[Benefits of DVC for Other Teams](#toc0_)

- **DevOps and Infrastructure**
  - CI/CD integration for ML workflows
  - Infrastructure as Code compatibility
- **FinOps and Resource Management**
  - Cost optimization and predictive modeling
  - Resource utilization tracking
- **Quality Assurance**
  - Reproducible test environments
  - Model performance validation


## <a id='toc6_3_'></a>[Limitations of DVC](#toc0_)

1. Learning curve for new users
2. Limited built-in visualization tools
3. Basic collaboration features
4. Ecosystem fragmentation in MLOps toolchain


# <a id='toc7_'></a>[Comparison of DVC with other tools](#toc0_)


## <a id='toc7_1_'></a>[DVC comparison with standalone tools](#toc0_)


### <a id='toc7_1_1_'></a>[MLflow](#toc0_)

| **Aspect**          | **MLflow**                                                                 | **DVC**                                                       |
| ------------------- | -------------------------------------------------------------------------- | ------------------------------------------------------------- |
| **Purpose**         | Primarily focused on experiment tracking, model management, and deployment | Focused on data version control and pipeline management       |
| **Key Differences** | More comprehensive UI for experiment tracking and visualization            | Integrates more closely with Git for version control          |
| **Similarities**    | Both support experiment reproducibility, metrics & parameters              | Both support experiment reproducibility, metrics & parameters |


### <a id='toc7_1_2_'></a>[Pachyderm](#toc0_)

| **Feature**           | **Pachyderm**                                                | **DVC**                                                               |
| --------------------- | ------------------------------------------------------------ | --------------------------------------------------------------------- |
| **Purpose**           | Focuses on data lineage, versioning, and pipeline automation | Focused on data version control and pipeline management               |
| **Pipeline Approach** | Uses a container-based approach for pipeline stages          | More lightweight and integrates directly with Git                     |
| **Scalability**       | Stronger support for large-scale data processing             | Simpler learning curve and easier integration into existing workflows |
| **Data Management**   | Provides data versioning and pipeline management             | Provides data versioning and pipeline management                      |
| **Collaboration**     | Supports reproducibility and collaboration                   | Supports reproducibility and collaboration                            |


### <a id='toc7_1_3_'></a>[Weights & Biases (wandb)](#toc0_)

| **Feature**                       | **Weights & Biases (Wandb)**                                | **DVC**                                                       |
| --------------------------------- | ----------------------------------------------------------- | ------------------------------------------------------------- |
| **Purpose**                       | Specializes in experiment tracking and visualization for ML | Focused on data and pipeline version control                  |
| **Visualization & Collaboration** | More advanced visualization and collaboration features      | Integrates more closely with existing version control systems |
| **Hosting**                       | Offers hosted solutions                                     | Self-hosted                                                   |
| **Integration**                   | Supports tracking experiments and metrics                   | Supports tracking experiments and metrics                     |
| **Collaboration**                 | Supports team collaboration in ML projects                  | Supports team collaboration in ML projects                    |


### <a id='toc7_1_4_'></a>[Neptune.ai](#toc0_)

| **Feature**                        | **Neptune.ai**                                                       | **DVC**                                                             |
| ---------------------------------- | -------------------------------------------------------------------- | ------------------------------------------------------------------- |
| **Purpose**                        | Metadata store for MLOps, specialized in experiment management       | Focused on data versioning and pipeline management                  |
| **User Interface & Collaboration** | More comprehensive UI for experiment tracking and team collaboration | Integrates more closely with Git and existing development workflows |
| **Versioning & Tracking**          | Supports versioning and tracking of ML experiments                   | Supports versioning and tracking of ML experiments                  |


### <a id='toc7_1_5_'></a>[Key Takeaways - DVC vs. Standalone Tools](#toc0_)

| Aspect                    | **DVC**                                                                    | **Standalone Tools** (MLflow, Pachyderm, W&B, Neptune.ai)                                                 |
| ------------------------- | -------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------- |
| **Purpose**               | Focused on data version control, pipeline management, and Git integration. | Primarily focused on experiment tracking, model management, and deployment, with various specializations. |
| **Versioning & Tracking** | Git-based version control for data, pipelines, and models.                 | Focus on experiment tracking and metrics, with some supporting metadata stores and lineage.               |
| **Pipeline Management**   | Lightweight, Git-integrated pipelines via `dvc.yaml`.                      | Varies; some offer advanced pipelines (e.g., Pachyderm's container-based approach).                       |
| **Scalability**           | Scales with cloud storage backend integration.                             | Often better suited for large-scale workflows; some tools are cloud-hosted for added scalability.         |
| **Collaboration**         | Git-based collaboration for data and models.                               | Advanced collaboration features, often with built-in team and UI support.                                 |
| **Visualization**         | Limited visualization, relies on external tools.                           | Richer, built-in visualizations and dashboards (e.g., W&B excels in experiment visualization).            |


## <a id='toc7_2_'></a>[DVC vs. Cloud Provider Offerings](#toc0_)


### <a id='toc7_2_1_'></a>[Google Cloud Platform (GCP)](#toc0_)

| Feature/Offering    | DVC                                             | GCP                                      | Integration/Comparison                                                |
| ------------------- | ----------------------------------------------- | ---------------------------------------- | --------------------------------------------------------------------- |
| Data Storage        | Local or remote storage                         | Google Cloud Storage                     | DVC can use Cloud Storage as a remote storage backend                 |
| Version Control     | Git-like versioning for data and models         | No native data versioning                | DVC adds versioning capabilities to GCP storage                       |
| Experiment Tracking | Basic experiment versioning through Git commits | AI Platform Experiments                  | DVC can complement AI Platform by versioning data used in experiments |
| Pipeline Management | `dvc.yaml` for defining pipelines               | Cloud Composer (based on Apache Airflow) | DVC pipelines can be triggered within Cloud Composer workflows        |
| Model Registry      | Through Git and DVC remote storage              | AI Platform Model Registry               | DVC can version model files that are registered in AI Platform        |
| Scalability         | Depends on storage backend                      | Highly scalable cloud infrastructure     | DVC leverages GCP's scalability when using Cloud Storage              |
| Collaboration       | Through Git and shared DVC remotes              | Various GCP collaboration tools          | DVC enhances collaboration on data and models within GCP projects     |


### <a id='toc7_2_2_'></a>[Amazon Web Services (AWS)](#toc0_)

| Feature/Offering    | DVC                                             | AWS                                        | Integration/Comparison                                              |
| ------------------- | ----------------------------------------------- | ------------------------------------------ | ------------------------------------------------------------------- |
| Data Storage        | Local or remote storage                         | Amazon S3                                  | DVC can use S3 as a remote storage backend                          |
| Version Control     | Git-like versioning for data and models         | No native data versioning in S3            | DVC adds versioning capabilities to AWS storage                     |
| Experiment Tracking | Basic experiment versioning through Git commits | SageMaker Experiments                      | DVC can complement SageMaker by versioning data used in experiments |
| Pipeline Management | `dvc.yaml` for defining pipelines               | AWS Step Functions, SageMaker Pipelines    | DVC pipelines can be integrated into AWS workflow tools             |
| Model Registry      | Through Git and DVC remote storage              | SageMaker Model Registry                   | DVC can version model files that are registered in SageMaker        |
| Scalability         | Depends on storage backend                      | Highly scalable cloud infrastructure       | DVC leverages AWS's scalability when using S3                       |
| Collaboration       | Through Git and shared DVC remotes              | AWS collaboration tools (e.g., CodeCommit) | DVC enhances collaboration on data and models within AWS projects   |


### <a id='toc7_2_3_'></a>[Microsoft Azure](#toc0_)

| Feature/Offering    | DVC                                             | Azure                                     | Integration/Comparison                                              |
| ------------------- | ----------------------------------------------- | ----------------------------------------- | ------------------------------------------------------------------- |
| Data Storage        | Local or remote storage                         | Azure Blob Storage                        | DVC can use Azure Blob Storage as a remote storage backend          |
| Version Control     | Git-like versioning for data and models         | No native data versioning in Blob Storage | DVC adds versioning capabilities to Azure storage                   |
| Experiment Tracking | Basic experiment versioning through Git commits | Azure Machine Learning                    | DVC can complement Azure ML by versioning data used in experiments  |
| Pipeline Management | `dvc.yaml` for defining pipelines               | Azure Data Factory, ML Pipelines          | DVC pipelines can be integrated into Azure workflow tools           |
| Model Registry      | Through Git and DVC remote storage              | Azure Machine Learning Model Registry     | DVC can version model files that are registered in Azure ML         |
| Scalability         | Depends on storage backend                      | Highly scalable cloud infrastructure      | DVC leverages Azure's scalability when using Blob Storage           |
| Collaboration       | Through Git and shared DVC remotes              | Azure DevOps                              | DVC enhances collaboration on data and models within Azure projects |


### <a id='toc7_2_4_'></a>[Key Takeaways - DVC vs. Cloud offerings](#toc0_)

| Aspect                  | DVC                                                                                    | Cloud Provider Offerings                                                                                        |
| ----------------------- | -------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------- |
| **Versioning**          | Cloud-agnostic, Git-like versioning for data and models across cloud storage backends. | Limited or no native data versioning in most storage services.                                                  |
| **Experiment Tracking** | Basic versioning via Git commits. Lacks rich experiment tracking features.             | Advanced tools (e.g., SageMaker, Google AI, Azure ML) offer detailed metrics and comparison capabilities.       |
| **Model Registry**      | Supports model versioning but lacks deployment and lineage features.                   | Full-featured registries with advanced capabilities (e.g., SageMaker, Azure ML Model Registry).                 |
| **Pipeline Management** | Focused on data science workflows via `dvc.yaml`.                                      | Broader workflow management, including data engineering and ETL (e.g., AWS Step Functions, Azure Data Factory). |
