# DATA VERSIONING - DVC

1. Why profesional need to version control their data?
2. What is the best way to version control data?
3. How to use DVC to version control data?

Data Ingestion > Data Preprocessing > Feature Engineering > Feature Selection > Model Training > Model Evaluation > Model Monitoring > Model Maintenance > Data Versioning > Model Version



* What is DVC
* Its history and need
* Why professionals use data version control
* Best way to version control data
* How to use DVC with examples
* A flowchart to tie it all together

---

# 📚 Notes: **Data Versioning with DVC**

---

## 🔍 What is DVC?

**DVC (Data Version Control)** is an **open-source tool** designed to handle **version control for data, models, and machine learning pipelines**. It works **on top of Git** and manages large files, datasets, and ML experiments efficiently.

🔧 It creates small `.dvc` files that represent your data — these are versioned in Git, while the actual data lives in a **remote storage (S3, GDrive, etc.)**.

---

## 🏛️ History of DVC

* **Created by Iterative.ai**
* First released in **2017**
* Inspired by Git but tailored for **ML workflows** and **big data**
* Gained popularity for making ML projects reproducible and collaborative

---

## 🧩 Why is DVC Needed?

Traditional version control systems like Git aren't built for:

* Large files (GBs to TBs)
* Data pipelines
* ML experiments that require reproducibility

**DVC solves these problems** by:

* Keeping data out of Git
* Tracking data & model versions
* Supporting external storage (cloud/local)
* Enabling team collaboration on big data

---

## ✅ Why Professionals Need Data Version Control

| Problem                | Why Versioning Helps                    |
| ---------------------- | --------------------------------------- |
| Data changes over time | Keep track of what changed and when     |
| Experiments fail       | Roll back to stable versions            |
| Teams collaborate      | Avoid overwriting each other's data     |
| Reproducibility        | Run same code + same data = same result |
| Compliance & audits    | Show what data was used in production   |

---

## 🏆 Best Way to Version Control Data

| Option                                 | Pros                           | Cons                   |
| -------------------------------------- | ------------------------------ | ---------------------- |
| **Manual naming** (e.g. `data_v1.csv`) | Easy                           | Messy, error-prone     |
| **Git only**                           | Great for code                 | Fails with large files |
| **Git + DVC** ✅                        | Scalable, efficient, organized | Needs setup            |

➡️ **Best Practice**: Use **Git + DVC + remote storage**

---

## 🛠️ How to Use DVC to Version Control Data

### 🔹 Step-by-Step

1. **Install DVC**

   ```bash
   pip install dvc
   ```

2. **Initialize Git + DVC**

   ```bash
   git init
   dvc init
   ```

3. **Add data to DVC**

   ```bash
   dvc add data/
   ```

4. **Track with Git**

   ```bash
   git add data.dvc .gitignore
   git commit -m "Add data with DVC"
   ```

5. **Push to Remote**

   ```bash
   dvc remote add -d myremote gdrive://<folder-id>
   dvc push
   ```

---

## ✌️ Two Easy Examples

### 📘 Example 1: Image Dataset Versioning

You have `data/images/` for a classification task:

```bash
dvc add data/images/
git add data/images.dvc
git commit -m "Added image dataset v1"
dvc push
```

Later, you update the dataset:

```bash
# Add new images
dvc add data/images/
git commit -am "Updated dataset to v2"
dvc push
```

To switch between versions:

```bash
git checkout <old_commit>
dvc checkout
```

---

### 📗 Example 2: Track a Model File

You trained a model and saved it to `models/model.pkl`

```bash
dvc add models/model.pkl
git add models/model.pkl.dvc
git commit -m "Track model v1"
dvc push
```

Now you can track **data + code + model** together. Reproducible and clean!

---

## 🔁 Flowchart: DVC Workflow Overview

```plaintext
+----------------+
|   Raw Data     |
+--------+-------+
         |
   dvc add data/
         |
+--------v--------+
| .dvc File + Git |
+--------+--------+
         |
   dvc remote add
         |
+--------v--------+
|  dvc push (to   |
|  cloud/local)   |
+--------+--------+
         |
+--------v--------+
| Git commit code |
| + .dvc metadata |
+-----------------+
         |
+--------v--------+
| Reproduce, roll |
| back, share     |
+-----------------+
```

---

## 📝 Summary

* DVC = Git for Data
* Tracks large files, models, datasets
* Works well with Git for team ML workflows
* Makes ML **reproducible, shareable, rollbackable**
* Easy to use with cloud remotes

---
