# Introduction to DVC

This chapter provides a comprehensive introduction to Data Version Control (DVC), a tool essential for data versioning in machine learning. Learners will explore the motivation behind data versioning, understand its differences from code versioning, and experiment with a simple classification problem. They will review basic Git commands, learn about DVC, and practice setting up a repository. The chapter concludes with an overview of DVC’s features and use cases, including versioning data and models, CI/CD for machine learning, experiment tracking, pipelines, and more.

## 1.1 Data Versioning Motivation

### Ex.1 - Anatomy of a Machine Learning Model

Now, you will reinforce your understanding of how data influences the model performance. You will be working with the Airbnb booking dataset (in the file booking.csv). The dataset is suited for classification tasks to predict if someone would cancel a booking. It contains several numerical and categorical columns. You will split the provided dataset into three mutually exclusive samples - train_A.csv, train_B.csv, and test.csv - using split_dataset.py script. Further, for each training dataset, you'll run the data processing and model training pipeline to train a Random Forest Classifier model and test its performance on the test set by using model_training.py. The hyperparameters defined in params.json are consistent in both runs.

The Python scripts are designed to accept command line arguments and run via shell. Feel free to explore these scripts to enrich your understanding.

**Instruction**

1. Split the dataset by running the command `python split_dataset.py booking.csv train_A.csv train_B.csv test.csv` on the editor shell.
2. Train and analyze model performance using the first training set by running `python model_training.py <params_file> <training_file> <test_file>` with proper filenames. Take a note of the metrics.
3. Train and analyze model performance using the second training set by running `python model_training.py <params_file> <training_file> <test_file>` with proper filenames. Compare the metrics with the previous run.

In [4]:
!python -m ExampleML1.split_dataset ExampleML1/data-raw/booking.csv \
                                    ExampleML1/data-processed/train_A.csv \
                                    ExampleML1/data-processed/train_B.csv \
                                    ExampleML1/data-processed/test.csv

ExampleML1/data-processed/test.csv file created...
ExampleML1/data-processed/train_A.csv file created...
ExampleML1/data-processed/train_B.csv file created...
Completed!


In [5]:
!python -m ExampleML1.model_training ExampleML1/config/params.json \
                                     ExampleML1/data-processed/train_A.csv \
                                     ExampleML1/data-processed/test.csv

{'Precision': 0.8914, 'Recall': 0.4613, 'F1 Score': 0.608, 'Accuracy': 0.8031}


In [6]:
!python -m ExampleML1.model_training ExampleML1/config/params.json \
                                     ExampleML1/data-processed/train_B.csv \
                                     ExampleML1/data-processed/test.csv

{'Precision': 0.8421, 'Recall': 0.4908, 'F1 Score': 0.6202, 'Accuracy': 0.801}


## 1.2 DVC

### Installation:

1. Open PowerShell as Administrator

2. Install **[Chocolatery](https://chocolatey.org/install)**
    - Choose Individual
    - Open Powershell as administrator
    - Review that `Get-ExecutionPolicy` is not Restricted, by running
        > ```
        > $ Get-ExecutionPolicy
        > ```
        
    - If restricted is returned, run
        > ```
        > Set-ExecutionPolicy Bypass -Scope Process
        > ```
        
    - Run
        > ```
        > $ Set-ExecutionPolicy Bypass -Scope Process -Force; [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072; iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))
        > ```
        
    - Once it is installed, you can verify it by running:
        > ```
        > $ choco -?
        > ```

3. Install **DVC**
    > ```
    > $ choco install dvc
    > ```

### Ex.2 - Working with Git CLI
Imagine you're starting a new machine learning project. You want to utilize Git to track code changes and collaborate effectively. In this exercise, you'll create a Git repository, create a new branch, add an initial file, and make your first commit. These foundational Git commands will help you later when we apply them in conjunction with DVC. You can expand on this exercise by adding more files, creating more branches, and committing those changes to further your understanding.

**Instruction:**

1. Initialize a Git repository within this directory.
2. Create a new branch called `main` using `git checkout -b <branch-name>` command.
3. Add the `greeter.py` file to the staging area.
4. Commit the changes with the following commit message `Initial commit: Added greeter.py`.

> ```
$ git init
$ git checkout -b main
$ git add greeter.py
$ git commit -m "Initial commit: Added greeter.py"
> ```

### Ex.3 - Review DVC CLI

In this exercise, you will test your knowledge of DVC (Data Version Control) commands by evaluating multiple statements about its CLI. DVC is a powerful tool for data versioning, and understanding its commands is essential for effective data management. Mark the statement(s) as correct based on whether the combination accurately represents the actions performed by each command.

**Select all correct answers**

- [x] `dvc init` initializes a DVC repository in your working folder. **True**
- [ ] `dvc get` is used to synchronize data changes from a remote data server. False.
    > `dvc get` is used to download a specific file or directory tracked by DVC.
    > To synchronize data changes, `dvc pull` is used.
- [x] `dvc checkout` is used to update all DVC-tracked files and directories to match a specific state. **True**
- [ ] `dvc add` is used to record the current state of all tracked data files. False
    > `dvc add` is used to add a data file to DVC for tracking.
    > To record the current state, `dvc commit` is used.

## 1.3 DVC features and use cases

### Pipelines

- Define pipeline in `dvc.yaml`

```
stages:
train:
    cmd: python train.py
    deps:
        - code/train.py
        - data/input_data.csv
        - params/params.json
    outs:
        - model_output/model.pkl
```

- Run with
```
$ dvc repro
```


### Tracking metrics and plots
```
$ dvc metrics diff
```

### Experiment tracking

- Run experiment and log metrics
```
dvc repro
dvc exp save
```
- Alternatively, combine two steps `dvc exp run`.
- Experiments are custom Git references
    * Prevent bloating up `Git` commits
    * Explicit saves can be made with `dvc exp save`
- Visualize using `dvc exp show`

---------------------------