# DVC Configuration and Data Management

This chapter delves into the setup of DVC, encompassing aspects such as installation, initialization of the repository, and the utilization of the .dvcignore file. It further navigates through the exploration of DVC cache and staging files, imparting knowledge on how to add and remove files, manage caches, and comprehend the underlying mechanisms using the MD5 hash. The chapter also elucidates on DVC remotes, distinguishing them from Git remotes, and guides you on how to add, list, and modify them. Lastly, it teaches you how to interact with these remotes by pushing and pulling data, checking out specific versions, and fetching data to the cache.

## 2.1 DVC Setup and Initialization

### Verify Installation

In [1]:
!dvc version

DVC version: 3.54.1 (choco)
---------------------------
Platform: Python 3.12.5 on Windows-11-10.0.22631-SP0
Subprojects:
	dvc_data = 3.16.5
	dvc_objects = 5.1.0
	dvc_render = 1.0.2
	dvc_task = 0.4.0
	scmrepo = 3.3.7
Supports:
	azure (adlfs = 2024.7.0, knack = 0.12.0, azure-identity = 1.17.1),
	gdrive (pydrive2 = 1.20.0),
	gs (gcsfs = 2024.6.1),
	http (aiohttp = 3.10.5, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.10.5, aiohttp-retry = 2.8.3),
	oss (ossfs = 2023.12.0),
	s3 (s3fs = 2024.6.1, boto3 = 1.34.162),
	ssh (sshfs = 2024.6.0)
Config:
	Global: C:\Users\Jacqueline\AppData\Local\iterative\dvc
	System: C:\ProgramData\iterative\dvc


### Data & Code Versioning

In [2]:
!git init
!dvc init

Initialized empty Git repository in C:/Users/Jacqueline/Documents/projects/CAMP-MLPRod/6-DataVersioningDVC/.git/
Initialized DVC repository.

You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>


### Checking Ignored Files

If the file is ignored, a returned message is provided, otherwise no response

In [3]:
!more .dvcignore

# Add patterns of files dvc should ignore, which could improve
# the performance. Learn more at
# https://dvc.org/doc/user-guide/dvcignore

# Ignore material folder
material/*


In [4]:
!dvc check-ignore "material/1.2 Introduction to DVC.txt"

material/1.2 Introduction to DVC.txt


In [5]:
!dvc check-ignore -d "material/1.2 Introduction to DVC.txt"

.dvcignore:6:material/*	material/1.2 Introduction to DVC.txt


In [6]:
!dvc check-ignore data-sources/booking.csv

## 2.2 DVC Cache and Staging Files

### Configure location for DVC Cache

```
$ dvc cache dir ~/mycache
```

### Adding Files to Cache

Use `dvc add -v` for verbose output

```
$ dvc add data.csv
```

### `booking.csv.dvc` file content as example

```
outs:
- md5: 8e30b9da0032c81edebc9f7492dcea14
  size: 3241399
  hash: md5
  path: booking.csv
```

### Removing from and Cleaning Cache

```
$ dvc remove data.csv.dvc
$ dvc gc -w
```

## 2.3 Configuring DVC Remotes

### Setting up Remotes

Setting remotes

```
dvc remote add <name> <location>
```

### S3 bucket

```
$ dvc remote add s3_remote s3://mys3bucket
```

### GCP bucket

```
$ dvc remote add gcp_remote gs://myGCPbucket
```

### Azure

```
$ dvc remote add azure_remote azure://mycontainer/path
```

In [7]:
!dvc remote add s3_remote s3://mys3bucket

### Local Remotes

Local remotes are used for rapid prototyping

```
$ dvc remote add --local mylocalremote /tmp/dvc
$ dvc remote add mylocalremote /tmp/dvc
```

Set default remotes with -d flag
```
$ dvc remote add -d mylocalremote /tmp/dvc
```

In [8]:
!dvc remote add -d mylocalremote remote-storage/

Setting 'mylocalremote' as a default remote.


### Reads from `.dvc\config` file

In [9]:
!more .dvc\config

[core]
    remote = mylocalremote
['remote "s3_remote"']
    url = s3://mys3bucket
['remote "mylocalremote"']
    url = ../remote-storage


### Listing Remotes

```
$ dvc remote list
```

In [10]:
!dvc remote list

s3_remote	s3://mys3bucket
mylocalremote	C:\Users\Jacqueline\Documents\projects\CAMP-MLPRod\6-DataVersioningDVC\remote-storage


### Modifying Remote Configuration

Customizations can be done with dvc remote modify

```
$ dvc remote modify s3_remote connect_timeout 300
```

Modifying a DVC remote's location, we can use the following command

```
$ dvc remote modify --local <remote_name> url </path/to/new-location>
```

In [11]:
!dvc remote modify s3_remote connect_timeout 300

In [12]:
!more .dvc\config

[core]
    remote = mylocalremote
['remote "s3_remote"']
    url = s3://mys3bucket
    connect_timeout = 300
['remote "mylocalremote"']
    url = ../remote-storage


## 2.4 Interacting with DVC Remotes

### Uploading and Retrieving Data

- Moving data from cache to DVC remote

```
$ dvc push <target>
$ dvc pull <target>
```

- Push entire cache

```
$ dvc push
```

- Update the cache without changing workspace contents

```
$ dvc fetch
```

- Override default remote with -r flag

```
$ dvc push -r aws_remote data.csv
```

### Versioning data

- `.dvc` is tracked by Git, not DVC
- Leverage this to checkout specific version of data file
- Checkout `.dvc` file

```
$ git checkout <commit_hash|tag|branch>
```
    
- Retrieve data with MD5 specified in .dvc file

```
$ dvc checkout <target>
```

### Tracking Data Changes

- Change data file contents, then add dataset changes

```
$ dvc add <target>
```
    
- Commit changed .dvc file to Git

```
$ git add <target>.dvc
$ git commit <target>.dvc -m "Dataset updates"
```

- Push metadata to Git

```
$ git push origin main
```

- Upload changed data file
```
$ dvc push
```

### Ex.1 - Versioning Data using DVC Remote

In this editor exercise, you'll practice how to version your datasets and push them into DVC remote. Data versioning and storage is the fundamental value proposition of DVC, and you'll learn the mechanics of the interplay between Git and DVC to achieve this. The dataset you'll be working with is a weather dataset that is used for rainfall prediction, given the atmospheric conditions.

We've already initialized DVC, configured a local remote at /tmp/dvc, and added a setup commit.

**Instruction**

- Add the `dataset.csv` to DVC cache.
- Commit the corresponding `.dvc` file to Git, with the commit message `"tracking dataset.csv"`.
- Push the dataset to the DVC remote.
- Though you are the only one working with this DVC setup, run the dvc pull command to ensure everything is up to date.

```
$ dvc add dataset.csv 
$ git status
$ git add .
$ git commit -m "tracking dataset.csv"
$ dvc remote list
$ dvc push
$ dvc pull                                                                         
```

### Ex.2 - Checking out Versioned Data

In this editor exercise, you'll practice moving between versions of your datasets by checking out corresponding metadata versions from the Git repository. This exercise builds on the previous one by tracking the initial state of the weather dataset, followed by removing 1000 lines from it and committing it to DVC remote. Your task will be to roll back the Git commit to a previous state, check out the DVC dataset at that corresponding state, and observe the changes.

We've already initialized DVC, configured a local remote at /tmp/dvc, and added a setup commit. Then, we added two more commits marking the dataset tracking and changes.

**NOTE**: To rollback changes we have committed to git repository by N commits, you can use

```
git reset --hard HEAD~N
```

**Instruction**

1. Inspect the Git commit history using git log command. Notice the top two commit messages reflecting the updates to the dataset. Press q to get out of interactive mode.
2. Inspect the md5 value in the dataset.csv.dvc file and compare it to the file by running md5sum dataset.csv.
3. Roll back the changes made to the dataset metadata file by one commit. The md5 value would have changed, but will be inconsistent with the md5sum dataset.csv.
4. Update the dataset by checking out the version consistent with the metadata file. The md5 value in the metadata should now be consistent with md5sum dataset.csv.


```
$ git log
$ cat dataset.csv.dvc
$ md5sum dataset.csv
$ git reset --hard HEAD~1
$ dvc checkout
$ cat dataset.csv.dvc
$ md5sum dataset.csv
```

-------------------