# DVC Tutorial

This notebook provides and introduction to Data Version Control (DVC) and demonstrates its key features, including versioning data, managing pipelines and storing data remotely.

## What is DVC?
[DVC](https://dvc.org/) is an open-source version control system for data science and Machine learning projects to manage data-files and models similar to git.

## Before running this notebook:
 - Make sure to follow instructions from [Readme.md](README.md) file to setup the environment. 
 - Select 'DVC Tutorial' Kernel for this notebook.


### Import relevant libraries
Install dvc if not already installed

In [1]:
!pip install dvc




#### Initialize DVC

This should create a '.dvc' subfolder in the project folder. If [.dvc](.dvc) already exists you can delete it for the tutorial. Or use -f to force as follows:

In [2]:
!dvc init -f

Initialized DVC repository.

You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>


### Determine OS

In [3]:
import os
from PIL import Image
import platform
import numpy as np

os_type = platform.system()
print(f"Running on {os_type}")

Running on Windows


### Define the data directory
#### Create Directory if it doesn't exist

In [4]:
image_dir = "images"
os.makedirs(image_dir, exist_ok=True)

for i in range(4):
    with open(os.path.join(image_dir,f"image{i}.png"),"wb") as f:
        data = np.random.rand(100,100,3) * 255
        img = Image.fromarray(data.astype('uint8')).convert('RGB')
        img.save(os.path.join(image_dir, f"image{i}.png"))

!dvc add images


To track the changes with git, run:

	git add images.dvc .gitignore

To enable auto staging, run:

	dvc config core.autostage true


⠋ Checking graph



If it is tracked by git, follow these steps:

`
git rm -r --cached 'data\example_subfolder'
`

`
git commit -m "stop tracking data\example_subfolder"
`

In [5]:


for i in range(4):
    with open(os.path.join(image_dir,f"image{i}.png"),"wb") as f:
        data = np.random.rand(100,100,3) * 255
        img = Image.fromarray(data.astype('uint8')).convert('RGB')
        img.save(os.path.join(image_dir, f"image{i}.png"))
                
!dvc add images


To track the changes with git, run:

	git add images.dvc

To enable auto staging, run:

	dvc config core.autostage true


⠋ Checking graph



### Storing data remotely

In [6]:
!dvc remote add -d myremote /tmp/dvcstore
!dvc push

Setting 'myremote' as a default remote.
5 files pushed


### Management

In [7]:
# Create a new branch for an experiment
!git checkout -b experiment


# make changes to image files
for i in range(4):
    with open(os.path.join(image_dir,f"image{i}.png"),"wb") as f:
        data = np.random.rand(100,100,3) * 255
        img = Image.fromarray(data.astype('uint8')).convert('RGB')
        img.save(os.path.join(image_dir, f"image{i}.png"))
        
# Add and commit changes
!dvc add images
!dvc repro

# Compare experiment with main branch
!git checkout main
!dvc metrics diff experiment

fatal: a branch named 'experiment' already exists



To track the changes with git, run:

	git add images.dvc

To enable auto staging, run:

	dvc config core.autostage true


⠋ Checking graph

ERROR: 'c:\Users\pss2\PycharmProjects\dataversioncontrol\dvc.yaml' does not exist


M	.dvc/config
M	.gitignore
M	binder/environment.yaml
M	dvc_tutorial.ipynb
Your branch is up to date with 'origin/main'.


Already on 'main'


### Clean up


In [13]:
# !dvc remove images.dvc
import stat

def change_permissions(path):
    os.chmod(path, stat.S_IWRITE)
    
def remove_dir(dir_path):
    if os.path.exists(dir_path):
        for root, dirs, files in os.walk(dir_path, topdown=False):
            for name in files:
                fpath = os.path.join(root, name)
                change_permissions(fpath)
                os.remove(fpath)
            for name in dirs:
                fpath = os.path.join(root, name)
                change_permissions(fpath)
                os.remove(fpath)
        change_permissions(dir_path)
        os.rmdir(dir_path)

    
if os_type == "Windows":
    import shutil
    if os.path.exists('.dvc'):
        change_permissions('.dvc')
        # shutil.rmtree('.dvc', ignore_errors=True)
        remove_dir('.dvc')
    if os.path.exists('images'):
        change_permissions('images')
        shutil.rmtree('images', ignore_errors=True)

else:
    !rm -rf images .dvc

: 