This notebook was executed in http://sipecamdata.conabio.gob.mx:8888/lab? (must be connected to VPN). But you can execute the cells locally using:

```
docker run -d --rm --name jupyterlab_dummy -p 8888:8888 palmoreck/jupyterlab_optimizacion:3.1.0
```

and it's using: https://github.com/palmoreck/example_dvc repo

# Install `dvc`

In [1]:
!pip install -q dvc

You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m


# Clone repo

In [2]:
%%bash
cd ~
git clone https://github.com/palmoreck/example_dvc.git

Cloning into 'example_dvc'...


# `dvc init`

In [7]:
%%bash
cd ~/example_dvc
path_dvc=/home/myuser/.local/bin/
$path_dvc/dvc init

Initialized DVC repository.

You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>


# Example data

Get data

In [9]:
%%bash
cd ~/example_dvc
url_data=https://raw.githubusercontent.com/palmoreck/minikube_kubeflow_kale_examples/main/reference_notebooks/kale/
wget $url_data/train_titanic.csv
wget $url_data/test_titanic.csv

--2021-11-12 09:53:38--  https://raw.githubusercontent.com/palmoreck/minikube_kubeflow_kale_examples/main/reference_notebooks/kale//train_titanic.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /palmoreck/minikube_kubeflow_kale_examples/main/reference_notebooks/kale/train_titanic.csv [following]
--2021-11-12 09:53:38--  https://raw.githubusercontent.com/palmoreck/minikube_kubeflow_kale_examples/main/reference_notebooks/kale/train_titanic.csv
Reusing existing connection to raw.githubusercontent.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 61194 (60K) [text/plain]
Saving to: ‘train_titanic.csv’

     0K .......... .......... .......... .......... .......... 83% 1.26M 0s
    50K .........                                  

# `dvc add`

In [10]:
%%bash
cd ~/example_dvc
path_dvc=/home/myuser/.local/bin/
$path_dvc/dvc add train_titanic.csv
$path_dvc/dvc add test_titanic.csv


To track the changes with git, run:

	git add train_titanic.csv.dvc .gitignore

To track the changes with git, run:

	git add .gitignore test_titanic.csv.dvc






# Track files and commit

First configure `git` with your mail, name, etc in bash not in notebook.

In [12]:
%%bash
cd ~/example_dvc
git add .gitignore train_titanic.csv.dvc
git add test_titanic.csv.dvc
git commit -m "Add dvc files for data"

[main cae2de9] Add dvc files for data
 12 files changed, 525 insertions(+)
 create mode 100644 .dvc/.gitignore
 create mode 100644 .dvc/config
 create mode 100644 .dvc/plots/confusion.json
 create mode 100644 .dvc/plots/confusion_normalized.json
 create mode 100644 .dvc/plots/linear.json
 create mode 100644 .dvc/plots/scatter.json
 create mode 100644 .dvc/plots/simple.json
 create mode 100644 .dvc/plots/smooth.json
 create mode 100644 .dvcignore
 create mode 100644 .gitignore
 create mode 100644 test_titanic.csv.dvc
 create mode 100644 train_titanic.csv.dvc


# `dvc remote add`

Add local remote:

In [14]:
%%bash
cd ~/example_dvc
mkdir -p ~/dvc_storage
path_dvc=/home/myuser/.local/bin/
$path_dvc/dvc remote add local ~/dvc_storage

See list of remote:

In [15]:
%%bash
cd ~/example_dvc
path_dvc=/home/myuser/.local/bin/
$path_dvc/dvc remote list

local	/home/myuser/dvc_storage


See config:

In [16]:
%%bash
cd ~/example_dvc
cat .dvc/config 

['remote "local"']
    url = /home/myuser/dvc_storage


Commit:

In [17]:
%%bash
cd ~/example_dvc
git commit .dvc/config -m "Configure local remote"

[main aa43094] Configure local remote
 1 file changed, 2 insertions(+)


# `dvc push`

Check dir `~/dvc_storage` is emtpy before `dvc push`

In [18]:
%%bash
ls -lh ~/dvc_storage

total 0


Using `dvc push` copy the data locally to the remote storage we set up earlier:

In [19]:
%%bash
cd ~/example_dvc
path_dvc=/home/myuser/.local/bin/
$path_dvc/dvc push -r local

2 files pushed


Check:

In [20]:
%%bash
ls -lh ~/dvc_storage

total 8.0K
drwxr-xr-x 2 myuser myuser 4.0K Nov 12 10:01 02
drwxr-xr-x 2 myuser myuser 4.0K Nov 12 10:01 61


# Inside `~/example_dvc` execute `git push` in bash not in notebook.

Check in repo that there's no .csv files

# These csv files have next number of rows:

In [25]:
import pandas as pd
import os

In [32]:
url_csv = os.path.join(os.path.expanduser("~"),"example_dvc")
filename_train = "train_titanic.csv"
filename_test = "test_titanic.csv"
train_df = pd.read_csv(os.path.join(url_csv, filename_train))
test_df  = pd.read_csv(os.path.join(url_csv, filename_test))

In [33]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [34]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


# Making changes to data

Suppose you change the data

In [40]:
%%bash
head -n 400 ~/example_dvc/train_titanic.csv > ~/example_dvc/train_titanic.csv_2
mv ~/example_dvc/train_titanic.csv_2 ~/example_dvc/train_titanic.csv
head -n 400 ~/example_dvc/test_titanic.csv >  ~/example_dvc/test_titanic.csv_2
mv ~/example_dvc/test_titanic.csv_2 ~/example_dvc/test_titanic.csv

Execute `dvc_add` then `git commit` and `dvc push`

In [41]:
%%bash
cd ~/example_dvc
path_dvc=/home/myuser/.local/bin/
$path_dvc/dvc add train_titanic.csv
$path_dvc/dvc add test_titanic.csv


To track the changes with git, run:

	git add train_titanic.csv.dvc

To track the changes with git, run:

	git add test_titanic.csv.dvc






In [42]:
%%bash
cd ~/example_dvc
git commit -m "data update" -i *.dvc

[main f93871e] data update
 2 files changed, 4 insertions(+), 4 deletions(-)


In [44]:
%%bash
cd ~/example_dvc
path_dvc=/home/myuser/.local/bin/
$path_dvc/dvc push -r local

2 files pushed


Check:

In [45]:
%%bash
ls -lh ~/dvc_storage/*

/home/myuser/dvc_storage/02:
total 28K
-r--r--r-- 1 myuser myuser 28K Nov 12 10:01 9c9cd22461f6dbe8d9ab01def965c6

/home/myuser/dvc_storage/61:
total 60K
-r--r--r-- 1 myuser myuser 60K Nov 12 10:01 fdd54abdbf6a85b778e937122e1194

/home/myuser/dvc_storage/6f:
total 28K
-r--r--r-- 1 myuser myuser 27K Nov 12 10:18 239b05fcf9ef02f4c29999e4ef8c65

/home/myuser/dvc_storage/eb:
total 28K
-r--r--r-- 1 myuser myuser 27K Nov 12 10:18 948a8bd52f721c6ab4519cd9aac07c


# Now these csv files have next number of rows:

In [46]:
train_df = pd.read_csv(os.path.join(url_csv, filename_train))
test_df  = pd.read_csv(os.path.join(url_csv, filename_test))

In [47]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 399 entries, 0 to 398
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  399 non-null    int64  
 1   Survived     399 non-null    int64  
 2   Pclass       399 non-null    int64  
 3   Name         399 non-null    object 
 4   Sex          399 non-null    object 
 5   Age          321 non-null    float64
 6   SibSp        399 non-null    int64  
 7   Parch        399 non-null    int64  
 8   Ticket       399 non-null    object 
 9   Fare         399 non-null    float64
 10  Cabin        91 non-null     object 
 11  Embarked     398 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 37.5+ KB


In [48]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 399 entries, 0 to 398
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  399 non-null    int64  
 1   Pclass       399 non-null    int64  
 2   Name         399 non-null    object 
 3   Sex          399 non-null    object 
 4   Age          318 non-null    float64
 5   SibSp        399 non-null    int64  
 6   Parch        399 non-null    int64  
 7   Ticket       399 non-null    object 
 8   Fare         398 non-null    float64
 9   Cabin        85 non-null     object 
 10  Embarked     399 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 34.4+ KB


# Inside `~/example_dvc` execute `git push` in bash not in notebook.

# If these csv updates weren't good (because some evaluations of models using these updates) then go back to previous version of data before the update with `dvc checkout`

In [50]:
%%bash
cd ~/example_dvc
git checkout aa43094f8f9e40c5f46e11d8c131ed83663553bf train_titanic.csv.dvc

Updated 1 path from 67a9f42


In [51]:
%%bash
cd ~/example_dvc
git checkout aa43094f8f9e40c5f46e11d8c131ed83663553bf test_titanic.csv.dvc

Updated 1 path from 67a9f42


In [52]:
%%bash
cd ~/example_dvc
path_dvc=/home/myuser/.local/bin/
$path_dvc/dvc checkout

M       test_titanic.csv
M       train_titanic.csv


Check number of rows

In [53]:
train_df = pd.read_csv(os.path.join(url_csv, filename_train))
test_df  = pd.read_csv(os.path.join(url_csv, filename_test))

In [54]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [55]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


# Extra: if someone else wants to execute some commands with data (version 1 and 2) then:

Connect to server where is located `~/dvc_storage/`, clone repo https://github.com/palmoreck/example_dvc and execute `dvc pull`

In [56]:
%%bash
cd ~
git clone https://github.com/palmoreck/example_dvc.git ~/new_repo/

Cloning into '/home/myuser/new_repo'...


In [57]:
%%bash
cd ~/new_repo/
path_dvc=/home/myuser/.local/bin/
$path_dvc/dvc pull -r local

A       test_titanic.csv
A       train_titanic.csv
2 files added and 2 files fetched


In [58]:
url_csv = os.path.join(os.path.expanduser("~"),"new_repo")
filename_train = "train_titanic.csv"
filename_test = "test_titanic.csv"

In [59]:
train_df = pd.read_csv(os.path.join(url_csv, filename_train))
test_df  = pd.read_csv(os.path.join(url_csv, filename_test))

In [60]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 399 entries, 0 to 398
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  399 non-null    int64  
 1   Survived     399 non-null    int64  
 2   Pclass       399 non-null    int64  
 3   Name         399 non-null    object 
 4   Sex          399 non-null    object 
 5   Age          321 non-null    float64
 6   SibSp        399 non-null    int64  
 7   Parch        399 non-null    int64  
 8   Ticket       399 non-null    object 
 9   Fare         399 non-null    float64
 10  Cabin        91 non-null     object 
 11  Embarked     398 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 37.5+ KB


In [61]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 399 entries, 0 to 398
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  399 non-null    int64  
 1   Pclass       399 non-null    int64  
 2   Name         399 non-null    object 
 3   Sex          399 non-null    object 
 4   Age          318 non-null    float64
 5   SibSp        399 non-null    int64  
 6   Parch        399 non-null    int64  
 7   Ticket       399 non-null    object 
 8   Fare         398 non-null    float64
 9   Cabin        85 non-null     object 
 10  Embarked     399 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 34.4+ KB


# Refs


https://dvc.org/doc/start

https://dvc.org/doc/start/data-and-model-versioning

https://dvc.org/doc/command-reference/remote

https://dvc.org/doc/start/data-and-model-versioning