(Opcional) Paso 2: Data Pipelines en DVC
===

* Ultima modificación: Abril 4, 2022

**Este es un proyecto que desarrolla en varios pasos**

**No funciona a gran escala, pero este es un buen ejemplo de pipelines**

In [1]:
%cd dvcdemo

/workspace/dvcdemo


Descarga del proyecto
---

In [2]:
!wget https://code.dvc.org/get-started/code.zip
!unzip code.zip
!rm -f code.zip

--2022-06-07 16:40:57--  https://code.dvc.org/get-started/code.zip
Resolving code.dvc.org (code.dvc.org)... 104.21.81.205, 172.67.164.76, 2606:4700:3036::6815:51cd, ...
Connecting to code.dvc.org (code.dvc.org)|104.21.81.205|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://s3-us-east-2.amazonaws.com/dvc-public/code/get-started/code.zip [following]
--2022-06-07 16:40:59--  https://s3-us-east-2.amazonaws.com/dvc-public/code/get-started/code.zip
Resolving s3-us-east-2.amazonaws.com (s3-us-east-2.amazonaws.com)... 52.219.103.65
Connecting to s3-us-east-2.amazonaws.com (s3-us-east-2.amazonaws.com)|52.219.103.65|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5800 (5.7K) [application/zip]
Saving to: ‘code.zip’


2022-06-07 16:41:02 (93.3 KB/s) - ‘code.zip’ saved [5800/5800]

Archive:  code.zip
  inflating: params.yaml             
  inflating: src/evaluate.py         
  inflating: src/featurization.py    
  inflating: src/pr

In [5]:
!pip3 install --quiet -r  src/requirements.txt

[0m

Stage prepare
---

In [6]:
cmd = """
dvc stage add -n prepare \
              -p prepare.seed,prepare.split \
              -d src/prepare.py \
              -d data/data.xml \
              -o data/prepared \
               python3 src/prepare.py data/data.xml
"""
!{cmd}

Creating 'dvc.yaml'                                                   core[39m>
Adding stage 'prepare' in 'dvc.yaml'

To track the changes with git, run:

    git add data/.gitignore dvc.yaml

To enable auto staging, run:

	dvc config core.autostage true
[0m

In [7]:
!cat dvc.yaml

stages:
  prepare:
    cmd: python3 src/prepare.py data/data.xml
    deps:
    - data/data.xml
    - src/prepare.py
    params:
    - prepare.seed
    - prepare.split
    outs:
    - data/prepared


src/prepare.py
---

In [8]:
!pygmentize src/prepare.py

[34mimport[39;49;00m [04m[36mio[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mrandom[39;49;00m
[34mimport[39;49;00m [04m[36mre[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36mxml[39;49;00m[04m[36m.[39;49;00m[04m[36metree[39;49;00m[04m[36m.[39;49;00m[04m[36mElementTree[39;49;00m

[34mimport[39;49;00m [04m[36myaml[39;49;00m

params = yaml.safe_load([36mopen[39;49;00m([33m"[39;49;00m[33mparams.yaml[39;49;00m[33m"[39;49;00m))[[33m"[39;49;00m[33mprepare[39;49;00m[33m"[39;49;00m]

[34mif[39;49;00m [36mlen[39;49;00m(sys.argv) != [34m2[39;49;00m:
    sys.stderr.write([33m"[39;49;00m[33mArguments error. Usage:[39;49;00m[33m\n[39;49;00m[33m"[39;49;00m)
    sys.stderr.write([33m"[39;49;00m[33m\t[39;49;00m[33mpython prepare.py data-file[39;49;00m[33m\n[39;49;00m[33m"[39;49;00m)
    sys.exit([34m1[39;49;00m)

[37m# Test data set split ratio[39

Stage featurize
---

In [9]:
cmd = """
dvc stage add -n featurize \
              -p featurize.max_features,featurize.ngrams \
              -d src/featurization.py \
              -d data/prepared \
              -o data/features \
              python3 src/featurization.py data/prepared data/features
"""
!{cmd}

Adding stage 'featurize' in 'dvc.yaml'                                core[39m>

To track the changes with git, run:

    git add data/.gitignore dvc.yaml

To enable auto staging, run:

	dvc config core.autostage true
[0m

In [10]:
!cat dvc.yaml

stages:
  prepare:
    cmd: python3 src/prepare.py data/data.xml
    deps:
    - data/data.xml
    - src/prepare.py
    params:
    - prepare.seed
    - prepare.split
    outs:
    - data/prepared
  featurize:
    cmd: python3 src/featurization.py data/prepared data/features
    deps:
    - data/prepared
    - src/featurization.py
    params:
    - featurize.max_features
    - featurize.ngrams
    outs:
    - data/features


src/featurization.py
---

In [11]:
!pygmentize src/featurization.py

[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpickle[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m

[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mimport[39;49;00m [04m[36mscipy[39;49;00m[04m[36m.[39;49;00m[04m[36msparse[39;49;00m [34mas[39;49;00m [04m[36msparse[39;49;00m
[34mimport[39;49;00m [04m[36myaml[39;49;00m
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mfeature_extraction[39;49;00m[04m[36m.[39;49;00m[04m[36mtext[39;49;00m [34mimport[39;49;00m CountVectorizer, TfidfTransformer

params = yaml.safe_load([36mopen[39;49;00m([33m"[39;49;00m[33mparams.yaml[39;49;00m[33m"[39;49;00m))[[33m"[39;49;00m[33mfeaturize[39;49;00m[33m"[39;49;00m]

np.set_printoptions(suppress=[34mTrue[39;49;00m)

[34mif[39;49;00m [36mlen[39;

Stage train
---

In [12]:
cmd = """
dvc stage add -n train \
              -p train.seed,train.n_est,train.min_split \
              -d src/train.py \
              -d data/features \
              -o model.pkl \
              python3 src/train.py data/features model.pkl
"""
!{cmd}

Adding stage 'train' in 'dvc.yaml'                                    core[39m>

To track the changes with git, run:

    git add .gitignore dvc.yaml

To enable auto staging, run:

	dvc config core.autostage true
[0m

In [13]:
!cat dvc.yaml

stages:
  prepare:
    cmd: python3 src/prepare.py data/data.xml
    deps:
    - data/data.xml
    - src/prepare.py
    params:
    - prepare.seed
    - prepare.split
    outs:
    - data/prepared
  featurize:
    cmd: python3 src/featurization.py data/prepared data/features
    deps:
    - data/prepared
    - src/featurization.py
    params:
    - featurize.max_features
    - featurize.ngrams
    outs:
    - data/features
  train:
    cmd: python3 src/train.py data/features model.pkl
    deps:
    - data/features
    - src/train.py
    params:
    - train.min_split
    - train.n_est
    - train.seed
    outs:
    - model.pkl


src/train.py
---

In [14]:
!pygmentize src/train.py

[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpickle[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m

[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36myaml[39;49;00m
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mensemble[39;49;00m [34mimport[39;49;00m RandomForestClassifier

params = yaml.safe_load([36mopen[39;49;00m([33m"[39;49;00m[33mparams.yaml[39;49;00m[33m"[39;49;00m))[[33m"[39;49;00m[33mtrain[39;49;00m[33m"[39;49;00m]

[34mif[39;49;00m [36mlen[39;49;00m(sys.argv) != [34m3[39;49;00m:
    sys.stderr.write([33m"[39;49;00m[33mArguments error. Usage:[39;49;00m[33m\n[39;49;00m[33m"[39;49;00m)
    sys.stderr.write([33m"[39;49;00m[33m\t[39;49;00m[33mpython train.py features model[39;49;00m[33m\n[39;49;00m[33m"[39;49;00m)
    sys.exit([34m1[39;49;00m)

[36minput[39;49;00m = sys.argv[[34m1

Data
---

In [15]:
#
# Descarga los datos desde el repositorio de ejemplo de dvc
#
repo = "https://github.com/iterative/dataset-registry"
src = "get-started/data.xml"
dst = "data/data.xml"

!dvc get {repo} {src} -o {dst}

[31mERROR[39m: unexpected error - [Errno 17] File exists: 'data/data.xml'

[33mHaving any troubles?[39m Hit us up at [34mhttps://dvc.org/support[39m, we are always happy to help!
[0m

Reproducción
---

In [16]:
pwd

'/workspace/dvcdemo'

In [17]:
!dvc repro

Verifying data sources in stage: 'data/data.xml.dvc'                  core[39m>
                                                                                
Running stage 'prepare':
> python3 src/prepare.py data/data.xml
  0% Transferring|                                   |0/3 [00:00<?,     ?file/s]
![A
  0%|          |e72f304d64c28867d884e798568460.dir 0.00/? [00:00<?,        ?B/s][A
  0%|          |e72f304d64c28867d884e798568460.di0.00/137 [00:00<?,        ?B/s][A
Generating lock file 'dvc.lock'                                                 [A
Updating lock file 'dvc.lock'

Running stage 'featurize':
> python3 src/featurization.py data/prepared data/features
The input data frame data/prepared/train.tsv size is (16011, 3)
The output matrix data/features/train.pkl size is (16011, 102) and data type is float64
The input data frame data/prepared/test.tsv size is (3989, 3)
The output matrix data/features/test.pkl size is (3989, 102) and data type is float64
  0% Transferring| 

In [18]:
!cat dvc.lock

schema: '2.0'
stages:
  prepare:
    cmd: python3 src/prepare.py data/data.xml
    deps:
    - path: data/data.xml
      md5: 079fbd15fa2c32c539c4c4e3675b514a
      size: 28890194
    - path: src/prepare.py
      md5: f09ea0c15980b43010257ccb9f0055e2
      size: 1576
    params:
      params.yaml:
        prepare.seed: 20170428
        prepare.split: 0.2
    outs:
    - path: data/prepared
      md5: 2fe72f304d64c28867d884e798568460.dir
      size: 16874726
      nfiles: 2
  featurize:
    cmd: python3 src/featurization.py data/prepared data/features
    deps:
    - path: data/prepared
      md5: 2fe72f304d64c28867d884e798568460.dir
      size: 16874726
      nfiles: 2
    - path: src/featurization.py
      md5: e0265fc22f056a4b86d85c3056bc2894
      size: 2490
    params:
      params.yaml:
        featurize.max_features: 100
        featurize.ngrams: 1
    outs:
    - path: data/features
      md5: 1e28c4afe2ae56365ae96716fff987a5.dir
      size: 3122242
      nfiles: 2
  train:
    

Visualización
---

In [19]:
!dvc dag

+-------------------+  
| data/data.xml.dvc |  
+-------------------+  
          *            
          *            
          *            
     +---------+       
     | prepare |       
     +---------+       
          *            
          *            
          *            
    +-----------+      
    | featurize |      
    +-----------+      
          *            
          *            
          *            
      +-------+        
      | train |        
      +-------+        
[0m