# Pipelines in DVC

This chapter focuses on automating ML pipelines using DVC. Learners create a configuration file containing settings and hyperparameters. They also learn about pipeline visualization using directed acyclic graphs and use commands to describe dependencies, commands, and outputs. Execution of DVC pipelines is covered, including local model training and how Git tracks DVC metadata. Additionally, learners explore metrics and plots tracking in DVC, including how to print metrics, create plot files, and compare metrics and plots across different pipeline stages.

In [1]:
# Removing the DVC.yml if exist
import os

try:
    os.remove("dvc.yaml")
except Exception as e:
    pass

## 3.1 Code organization and refactoring

### YAML Syntax

In [2]:
import json
from pprint import pprint

# Reading the json file
with open('ExampleReadYml/case1/params.json') as f:
    d = json.load(f)
pprint(d)

{'dataset': {'categorical_columns': ['type of meal',
                                     'room type',
                                     'market segment type'],
             'label_column': 'booking status',
             'numerical_columns': ['number of adults',
                                   'number of children',
                                   'number of weekend nights',
                                   'number of week nights',
                                   'car parking space',
                                   'lead time',
                                   'repeated',
                                   'P-C',
                                   'P-not-C',
                                   'average price',
                                   'special requests']},
 'pipeline': {'rfc': {'max_depth': 5, 'n_estimators': 5, 'random_state': 42}}}


In [3]:
import yaml

# Saving the json file in a yaml format
with open('ExampleReadYml/case1/params-example.yaml', 'w', encoding='utf8') as f:
    yaml.dump(d, f, default_flow_style=False, allow_unicode=True)

In [4]:
!dir .\ExampleReadYml\case1 /B

params-example.yaml
params.json


In [5]:
# Loading a yaml file
with open('ExampleReadYml/case1/params-example.yaml') as f:
    p = yaml.safe_load(f)

pprint(p)

{'dataset': {'categorical_columns': ['type of meal',
                                     'room type',
                                     'market segment type'],
             'label_column': 'booking status',
             'numerical_columns': ['number of adults',
                                   'number of children',
                                   'number of weekend nights',
                                   'number of week nights',
                                   'car parking space',
                                   'lead time',
                                   'repeated',
                                   'P-C',
                                   'P-not-C',
                                   'average price',
                                   'special requests']},
 'pipeline': {'rfc': {'max_depth': 5, 'n_estimators': 5, 'random_state': 42}}}


## 3.2 Writing and visualizing DVC pipelines

### Adding preprocessing stage

We can use the `dvc stage add` command to create a stage in the `dvc.yaml` file. 

- specifying the name with `-n`,
- parameters with `-p`,
- dependencies with `-d`,
- outputs with `-o`,
- and writing the command at the end.

In [6]:
!dvc stage add --force -n preprocess \
                       -p params.yaml:preprocess \
                       -d raw_data.csv \
                       -d preprocess.py \
                       -o processed_data.csv python preprocess.py

Added stage 'preprocess' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml

To enable auto staging, run:

	dvc config core.autostage true


In [7]:
!more dvc.yaml

stages:
  preprocess:
    cmd: python preprocess.py
    deps:
    - preprocess.py
    - raw_data.csv
    params:
    - preprocess
    outs:
    - processed_data.csv


### Visualizing DVC pipelines

In [8]:
# Print DAG on terminal
!dvc dag

+------------+ 
| preprocess | 
+------------+ 


In [9]:
# Display DAG up to a certain step
!dvc dag preprocess

+------------+ 
| preprocess | 
+------------+ 


In [10]:
# Display step outputs as nodes
!dvc dag --outs

+--------------------+ 
| processed_data.csv | 
+--------------------+ 


In [11]:
!dvc dag --dot

strict digraph  {
"preprocess";
}



### Another example - using the ExampleML1

In [12]:
!dvc remove preprocess

In [13]:
!dvc stage add --force -q \
                       -n data-preparation \
                       -d ExampleML1/data-raw/booking.csv \
                       -o ExampleML1/data-processed/train_A.csv \
                       -o ExampleML1/data-processed/train_B.csv \
                       -o ExampleML1/data-processed/test.csv \
                       python -m ExampleML1.split_dataset ExampleML1/data-raw/booking.csv \
                                                          ExampleML1/data-processed/train_A.csv \
                                                          ExampleML1/data-processed/train_B.csv \
                                                          ExampleML1/data-processed/test.csv
!dvc stage add --force -q \
                       -n model-trainingA \
                       -d ExampleML1/config/params.json \
                       -d ExampleML1/data-processed/train_A.csv \
                       -d ExampleML1/data-processed/test.csv \
                       python -m ExampleML1.model_training ExampleML1/config/params.json \
                                                           ExampleML1/data-processed/train_A.csv \
                                                           ExampleML1/data-processed/test.csv
!dvc stage add --force -q \
                       -n model-trainingB \
                       -d ExampleML1/config/params.json \
                       -d ExampleML1/data-processed/train_B.csv \
                       -d ExampleML1/data-processed/test.csv \
                       python -m ExampleML1.model_training ExampleML1/config/params.json \
                                                           ExampleML1/data-processed/train_B.csv \
                                                           ExampleML1/data-processed/test.csv

In [14]:
!more dvc.yaml

stages:
  data-preparation:
    cmd: python -m ExampleML1.split_dataset ExampleML1/data-raw/booking.csv ExampleML1/data-processed/train_A.csv
      ExampleML1/data-processed/train_B.csv ExampleML1/data-processed/test.csv
    deps:
    - ExampleML1/data-raw/booking.csv
    outs:
    - ExampleML1/data-processed/test.csv
    - ExampleML1/data-processed/train_A.csv
    - ExampleML1/data-processed/train_B.csv
  model-trainingA:
    cmd: python -m ExampleML1.model_training ExampleML1/config/params.json ExampleML1/data-processed/train_A.csv
      ExampleML1/data-processed/test.csv
    deps:
    - ExampleML1/config/params.json
    - ExampleML1/data-processed/test.csv
    - ExampleML1/data-processed/train_A.csv
  model-trainingB:
    cmd: python -m ExampleML1.model_training ExampleML1/config/params.json ExampleML1/data-processed/train_B.csv
      ExampleML1/data-processed/test.csv
    deps:
    - ExampleML1/config/params.json
    - ExampleML1/data-processed/test.csv
    - ExampleML1/data-proces

In [15]:
# Print DAG on terminal
!dvc dag

                  +------------------+                     
                  | data-preparation |                     
                  +------------------+                     
                  ***                ***                   
               ***                      ***                
             **                            **              
+-----------------+                   +-----------------+  
| model-trainingA |                   | model-trainingB |  
+-----------------+                   +-----------------+  


In [16]:
# Display step outputs as nodes
!dvc dag --outs

+------------------------------------+ 
| ExampleML1\data-processed\test.csv | 
+------------------------------------+ 
+---------------------------------------+  
| ExampleML1\data-processed\train_A.csv |  
+---------------------------------------+  
+---------------------------------------+  
| ExampleML1\data-processed\train_B.csv |  
+---------------------------------------+  


In [17]:
!dvc dag --dot

strict digraph  {
"data-preparation";
"model-trainingA";
"model-trainingB";
"data-preparation" -> "model-trainingA";
"data-preparation" -> "model-trainingB";
}



### Ex.1 - Designing a DVC pipeline

Designing a DVC pipeline, or DAG, is fundamental to leveraging DVC in your machine learning workflows. DAGs allow us to codify inputs, outputs, and execution of a certain step. The outputs of one step can serve as input to one or more steps, thereby naturally setting the right dependencies between steps.

In this exercise, you'll work on designing an ML workflow that contains four stages, namely,

- Data preprocessing (preprocess_stage)
- Data splitting (split_stage)
- Model training (train_stage)
- Model evaluation (evaluate_stage)

We will exclusively work with the dvc stage add commands. Scroll down to the end of the shell script file (dvc_dag_stages_add.sh) if needed.

**Instruction:**

1. Add `processed_data.csv` as output from `preprocess_stage`.
2. Add parameters from the `split` section of the default parameter file to the `split_stage`.
3. Add `model.pkl` as one of the dependencies in the `evaluate_stage`.
4. Run the bash file by running bash `dvc_dag_stages_add.sh` command on the terminal. Notice how `dvc.yaml` gets populated.

```
# Preprocess stage - Output is processed_data.csv
dvc stage add --force -n preprocess_stage \
                      -p preprocess \
                      -d raw_data.csv \
                      -d preprocess.py \
                      -o processed_data.csv \
                      python3 preprocess.py

# Split stage - This stage uses parameters from `split` section of params.yaml
dvc stage add --force -n split_stage \
                      -p split \
                      -d processed_data.csv \
                      -d split.py \
                      -o train_data.csv \
                      -o eval_data.csv \
                      python3 split.py

# Train stage - This stage generates model.pkl as output
dvc stage add --force -n train_stage \
                      -p train \
                      -d train_data.csv \
                      -d train.py \
                      -o model.pkl \
                      python3 train.py

# Evaluate stage - This stage uses model.pkl as one of the input
dvc stage add --force -n evaluate_stage \
                      -p evaluate \
                      -d eval_data.csv \
                      -d model.pkl \
                      -d evaluate.py \
                      -o metrics.json \
                      python3 evaluate.py
```

### Ex.2 - Visualizing a DVC pipeline

In this exercise, you will learn to use the dvc dag command with different flags to gain various insights about your project's pipeline. Understanding these flags and their effects on the dvc dag command's output will help you better manage and understand your project's pipeline.

Remember, the goal of this exercise is not just to execute the commands but to understand the nuances of the dvc dag command and how different flags alter its output.

**Instruction**

1. Run the `dvc dag` command without any flags and observe the output.
2. Run the `dvc dag` command with the `--outs` flag and compare the output with the previous step.
3. Run the `dvc dag` command with a `train` stage as a target and observe how the output changes.

In [18]:
# preparing the data to review
with open("dvc.yaml", "w") as f:
    f.write(
        'stages:\n'
        '  preprocess:\n'
        '    cmd: python3 preprocess.py\n'
        '    deps:\n'
        '    - preprocess.py\n'
        '    - raw_data.csv\n'
        '    params:\n'
        '    - preprocess\n'
        '    outs:\n'
        '    - processed_data.csv\n'
        '  split:\n'
        '    cmd: python3 split.py\n'
        '    deps:\n'
        '    - processed_data.csv\n'
        '    - split.py\n'
        '    params:\n'
        '    - split\n'
        '    outs:\n'
        '    - eval_data.csv\n'
        '    - train_data.csv\n'
        '  train:\n'
        '    cmd: python3 train.py\n'
        '    deps:\n'
        '    - train.py\n'
        '    - train_data.csv\n'
        '    params:\n'
        '    - train\n'
        '    outs:\n'
        '    - model.pkl\n'
        '  evaluate:\n'
        '    cmd: python3 evaluate.py\n'
        '    deps:\n'
        '    - eval_data.csv\n'
        '    - evaluate.py\n'
        '    - model.pkl\n'
        '    params:\n'
        '    - evaluate\n'
        '    outs:\n'
        '    - metrics.json\n'
    )

In [19]:
!dvc dag

       +------------+     
       | preprocess |     
       +------------+     
              *           
              *           
              *           
          +-------+       
          | split |       
          +-------+       
         **        **     
       **            *    
      *               **  
+-------+               * 
| train |             **  
+-------+            *    
         **        **     
           **    **       
             *  *         
        +----------+      
        | evaluate |      
        +----------+      


In [20]:
!dvc dag --outs

             +--------------------+               
             | processed_data.csv |               
             +--------------------+               
                ***            ***                
              **                  ***             
            **                       **           
+----------------+                     **         
| train_data.csv |                      *         
+----------------+                      *         
         *                              *         
         *                              *         
         *                              *         
  +-----------+                +---------------+  
  | model.pkl |                | eval_data.csv |  
  +-----------+*               +---------------+  
                ***            ***                
                   **        **                   
                     **    **                     
                +--------------+                  
                | metrics.json 

In [21]:
!dvc dag train

+------------+ 
| preprocess | 
+------------+ 
       *       
       *       
       *       
  +-------+    
  | split |    
  +-------+    
       *       
       *       
       *       
  +-------+    
  | train |    
  +-------+    


## 3.3 Executing DVC pipelines

In [22]:
# Restarting the DVC.yml
import os

os.remove("dvc.yaml")

In [23]:
!dvc stage add --force -q \
                       -n data-preparation \
                       -d ExampleML1/data-raw/booking.csv \
                       -o ExampleML1/data-processed/train_A.csv \
                       -o ExampleML1/data-processed/train_B.csv \
                       -o ExampleML1/data-processed/test.csv \
                       python -m ExampleML1.split_dataset ExampleML1/data-raw/booking.csv \
                                                          ExampleML1/data-processed/train_A.csv \
                                                          ExampleML1/data-processed/train_B.csv \
                                                          ExampleML1/data-processed/test.csv
!dvc stage add --force -q \
                       -n model-trainingA \
                       -d ExampleML1/config/params.json \
                       -d ExampleML1/data-processed/train_A.csv \
                       -d ExampleML1/data-processed/test.csv \
                       python -m ExampleML1.model_training ExampleML1/config/params.json \
                                                           ExampleML1/data-processed/train_A.csv \
                                                           ExampleML1/data-processed/test.csv
!dvc stage add --force -q \
                       -n model-trainingB \
                       -d ExampleML1/config/params.json \
                       -d ExampleML1/data-processed/train_B.csv \
                       -d ExampleML1/data-processed/test.csv \
                       python -m ExampleML1.model_training ExampleML1/config/params.json \
                                                           ExampleML1/data-processed/train_B.csv \
                                                           ExampleML1/data-processed/test.csv

In [24]:
!more dvc.yaml

stages:
  data-preparation:
    cmd: python -m ExampleML1.split_dataset ExampleML1/data-raw/booking.csv ExampleML1/data-processed/train_A.csv
      ExampleML1/data-processed/train_B.csv ExampleML1/data-processed/test.csv
    deps:
    - ExampleML1/data-raw/booking.csv
    outs:
    - ExampleML1/data-processed/test.csv
    - ExampleML1/data-processed/train_A.csv
    - ExampleML1/data-processed/train_B.csv
  model-trainingA:
    cmd: python -m ExampleML1.model_training ExampleML1/config/params.json ExampleML1/data-processed/train_A.csv
      ExampleML1/data-processed/test.csv
    deps:
    - ExampleML1/config/params.json
    - ExampleML1/data-processed/test.csv
    - ExampleML1/data-processed/train_A.csv
  model-trainingB:
    cmd: python -m ExampleML1.model_training ExampleML1/config/params.json ExampleML1/data-processed/train_B.csv
      ExampleML1/data-processed/test.csv
    deps:
    - ExampleML1/config/params.json
    - ExampleML1/data-processed/test.csv
    - ExampleML1/data-proces

### Dry running a pipeline

- `!dvc repro` run the experiment
- `!dvc repro --dry` shows only the commands that will be executed

In [25]:
!dvc repro --dry

Running stage 'data-preparation':
> python -m ExampleML1.split_dataset ExampleML1/data-raw/booking.csv ExampleML1/data-processed/train_A.csv ExampleML1/data-processed/train_B.csv ExampleML1/data-processed/test.csv

Running stage 'model-trainingA':
> python -m ExampleML1.model_training ExampleML1/config/params.json ExampleML1/data-processed/train_A.csv ExampleML1/data-processed/test.csv

Running stage 'model-trainingB':
> python -m ExampleML1.model_training ExampleML1/config/params.json ExampleML1/data-processed/train_B.csv ExampleML1/data-processed/test.csv
Use `dvc push` to send your updates to remote storage.


### Reproducing a pipeline

- `!dvc repro` run the experiment
- `!dvc repro --dry` shows only the commands that will be executed

In [26]:
!dvc repro

Running stage 'data-preparation':
> python -m ExampleML1.split_dataset ExampleML1/data-raw/booking.csv ExampleML1/data-processed/train_A.csv ExampleML1/data-processed/train_B.csv ExampleML1/data-processed/test.csv
ExampleML1/data-processed/test.csv file created...
ExampleML1/data-processed/train_A.csv file created...
ExampleML1/data-processed/train_B.csv file created...
Completed!
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'

Running stage 'model-trainingA':
> python -m ExampleML1.model_training ExampleML1/config/params.json ExampleML1/data-processed/train_A.csv ExampleML1/data-processed/test.csv
{'Precision': 0.8914, 'Recall': 0.4613, 'F1 Score': 0.608, 'Accuracy': 0.8031}
Updating lock file 'dvc.lock'

Running stage 'model-trainingB':
> python -m ExampleML1.model_training ExampleML1/config/params.json ExampleML1/data-processed/train_B.csv ExampleML1/data-processed/test.csv
{'Precision': 0.8421, 'Recall': 0.4908, 'F1 Score': 0.6202, 'Accuracy': 0.801}
Updating lock fi

### Reviewing Case in example ML 2

In [55]:
# Creating the params yaml file
with open("ExampleML2/config/params.yaml", "w") as f:
    f.write(
        'preprocess:\n'
        '  drop_colnames:\n'
        '    - Date\n'
        '  target_column: RainTomorrow\n'
        '  categorical_features:\n'
        '    - Location\n'
        '    - WindGustDir\n'
        '    - WindDir9am\n'
        '    - WindDir3pm\n'
        '    - RainToday\n'
        'train_and_evaluate:\n'
        '  target_column: RainTomorrow\n'
        '  train_test_split:\n'
        '    test_size: 0.2\n'
        '    random_state: 1993\n'
        '  shuffle: true\n'
        '  shuffle_random_state: 1993\n'
        '  rfc_params:\n'
        '    n_estimators: 2\n'
        '    max_depth: 2\n'
        '    random_state: 42'
    )

In [56]:
# Creating the dvc yaml
with open("dvc.yaml", "w") as f:
    f.write(
        'stages:\n'
        '  preprocess:\n'
        '    # Run the data preprocessing script\n'
        '    cmd: python -m ExampleML2.preprocess_dataset ExampleML2/config/params.yaml'
                                                        ' ExampleML2/data-raw/weather.csv'
                                                        ' ExampleML2/data-processed/weather.csv\n'
        '    deps:\n'
        '    - ExampleML2/preprocess_dataset.py\n'
        '    - ExampleML2/data-raw/weather.csv\n'
        '    - ExampleML2/utils_and_constants.py\n'
        '    params:\n'
        '      - ExampleML2\config\params.yaml:\n'
        '        - preprocess\n'
        '    outs:\n'
        '    - ExampleML2/data-processed/weather.csv\n'
        '  train_and_evaluate:\n'
        '    # Run the model training and evaluation script\n'
        '    cmd: python -m ExampleML2.train_and_evaluate ExampleML2/config/params.yaml'
                                                        ' ExampleML2/data-processed/weather.csv\n'
        '    deps:\n'
        '    - ExampleML2/metrics_and_plots.py\n'
        '    - ExampleML2/model.py\n'
        '    # Specify the preprocessed dataset as a dependency\n'
        '    - ExampleML2/data-processed/weather.csv\n'
        '    - ExampleML2/train_and_evaluate.py\n'
        '    - ExampleML2/utils_and_constants.py\n'
        '    params:\n'
        '      - ExampleML2\config\params.yaml:\n'
        '        - train_and_evaluate\n'
        '    outs:\n'
        '    - ExampleML2/metrics/metrics.json\n'
        '    - ExampleML2/images/confusion_matrix.png\n'
    )

In [57]:
!python -m ExampleML2.preprocess_dataset ExampleML2/config/params.yaml \
                                         ExampleML2/data-raw/weather.csv \
                                         ExampleML2/data-processed/weather.csv

Reading raw data and processing it...
Target encoding categorical columns...
Imputing and scaling features...
Writing processed dataset to ExampleML2/data-processed/weather.csv...
Done!


In [58]:
!python -m ExampleML2.train_and_evaluate ExampleML2/config/params.yaml \
                                         ExampleML2/data-processed/weather.csv

Loading and splitting the dataset...
Dataset shape: (25000, 22)
Train set shape: (20000, 22)
Test set shape: (5000, 22)
Training and evaluating the model...
{
  "accuracy": 0.9914,
  "precision": 0.9771,
  "recall": 0.9849,
  "f1_score": 0.981
}


In [59]:
!more dvc.yaml

stages:
  preprocess:
    # Run the data preprocessing script
    cmd: python -m ExampleML2.preprocess_dataset ExampleML2/config/params.yaml ExampleML2/data-raw/weather.csv ExampleML2/data-processed/weather.csv
    deps:
    - ExampleML2/preprocess_dataset.py
    - ExampleML2/data-raw/weather.csv
    - ExampleML2/utils_and_constants.py
    params:
      - ExampleML2\config\params.yaml:
        - preprocess
    outs:
    - ExampleML2/data-processed/weather.csv
  train_and_evaluate:
    # Run the model training and evaluation script
    cmd: python -m ExampleML2.train_and_evaluate ExampleML2/config/params.yaml ExampleML2/data-processed/weather.csv
    deps:
    - ExampleML2/metrics_and_plots.py
    - ExampleML2/model.py
    # Specify the preprocessed dataset as a dependency
    - ExampleML2/data-processed/weather.csv
    - ExampleML2/train_and_evaluate.py
    - ExampleML2/utils_and_constants.py
    params:
      - ExampleML2\config\params.yaml:
        - train_and_evaluate
    outs:
    

In [60]:
!more ExampleML2\config\params.yaml

preprocess:
  drop_colnames:
    - Date
  target_column: RainTomorrow
  categorical_features:
    - Location
    - WindGustDir
    - WindDir9am
    - WindDir3pm
    - RainToday
train_and_evaluate:
  target_column: RainTomorrow
  train_test_split:
    test_size: 0.2
    random_state: 1993
  shuffle: true
  shuffle_random_state: 1993
  rfc_params:
    n_estimators: 2
    max_depth: 2
    random_state: 42


### Ex.3 - Execute a ML model training pipeline

DVC pipelines are used to ensure reproducibility in your project.

In this exercise, you will build on the learnings of creating a pipeline in the dvc.yaml file and execute the steps to train a machine-learning model using a structured approach. Your task is to execute different variants of dvc repro command to understand the nuances of it.

**Instruction:**

1. Execute a dry run of the pipeline. Understand the steps and execution order.
2. Execute only the `preprocessing` stage of the pipeline that is specified under preprocess block in `dvc.yaml`. Observe changes to the `dvc.lock` file.
3. Execute only the `training/evaluation` stage of the pipeline that is specified under `train_and_evaluate` block in `dvc.yaml`. Observe changes to the `dvc.lock` file.
4. Execute the entire DVC pipeline. Notice how the caching in DVC skips the actual execution of the steps.

In [61]:
# 1. Execute a dry run of the pipeline. Understand the steps and execution order.
!dvc repro --dry

Running stage 'preprocess':
> python -m ExampleML2.preprocess_dataset ExampleML2/config/params.yaml ExampleML2/data-raw/weather.csv ExampleML2/data-processed/weather.csv

Running stage 'train_and_evaluate':
> python -m ExampleML2.train_and_evaluate ExampleML2/config/params.yaml ExampleML2/data-processed/weather.csv
Use `dvc push` to send your updates to remote storage.


In [62]:
# 2. Execute only the preprocessing stage of the pipeline that is specified under preprocess block in dvc.yaml.
!dvc repro preprocess

Stage 'preprocess' is cached - skipping run, checking out outputs
Use `dvc push` to send your updates to remote storage.


In [63]:
# 2. Observe changes to the dvc.lock file.
!more dvc.lock

schema: '2.0'
stages:
  data-preparation:
    cmd: python -m ExampleML1.split_dataset ExampleML1/data-raw/booking.csv ExampleML1/data-processed/train_A.csv
      ExampleML1/data-processed/train_B.csv ExampleML1/data-processed/test.csv
    deps:
    - path: ExampleML1/data-raw/booking.csv
      hash: md5
      md5: 8e30b9da0032c81edebc9f7492dcea14
      size: 3241399
    outs:
    - path: ExampleML1/data-processed/test.csv
      hash: md5
      md5: 31e6b80f48ac5efcd9a32d71aef19294
      size: 584943
    - path: ExampleML1/data-processed/train_A.csv
      hash: md5
      md5: 18e286dd3cd267afc6bbf27539db3887
      size: 1169596
    - path: ExampleML1/data-processed/train_B.csv
      hash: md5
      md5: c39a9e7969c1003f0d8d394953c9880c
      size: 1169787
  model-trainingA:
    cmd: python -m ExampleML1.model_training ExampleML1/config/params.json ExampleML1/data-processed/train_A.csv
      ExampleML1/data-processed/test.csv
    deps:
    - path: ExampleML1/config/params.json
      hash

In [64]:
# 3. Execute only the training/evaluation stage of the pipeline that is specified 
#    under train_and_evaluate block in dvc.yaml.
!dvc repro train_and_evaluate

Stage 'preprocess' is cached - skipping run, checking out outputs

Stage 'train_and_evaluate' is cached - skipping run, checking out outputs
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.lock

To enable auto staging, run:

	dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.


In [65]:
# !git rm -r --cached ExampleML2/metrics/metrics.json

In [39]:
# 3. Observe changes to the dvc.lock file.
!more dvc.lock

schema: '2.0'
stages:
  data-preparation:
    cmd: python -m ExampleML1.split_dataset ExampleML1/data-raw/booking.csv ExampleML1/data-processed/train_A.csv
      ExampleML1/data-processed/train_B.csv ExampleML1/data-processed/test.csv
    deps:
    - path: ExampleML1/data-raw/booking.csv
      hash: md5
      md5: 8e30b9da0032c81edebc9f7492dcea14
      size: 3241399
    outs:
    - path: ExampleML1/data-processed/test.csv
      hash: md5
      md5: 31e6b80f48ac5efcd9a32d71aef19294
      size: 584943
    - path: ExampleML1/data-processed/train_A.csv
      hash: md5
      md5: 18e286dd3cd267afc6bbf27539db3887
      size: 1169596
    - path: ExampleML1/data-processed/train_B.csv
      hash: md5
      md5: c39a9e7969c1003f0d8d394953c9880c
      size: 1169787
  model-trainingA:
    cmd: python -m ExampleML1.model_training ExampleML1/config/params.json ExampleML1/data-processed/train_A.csv
      ExampleML1/data-processed/test.csv
    deps:
    - path: ExampleML1/config/params.json
      hash

In [40]:
# 4. Execute the entire DVC pipeline. Notice how the caching in DVC skips the actual execution of the steps.
!dvc repro

Stage 'preprocess' is cached - skipping run, checking out outputs

Stage 'train_and_evaluate' is cached - skipping run, checking out outputs
Use `dvc push` to send your updates to remote storage.


## 3.4 Evaluation: Metrics and plots in DVC

### Metrics: changes in dvc.yaml

In [41]:
# Prepare the yaml file
with open("dvc.yaml", "w") as f:
    f.write(
        'stages:\n'
        '  preprocess:\n'
        '    # Run the data preprocessing script\n'
        '    cmd: python -m ExampleML2.preprocess_dataset ExampleML2/config/params.yaml'
                                                        ' ExampleML2/data-raw/weather.csv'
                                                        ' ExampleML2/data-processed/weather.csv\n'
        '    deps:\n'
        '    - ExampleML2/preprocess_dataset.py\n'
        '    - ExampleML2/data-raw/weather.csv\n'
        '    - ExampleML2/utils_and_constants.py\n'
        '    params:\n'
        '      - ExampleML2\config\params.yaml:\n'
        '        - preprocess\n'
        '    outs:\n'
        '    - ExampleML2/data-processed/weather.csv\n'
        '  train_and_evaluate:\n'
        '    # Run the model training and evaluation script\n'
        '    cmd: python -m ExampleML2.train_and_evaluate ExampleML2/config/params.yaml'
                                                        ' ExampleML2/data-processed/weather.csv\n'
        '    deps:\n'
        '    - ExampleML2/metrics_and_plots.py\n'
        '    - ExampleML2/model.py\n'
        '    # Specify the preprocessed dataset as a dependency\n'
        '    - ExampleML2/data-processed/weather.csv\n'
        '    - ExampleML2/train_and_evaluate.py\n'
        '    - ExampleML2/utils_and_constants.py\n'
        '    params:\n'
        '      - ExampleML2\config\params.yaml:\n'
        '        - train_and_evaluate\n'
        '    outs:\n'
        '    - ExampleML2/images/confusion_matrix.png\n'
        '    metrics:\n'
        '    - ExampleML2/metrics/metrics.json:\n'
        '        cache: false'
    )

In [42]:
# Prepare the params yaml file
with open("ExampleML2/config/params.yaml", "w") as f:
    f.write(
        'preprocess:\n'
        '  drop_colnames:\n'
        '    - Date\n'
        '  target_column: RainTomorrow\n'
        '  categorical_features:\n'
        '    - Location\n'
        '    - WindGustDir\n'
        '    - WindDir9am\n'
        '    - WindDir3pm\n'
        '    - RainToday\n'
        'train_and_evaluate:\n'
        '  target_column: RainTomorrow\n'
        '  train_test_split:\n'
        '    test_size: 0.2\n'
        '    random_state: 1993\n'
        '  shuffle: true\n'
        '  shuffle_random_state: 1993\n'
        '  rfc_params:\n'
        '    n_estimators: 2\n'
        '    max_depth: 2\n'
        '    random_state: 42'
    )

In [43]:
!dvc repro

Stage 'preprocess' is cached - skipping run, checking out outputs

Stage 'train_and_evaluate' is cached - skipping run, checking out outputs
Use `dvc push` to send your updates to remote storage.


### Printing DVC metrics

In [44]:
!dvc metrics show

Path                             accuracy    f1_score    precision    recall
ExampleML2\metrics\metrics.json  0.9914      0.981       0.9771       0.9849


In [None]:
!git add .
!git commit -m "Experiment Run 1"

### Ex.4 - Tracking DVC Metrics

DVC pipelines are employed to guarantee the reproducibility of your project.

In this exercise, you will expand on your knowledge of constructing a pipeline in the dvc.yaml file and carry out the steps to train a machine learning model in a systematic manner. Your assignment involves executing various forms of the dvc metrics command to comprehend its subtleties. We have already run the pipeline once and committed the metrics file to Git.

**Instruction:**

1. Print the current metrics by running appropriate dvc metrics subcommand.
2. Change n_estimators to 3 in line 20 of opened `params.yaml` file.
3. Execute the DVC pipeline.
4. Compare the changed metrics with the previous run using appropriate dvc metrics subcommand.

In [45]:
# 1. Print the current metrics by running appropriate dvc metrics subcommand.
!dvc metrics show

Path                             accuracy    f1_score    precision    recall
ExampleML2\metrics\metrics.json  0.9914      0.981       0.9771       0.9849


In [46]:
# 2. Change n_estimators to 3 in line 20 of opened params.yaml file.
!more ExampleML2\config\params.yaml

preprocess:
  drop_colnames:
    - Date
  target_column: RainTomorrow
  categorical_features:
    - Location
    - WindGustDir
    - WindDir9am
    - WindDir3pm
    - RainToday
train_and_evaluate:
  target_column: RainTomorrow
  train_test_split:
    test_size: 0.2
    random_state: 1993
  shuffle: true
  shuffle_random_state: 1993
  rfc_params:
    n_estimators: 2
    max_depth: 2
    random_state: 42


In [47]:
# Changing the n_estimators value in params.yaml
with open("ExampleML2/config/params.yaml", "w") as f:
    f.write("""
preprocess:
  drop_colnames:
    - Date
  target_column: RainTomorrow
  categorical_features:
    - Location
    - WindGustDir
    - WindDir9am
    - WindDir3pm
    - RainToday
train_and_evaluate:
  target_column: RainTomorrow
  train_test_split:
    test_size: 0.2
    random_state: 1993
  shuffle: true
  shuffle_random_state: 1993
  rfc_params:
    # Change number of estimators to 3
    n_estimators: 3
    max_depth: 2
    random_state: 42
""")

In [48]:
# 3. Execute the DVC pipeline.
!dvc repro

Stage 'preprocess' is cached - skipping run, checking out outputs

Running stage 'train_and_evaluate':
> python -m ExampleML2.train_and_evaluate ExampleML2/config/params.yaml ExampleML2/data-processed/weather.csv
Loading and splitting the dataset...
Dataset shape: (25000, 22)
Train set shape: (20000, 22)
Test set shape: (5000, 22)
Training and evaluating the model...
{
  "accuracy": 0.9948,
  "precision": 0.9774,
  "recall": 1.0,
  "f1_score": 0.9886
}
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.lock

To enable auto staging, run:

	dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.


In [51]:
# 4. Compare the changed metrics with the previous run using appropriate dvc metrics subcommand.
!git add .
!git commit -m "Tracking metrics"

On branch master
nothing to commit, working tree clean


In [54]:
!dvc metrics diff

Path                             Metric     HEAD    workspace    Change
ExampleML2\metrics\metrics.json  accuracy   -       0.9948       -
ExampleML2\metrics\metrics.json  f1_score   -       0.9886       -
ExampleML2\metrics\metrics.json  precision  -       0.9774       -
ExampleML2\metrics\metrics.json  recall     -       1.0          -


DVC failed to load some metrics for following revisions: 'HEAD'.


### Ex.5 - Adding plots to dvc.yaml
In this exercise, you are tasked to fill in the dvc.yaml file that outlines a model training process.

The files preprocess_dataset.py and train_and_evaluate.py are responsible for data preprocessing and model training/evaluation respectively, using weather.csv from the raw_dataset folder as input. The output of the model training code is the predictions.csv file, which includes the predictions and the actual values from the test dataset, and a metrics.json file that holds structured metrics data. The predictions.csv file will be utilized to create a confusion matrix plot.

Ide Exercise Instruction
100XP
Set the plot target to the output file containing predictions data.
Set the plot template to confusion to plot the confusion matrix.
Set the correct value for cache key to track plots in Git repository instead of DVC remote.

---------------