# 4. Comparing training runs and Hyperparameter (HP) tuning

In this chapter, you will direct your attention towards the analysis of model performance and the fine-tuning of hyperparameters. You will acquire practical expertise in comparing metrics and visualizations across different branches to assess changes in model performance. You will conduct hyperparameter tuning using scikit-learn's GridSearchCV. Furthermore, you will delve into the automation of pull requests using the optimal model configuration.

## 4.1 Comparing metrics and plots in DVC

### Running the process manually

In [1]:
!python .\ml-example3\preprocess_data.py
!python .\ml-example3\train.py

{
  "accuracy": 0.947,
  "precision": 0.988,
  "recall": 0.7702,
  "f1_score": 0.8656
}


### Preparing the environment

In [2]:
# Removing files from git versioning
# To ensure this files can be tracked in dvc.
!git add ml-example3\processed-data\weather.csv
!git add ml-example3\evaluation-result\confusion_matrix.png
!git commit -m "tracking ml-ex3 .. weather.cs and confusion_matrix.png"
!git rm -r --cached ml-example3\processed-data\weather.csv
!git rm -r --cached ml-example3\evaluation-result\confusion_matrix.png
!git commit -m "stop tracking ml-ex3 .. weather.cs and confusion_matrix.png"

The following paths are ignored by one of your .gitignore files:
ml-example3/processed-data/weather.csv
hint: Use -f if you really want to add them.
hint: Disable this message with "git config advice.addIgnoredFile false"
The following paths are ignored by one of your .gitignore files:
ml-example3/evaluation-result/confusion_matrix.png
hint: Use -f if you really want to add them.
hint: Disable this message with "git config advice.addIgnoredFile false"


On branch myfeature
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   Track4-MLTrainning.ipynb

no changes added to commit (use "git add" and/or "git commit -a")


fatal: pathspec 'ml-example3\processed-data\weather.csv' did not match any files
fatal: pathspec 'ml-example3\evaluation-result\confusion_matrix.png' did not match any files


On branch myfeature
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   Track4-MLTrainning.ipynb

no changes added to commit (use "git add" and/or "git commit -a")


In [3]:
# Removing any previous `dvc.yaml` file
# Removing any existing previous yml file
!del dvc.yaml

### Working with DVC pipeline for `ml-example3`

In [4]:
# Prepare the yaml file
yml_str = f"""
stages:
  preprocess:
    cmd: python ml-example3/preprocess_data.py
    deps:
    - ml-example3/raw-data/weather.csv
    - ml-example3/preprocess_data.py
    - ml-example3/utils_and_constants.py
    outs:
    - ml-example3/processed-data/weather.csv
  train:
    cmd: python ml-example3/train.py
    deps:
    - ml-example3/metrics_and_plots.py
    - ml-example3/model.py
    - ml-example3/processed-data/weather.csv
    - ml-example3/train.py
    - ml-example3/utils_and_constants.py
    outs:
    - ml-example3/evaluation-result/confusion_matrix.png
    metrics:
      - ml-example3/evaluation-result/metrics.json:
          cache: false
    plots:
      - ml-example3/evaluation-result/predictions.csv:
          template: confusion_normalized
          x: predicted_label
          y: true_label
          x_label: 'Predicted label'
          y_label: 'True label'
          title: Confusion matrix
          cache: false
"""
with open("dvc.yaml", "w") as f:
    f.write(yml_str)

In [5]:
# Reviewing the dvc.yaml file
!more dvc.yaml


stages:
  preprocess:
    cmd: python ml-example3/preprocess_data.py
    deps:
    - ml-example3/raw-data/weather.csv
    - ml-example3/preprocess_data.py
    - ml-example3/utils_and_constants.py
    outs:
    - ml-example3/processed-data/weather.csv
  train:
    cmd: python ml-example3/train.py
    deps:
    - ml-example3/metrics_and_plots.py
    - ml-example3/model.py
    - ml-example3/processed-data/weather.csv
    - ml-example3/train.py
    - ml-example3/utils_and_constants.py
    outs:
    - ml-example3/evaluation-result/confusion_matrix.png
    metrics:
      - ml-example3/evaluation-result/metrics.json:
          cache: false
    plots:
      - ml-example3/evaluation-result/predictions.csv:
          template: confusion_normalized
          x: predicted_label
          y: true_label
          x_label: 'Predicted label'
          y_label: 'True label'
          title: Confusion matrix
          cache: false


In [6]:
# Reviewing the process
!dvc dag

+------------+ 
| preprocess | 
+------------+ 
       *       
       *       
       *       
  +-------+    
  | train |    
  +-------+    


In [7]:
# Reviewing the comands that the pipeline will execute:
!dvc repro --dry

Stage 'preprocess' didn't change, skipping
Stage 'train' didn't change, skipping
Data and pipelines are up to date.


In [8]:
!dvc repro

Stage 'preprocess' didn't change, skipping
Stage 'train' didn't change, skipping
Data and pipelines are up to date.


### Querying and comparing DVC metrics

In [9]:
!dvc metrics show

Path                                        accuracy    f1_score    precision    recall
ml-example3\evaluation-result\metrics.json  0.947       0.8656      0.988        0.7702


### Commiting changes in myfeature branch

In [11]:
!git branch
!git checkout myfeature

  master
* myfeature
M	Track4-MLTrainning.ipynb


Already on 'myfeature'


In [12]:
!dvc push
!git add .
!git commit -m "DVC Pipeline for ml-example3"
!git push origin myfeature

Everything is up to date.




[myfeature 1c3a9f1] DVC Pipeline for ml-example3
 1 file changed, 40 insertions(+), 60 deletions(-)


To https://github.com/jacesca/CICD-Workflow.git
   4cd58d7..1c3a9f1  myfeature -> myfeature


### Ex.1 - Adding metrics and plots to dvc.yaml
In this exercise, your task is to complete the contents of dvc.yaml that defines a model training workflow.

Here preprocess_dataset.py and train.py are the files that perform data preprocessing and model training by taking weather.csv as input in the raw_dataset folder. As output, the model training code generates a predictions.csv file that contains the predictions and the ground truth, and metrics.json file containing structured metrics data. The former would be used to generate a normalized confusion matrix plot for comparing it with previous commits.

**Instruction**
1. Set the metrics target to the output metrics file.
2. Set the plot target to the output file containing predictions data.
3. Set the plot template to confusion_normalized to plot the normalized confusion matrix.
4. Set the correct value for cache key to track plots in Git repository instead of DVC remote.

In [13]:
!more dvc.yaml


stages:
  preprocess:
    cmd: python ml-example3/preprocess_data.py
    deps:
    - ml-example3/raw-data/weather.csv
    - ml-example3/preprocess_data.py
    - ml-example3/utils_and_constants.py
    outs:
    - ml-example3/processed-data/weather.csv
  train:
    cmd: python ml-example3/train.py
    deps:
    - ml-example3/metrics_and_plots.py
    - ml-example3/model.py
    - ml-example3/processed-data/weather.csv
    - ml-example3/train.py
    - ml-example3/utils_and_constants.py
    outs:
    - ml-example3/evaluation-result/confusion_matrix.png
    metrics:
      - ml-example3/evaluation-result/metrics.json:
          cache: false
    plots:
      - ml-example3/evaluation-result/predictions.csv:
          template: confusion_normalized
          x: predicted_label
          y: true_label
          x_label: 'Predicted label'
          y_label: 'True label'
          title: Confusion matrix
          cache: false
