# 3. Continuous Integration in Machine Learning

In this chapter, you'll explore the integration of machine learning model training into a GitHub Action pipeline using Continuous Machine Learning GitHub Action. You'll generate a comprehensive markdown report including model metrics and plots. You will also delve into data versioning in Machine Learning by adopting Data Version Control (DVC) to track data changes. The chapter also covers setting DVC remotes and dataset transfers. Finally, you'll explore DVC pipelines, configuring a DVC YAML file to orchestrate reproducible model training.

## 3.1 - Model training with GitHub Actions

Working on ml-example model, we are going to add the result of the model evaluation to the pull request:

```
on:
  pull_request:
    branches:
      - master
      - myfeature

jobs:
  comment:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/github-script@v6
        with:
          github-token: ${{secrets.GITHUB_TOKEN}}
          script: |
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: '👋 Thanks for reporting!'
            })
  ml-job:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4
      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: 3.11.9
      - name: Install Dependencies
        run: |
          pip install -r requirements.txt
      - uses: iterative/setup-cml@v3
      - name: Train and Test model
        run: |
          python ml-example/train_and_test.py
      - name: Write CML report
        run: |
          echo "# ML Example 1" > report.md
          cat ml-example/result.txt >> report.md
          echo "![Confusion Matrix](ml-example/confusion-matrix.png)" >> report.md
          cml comment create report.md
        env:
          REPO_TOKEN: ${{secrets.GITHUB_TOKEN}}
```

### Ex.1 - Develop a classification model
In this exercise, you'll work with the weather dataset and develop a training code to predict rainfall for the next day. The preprocess_dataset.py contains helper functions to pre-process the dataset. Your task is to finish the scaffolded train.py to formulate a high-level model training flow.

Model encapsulated into 'ml-example2/' folder.

In [1]:
!dir "ml-example2/"

 Volume in drive C is Acer
 Volume Serial Number is 28AC-E997

 Directory of C:\Users\Jacqueline\Documents\projects\CAMP-MLEngTrack\14-CICD for Machine Learning\ml-example2

09/06/2024  05:56 PM    <DIR>          .
09/07/2024  01:22 PM    <DIR>          ..
09/07/2024  12:50 PM    <DIR>          evaluation-result
09/06/2024  06:04 PM               540 metrics_and_plots.py
09/06/2024  06:00 PM               959 model.py
09/06/2024  05:56 PM             3,553 preprocess_data.py
09/07/2024  12:50 PM    <DIR>          processed-data
09/06/2024  05:46 PM    <DIR>          raw-data
09/06/2024  05:58 PM             1,191 train.py
09/06/2024  06:04 PM               613 utils_and_constants.py
09/06/2024  06:05 PM    <DIR>          __pycache__
               5 File(s)          6,856 bytes
               6 Dir(s)  517,232,308,224 bytes free


### Ex.2 - Your task is to finish the scaffolded .github/workflows/train_cml.yaml to formulate a high-level model training flow.

**Instruction:**

1. Setup CML GitHub Action iterative/setup-cml@v1.
2. Add evaluation metrics data, metrics.json, to the markdown report in the Write CML report step.
3. Add confusion matrix plot, confusion_matrix.png , to the markdown report in the Write CML report step.
4. Write the correct cml comment subcommand to create a comment in the PR.

---------

```
name: comments-ml-example2

on:
  pull_request:
    branches: ["master", "myfeature"]

permissions: write-all

jobs:
  train_and_test_model_ml2:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: 3.11.9

      - name: Install Dependencies
        run: |
          pip install -r requirements.txt

      - name: Train and test model
        run: |
          python ml-example2/preprocess_data.py
          python ml-example2/train.py

      - name: Setup CML GitHub Actions
        uses: iterative/setup-cml@v3

      - name: Write CML report
        env:
          REPO_TOKEN: ${{secrets.GITHUB_TOKEN}}
        run: |
          echo "# ML Example 2" > model_eval_report.md
          cat ml-example2/evaluation-result/metrics.json >> model_eval_report.md
          echo "![Confusion Matrix Plot](ml-example2/evaluation-result/confusion_matrix.png)" >> model_eval_report.md
          cml comment create model_eval_report.md
```

## 3.2 - Versioning datasets with Data Version Control

### Ex.3 - Data versioning in action

Data Version Control (DVC) provides a systematic approach to versioning data, a critical aspect often overlooked. With DVC, you can precisely track changes in your datasets, ensuring reproducibility, collaboration, and troubleshooting ease. It's your safeguard against data-related challenges, fostering trust and efficiency in your data-driven projects.

In this exercise, you will practice initializing a DVC project and versioning a dataset. Git has already been initialized for this project.

**Instruction:**

1. Initialize DVC in the workspace.
2. Verify that .dvcignore file and .dvc folder are present.
3. Add dataset.csv to DVC and examine the contents of dataset.csv.dvc by opening it in the file editor.
4. Verify that DVC cache is populated by running find .dvc/cache -type f command in terminal.

---------------------

In [2]:
# Confirming we are in myfeature branch
!git checkout myfeature
!git branch

D	.dvc/.gitignore
D	.dvc/config
M	Track3-CIML.ipynb
D	my-remote-storage/files/md5/5a/799ba072f8399633fbd5b922d7c499
D	my-remote-storage/files/md5/dc/ef62662b1736bc960c55d85071cbda
D	my-remote-storage/files/md5/ec/916b9800512aebcd3824504c316d48


Already on 'myfeature'


  master
* myfeature


In [3]:
# Creating a new dataset
!copy data-sources\weather.csv data-sources\dataset.csv

        1 file(s) copied.


In [4]:
# Initializing DVC
!dvc init

Initialized DVC repository.

You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>


In [None]:
# We need to ensure that 'data-sources\dataset.csv' is not tracked
# - ml-example2\processed-data\weather.csv
# - ml-example2\evaluation-result\metrics.json
# - ml-example2\evaluation-result\confusion_matrix.png
!git add ml-example2\processed-data\weather.csv
!git commit -m "Tracking ml-example2\processed-data\weather.csv"
!git rm -r --cached ml-example2\processed-data\weather.csv
!git commit -m "stop tracking ml-example2\processed-data\weather.csv"

In [5]:
# Adding a datafile for versioning
!dvc add data-sources/dataset.csv

\u280b Checking graph

ERROR:  output 'data-sources\dataset.csv' is already tracked by SCM (e.g. Git).
    You can remove it from Git, then add to DVC.
        To stop tracking from Git:
            git rm -r --cached 'data-sources\dataset.csv'
            git commit -m "stop tracking data-sources\dataset.csv" 


In [6]:
!git add data-sources/dataset.csv.dvc
!git commit -m "dataset.csv.dvc added"

fatal: pathspec 'data-sources/dataset.csv.dvc' did not match any files


On branch myfeature
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   Track3-CIML.ipynb
	modified:   data-sources/dataset.csv
	deleted:    my-remote-storage/files/md5/5a/799ba072f8399633fbd5b922d7c499
	deleted:    my-remote-storage/files/md5/8b/80f6484630c3b5b0dacaabc37afaf0
	deleted:    my-remote-storage/files/md5/d4/1d8cd98f00b204e9800998ecf8427e
	deleted:    my-remote-storage/files/md5/dc/ef62662b1736bc960c55d85071cbda
	deleted:    my-remote-storage/files/md5/ec/916b9800512aebcd3824504c316d48

no changes added to commit (use "git add" and/or "git commit -a")


In [7]:
!git push origin myfeature

To https://github.com/jacesca/CICD-Workflow.git
   35ca3b4..9ab0516  myfeature -> myfeature


In [8]:
# Reviewing what happened with the versioned dataset
!dir data-sources\ /b

dataset.csv
weather.csv


In [9]:
# Reviewing the content of the new .dvc file
!more data-sources\dataset.csv.dvc

Cannot access file C:\Users\Jacqueline\Documents\projects\CAMP-MLEngTrack\14-CICD for Machine Learning\data-sources\dataset.csv.dvc


In [10]:
# Reviewing the content of the .gitignore file
# We observ that the current dataset.csv is excluded from git versioned
!more data-sources\.gitignore

Cannot access file C:\Users\Jacqueline\Documents\projects\CAMP-MLEngTrack\14-CICD for Machine Learning\data-sources\.gitignore


In [11]:
# The file was tracked in dvc
!findstr /n .  .\.dvc\cache\files\md5\8b\80f6484630c3b5b0dacaabc37afaf0 | findstr "^[0-5]:"

FINDSTR: Cannot open .\.dvc\cache\files\md5\8b\80f6484630c3b5b0dacaabc37afaf0


## 3.3 - Interacting with DVC remotes

### Ex.4 - DVC remotes in action
In this exercise, you'll learn how to set up and use DVC remotes to store and share your datasets securely. Whether it's a colleague across the globe or your future self working on the project, DVC remotes ensure that your data is readily accessible and up-to-date. This exercise already has DVC initialized and the dataset added to DVC cache. We will be limiting ourselves to DVC remotes set up on a local filesystem.

The syntax for adding a default DVC remote is

dvc remote add -d --local <remote_name> </path/to/folder>

where -d indicates the default DVC remote, and --local indicates that the DVC remote is pointed locally.

**Instruction:**

1. Set up a local DVC remote named myremote pointed at /tmp/dvc/localremote and verify it is empty by examining output of ls /tmp/dvc/localremote.
2. Examine the contents of .dvc/config.local, is the default set correctly to myremote?
3. Run dvc push and verify that the local remote now contains the file.
4. Run dvc pull and verify that "Everything up to date" appears as shell output.

----------------

```
dvc remote add -d --local myremote /tmp/dvc/localremote
cat .dvc/config
dvc push
dvc pull
```

In [12]:
# Setting the default remote folder to track the changes on dataset
!dvc remote add -d --local myremote .\my-remote-storage\ --force

Setting 'myremote' as a default remote.


In [13]:
# Reviewing the configuration result
!more .dvc\config.local

[core]
    remote = myremote
['remote "myremote"']
    url = ../my-remote-storage


In [14]:
# Sending the tracked file
!dvc push -r myremote data-sources\dataset.csv

ERROR: failed to push data to the cloud - 'data-sources/dataset.csv' does not exist as an output or a stage name in 'dvc.yaml': Stage 'data-sources/dataset.csv' not found inside 'dvc.yaml' file


In [15]:
# What it is saved in remote storage?
!dir my-remote-storage\files\md5\8b\80f6484630c3b5b0dacaabc37afaf0 /b

The system cannot find the path specified.


In [16]:
# Verifying if we are up to date
!dvc pull

Everything is up to date.


md5: 5a799ba072f8399633fbd5b922d7c499
md5: dcef62662b1736bc960c55d85071cbda
md5: ec916b9800512aebcd3824504c316d48


In [17]:
# Making changes to the dataset
!findstr /n .  data-sources\dataset.csv | findstr "^[0-9]:" > data-sources\dataset.csv

In [18]:
# Registering changes in dvc
!dvc add data-sources\dataset.csv

\u280b Checking graph

ERROR:  output 'data-sources\dataset.csv' is already tracked by SCM (e.g. Git).
    You can remove it from Git, then add to DVC.
        To stop tracking from Git:
            git rm -r --cached 'data-sources\dataset.csv'
            git commit -m "stop tracking data-sources\dataset.csv" 


In [19]:
# Registering changes in Git
!git add data-sources\dataset.csv.dvc
!git commit -m "Dataset updates"

fatal: pathspec 'data-sources\dataset.csv.dvc' did not match any files


On branch myfeature
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   Track3-CIML.ipynb
	deleted:    my-remote-storage/files/md5/5a/799ba072f8399633fbd5b922d7c499
	deleted:    my-remote-storage/files/md5/8b/80f6484630c3b5b0dacaabc37afaf0
	deleted:    my-remote-storage/files/md5/d4/1d8cd98f00b204e9800998ecf8427e
	deleted:    my-remote-storage/files/md5/dc/ef62662b1736bc960c55d85071cbda
	deleted:    my-remote-storage/files/md5/ec/916b9800512aebcd3824504c316d48

no changes added to commit (use "git add" and/or "git commit -a")


In [20]:
# Push metadata to Git
!git push origin myfeature

Everything up-to-date


In [21]:
# Upload changed data file
# As we have a default remote folder, we do not need to specify
!dvc push

Everything is up to date.


md5: 5a799ba072f8399633fbd5b922d7c499
md5: ec916b9800512aebcd3824504c316d48
md5: dcef62662b1736bc960c55d85071cbda


## 3.4 DVC Pipelines

In [22]:
# We are going to work with
!dir ml-example2\ /b

evaluation-result
metrics_and_plots.py
model.py
preprocess_data.py
processed-data
raw-data
train.py
utils_and_constants.py
__pycache__


In [23]:
# Removing previous tracking to start with a clean process
!dvc remove data-sources\dataset.csv.dvc

ERROR: 'C:\Users\Jacqueline\Documents\projects\CAMP-MLEngTrack\14-CICD for Machine Learning\data-sources\dataset.csv.dvc' does not exist


In [24]:
# We need to ensure that the following files are not tracked
# - ml-example2\processed-data\weather.csv
# - ml-example2\evaluation-result\metrics.json
# - ml-example2\evaluation-result\confusion_matrix.png
!git add ml-example2\processed-data\weather.csv
!git commit -m "Tracking ml-example2\processed-data\weather.csv"
!git rm -r --cached ml-example2\processed-data\weather.csv
!git commit -m "stop tracking ml-example2\processed-data\weather.csv"

!git add ml-example2\evaluation-result\metrics.json
!git commit -m "Tracking ml-example2\evaluation-result\metrics.json"
!git rm -r --cached ml-example2\evaluation-result\metrics.json
!git commit -m "stop tracking ml-example2\evaluation-result\metrics.json"

!git add ml-example2\evaluation-result\confusion_matrix.png
!git commit -m "Tracking ml-example2\evaluation-result\confusion_matrix.png"
!git rm -r --cached ml-example2\evaluation-result\confusion_matrix.png
!git commit -m "stop tracking ml-example2\evaluation-result\confusion_matrix.png"

The following paths are ignored by one of your .gitignore files:
ml-example2/processed-data/weather.csv
hint: Use -f if you really want to add them.
hint: Disable this message with "git config advice.addIgnoredFile false"


On branch myfeature
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   Track3-CIML.ipynb
	deleted:    my-remote-storage/files/md5/5a/799ba072f8399633fbd5b922d7c499
	deleted:    my-remote-storage/files/md5/8b/80f6484630c3b5b0dacaabc37afaf0
	deleted:    my-remote-storage/files/md5/d4/1d8cd98f00b204e9800998ecf8427e
	deleted:    my-remote-storage/files/md5/dc/ef62662b1736bc960c55d85071cbda
	deleted:    my-remote-storage/files/md5/ec/916b9800512aebcd3824504c316d48

no changes added to commit (use "git add" and/or "git commit -a")


fatal: pathspec 'ml-example2\processed-data\weather.csv' did not match any files


On branch myfeature
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   Track3-CIML.ipynb
	deleted:    my-remote-storage/files/md5/5a/799ba072f8399633fbd5b922d7c499
	deleted:    my-remote-storage/files/md5/8b/80f6484630c3b5b0dacaabc37afaf0
	deleted:    my-remote-storage/files/md5/d4/1d8cd98f00b204e9800998ecf8427e
	deleted:    my-remote-storage/files/md5/dc/ef62662b1736bc960c55d85071cbda
	deleted:    my-remote-storage/files/md5/ec/916b9800512aebcd3824504c316d48

no changes added to commit (use "git add" and/or "git commit -a")


The following paths are ignored by one of your .gitignore files:
ml-example2/evaluation-result/metrics.json
hint: Use -f if you really want to add them.
hint: Disable this message with "git config advice.addIgnoredFile false"


On branch myfeature
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   Track3-CIML.ipynb
	deleted:    my-remote-storage/files/md5/5a/799ba072f8399633fbd5b922d7c499
	deleted:    my-remote-storage/files/md5/8b/80f6484630c3b5b0dacaabc37afaf0
	deleted:    my-remote-storage/files/md5/d4/1d8cd98f00b204e9800998ecf8427e
	deleted:    my-remote-storage/files/md5/dc/ef62662b1736bc960c55d85071cbda
	deleted:    my-remote-storage/files/md5/ec/916b9800512aebcd3824504c316d48

no changes added to commit (use "git add" and/or "git commit -a")


fatal: pathspec 'ml-example2\evaluation-result\metrics.json' did not match any files


On branch myfeature
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   Track3-CIML.ipynb
	deleted:    my-remote-storage/files/md5/5a/799ba072f8399633fbd5b922d7c499
	deleted:    my-remote-storage/files/md5/8b/80f6484630c3b5b0dacaabc37afaf0
	deleted:    my-remote-storage/files/md5/d4/1d8cd98f00b204e9800998ecf8427e
	deleted:    my-remote-storage/files/md5/dc/ef62662b1736bc960c55d85071cbda
	deleted:    my-remote-storage/files/md5/ec/916b9800512aebcd3824504c316d48

no changes added to commit (use "git add" and/or "git commit -a")


The following paths are ignored by one of your .gitignore files:
ml-example2/evaluation-result/confusion_matrix.png
hint: Use -f if you really want to add them.
hint: Disable this message with "git config advice.addIgnoredFile false"


On branch myfeature
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   Track3-CIML.ipynb
	deleted:    my-remote-storage/files/md5/5a/799ba072f8399633fbd5b922d7c499
	deleted:    my-remote-storage/files/md5/8b/80f6484630c3b5b0dacaabc37afaf0
	deleted:    my-remote-storage/files/md5/d4/1d8cd98f00b204e9800998ecf8427e
	deleted:    my-remote-storage/files/md5/dc/ef62662b1736bc960c55d85071cbda
	deleted:    my-remote-storage/files/md5/ec/916b9800512aebcd3824504c316d48

no changes added to commit (use "git add" and/or "git commit -a")


fatal: pathspec 'ml-example2\evaluation-result\confusion_matrix.png' did not match any files


On branch myfeature
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   Track3-CIML.ipynb
	deleted:    my-remote-storage/files/md5/5a/799ba072f8399633fbd5b922d7c499
	deleted:    my-remote-storage/files/md5/8b/80f6484630c3b5b0dacaabc37afaf0
	deleted:    my-remote-storage/files/md5/d4/1d8cd98f00b204e9800998ecf8427e
	deleted:    my-remote-storage/files/md5/dc/ef62662b1736bc960c55d85071cbda
	deleted:    my-remote-storage/files/md5/ec/916b9800512aebcd3824504c316d48

no changes added to commit (use "git add" and/or "git commit -a")


In [25]:
# Removing any existing previous yml file
!del dvc.yaml

In [26]:
# Preparing the DVC pipeline to execute ML Example 2
!dvc stage add -n preprocess                             \
               -d ml-example2/raw-data/weather.csv       \
               -d ml-example2/preprocess_data.py         \
               -d ml-example2/utils_and_constants.py     \
               -o ml-example2/processed-data/weather.csv \
               python ml-example2/preprocess_data.py
!dvc stage add -n train                                              \
               -d ml-example2/metrics_and_plots.py                   \
               -d ml-example2/model.py                               \
               -d ml-example2/processed-data/weather.csv             \
               -d ml-example2/train.py                               \
               -d ml-example2/utils_and_constants.py                 \
               -o ml-example2/evaluation-result/metrics.json         \
               -o ml-example2/evaluation-result/confusion_matrix.png \
               python ml-example2/train.py

Added stage 'preprocess' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml

To enable auto staging, run:

	dvc config core.autostage true
Added stage 'train' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml

To enable auto staging, run:

	dvc config core.autostage true


In [27]:
# Reviewing the process
!dvc dag

+------------+ 
| preprocess | 
+------------+ 
       *       
       *       
       *       
  +-------+    
  | train |    
  +-------+    


In [28]:
# Reviewing the comands that the pipeline will execute:
!dvc repro --dry

Running stage 'preprocess':
> python ml-example2/preprocess_data.py

Running stage 'train':
> python ml-example2/train.py
Use `dvc push` to send your updates to remote storage.


In [29]:
# Executing the pipeline
!dvc repro

Running stage 'preprocess':
> python ml-example2/preprocess_data.py

Running stage 'train':
> python ml-example2/train.py
{
  "accuracy": 0.947,
  "precision": 0.988,
  "recall": 0.7702,
  "f1_score": 0.8656
}
Use `dvc push` to send your updates to remote storage.


In [30]:
!dvc push

3 files pushed


In [31]:
!git add .



In [32]:
!git commit -m "DVC Pipeline Testing"

[myfeature 02be940] DVC Pipeline Testing
 3 files changed, 50 insertions(+), 25003 deletions(-)
 delete mode 100644 my-remote-storage/files/md5/8b/80f6484630c3b5b0dacaabc37afaf0
 delete mode 100644 my-remote-storage/files/md5/d4/1d8cd98f00b204e9800998ecf8427e


### Ex.5 - Creating a DVC pipeline

Imagine a simple example of a workflow where a document is printed and then scanned to create a signed PDF document, with DVC managing the dependencies and outputs of each stage.

The print stage depends on printing instructions outlined in print.sh and produces the pages output. The scan stage depends on instructions in scan.sh and pages (output of printer) and produces a signed.pdf output.

Your task is to design a DVC pipeline outlining the workflow using the dvc stage add command. Its syntax is
```
$ dvc stage add -n <stage_name> -d <dependency> -o <output> <command>
```
You can add multiple dependencies and outputs with repeated use of -d and -o flags, respectively.

NOTE: DVC has already been initialized in the exercise setup. There is no need to run dvc init again.

**Instruction:**

1. Design the print stage with print.sh as a dependency, pages as output, and ./print.sh as command.
2. Design the scan stage with scan.sh and pages as dependencies, signed.pdf as output, and ./scan.sh as command.
3. Verify dvc.yaml is written correctly.
4. Visualize the pipeline with dvc dag.


------------------------------------
```
$ dvc stage add -n print -d print.sh -o pages ./print.sh
$ dvc stage add -n scan -d scan.sh -d pages -o signed.pdf ./scan.sh
```

-------------------------------------
**dvc.yaml** file:

```
stages:
  print:
    cmd: ./print.sh
    deps:
    - print.sh
    outs:
    - pages
  scan:
    cmd: ./scan.sh
    deps:
    - pages
    - scan.sh
    outs:
    - signed.pdf
```

-------------------------------------
```
$ dvc dag
+-------+  
| print |  
+-------+  
    *      
    *      
    *      
+------+   
| scan |   
+------+ 
```

------------------