# Continuous Integration in Machine Learning

In this chapter, you'll explore the integration of machine learning model training into a GitHub Action pipeline using Continuous Machine Learning GitHub Action. You'll generate a comprehensive markdown report including model metrics and plots. You will also delve into data versioning in Machine Learning by adopting Data Version Control (DVC) to track data changes. The chapter also covers setting DVC remotes and dataset transfers. Finally, you'll explore DVC pipelines, configuring a DVC YAML file to orchestrate reproducible model training.

## Model training with GitHub Actions

Working on ml-example model, we are going to add the result of the model evaluation to the pull request:

```
on:
  pull_request:
    branches:
      - master
      - myfeature

jobs:
  comment:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/github-script@v6
        with:
          github-token: ${{secrets.GITHUB_TOKEN}}
          script: |
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: '👋 Thanks for reporting!'
            })
  ml-job:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4
      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: 3.11.9
      - name: Install Dependencies
        run: |
          pip install -r requirements.txt
      - uses: iterative/setup-cml@v3
      - name: Train and Test model
        run: |
          python ml-example/train_and_test.py
      - name: Write CML report
        run: |
          cat ml-example/result.txt > report.md
          echo "![Confusion Matrix](ml-example/confusion-matrix.png)" >> report.md
          cml comment create report.md
        env:
          REPO_TOKEN: ${{secrets.GITHUB_TOKEN}}

```

## Ex.1 - Develop a classification model
In this exercise, you'll work with the weather dataset and develop a training code to predict rainfall for the next day. The preprocess_dataset.py contains helper functions to pre-process the dataset. Your task is to finish the scaffolded train.py to formulate a high-level model training flow.

Model encapsulated into 'ml-example2/' folder.

In [3]:
!dir "ml-example2/"

 Volume in drive C is Acer
 Volume Serial Number is 28AC-E997

 Directory of C:\Users\Jacqueline\Documents\projects\CAMP-MLEngTrack\14-CICD for Machine Learning\ml-example2

09/06/2024  05:56 PM    <DIR>          .
09/06/2024  06:07 PM    <DIR>          ..
09/06/2024  06:05 PM    <DIR>          evaluation-result
09/06/2024  06:04 PM               540 metrics_and_plots.py
09/06/2024  06:00 PM               959 model.py
09/06/2024  05:56 PM             3,553 preprocess_data.py
09/06/2024  05:56 PM    <DIR>          processed-data
09/06/2024  05:46 PM    <DIR>          raw-data
09/06/2024  05:58 PM             1,191 train.py
09/06/2024  06:04 PM               613 utils_and_constants.py
09/06/2024  06:05 PM    <DIR>          __pycache__
               5 File(s)          6,856 bytes
               6 Dir(s)  516,870,754,304 bytes free


## Ex.2 - Your task is to finish the scaffolded .github/workflows/train_cml.yaml to formulate a high-level model training flow.

NOTE: Use python3 instead of python to run Python scripts.

Ide Exercise Instruction
100XP
Setup CML GitHub Action iterative/setup-cml@v1.
Add evaluation metrics data, metrics.json, to the markdown report in the Write CML report step.
Add confusion matrix plot, confusion_matrix.png , to the markdown report in the Write CML report step.
Write the correct cml comment subcommand to create a comment in the PR.