# Continuous Integration and Deployment Pipelines

![Status](https://img.shields.io/static/v1.svg?label=Status&message=Finished&color=green)

In this section, we will develop a CI/CD pipeline for our model package and prediction serving API. CI/CD stands for continuous integration and continuous deployment. This involves automatic testing for changes that are being merged to the main or master branch of the code repository, as well as automatically building and deploying the model package and the associated API. 

Automation means that no person needs to run a script or SSH into a machine every time a change is made to the code base, which can be time consuming and error prone. Moreover, having a CI/CD pipeline means that the system is always in a releasable state so that development teams can quickly react to issues in production. 
Hence, a project that has a CI/CD pipeline can have faster release cycles with changes to the code deployed on a regular basis, e.g. days instead of months. This reduces the chance of breaking things and makes it easier to integrate our piece of software to the system as changes around the model are also small.

Finally, CI/CD platforms can add visibility to the release cycles which can be important when performing audits. For our project, we will be using the [CircleCI](https://circleci.com/) platform which has a free tier. And we will upload our model package to [Gemfury](https://gemfury.com/) which is a private package index.

## CircleCI config

CircleCI is a third party platform for managing CI/CD pipelines. This is a good all around tool with a free tier. We log in using our GitHub account. To setup CircleCI workflows, we only need to create a `.circleci` directory in the root of our repository which should contain a `config.yml` file. This allows us to setup the project in CircleCI after logging in with our GitHub account.

Recall that previously, we used `tox` as our main tool to train and test our model. Then, we built the model package and uploaded it to PyPI. A similar process was done for our API which was deployed to Heroku. We will continue to use `tox` as the workhorse of our CI workflows. 

```{margin}
[`.circleci/config.yml`](https://github.com/particle1331/model-deployment/blob/main/.circleci/config.yml)
```
```YAML
version: 2


defaults: &defaults
  docker:
    - image: circleci/python:3.9.5
  working_directory: ~/project

prepare_tox: &prepare_tox
  run:
    name: Install tox
    command: |
      sudo pip install --upgrade pip
      pip install --user tox
```

Here `&` notation is YAML specific which just means we can use these variables later using `*`, and `version` specifies the version of CircleCI used. First, we define `defaults` which specifies the default environment settings. The other one is `prepare_tox` which installs and upgrades `pip` and installs `tox`. These two will be used by jobs which we define below. 

### Jobs

```{margin}
[`.circleci/config.yml`](https://github.com/particle1331/model-deployment/blob/cicd/.circleci/config.yml)
```
```YAML
jobs:
  test_app:
    <<: *defaults
    working_directory: ~/project/api
    steps:
      - checkout:
          path: ~/project
      - *prepare_tox
      - run:
          name: Runnning app tests
          command: |
            tox
  
  deploy_app_to_heroku:
    <<: *defaults
    steps:
      - checkout:
          path: ~/project
      - run:
          name: Deploy to Heroku
          command: |
            git subtree push --prefix api https://heroku:$HEROKU_API_KEY@git.heroku.com/$HEROKU_APP_NAME.git main
  
  train_and_upload_regression_model:
    <<: *defaults
    working_directory: ~/project/packages/regression_model
    steps:
      - checkout:
          path: ~/project
      - *prepare_tox
      - run:
          name: Fetch the data
          command: |
            tox -e fetch_data
      - run:
          name: Train the model
          command: |
            tox -e train
      - run:
          name: Test the model
          command: |
            tox
      - run:
          name: Publish model to Gemfury
          command: |
            tox -e publish_model
```

The `<<` notation just inherits all contents of the variables on the same level. The `checkout` will checkout the source code into the job’s `working_directory`. First, we have `test_app` which runs the tests on the `api` directory, i.e. for the model serving API. Next, we have `deploy_app_to_heroku` which does not run any test, it just pushes the code to Heroku. Let us look at the `tox` files for the first step:

```{margin}
[`api/tox.ini`](https://github.com/particle1331/model-deployment/blob/main/api/tox.ini)
```
```ini
[testenv]
install_command = pip install {opts} {packages}

passenv =
	PIP_EXTRA_INDEX_URL

...
````

### Secrets

The only modification to the `tox` file above is `passenv` where we specify the extra index where `pip` will look for packages if not found in PyPI. This uses the environmental variable `PIP_EXTRA_INDEX_URL`. Note also that `HEROKU_API_KEY` and `HEROKU_APP_NAME` are also environmental variables that we set in the project settings of CircleCI.

```{margin}
[`api/requirements.txt`](https://github.com/particle1331/model-deployment/blob/cicd/api/requirements.txt)
```
```text
--extra-index-url ${PIP_EXTRA_INDEX_URL}
```

In particular, we will upload the `regression_model` package in Gemfury.
We will set `PIP_EXTRA_INDEX_URL` to be the same as the `GEMFURY_PUSH_URL` which follows the following format: 

```
https://TOKEN:@pypi.fury.io/USERNAME/
```

<br>

**Remark.** It would be nice if we can set a particular index for each package so that all other packages are installed from PyPI and our model package specifically from Gemfury, since a package of the same name may exist in PyPI. However, the `requirements.txt` method does not allow this in one file. A fix is to use multiple requirements files with `--index-url ${PIP_EXTRA_INDEX_URL}` specified on one file to determine the index for the model package, then install using:

```
pip install -r requirements.txt -r package_requirements.txt
```

```{figure} ../../img/secrets.png
---
---
Environment variables in the CircleCI project settings.
```

### Build and upload package

Next, we have the automatic build and upload step which fetches the data from Kaggle, trains the model, and uploads the model package to Gemfury. For other projects, the fetch part can be replaced by AWS CLI from S3 bucket or making a database call. These steps depend on the `tox` file in the model package:

```{margin}
[`packages/regression_model/tox.ini`](https://github.com/particle1331/model-deployment/blob/cicd/packages/regression_model/tox.ini)
```
```ini
...
[testenv]
install_command = pip install {opts} {packages}

passenv =
	KAGGLE_USERNAME
	KAGGLE_KEY
	GEMFURY_PUSH_URL
...

[testenv:fetch_data]
envdir 	 = {toxworkdir}/test_package
deps 	 = {[testenv:test_package]deps}
setenv   = {[testenv:test_package]setenv}
commands =
	kaggle competitions download -c house-prices-advanced-regression-techniques -p ./regression_model/datasets
	unzip ./regression_model/datasets/house-prices-advanced-regression-techniques.zip -d ./regression_model/datasets


[testenv:publish_model]
envdir 	 = {toxworkdir}/test_package
deps 	 = {[testenv:test_package]deps}
setenv 	 = {[testenv:test_package]setenv}
commands =
	pip install --upgrade build
	python -m build
	python publish_model.py

...
```

Here we focus on two environments. First, the  the `fetch_data` uses the Kaggle CLI to download the data, hence the `KAGGLE_USERNAME` and `KAGGLE_KEY` secrets are required. This can be obtained from the `~/kaggle.json` file from your Kaggle account. 

Next, we have `publish_model` which first builds the regression model package using the Python `build` module. This results in a `dist` directory containing build artifacts which are then pushed to Gemfury using `publish_model.py` script:

```{margin}
[`packages/regression_model/publish_model.py`](https://github.com/particle1331/model-deployment/blob/cicd/packages/regression_model/publish_model.py)
```
```py
import os
import glob


for p in glob.glob('dist/*.whl'):
    try:
        os.system(f'curl -F package=@{p} {os.environ['GEMFURY_PUSH_URL']}')
    except:
        raise Exception("Uploading package failed on file {p}")
```

### Workflows 

Next, the `config.yml` defines workflows. A workflow determines a sequence of jobs to run given their triggers for each push to the repository. Below we define a single workflow called `regression-model`.

```{margin}
[`.circleci/config.yml`](https://github.com/particle1331/model-deployment/blob/cicd/.circleci/config.yml)
```
```YAML
...

workflows:
  version: 2
  regression-model:
    jobs:
      - test_app
      
      - train_and_upload_regression_model:
          filters:
            # Ignore any commit on any branch by default
            branches:
              ignore: /.*/
            # Only act on version tags
            tags:
              only: /^.*/

      - deploy_app_to_heroku:
          requires:
            - test_app
          filters:
            branches:
              only:
                - main
```

First, we have `test_app` running the tests for the API for each commit on each branch. Next, the model package build and upload job `train_and_upload_regression_model` is triggered only when new version tags are created in the git repository. Lastly, `deploy_app_to_heroku` is triggered for each push to main. Note that the app is deployed to Heroku only if the `test_app` job passes. This makes sense since we do not want to deploy an API build that fails its tests. Also note that a push to development branches does not trigger a deploy of the API.


## Triggering the workflows

To trigger the workflows, we will update the model package. 
Recall model package version in API has to be specified in its requirements file. This makes it transparent. So we need to do the ff. in sequence:

1. Bump model package version.
2. Release tag in the GitHub repository.
3. Update app requirements.

The second step triggers automatically triggers a push to our private index of an updated model package containing a newly trained model. The last step triggers a build of the API with a new regression model version deployed to Heroku.

**Remark.** Note bumping model package version in #1 above triggers `test_app` and `deploy_app_to_heroku` jobs which is a waste of resources since updating the API requirements in #3 above will trigger these same jobs again. But this is unavoidable since we are using a monorepo for demonstration purposes. Separating the repositories would solve this issue.

### Creating release tags

Before triggering the workflows for making a new model release which is followed by the app being updated by this release. We first introduce release tags. Releases is located in the right sidebar of the repository home page in GitHub. Note that a tag release can be applied on a specific branch (not necessarily the `main` branch) at the latest commit: 

```{figure} ../../img/tag-name.png
---
width: 30em
---
```

This triggers the train and upload job on the repository `cicd` branch at ` 9dd7c59`:

```{figure} ../../img/tag-workflow.png
---
width: 40em
---
```

### Updating model version

We now proceed to updating the model. Suppose we bump the model version, e.g. if the data has updated. Then, we have to make the new model available in the private index. Once this is available, we can update the model used by the prediction server and deploy it. Here we will bump the model version from `0.1.0` to `0.1.1`. We create a new release targeted on the `main` branch.

```{figure} ../../img/bump-model.png
---
width: 40em
---
```

```{figure} ../../img/bump-release.png
---
width: 40em
---
```

Here we commit on the `cicd` branch a change to the model `VERSION` file. This is merged to the `main` branch and a corresponding release is created following the updated version as tag name. This triggers model retraining and upload.

```{figure} ../../img/gemfury.png
---
width: 40em
---
New package model uploaded to Gemfury.
```

Once the package finishes uploading, we update the model package of the API in the `main` branch. This triggers a deploy job. After the deployment is done, we can look in at the `/api/v1/health` endpoint to see that model version has updated.

```{figure} ../../img/api-old.png
---
width: 40em
---
```
```{figure} ../../img/api-new.png
---
width: 42em
---
Model version has updated from `0.1.0` to `0.1.1`.
```

<br>

```{figure} ../../img/bump-model-workflows.png
---
width: 50em
---
The completed jobs for retraining the regression model.
```

## Appendix: Triggering deployment

Note that we had to wait for the model upload to be finished before deploying our app. It would be nice if we can automate this. Otherwise, we might forget to deploy the new model in the service after making a release.