# Model Deployment

![Status](https://img.shields.io/static/v1.svg?label=Status&message=Ongoing&color=orange)

<!-- Place this tag where you want the button to render. -->
<a class="github-button" href="https://github.com/particle1331/steepest-ascent" data-color-scheme="no-preference: dark; light: light; dark: dark;" data-icon="octicon-star" data-size="large" data-show-count="true" aria-label="Star particle1331/steepest-ascent on GitHub">Star</a>
<!-- Place this tag in your head or just before your close body tag. -->
<script async defer src="https://buttons.github.io/buttons.js"></script> 


In this module, we will look into deploying the ride duration model which has been our working example in the modules. Deploying means that other applications can get predictions from our model. We will look at three modes of deployment: **online** deployment, **offline** or batch deployment, and **streaming**. 

In online mode, our service must be up all the time. To do this, we implement a web service which takes in HTTP requests and sends out predictions. In offline or mode, we have a service running regularly, but not necessarily all the time. This can make predictions for a batch of examples that runs periodically using workflow orchestration. Finally, we look at how to implement a streaming service, i.e. a machine learning service that listens to a stream of events and reacts to it using AWS Kinesis and AWS Lambda.

```{margin}
⚠️ **Attribution:** These are notes for [Module 4: Model Deployment](https://github.com/DataTalksClub/mlops-zoomcamp/blob/main/04-deployment) of the [MLOps Zoomcamp](https://github.com/DataTalksClub/mlops-zoomcamp). The MLOps Zoomcamp is a free course from [DataTalks.Club](https://github.com/DataTalksClub).
```


## Deploying models with Flask and Docker

In this section, we develop a web server using Flask for serving model predictions. The model is obtained from an S3 artifacts store and predicts on data sent to the service by the backend. We will containerize this application using Docker. This container can be deployed anywhere where Docker is supported such as Kubernetes and Elastic Beanstalk.

### Model package

Here we will package code for model prediction that will be used by the Flask application. This can also be used for offline model training or batch scoring. The directory structure of our project would look like:

```
deployment/
├── app/
│   └── main.py
├── ride_duration/
│   ├── __init__.py
│   ├── predict.py
│   ├── utils.py
│   └── VERSION
├── .env
├── Dockerfile
├── Pipfile
├── MANIFEST.in
├── Pipfile.lock
├── setup.py
├── test.py
├── train.py
└── pyproject.toml
```

First we create [`setup.py`](https://github.com/particle1331/inefficient-networks/blob/92b7232fbbba4e17a10ce6a55a37725d32371f0b/docs/notebooks/mlops/04-deployment/setup.py) and [`pyproject.toml`](https://github.com/particle1331/inefficient-networks/blob/217134c84bb323452bf0dc3e8b6a6a04fea8f06b/docs/notebooks/mlops/04-deployment/pyproject.toml) for packaging. Refer to the links to see the complete code. For `setup.py` you only have to change the package metadata (or just leave them blank) and set `install_requires` to `[]`. This list will be later filled using a tool that integrates with Pipenv which we will use for package management.

```{margin}
[`setup.py`](https://github.com/particle1331/inefficient-networks/blob/92b7232fbbba4e17a10ce6a55a37725d32371f0b/docs/notebooks/mlops/04-deployment/setup.py)
```
```python
from pathlib import Path
from setuptools import find_packages, setup


# Package meta-data.
NAME = "ride-duration-prediction"
DESCRIPTION = ""
URL = ""
EMAIL = ""
AUTHOR = ""
REQUIRES_PYTHON = ">=3.9.0"


# The rest you shouldn't have to touch too much. Except for install_requires=[]. 
# Perhaps also the License and Trove Classifiers if publishing to PyPI (public).
...

setup(
    ...
    install_requires=[],             
    ...
    license="MIT",
    classifiers=[
        # Trove classifiers
        "License :: OSI Approved :: MIT License",
        "Programming Language :: Python :: 3.9",
        "Programming Language :: Python :: Implementation :: CPython",
        "Programming Language :: Python :: Implementation :: PyPy",
    ],
)
```

Additionally, we can include [`MANIFEST.in`](https://github.com/particle1331/inefficient-networks/blob/217134c84bb323452bf0dc3e8b6a6a04fea8f06b/docs/notebooks/mlops/04-deployment/MANIFEST.in) file to specify the files included in the source distribution of the package. The full list can be viewed in the `SOURCES.txt` file of the generated `egg-info` folder after building the package.

```{margin}
[`MANIFEST.in`](https://github.com/particle1331/inefficient-networks/blob/217134c84bb323452bf0dc3e8b6a6a04fea8f06b/docs/notebooks/mlops/04-deployment/MANIFEST.in)
```
```
include ride_duration/*.py
include ride_duration/VERSION

recursive-exclude * __pycache__
recursive-exclude * *.py[co]
```

### Pipenv

To manage our projects package dependencies, we will use [Pipenv](https://pipenv.pypa.io/en/latest/). Notice that we get `Pipfile` which supersedes the usual requirements file, and also a `Pipfile.lock` containing hashes of downloaded packages that ensure reproducible builds. 

```bash
pipenv install scikit-learn==1.0.2 flask pandas mlflow boto3 --python=3.9
pipenv install --dev requests
pipenv install --dev pipenv-setup
```

Next we install the model package locally. Here we use `pipenv-setup sync` to update `install_requires` in the `setup` script according to the packages installed using Pipenv. This makes sure there are no dependency conflicts when using the package.

```bash
pipenv-setup sync
pipenv install -e .
```

Our `Pipfile` should now look like the following. Note that `ride-duration-prediction` is installed in editable mode which is okay since the underlying code is still in development.

```{margin}
[`Pipfile`](https://github.com/particle1331/inefficient-networks/blob/32bb7aecb5b6c4999becba323d6695eb97f3cd3a/docs/notebooks/mlops/04-deployment/Pipfile)
```
```ini
[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"

[packages]
scikit-learn = "==1.0.2"
flask = "*"
pandas = "*"
mlflow = "*"
boto3 = "*"
ride-duration-prediction = {editable = true, path = "."}

[dev-packages]
requests = "*"
pipenv-setup = "*"

[requires]
python_version = "3.9"
```

### Environmental variables

AWS credentials and other environmental variables that we will use later are saved in a `.env` file in the same directory as Pipfile. These are automatically detected and loaded by Pipenv when calling `pipenv shell`. However, the shell must be restarted whenever the `.env` file is modified.

```bash
# .env
EXPERIMENT_ID=1
MODEL_RUN_ID=f4e2242a53a3410d89c061d1958ae70a
AWS_ACCESS_KEY_ID=A*************LI
AWS_SECRET_ACCESS_KEY=N*********************+9
```

### Model package scripts

In the following script of the model package, we define helper functions for model training and inference. This includes the usual `load_training_dataframe` function which creates the target features (ride duration in minutes) and filters it to some range, i.e. `[1, 60]`, and `prepare_features` for creating the `PU_DO` combination feature IDs of pick-up and drop-off points.

```{margin}
[`utils.py`](https://github.com/particle1331/inefficient-networks/blob/217134c84bb323452bf0dc3e8b6a6a04fea8f06b/docs/notebooks/mlops/04-deployment/ride_duration/utils.py)
```
```python
def load_training_dataframe(file_path, y_min=1, y_max=60):
    """Load data from disk and preprocess for training."""
    
    # Load data from disk
    data = pd.read_parquet(file_path)

    # Create target column and filter outliers
    data['duration'] = data.lpep_dropoff_datetime - data.lpep_pickup_datetime
    data['duration'] = data.duration.dt.total_seconds() / 60
    data = data[(data.duration >= y_min) & (data.duration <= y_max)]

    return data


def prepare_features(input_data):
    """Prepare features for dict vectorizer."""

    X = pd.DataFrame(input_data)
    X['PU_DO'] = X['PULocationID'].astype(str) + '_' + X['DOLocationID'].astype(str)
    X = X[['PU_DO', 'trip_distance']].to_dict(orient='records')
    
    return X
```

Note that this package expects models that are pipelines such as (see **Appendix** below):

```python
pipeline = make_pipeline(
    DictVectorizer(), 
    RandomForestRegressor(**params, n_jobs=-1)
)
```

This avoids having to load the preprocessor separately from the artifacts store. Thus, our models expect `prepare_features(input_data)` where `input_data` can be a `DataFrame` with rows containing ride or a list of ride features dictionaries (e.g. obtained as a JSON payload).

**Predict.** The `load_model()` function in `predict.py` is of interest. The model is loaded directly from the S3 artifacts store. This ensures that we always get a model assuming the following environmental variables are properly configured.

```{margin}
[`predict.py`](https://github.com/particle1331/inefficient-networks/blob/6015767edcbf1fa019b57ca3138ebb900c71a6a9/docs/notebooks/mlops/04-deployment/ride_duration/predict.py)
```
```python
...
from ride_duration.utils import package_dir, prepare_features


def load_model(experiment_id, run_id):
    """Get model from our S3 artifacts store."""

    source = f"s3://mlflow-models-ron/{experiment_id}/{run_id}/artifacts/model"
    model = mlflow.pyfunc.load_model(source)

    return model


def make_prediction(model, input_data: Union[list[dict], pd.DataFrame]):
    """Make prediction from features dict or DataFrame."""
    
    X = prepare_features(input_data)
    preds = model.predict(X)

    return preds
```

Note that we can also load the latest **production version** directly from the model registry as follows (no need to specify a run and experiment ID). But one issue is that starting of the Flask server can fail whenever the request the tracking server is down.

```python
TRACKING_URI = f"http://{TRACKING_SERVER_HOST}:5000"

# Fetch production model from client
mlflow.set_tracking_uri(TRACKING_URI)
client = MlflowClient(tracking_uri=TRACKING_URI)
prod_model = client.get_latest_versions(name='NYCRideDurationModel', stages=['Production'])[0]

run_id = prod_model.run_id
source = prod_model.source
model = mlflow.pyfunc.load_model(source)
```

Testing out the `load_model()` function:

```bash
❯ python
Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 17:01:00)
[Clang 13.0.1 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> from ride_duration.predict import load_model
>>> model = load_model(os.getenv("EXPERIMENT_ID"), os.getenv("MODEL_RUN_ID"))
>>> model
mlflow.pyfunc.loaded_model:
  artifact_path: model
  flavor: mlflow.sklearn
  run_id: f4e2242a53a3410d89c061d1958ae70a
```

<br>

```{figure} ../../../img/s3-artifacts-ss.png
---
---
Artifacts store for model runs of experiment 1.
```

### Serving predictions using Flask

For our Flask application, we simply define an endpoint that serves the model predictions. The model loads when the server starts (i.e. we don't load it every time we make a prediction). Then, there is a single `predict_endpoint` which expects a singleton JSON payload, otherwise it only predicts on the first entry. For versioning, we return the run ID of the model is returned along with its prediction.

```{margin}
[`app/main.py`](https://github.com/particle1331/inefficient-networks/blob/57234281467bfcd332e27252b45aa460395a6227/docs/notebooks/mlops/04-deployment/app/main.py)
```
```python
...
from ride_duration.predict import load_model, make_prediction
from flask import Flask, request, jsonify


# Load model with run ID and experiment ID defined in the env.
RUN_ID = os.getenv("MODEL_RUN_ID")
EXPERIMENT_ID = os.getenv("EXPERIMENT_ID")
model = load_model(run_id=RUN_ID, experiment_id=EXPERIMENT_ID)

app = Flask('duration-prediction')


@app.route('/predict', methods=['POST'])
def predict_endpoint():
    """Predict duration of a single ride using NYCRideDurationModel."""
    
    ride = request.get_json()
    preds = make_prediction(model, ride)

    return jsonify({
        'duration': float(preds[0]),
        'model_version': RUN_ID,
    })


if __name__ == "__main__":
    app.run(debug=True, host='0.0.0.0', port=9696)
```

We also define a script for testing the endpoint. Note that this same script can be used without modification to test remote hosts using port forwarding.

```{margin}
[`test.py`](https://github.com/particle1331/inefficient-networks/blob/217134c84bb323452bf0dc3e8b6a6a04fea8f06b/docs/notebooks/mlops/04-deployment/test.py)
```
```python
import json
import requests


ride = [{
    'PULocationID': 130,
    'DOLocationID': 205,
    'trip_distance': 3.66,
}]


if __name__ == "__main__":
    
    host = 'http://0.0.0.0:9696'
    url = f'{host}/predict'
    response = requests.post(url, json=ride)
    result = response.json()
    
    print(result)
```

### Dockerfile

For our `Dockerfile`, we start by installing Pipenv. Next we copy `Pipfile` and `Pipfile.lock` as well as files for installing the model package. We also copy the files for the web service. Finally, we install everything using Pipenv, expose the `9696` endpoint, and configure the entrypoint (i.e. serve the main app on `0.0.0.0:9696`).

```{margin}
[`Dockerfile`](https://github.com/particle1331/inefficient-networks/blob/69fc955d08a8bbb8d214cc753af35250c34f4a27/docs/notebooks/mlops/04-deployment/Dockerfile)
```
```Dockerfile
FROM python:3.9.13-slim

RUN pip install -U pip
RUN pip install pipenv

WORKDIR /app

COPY [ "Pipfile", "Pipfile.lock",  "./"]
COPY [ "setup.py", "pyproject.toml", "MANIFEST.in",  "./"]
COPY [ "ride_duration",  "./ride_duration"]
COPY [ "app",  "./app"]

RUN pipenv install --system --deploy

EXPOSE 9696

# https://stackoverflow.com/a/71092624/1091950
ENTRYPOINT [ "gunicorn", "--bind=0.0.0.0:9696", "--timeout=600", "app.main:app" ]
```

Building the image:

```bash
docker build -t ride-duration-prediction-service:v1 .
```
```bash
[+] Building 72.3s (13/13) FINISHED
 => [internal] load build definition from Dockerfile               0.1s
 => => transferring dockerfile: 388B                               0.0s
 => [internal] load .dockerignore                                  0.1s
 => => transferring context: 2B                                    0.0s
 => [internal] load metadata for docker.io/library/python:3.9.13-  3.4s
 => [internal] load build context                                  0.1s
 => => transferring context: 80.02kB                               0.1s
 => [1/8] FROM docker.io/library/python:3.9.13-slim@sha256:451ccc  0.0s
 => CACHED [2/8] RUN pip install -U pip                            0.0s
 => CACHED [3/8] RUN pip install pipenv                            0.0s
 => CACHED [4/8] WORKDIR /app                                      0.0s
 => [5/8] COPY [ Pipfile, Pipfile.lock, setup.py, pyproject.toml,  0.1s
 => [6/8] COPY [ ride_duration,  ./ride_duration]                  0.1s
 => [7/8] COPY [ app,  ./app]                                      0.0s
 => [8/8] RUN pipenv install --system --deploy                    64.8s
 => exporting to image                                             3.6s
 => => exporting layers                                            3.6s
 => => writing image sha256:1b2da34ca1b3504d45df527049f788a485713  0.0s
 => => naming to docker.io/library/ride-duration-prediction-servi  0.0s
```

Running the container (we load the environmental variables into the container with `--env-file .env`):

```bash
docker run --env-file .env -it --rm -p 9696:9696 ride-duration-prediction-service:v1
```
```
[2022-06-20 11:12:08 +0000] [1] [INFO] Starting gunicorn 20.1.0
[2022-06-20 11:12:08 +0000] [1] [INFO] Listening at: http://0.0.0.0:9696 (1)
[2022-06-20 11:12:08 +0000] [1] [INFO] Using worker: sync
[2022-06-20 11:12:08 +0000] [9] [INFO] Booting worker with pid: 9
Downloading model f4e2242a53a3410d89c061d1958ae70a from S3...
2022/06/20 11:22:28 WARNING mlflow.pyfunc: Detected one or more mismatches between the model's dependencies and the current Python environment:
 - psutil (current: uninstalled, required: psutil==5.9.1)
To fix the mismatches, call `mlflow.pyfunc.get_model_dependencies(model_uri)` to fetch the model's environment and install dependencies using the resulting environment file.
```

Looks like there is a mismatch between the training environment and the current one. Running the test script in another terminal to see if it still works:

```bash
❯ python test.py
{'duration': 18.210770674183355, 'model_version': 'f4e2242a53a3410d89c061d1958ae70a'}
```

Note that after the initial loading time, the next predictions are returned instantaneously. Have to make sure that the model is loaded only once. 

## Streaming: Deploying models with Kinesis and Lambda

A streaming service consists of **producers** and **consumers**. Producers push events to the event stream which are consumed by consuming services that react to this stream. Recall that a web service exhibits a 1-1 relationship so that there is explicit connection between user and service. On the other hand, the relationship between producing and consuming services can be 1-many or many-many. There is only implicit connection since we don't know which consumers will react or how many. This setup can be scaled to many services or models.

For example, when a user uses our ride hailing app, the backend can send an event to the stream containing all information about this ride. Then, services will react on this event, e.g. one consuming service predicts tip and sends a push notification to user asking for the tip. And consuming services which makes better ride duration prediction but takes more time to make a prediction can update the prediction that was initially given to the user by the online web service. 

## Deploying batch predictions

For use cases that do not require the responsiveness of a web service, we can implement an offline service that makes batch predictions. Typically, offline services are expected to be done between fixed time periods, e.g. daily, weekly, or monthly. A critical element of this is **workflow orchestration** where we regularly pull from a database, make predictions on that data, then write the predictions on a database, or to a file that is uploaded to S3, or it can be pushed to an analytics dashboard thereby refreshing it. 

```
=== TODO (waiting for new video with orchestration :) ===
```

<!-- ### Scoring script

```{margin}
[`score.py`](https://github.com/particle1331/inefficient-networks/blob/217134c84bb323452bf0dc3e8b6a6a04fea8f06b/docs/notebooks/mlops/04-deployment/score.py)
```
```python
from ride_duration.utils import load_training_dataframe
from ride_duration.predict import load_model, make_prediction


def generate_uuids(n):
    ride_ids = []
    for i in range(n):
        ride_ids.append(str(uuid.uuid4()))
    return ride_ids


def apply_model(
    input_file: str, 
    run_id: str, 
    output_file: str
) -> None:
    
    print(f'Reading the data from {input_file}...')
    df = load_training_dataframe(input_file)
    df['ride_id'] = generate_uuids(len(df))

    print(f'Loading the model with RUN_ID={run_id}...')
    model = load_model()

    print(f'Applying the model...')
    preds = make_prediction(model, df)

    print(f'Saving the result to {output_file}...')
    df_result = pd.DataFrame()
    df_result['ride_id'] = df['ride_id']
    df_result['lpep_pickup_datetime'] = df['lpep_pickup_datetime']
    df_result['PULocationID'] = df['PULocationID']
    df_result['DOLocationID'] = df['DOLocationID']
    df_result['actual_duration'] = df['duration']
    df_result['predicted_duration'] = preds
    df_result['diff'] = df_result['actual_duration'] - df_result['predicted_duration']
    df_result['model_version'] = run_id
    df_result.to_parquet(output_file, index=False)


def run(taxi_type: str, year: int, month: int, run_id: str) -> None:

    source_url = 'https://s3.amazonaws.com/nyc-tlc/trip+data'
    input_file = f'{source_url}/{taxi_type}_tripdata_{year:04d}-{month:02d}.parquet'
    output_file = f'output/{taxi_type}/{year:04d}-{month:02d}.parquet'

    apply_model(
        input_file=input_file,
        run_id=run_id,
        output_file=output_file
    )


if __name__ == '__main__':

    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument("--taxi_type", default='green', type=str)
    parser.add_argument("--year", default=2021, type=int)
    parser.add_argument("--month", default=1, type=int)
    parser.add_argument("--run_id", type=str)
    parser.add_argument("--experiment_id", type=int)
    args = parser.parse_args()
    
    run(
        taxi_type=args.taxi_type,
        year=args.year,
        month=args.month,
        run_id=args.run_id
    )
``` -->

<!-- ```bash
pipenv install --dev python-dotenv
python score.py
``` -->

## Appendix: Train script

For training models that we use to serve predictions in our API, we use the following script. This trains a model using the `ride_duration` package (which ensures smooth integration with the Flask API) and logs the trained model to a remote MLflow tracking server. The tracking server host is provided as a command line argument.

```{margin}
[`train.py`](https://github.com/particle1331/inefficient-networks/blob/217134c84bb323452bf0dc3e8b6a6a04fea8f06b/docs/notebooks/mlops/04-deployment/train.py)
```
```python
import mlflow 
import joblib

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import make_pipeline

from ride_duration.utils import load_training_dataframe, prepare_features


def setup(tracking_server_host):
    TRACKING_URI = f"http://{tracking_server_host}:5000"
    mlflow.set_tracking_uri(TRACKING_URI)
    mlflow.set_experiment("nyc-taxi-experiment")


def run_training(X_train, y_train, X_valid, y_valid):
    with mlflow.start_run():
        params = {
            'n_estimators': 100,
            'max_depth': 20
        }
        
        pipeline = make_pipeline(
            DictVectorizer(), 
            RandomForestRegressor(**params, n_jobs=-1)
        )
        
        pipeline.fit(X_train, y_train)
        y_pred = pipeline.predict(X_valid)
        rmse = mean_squared_error(y_valid, y_pred, squared=False)
        
        mlflow.log_params(params)
        mlflow.log_metric("rmse_valid", rmse)
        mlflow.sklearn.log_model(pipeline, artifact_path='model')


if __name__ == "__main__":

    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument("--tracking-server-host", type=str)
    parser.add_argument("--train_path", type=str)
    parser.add_argument("--valid_path", type=str)
    args = parser.parse_args()

    # Getting data from disk
    train_data = load_training_dataframe(args.train_path)
    valid_data = load_training_dataframe(args.valid_path)

    # Preprocessing dataset
    X_train = prepare_features(train_data.drop(['duration'], axis=1))
    X_valid = prepare_features(valid_data.drop(['duration'], axis=1))
    y_train = train_data.duration.values
    y_valid = valid_data.duration.values

    # Push training to server
    setup(args.tracking_server_host)
    run_training(X_train, y_train, X_valid, y_valid)
```