# "MLOps project - part 3a: Machine Learning Model Deployment" 
> "Deploying machine learning models in production."

- toc: True
- branch: master
- badges: true
- comments: true
- categories: [mlops]
- image: images/some_folder/your_image.png
- hide: false
- search_exclude: true

So far in this series of blog posts, we saw how to do experiment tracking and creating a machine learning pipeline for model training. But what should we do now with the trained mode?! That's right. We have to deploy the model into production. so people can use it and make inference. 

In this blog post we will see what is machine learning model deployment and what are the options that can help us to do it. In the next post, we will deploy our own trained model for customer sentiment analysis in production.

Let's get started.


# Machine Learning Model Deployment

There are multiple options for model deployment. First, we need to ask if we want the predictions to be done immediately or if they can wait for an hour, a day, etc. 

- In case we can wait a bit, we can go for **batch or offline** deployment. In this case, the model doen't need to be running all the time and we can call it with a batch of data regularly in a time interval.
- In the other case, we need the model predictions as soon as possible and the model should be running all the time. This is called **online** deployment. Online deployment has multiple variants as well:
    - *Web service*: In this case, we deploy our model as a web service and we can send HTTPS requests and get the prediction from that.
    - *Streaming*: In this case, there is a stream of events and model is listening to events and reacting to them. 
    
![](images/model-deployment/1.jpg)


## Batch or Offline Mode

In this case, we need to apply the model to a batch of data in a time interval. It can be every 10 minutes, every half an hour, every day, every week, etc. 

Usually we have a database with data in it and a job which has the model to pull the data from database and apply the model to the data. It can then save the results of the prediction into another database, so other jobs can use the data for other purposes like a report or dashboard.


![](images/model-deployment/2.jpg)



## Online Deployment 

### Web Service

The other common way of deploying models is Web Service which is in the category of Online deployment. In this case you have a web service which has the machine learning model. This service needs to be up and running all the time. It is also possible to use serverless microservices like a service deployed using Cloud Run. There would be a very small delay which needs to be taken into consideration and if it is not acceptable, you need to go for more real-time architectures. This case is more like a one-to-one relationship between client and server.

![](images/model-deployment/3.jpg)


### Streaming

In this case, we have producers and consumers. Producers will push some events into an event stream and multiple services or consumers would read from the stream and react to the events. This more like a one-to-many or many-to-many relationship between producer(s) and consumers. For example, the producer can be a user with an app which interacts with the backend and produces some events. Then this events will go to an stream event and multiple services can do different jobs on those events. The difference with the web service option is that there is no explicit connection between the procuser and consumers here. The producer just pushes an event and some services will process it. The result of these consumers may go to another event stream to be used by some other consumers and services. There is no limit there.


![](images/model-deployment/4.jpg)



Check the following video to learn more about different deployment options:

> youtube: https://youtu.be/JMGe4yIoBRA


# Model Deployment Tools

There are many ways to deploy your model into production. [This blog post](https://getindata.com/blog/machine-learning-model-serving-tools-comaprison-kserve-seldon-core-bentoml/) does a nice comparison of three popular tools: [Seldon Core](https://www.seldon.io/solutions/open-source-projects/core), [KServe](https://kserve.github.io/website/0.9/), and [BentoML](https://www.bentoml.com/). 

The comparison is focused on 9 main areas of model serving tools:

- ability to serve models from standard frameworks, including Scikit-Learn, PyTorch, Tensorflow and XGBoost
- ability to serve custom models / models from niche frameworks
- ability to pre-process/post-process data
- impact on the development workflow and existing codebase
- availability of the documentation
- DevOps operability
- autoscaling capabilities
- available interfaces for obtaining predictions
- infrastructure management

> **KServe**: KServe (previously, before the 0.7 version was named KFServing) is an open-source, Kubernetes-based tool providing custom abstraction (Kubernetes Custom Resource Definition) to define Machine Learning model serving capabilities. It’s main focus is to hide the underlying complexity of such deployments so that it’s users only need to  focus on the ML-related parts. It supports many advanced features such as autoscaling, scaling-to-zero, canary deployments, automatic request batching as well as many popular ML frameworks out-of-the-box.

> **Seldon Core**: Seldon Core is an open source tool developed by Seldon Technologies Ltd, as a building block of the larger (paid) Seldon Deploy solution. It’s similar to KServe in terms of the approach - it provides high level Kubernetes CRD and supports canary deployments, A/B testing as well as Multi-Armed-Bandit deployments. 

> **BentoML**: BentoML is a Python framework for wrapping the machine learning models into deployable services. It provides a simple object-oriented interface for packaging ML models and creating HTTP(s) services for them. BentoML offers in-depth integration with popular ML frameworks, so that all of the complexity related to packaging the models and their dependencies is hidden. BentoML-packaged models can be deployed in many runtimes, which include plain Kubernetes Clusters, Seldon Core, KServe, Knative as well as cloud-managed, serverless solutions like AWS Lambda, Azure Functions or Google Cloud Run.

It's really informative and I highly recommend check it out. 

In this blog post, I show how to deploy our own example, customer sentiment analysis, as a web service. I will do it in three ways: Google Cloud Run, Vertex-AI, and KServe with ZenML.



## Cloud Run



## Vertex AI

# KServe with ZenML

> [KServe](https://github.com/kserve/kserve) provides a Kubernetes Custom Resource Definition for serving machine learning (ML) models on arbitrary frameworks. It aims to solve production model serving use cases by providing performant, high abstraction interfaces for common ML frameworks like Tensorflow, XGBoost, ScikitLearn, PyTorch, and ONNX. It encapsulates the complexity of autoscaling, networking, health checking, and server configuration to bring cutting edge serving features like GPU Autoscaling, Scale to Zero, and Canary Rollouts to your ML deployments. It enables a simple, pluggable, and complete story for Production ML Serving including prediction, pre-processing, post-processing and explainability. KServe is being used across various organizations.


![](images/model-deployment/kserve.png)
*[source](https://github.com/kserve/kserve)*