# 09. PyTorch Model Deployment

Lets bring FoodVision to life and make it publicly accessible.

**We are going to deploy our FoodVision model to the internet as a usable app!**


## What is machine learning model deployment?

***Machine learning model deployment** is the process of making your machine learning model accessible for others.*

For example, someone taking a photo on their smartphone of food and then having our FoodVision model classify it into pizza, steak, or sushi.

Some other examples can be, *an operating system may lower its resource consumption based on a machine learning model making predictions on how much power someone generally uses at specific times of day.*

Also these models can learn from each other as well. For example, a Tesla car's computer vision system will interact with the car's route planning program and then the route planning program will get inputs and feedback from the driver.


## Why deploy a machine learning model?

One of the most important philosophical question in machine learning is:
***if a machine learning model never leaves a notebook, does it exist?***

---
Deploying a model is as important as training one.

Because although you can get a pretty good idea of how your model is going to function by evaluating it on a well crafted test set or visualizing its results, you never really know how it will perform ultil you release it to the wild.

***Having people who've never used your model interact with it will often reveal edge cases you never thought of during training.***

For example, what happens if someone was to upload a photo that wasn't of food to our FoodVision model?

One solution would be to create another model that firstclassifies images as "food" or "not food" and passing the target image through that model first.

THen if the images is of "food" it goes to our FoodVision model and gets classifies into pizza, steak, or sushi.

And if it's "not food", a message is displayed.

But what is these predictions were wrong?

What happens then?

You can see how these questions could keep going.

Thus this highlights the importance of model deployment: it helps you figure our errors in your model that aren't obvious during training/testing.

---

***But once you've got a good model, deployment is a good next step. Monitoring involves seeing how your model goes on the most important data split: data from the real world.***

## Different types of machine learning model deployment

Whole books could be written on the different types of machine learning model deployment(and many good one are listed [PyTorch Extra Resources](https://www.learnpytorch.io/pytorch_extra_resources/#resources-for-machine-learning-and-deep-learning-engineering))

---
Let's start with a simple question:
> What is the most ideal scenario for our machine leanring model to be used?

ANd then work backward from there.

In case of FoodVision, our ideal scenerio might be:
- someone taking a photo on a mobile device(through an app or web browser)
- The prediction comes back fast.

Easy.

So we have two main criteria:
1. The model should work on a mobile device
2. THe model should make predictions *fast*(because a slow app is a boring app).

And of course, depending on our use case, our requirements may vary.

We may notive the above two points break down into another two questions:
1. **Where's it going to go?** - As in, where is it going to be stored?
2. **How's it going to function?** - As in, does it return predictions immediatedly? or do they come later?

![](09-deployment-questions-to-ask.png)

*When starting to deploy machine learning models, it's helpful to start by asking what's the most ideal use case and then work backwards from there, asking where the model is going to go and then how it's going to function.*

## Where's it going to go?

When you deploy your machine learning model, where does it live?

THe main debate here is usually on-device (also called edge/in the browser) or on the cloud (a computer/server that isn't the actual device someone/something calls the model from).

Both have their pros and cons.

| Deployment location| Pros| Cons|
|:--|:--|:--|
| **On-device(edge/in the browser** | Can be very fast (since no data leaves the device)|Limited compute power (larger models take longer to run)|
| - | Privacy preserving (again no data has to leave the device)| Limited storage space (smaller model size required)|
|-|No internet connection required (sometimes)|Device-specific skills often required|
| **On cloud** | Near unlimited compute power (can scale up when needed)|Costs can get out of hand (if proper scaling limits aren't enforced)|
|-|Can deploy one model and use everywhere (via API)|Predictions can be slower due to data having to leave device and predictions having to come back (network latency)|
|-|Links into existing cloud ecosystem|Data has to leave device (this may cause privacy concerns)|

There are more details to these but I've left resources in the extra-curriculum to learn more.

Let's give an example.

If we're deploying FoodVision as an app, we want it to perform well and fast.

<div class='alert alert-success'>

So which model would we prefer?
1. A model on-device that performs at 95% accuracy with an inference time (latency) of one second per prediction.
2. A model on the cloud that performs at 98% accuracy with an inference time of 10 seconds per prediction (bigger, better model but takes longer to compute).

</div>

We've made these numbers up but they showcase a potential difference between on-device and on the cloud.

***Option 1** could potentially be a smaller less performant model that runs fast because its able to fit on a mobile device.*

***Option 2** could potentially a larger more performant model that requires more compute and storage but it takes a bit longer to run because we have to send data off the device and get it back (so even though the actual predictoin might be fast, the network time and data transfer has to be factored in)

**For FoodVision, we'd likely prefer 1, because the small hit in performance is outweighted by the faster inference speed.
![](09-model-deployment-on-device-vs-cloud.png)

*In the case of a Tesla car's computer vision system, which would be better? A smaller model that performs well on device (model is on the car) or a larger model that performs better that's on the cloud? In this case, you'd much prefer the model being on the car. The extra network time it would take for data to go from the car to the cloud and then back to the car just wouldn't be worth it (or potentially even impossible with poor signal areas).*

***Note**: For a full example of seeing what it's like to deploy a PyTorch model to an edge device, see the PyTorch tutorial on achieving real-time inference (30fps+) with a computer vision model on a Raspberry Pi.*

## How's it going to function?

Back to the ideal use case, when you deploy your machine learning, **How should it work?**

1. *As in, would you like predictions returned immediately??*

2. *or is it okay for them to happen later?*

These two scenarios are generally referred to as:
- Online (real-time) - *Predictions/inference happen **immediately***. For eample, someone uploads an image, the image gets transformed and predictions are returned or someone makes a purchase and the transaction is verified to be non-fraudulent by a model so the purchase can go through.
- Offline (batch) - *Predictions/inference happen **periodically***. For example, a photos application sorts your images into different categories (such as beach, mealtime, family, friends) whilst your mobile device is plugged into charge.

***Note:** `Batch` refers to inference being performed on multiple samples at a time. However, to add a little confusion, batch processing can happen immediately/online (multiple images being classified at once) and/or offline (multiple images being predicted/trained on at once).*

The main difference between each being: prediction being made immediately or periodically.

Periodically can have a varying timescale too, from every few seconds to every few hours or days.

And you can mix and match the two.

In the case of FoodVision, we'd want our inference pipeline to happen online (real-time), so when someone uploads an image of pizza, steak, or sushi, the prediction results are returned immediately (any slower than real-time would make a boring experience).

But for our training pipeline, it's okay for it to happen in a batch (offline) fashion, which is what we've beein doing throughout the previous chapters.

## Ways to deploy a machine learning model

We've discussed a couple of options for deploying machine learning models (on-device and cloud)

And each of these will have their specific requirements.

| Tool/Resource | Deployment Type|
|:-|:-|
| [Google's ML Kit](https://developers.google.com/ml-kit)| On-device (Android and iOS)|
| [Apple's Core ML](https://developer.apple.com/documentation/coreml) and [`coremltools` Python package](https://apple.github.io/coremltools/docs-guides/) | On-device (all Apple devices)|
| [Amazon Web Service's (AWS) Sagemaker](https://aws.amazon.com/sagemaker/)| Cloud|
| [Google Cloud's Vertex AI](https://cloud.google.com/vertex-ai) | Cloud |
| [Microsoft's Azure Machine Learning](https://azure.microsoft.com/en-au/services/machine-learning/) | Cloud|
| [HuggingFace Spaces](https://huggingface.co/spaces) | Cloud |
| API with [FastAPI](https://fastapi.tiangolo.com/) | Cloud/self-hosted server |
|API with [TorchServe](https://pytorch.org/serve/) | Cloud/self-hosted server |
| [ONNX (open Neural Network Exchange)](https://pytorch.org/serve/) | Many/general|
| Many more..|-|

***Note**: An application programming interface (API) is a way for two (or more) computer programs to interact with each other. For example, if your model was deployed as API, you would be able to write a program that could send data to it and then receive predictions back.*

Which option you choose will be highly dependent on what you're building/who you're working with.

But with so many options, it can be very intimidating.

So best to start small and keep it simple.

And one of the best ways to do so is by turning your machine learning model into a demp app with [***Gradio***](https://gradio.app/) and then deploying it on Hugging Face spaces.

We'll be doing just that with FoodVision later on.
![](09-tools-and-places-to-deploy-ml-models.png)

*A handful of places and tools to host and deploy machine learning models. There are plenty we've missed, feel free to find them later.*

## What we're going to cover

Let's become a machine learning engineer and actually deploy a machine learning model.

Our goal is to deploy our FoodVision Model via a demo Gradio app with the following metrics:
1. **Performance**: 95% accuracy.
2. **Speed**: real-time inference of 30FPS+(each prediction has a latency of lower than ~0.03s)

We'll start by running an experiment to compare our best two models so far: EffNetB2 and ViT feature extractors.

Then we'll deploy the one which performs closest to our goal metrics.

***Finally, we'll finish with a (BIG) surprise bonus.***

| Topic |
|:-|
|**0. Setting up** |
|**1. Get data** |
|**2. FoodVision Mini model deployment experiment outline**|
|**3. Creating an EffNetB2 feature extractor**|
|**4. Creating a ViT feature Extractor**|
|**5. Making predictions with our trained models and timing them**|
|**6. Comparing model results, prediction times and size**|
|**7. Bringing FoodVision to life by creating a Gradio demo**|
|**8. Turning our FoodVision Gradio demo into a deployable app**|
|**9. Deploying our gradio demo to HuggingFace Spaces**|
|**10. Creating a BiG surprise**|
|**11. Deploying our BiG suprise****|

## 0. Setting up

As we've dome previously, let's make sure we've got all of the modules we'll need for this section.

We'll import the Python scripts (such as `data_setup.py` and `engine.py`) we created in 05. PyTorch Going Modular.

To do so, we'll download `going_modular` directory from the `pytorch-deep-learning` repository (if we dont already have it).

We'll also get the `torchinfo` package if it's not available.

`torchinfo` will help later on to give us visual representation of our model.

ANd since later on we'll be using `torchvision`, we'll make sure we've got the latest versions.

In [1]:
import torch
import torchvision

print(f"torch version: {torch.__version__}")
print(f"torchvision version: {torchvision.__version__}")

torch version: 2.2.2
torchvision version: 0.17.2


Now we'll continue with the regular imports, setting up device agnostic code, and this time we'll also get the `helper_functions.py` script from GitHub.

The `helper_functions.py` scrip contains several functions we created in previous sections:
- `set_seeds()` - sets the random seeds
- `download_data()` - to download a data source given a link
- `plot_loss_curves()` - to inspect our model's training results.

In [2]:
# Continue with regular imports
import matplotlib.pyplot as plt
import torch
import torchvision

from torch import nn
from torchvision import transforms
from torchinfo import summary

# Try to import the going_modular directory, download it from GitHub if it doesn't work
try:
    from going_modular.going_modular import data_setup, engine
    from helper_functions import download_data, set_seeds, plot_loss_curves
except:
    # Get the going_modular scripts
    print("[INFO] Couldn't find going_modular or helper_functions scripts... downloading them from GitHub.")
    !git clone https://github.com/mrdbourke/pytorch-deep-learning
    !mv pytorch-deep-learning/going_modular .
    !mv pytorch-deep-learning/helper_functions.py . # get the helper_functions.py script
    !rm -rf pytorch-deep-learning
    from going_modular.going_modular import data_setup, engine
    from helper_functions import download_data, set_seeds, plot_loss_curves

[INFO] Couldn't find going_modular or helper_functions scripts... downloading them from GitHub.
Cloning into 'pytorch-deep-learning'...
remote: Enumerating objects: 4356, done.[K
remote: Counting objects: 100% (321/321), done.[K
remote: Compressing objects: 100% (143/143), done.[K
remote: Total 4356 (delta 213), reused 255 (delta 177), pack-reused 4035 (from 1)[K
Receiving objects: 100% (4356/4356), 654.51 MiB | 22.92 MiB/s, done.
Resolving deltas: 100% (2584/2584), done.
Updating files: 100% (248/248), done.


In [4]:
# device agnostic code
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cpu'