# Desing Patterns for Resilient Serving

- Software deployed into production environments is expected to be resilient and require little in the way of human intervention to keep it running
- Stateless serving
    - This patter allows to scale and handle thousands or even millions of predictions requests per second
- Batch Serving
    - Asynchronously handle occasional or periodic requests for millions to billions of predictions
    
- Continued model evaluation
    - Patterns that handle common problem of detecting when a deployed model is no longer fit-for-purpose

## Design Pattern 16: Stateless Serving Function

- A statless function is a function whose outputs are determined purely by its inputs
- A function that maintains a counter of the number of times it's been called is stateful
    - Can be done via class variables
    
- Because stateless components don't have any state, they can be shared by multiple clients
- Servers typically create a pool of statelss components and use them to service client requests as they come in
- On the other hand, stateful components will need to represent each clients conversational state
- Lifecycle of stateless components needs to be mandages by the serve
    - e.g. initialised on clients first request and then destroyed when the client terminates or times out
    - Becuase of these factors stateless components are highly scalable
    - Where as stateful components are expensive and difficult to manage
- Web applications transfer state from the clients to the server with each call

- In ML a lot of state is captured during training
- When the model is exported / saved we reqiuire the model framework to keep track of these stateful variables

- When stateless functions are used, it simplifies the server code and makes it more scalable but can make client code more complicated
    - e.g. A spelling correction model has to know the previous few words in order to correct words

### Problem

- Image we have a deep learning model trained (kera flavour)
- Model makes predictions on how positive the reviews are for a film, therefore has a embedding layer
- We use this model to make predictions via the `predict` call

- There are several problems carrying out inference by calling `model.predict()` on an in memory object 
    - Loaded enitre keras model into memory, with all embeddings and layers this can be very large for deep learning models
    - The preceding architechture imposes limits on the latency that can be achieved because calls to the `predict()` method have to be sent one by one
    - Model input and output is most effective for training and may not be user friendly. The model may output [logits](https://en.wikipedia.org/wiki/Logit) but clients may want the sigmoid of that so that the output range is between 0 and 1. This can then me interpreted more as a probability. Model may have also been trained on compressed binary records where as the input format from the client may be a JSON.

### Solution

1. Export the model into a format that captures the mathmatical core of the model and is programming language agnostic
2. In the production system, the formula consisting of the "forward" calculations of the model is resorted as a stateless function
3. The stateless function is deployed into a framework that provided a REST endpoint

#### Model Export

- Can use [ONNX](https://onnx.ai)
- Learning rate, dropout etc... don't need to be saved. This is what ONNX does.

#### Inference in Python

- In production the model's formula is restored along with any other files as a stateless function
- The function must conform to a specidic model signature with the input and output variable names and input types
- The signature specifies what the input and output of the model is 

#### Create Webendpoint

- Should define the serving function as a global variable or a singleton class so that it isn't reloaded in response to every request.

### Why it Works

#### AutoScaling

- Scaling to milltions of requests per second is a well understood engineering problem
- Rather than building services unique to ML we can rely on decades of engineering work that has gone into building resilient web applications
- Modern cloud services offer auto-scaling services
- Some ML framworks have their own serving subsystem:
    - PyTorch: TorchServe
    - TensorFLow: TensorFlow Serving
    
#### Fully Managed

- Some cloud providers also have the capability of serving models e.g. sagemaker

### Trade-Offs and Alternatives

#### Customer Serving Function

- Use a custom function to return the desired result to the client

#### Multiple Signatures

- There can be inexpensive (e.g. sigmoid) and expensive functions to run at inference time
- It would not be good to return both to the client with each request
- The client must explicitily request the inference from the expensive function

#### Prediction Library

- Instead of deploying the serving function as a microservice that can be called via a REST API, implement the prediction code as a library function
- The library would load the model the first time it is called
- Developers who need to predict with the libarary can then include the library with their application

- A library function is better alternative than a microservice if the model cannot be called over a netowrk for physical or performance reasons
- The libary function also places the computational burden on the clience and this might be preferable from a budgetary standpoint

- Draw back is that maintenance and updates of the model are difficult
- All client code that uses the model will have to be updated to use the new version of the library
- The more the model is updated the more attractive the microservice approach becomes
- The libaray approach is also restircted to programming languages for which the libraries are written
- Where as REST API opens up the model to applications written in any modern language

## Design Pattern 17: Batch Serving

- Carries out inference on a larger number of instances all at once

### Problem

- Predictions are carried out one at a time and on demand
    - e.g. working out if a credit card transaction is fraudulent or not
- When a model is typically deployed it is setup to process one instance
- The serving framework is archictected to process an individual request synchronously and as quickly as possible
- This is usually a microservice

- There are circumstances where predictions need to carried out asynchronously over large volumes of data
    - e.g. do we need to buy more stock? This can happen hourly or daily not when every time an item is sold
- Attempting to take an endpoint capable only only handling a single request at a time and sending it millions of requests may overwhelm the model (DDOS)

### Solution

- Batch serving pattern used a distributed data processing infrastructure (MapReduce, Spark, BigQuery, Apache Beam) to carry out ML inference on a large number of instances asynchronously
- Using distributed systems to carry out one-off predictions is not very efficient need to pass large volume of data to performe inferecne to make it worth while

### Why it Works

- Stateless serving function is setup for low latency serving to support thousands of simultaneous queries
- Using this for periodic process and be time consuming and expensive
- If requests are not latency sensitive it is more cost-effective to use a distributed data processing architecture

### Trade-Offs and Alternatives

- Batch serving pattern depends on the ability to split a task across multiple workers
- Even though batch serving is used when latency is not a concern, it is possible to incorporate precomputed results and periodic refreshing to use this in scenarios where the space of possible prediction inputs is limited

#### Batch and Stream Pipelines

- Frameworks like Spark or Apache Beam are useful when the input needs precprocessing before it can be supplied to the model, outputs require postproceccing or if either are hard to express in SQL

- Apache Beam is useful if the client code needs to maintain state 
    - e.g. time-windowed average
    - Stop users making multiple comments on a post
        - State needed to keep count of how many times a user commented on a post
- We can do the distributed processing and maintain state with with Apache Beam

#### Cached Results of Batch Serving

- Compute and cache work ahead of time
    - e.g. if we have 10 millions users and 10,000 items and we want the top 5 items per customer ranks, this would be not be feasible to do in near real time

#### Lambda Architecture
- A production ML system that supports both online serving and batch serving is called a [Lambda Architechture](https://oreil.ly/jLZ46)
- Such a ML system allows pracitioners to trade-off between latency (via the statless serving function pattern) and throughput (via batch serving pattern)

- Typically, a Lambda architecture is supported by having separate systems for online serving and batch serving

## Design Pattern 18: Continued Model Evaluation