# Desing Patterns for Resilient Serving

- Software deployed into production environments is expected to be resilient and require little in the way of human intervention to keep it running
- Stateless serving
    - This patter allows to scale and handle thousands or even millions of predictions requests per second
- Batch Serving
    - Asynchronously handle occasional or periodic requests for millions to billions of predictions
    
- Continued model evaluation
    - Patterns that handle common problem of detecting when a deployed model is no longer fit-for-purpose

## Design Pattern 16: Stateless Serving Function

- A statless function is a function whose outputs are determined purely by its inputs
- A function that maintains a counter of the number of times it's been called is stateful
    - Can be done via class variables
    
- Because stateless components don't have any state, they can be shared by multiple clients
- Servers typically create a pool of statelss components and use them to service client requests as they come in
- On the other hand, stateful components will need to represent each clients conversational state
- Lifecycle of stateless components needs to be mandages by the serve
    - e.g. initialised on clients first request and then destroyed when the client terminates or times out
    - Becuase of these factors stateless components are highly scalable
    - Where as stateful components are expensive and difficult to manage
- Web applications transfer state from the clients to the server with each call

- In ML a lot of state is captured during training
- When the model is exported / saved we reqiuire the model framework to keep track of these stateful variables

- When stateless functions are used, it simplifies the server code and makes it more scalable but can make client code more complicated
    - e.g. A spelling correction model has to know the previous few words in order to correct words

### Problem

- Image we have a deep learning model trained (kera flavour)
- Model makes predictions on how positive the reviews are for a film, therefore has a embedding layer
- We use this model to make predictions via the `predict` call

- There are several problems carrying out inference by calling `model.predict()` on an in memory object 
    - Loaded enitre keras model into memory, with all embeddings and layers this can be very large for deep learning models
    - The preceding architechture imposes limits on the latency that can be achieved because calls to the `predict()` method have to be sent one by one
    - Model input and output is most effective for training and may not be user friendly. The model may output [logits](https://en.wikipedia.org/wiki/Logit) but clients may want the sigmoid of that so that the output range is between 0 and 1. This can then me interpreted more as a probability. Model may have also been trained on compressed binary records where as the input format from the client may be a JSON.

### Solution

1. Export the model into a format that captures the mathmatical core of the model and is programming language agnostic
2. In the production system, the formula consisting of the "forward" calculations of the model is resorted as a stateless function
3. The stateless function is deployed into a framework that provided a REST endpoint

#### Model Export

- Can use [ONNX](https://onnx.ai)
- Learning rate, dropout etc... don't need to be saved. This is what ONNX does.

#### Inference in Python

- In production the model's formula is restored along with any other files as a stateless function
- The function must conform to a specidic model signature with the input and output variable names and input types
- The signature specifies what the input and output of the model is 

#### Create Webendpoint

- Should define the serving function as a global variable or a singleton class so that it isn't reloaded in response to every request.

### Why it Works

#### AutoScaling

- Scaling to milltions of requests per second is a well understood engineering problem
- Rather than building services unique to ML we can rely on decades of engineering work that has gone into building resilient web applications
- Modern cloud services offer auto-scaling services
- Some ML framworks have their own serving subsystem:
    - PyTorch: TorchServe
    - TensorFLow: TensorFlow Serving
    
#### Fully Managed

- Some cloud providers also have the capability of serving models e.g. sagemaker

### Trade-Offs and Alternatives