# Desing Patterns for Resilient Serving

- Software deployed into production environments is expected to be resilient and require little in the way of human intervention to keep it running
- Stateless serving
    - This patter allows to scale and handle thousands or even millions of predictions requests per second
- Batch Serving
    - Asynchronously handle occasional or periodic requests for millions to billions of predictions
    
- Continued model evaluation
    - Patterns that handle common problem of detecting when a deployed model is no longer fit-for-purpose

## Design Pattern 16: Stateless Serving Function

- A statless function is a function whose outputs are determined purely by its inputs
- A function that maintains a counter of the number of times it's been called is stateful
    - Can be done via class variables
    
- Because stateless components don't have any state, they can be shared by multiple clients
- Servers typically create a pool of statelss components and use them to service client requests as they come in
- On the other hand, stateful components will need to represent each clients conversational state
- Lifecycle of stateless components needs to be mandages by the serve
    - e.g. initialised on clients first request and then destroyed when the client terminates or times out
    - Becuase of these factors stateless components are highly scalable
    - Where as stateful components are expensive and difficult to manage
- Web applications transfer state from the clients to the server with each call

- In ML a lot of state is captured during training
- When the model is exported / saved we reqiuire the model framework to keep track of these stateful variables

- When stateless functions are used, it simplifies the server code and makes it more scalable but can make client code more complicated
    - e.g. A spelling correction model has to know the previous few words in order to correct words

### Problem

- Image we have a deep learning model trained (kera flavour)
- Model makes predictions on how positive the reviews are for a film, therefore has a embedding layer
- We use this model to make predictions via the `predict` call

- There are several problems carrying out inference by calling `model.predict()` on an in memory object 
    - Loaded enitre keras model into memory, with all embeddings and layers this can be very large for deep learning models
    - The preceding architechture imposes limits on the latency that can be achieved because calls to the `predict()` method have to be sent one by one
    - Model input and output is most effective for training and may not be user friendly. The model may output [logits](https://en.wikipedia.org/wiki/Logit) but clients may want the sigmoid of that so that the output range is between 0 and 1. This can then me interpreted more as a probability. Model may have also been trained on compressed binary records where as the input format from the client may be a JSON.

### Solution

1. Export the model into a format that captures the mathmatical core of the model and is programming language agnostic
2. In the production system, the formula consisting of the "forward" calculations of the model is resorted as a stateless function
3. The stateless function is deployed into a framework that provided a REST endpoint

#### Model Export

- Can use [ONNX](https://onnx.ai)
- Learning rate, dropout etc... don't need to be saved. This is what ONNX does.

#### Inference in Python

- In production the model's formula is restored along with any other files as a stateless function
- The function must conform to a specidic model signature with the input and output variable names and input types
- The signature specifies what the input and output of the model is 

#### Create Webendpoint

- Should define the serving function as a global variable or a singleton class so that it isn't reloaded in response to every request.

### Why it Works

#### AutoScaling

- Scaling to milltions of requests per second is a well understood engineering problem
- Rather than building services unique to ML we can rely on decades of engineering work that has gone into building resilient web applications
- Modern cloud services offer auto-scaling services
- Some ML framworks have their own serving subsystem:
    - PyTorch: TorchServe
    - TensorFLow: TensorFlow Serving
    
#### Fully Managed

- Some cloud providers also have the capability of serving models e.g. sagemaker

### Trade-Offs and Alternatives

#### Customer Serving Function

- Use a custom function to return the desired result to the client

#### Multiple Signatures

- There can be inexpensive (e.g. sigmoid) and expensive functions to run at inference time
- It would not be good to return both to the client with each request
- The client must explicitily request the inference from the expensive function

#### Prediction Library

- Instead of deploying the serving function as a microservice that can be called via a REST API, implement the prediction code as a library function
- The library would load the model the first time it is called
- Developers who need to predict with the libarary can then include the library with their application

- A library function is better alternative than a microservice if the model cannot be called over a netowrk for physical or performance reasons
- The libary function also places the computational burden on the clience and this might be preferable from a budgetary standpoint

- Draw back is that maintenance and updates of the model are difficult
- All client code that uses the model will have to be updated to use the new version of the library
- The more the model is updated the more attractive the microservice approach becomes
- The libaray approach is also restircted to programming languages for which the libraries are written
- Where as REST API opens up the model to applications written in any modern language

## Design Pattern 17: Batch Serving

- Carries out inference on a larger number of instances all at once

### Problem

- Predictions are carried out one at a time and on demand
    - e.g. working out if a credit card transaction is fraudulent or not
- When a model is typically deployed it is setup to process one instance
- The serving framework is archictected to process an individual request synchronously and as quickly as possible
- This is usually a microservice

- There are circumstances where predictions need to carried out asynchronously over large volumes of data
    - e.g. do we need to buy more stock? This can happen hourly or daily not when every time an item is sold
- Attempting to take an endpoint capable only only handling a single request at a time and sending it millions of requests may overwhelm the model (DDOS)

### Solution

- Batch serving pattern used a distributed data processing infrastructure (MapReduce, Spark, BigQuery, Apache Beam) to carry out ML inference on a large number of instances asynchronously
- Using distributed systems to carry out one-off predictions is not very efficient need to pass large volume of data to performe inferecne to make it worth while

### Why it Works

- Stateless serving function is setup for low latency serving to support thousands of simultaneous queries
- Using this for periodic process and be time consuming and expensive
- If requests are not latency sensitive it is more cost-effective to use a distributed data processing architecture

### Trade-Offs and Alternatives

- Batch serving pattern depends on the ability to split a task across multiple workers
- Even though batch serving is used when latency is not a concern, it is possible to incorporate precomputed results and periodic refreshing to use this in scenarios where the space of possible prediction inputs is limited

#### Batch and Stream Pipelines

- Frameworks like Spark or Apache Beam are useful when the input needs precprocessing before it can be supplied to the model, outputs require postproceccing or if either are hard to express in SQL

- Apache Beam is useful if the client code needs to maintain state 
    - e.g. time-windowed average
    - Stop users making multiple comments on a post
        - State needed to keep count of how many times a user commented on a post
- We can do the distributed processing and maintain state with with Apache Beam

#### Cached Results of Batch Serving

- Compute and cache work ahead of time
    - e.g. if we have 10 millions users and 10,000 items and we want the top 5 items per customer ranks, this would be not be feasible to do in near real time

#### Lambda Architecture
- A production ML system that supports both online serving and batch serving is called a [Lambda Architechture](https://oreil.ly/jLZ46)
- Such a ML system allows pracitioners to trade-off between latency (via the statless serving function pattern) and throughput (via batch serving pattern)

- Typically, a Lambda architecture is supported by having separate systems for online serving and batch serving

## Design Pattern 18: Continued Model Evaluation

### Problem

- Model trained and performing well in wild
- Model need to be monitored for unexpected data changes, accuracy etc..
- World is dynamic a ML model is usually a static model from historical data
- Once model goes into production it begins to degrade and predictions can become increasingly unreliable
- Two main reasons for degredation
    - Data drift
    - Concept drift
- Concept drift:
    - relationship between model inputs and outputs have changed 
    - The underlying model assumptions have changed
- Data drift:
    - Change in data being fed into the model for prediction as compared to the data that was used for training
    - e.g. feature distribution shifts
    - Categories are added or removed from features
    
- Model deployment is a continuous process
- To solve for drift it is necessary to update your training dataset to retrain your model with fresh data to improve predictions
    - How do we know re-training is necessary?
    - How often do we re-train?
    - This can be both costly and time consuming
    - Each step of the model development cycle adds additional overhead of development, monitoring and maintenance

### Solution

- Continuous monitoring
- Use the evaluation metrics you used for training

#### Concept

- Continuous evaluation requires access to raw prediction request data and the predictions the model generated as well as ground truth, all in the same place

- In most situation it may take time for the ground truth labels to become available.
    - e.g. churn, may not be known until the next subscription cycle
    - e.g. financial forecasting, the true revenue isn't known until after that quaters close and earnings report
    
#### Saving Predictions

- Save predictions on a sample of requests
    - Saving a sample over all of them reduces load on the serving system
    
#### Capturing Ground Truth

- Important to caputre ground truth for each instance send to the model for prediction

- Ground truth labels can be derived from how users interact with the model:
    - By having users take a specific action, it is possible to obtain implicit feedback for a model's prediction or to produce a ground truth label
    - e.g. A users chooses the alternate route in Google Maps, the chosen route serves as an implicit ground truth
    - e.g. When a user rates a recommended movie, this is a clear indication of the ground truth for a model that is built to predict user ratings in order to surface recommendations
    
- Important to keep in mind how feedback loop of a model is caputred
    - e.g. predicting when a customer abandons their basket
        - Model states customer has abandoned their basked
        - Business sends offer
        - User behaviour now influenced and we will never know if the basket was truly abandoned or not
        - Read up on "counterfactual reasoning"
        
#### Continuous Evaluation

- Output captures model version and timestamp of predicitons
- This is done so we can compare the output of models

### Why it Works

- When training DS models there is an assumption that the train, validation and test set come from the same distribution.
- When we deploy models to production we assume future data will be similar to past data
- Many ML models in the wild encounter rapidly changing, nonstationary data and models become stale overtime. This negatively impacts the quality of predictions

- Continuous model evaluation provides a framework to evaluate a deployed models performance exclusively on new data.
- This allows us to detect staleness as early as possible
- This helps us determine how frequently to retrain a model

- By capturing prediction inputs and outputs and comparing with ground truth it's possible to quantifiably track model performance or measure how different model versions perform with A/B testing in the current environment

### Trade-Offs and Alternatives

- Continuous evaluation provides a means to monitor model performance and keep models in production fresh
- In this way continuous evaluation provides a trigger for when to re-train.
- It is important to consider tolerance thresholds for model performance, the trade-offs they pose, and the role of scheduled retraining
- Tool like [TFX](https://www.tensorflow.org/tfx/guide/tfdv) can detect data and concept drift

#### Triggers for Retraining

- Should model be retrained as soon as the model dips?
    - Depends. Tied heavily to business use case and should be discussed along side evaluation metrics and model assessment
    - Retraining and be expensive (cost wise). The trade-off here is what amount of deterioration of performance is accpetable in relation to this cost


- Threshold for re-training could be an absolute value e.g. if accuracy falls below 95% then retrain or when accuracy takes a downward trend
- Whats important when choosing the threshold is to make sure it's similar to that for checking the model during training
- More sensitive / higher thresholds means models remain more fresh but at the higher cost of retraining. The opposite happens for less sesitive / lower thresholds


- Important to track and validate the triggers as well
- Not knowing when your model is re-trained will lead to issues

##### Serverless Triggers

- e.g. AWS lambdas
- Triggers could be messages published to a message queue, a change notification form a cloud storage bucket indicating a file has landed, HTTPS requests
- Once the event has fired, the function code is executed

#### Scheduled Retraining

- Continuous evaluation provides a crucial signal for knowing when its necessary to retrain your model
- Continued evaluation may happen every day, scheduled re-training may occur every week, or every month

- Once a new version of the model is trained, its performance is evaluated against the current model version
- The updated model is deployted as a replacement only if it outperforms the previous model with respect to the test set of the current data

- The frequency of scheduled retraining will depend on the business case, prevalence of new data and the cost (time and money).
- Sometimes the time horizon of the model naturally determines when to schedule retraining of jobs
    - e.g. a model that predicts next quarters earnings, you will only get ground truth once a quarter. It doesn't make sense to train more frequently than that
- If the volume and frequency of new data is high, it would be beneficial to retrain more frequently

#### Estimating Training interval

- A cheap tactic to understand how data and concept drift affect you model is to train on stale data and assess the performance of that model on more current data
- This mimics the continued evaluation process in an offline environment

## Design Pattern 19: Two-Phase Predictions

### Problem

- Cannot rely on used to have internet connection so models cannot always make predictions via http requests
- Models deployed at the edge - meaning they are on the device
- Given device constraints, models need to be smaller and to balance trade-off between complexity and size


- To convert a trained model into a format that works on edge devices, models often go through a process known as `quantization` where learned model weights are represented with fewer bytes


- Need to trade off model size and model accuracy

### Solution

Problem split into two parts:
- Smaller cheaper model deployed on device
- Second more complex model deployed in the cloud and triggered when needed
- This of course required you to have a problem that can be split into two parts
    - e.g. smart home device. Saying the activation work "alexa" would be detected by the model running on edge but the following question maybe sent to the cloud model
    
### Trade-Offs and Alternatives
- User may not have any internet connection so the two step approach may not be viable
- Have smaller version of complex model for use offline
- Better to some offline support than none
- Best to use a tool that allows you to quantize your model's weights and other math operations
    - Known as [quantization aware training](https://oreil.ly/ABd8r)
    - e.g. Google translate. Robust online model but smaller offline versions of the model are available for use offline via downloading
    
#### Offline Support for Specific use Cases
- Make part of your application available for use offline
- You could cache ML results etc...
- Example is google maps where you can download an area of the map but only directions via driving is available

#### Handling Many Predictions in Real Time
- A model may need to make thousands of requests in a day e.g. sensors
- Would be inefficient to send requests continuously to the cloud
- Instead, deploy on the edge to identify anomalies which get set to clould to reduce requests
- A variation of two phase prediction. Difference here is both the offline and cloud models perform the same prediction taks but with different inputs 

## Design Pattern 20: Keyed Predictions

- Typically model in prod makes predictions on same features the model was trained on
- Can be advantageous for a model to also pass through a client-suppliled key.
- This is called keyed predictions

### Problem

- You need to process millions of inputs and map them to their corresponding output
- You could process them serially but it would much more advantagous to process them using distributed data processing, collect output and send them back
- The problem with distributed computing is that the outputs will be jumbled in terms of order
- Requiring that the outputs be ordered the same way poses scalability challenges and providing the ouputs unordred requires clients to know which outputs correspong to which input

- Same problem occurs with an online system serving an array of instances
- Process a large number of instances locally will lead to hotspots
- Some server node may have higher loads than others
- Hot spots can cause you to buy more powerful harder than what you need
- Many online systems impose a limit on the number of instances that can be sent in a request

### Solution 

- The solution is to pass through keys 
- Have client supply a key associated with each input 
- Have the model return the key and the output 