# Reproducibility Design Patterns

- Unit testing produces a deterministic output
- This reproducibility is difficult in ML
- Many models start with random values which are adjusted during training
- It is possible to set a `random_state` which returns the same results each time
- Beyond random seed many other artifacts need to be fixed in order to ensure reproducibility during training
- ML also has multiple stages e.g. training, deployment etc... need to ensure these are reproducible as well

## Design Pattern 21: Transform

- This pattern makes moving ML inot production easier by keeping inputs, features and transforms carfully separate

### Probelm

- Inputs to ML models are the not the features that the ML model uses
- e.g. words are not directly used in models and need to be converted into some numerical form
- Need to keep track of transforms otherwise we cant reproduce in prod


- Training serving skew is casued by differences between trainijg and prod.
- e.g. if wednesday is encoded into a 3 during training in prod we need to know that this is the case. Some libraries may encoded tuesday into a 3

### Solution

- Explcitily capture the transformations applied to convert the model inputs into features
- In sklearn you would pickle the transformer
- Load in the pickled transformer and use this to transform new data into the required model inputs

## Design Pattern 22: Repeatable Splitting

- Ensure that sampling is repeatable and reproducible
- Needed for a determistic hash function to split data into train, test and validation

### Problem 

- Not good to randomly split rows in data
    - It's rare that rows in data are independent
    - e.g. model to predict flight delays. A day with lots of delays, a lot of flights on that day will have delays. Having some of these flights in training and some in test is not right and leads to data leakage. The leakage is due to correlated rows and this is a common problem. We need to avoid this during ML training.
- Randomly splitting rows is also bad as it can lead to a different dataset each time which is bad for reproducibility and comparing models.
- Set a random seed on the splitting or store the data in advance to get around this

- For ML we need repeatable splitting and ensure correlated rows fall into the same split.
    - e.g. don't want flights on 28th June 2022 in both train and test set
    
### Solution

- Find the column that caputres the correlation relationship between rows
    - In our example of plane delays it would be the `date` column
- Hash the values in the column and apply a modulo to split the data into train, validation and test
    - All same flights for a given day will have the same hash value because they will occur on the same `date`
    - This make it repeatable

- Take the modulo of the has to ten e.g. `% 10`
    - If the value is < 8 it goes into training
    - If the value is =8 it goes into validation
    - If the value is =9 it goes into test
    - This is how we get the 80%, 10% and 10% split for training, validation and test sets
    
- The `date` column here make sense to split on because:
    - Rows at the same date tend to be correlated
    - `date` is not an input into the model. We can extract other features instead such as day of week.
    - We must have enough `date` values. We are computing the hash and finding the modulo with respect to 10, we need at least 10 unique hash values. The more unique values we have the better. A rule of thumb: number of unique values should be 3 - 5x the denominator for the modulo. In our case we want 40 or so unique dates
    - Labels have to well distributed among dates. If all delays happen on July 1st and no delays on other days of the year, this wouldn't work since the split dataset will be skewed
    
_We can automate checking whether the label distributions are similar across the three datasets by using the Kolomogorov-Smirnov test. Just plot the cumulative distributions functions of the label in the three datasets and find the maximum distance between each pair. The smaller the maximum distance, the better the split_

### Trade-Offs and Alternatives

#### Random Split

- If the rows are not correlated we can do a random split which is repeatable
- If there is no natural column to split by, hash the entire row of data by converting it to a string and take the modulo like above
- If there are duplicate rows they will have the exact same hash and end up in the same split. If this is not what you desire add a unique ID column

#### Split on Multiple Columns

- If a combination of rows captures when two rows are correlated, simply concatenate the fields before computing the hash
- This would help in the airport scenario above where we would concatenate the airport name and date, hash and then the modulo.

#### Sequential Split

- Time series will need sequential splits
    - e.g. train on past 45 days to predict the next 14 days
- Such splits also useful of fast changing environments
    - e.g. bad actors in fraud quickly adapt to the fraud algorithm. As a result the model needs to be repeatably re-trained on  the lastest data.
        - Not sufficient to generate random split from historical data because the goal is to predict behaviour that the bad actors will exhibit in the future
        - The indirect goal is the same as that of a time-series model in that a good model will be able to train on historical data and predicit future fraud. The data has to be split sequentially in terms of time to corrrectly evaluate this.
- Another place where sequential splits make sense is high correlations between successive times.
    - e.g. weather forecasting.
    - The weather on consecutive days is highly correlated
    - Not reasonable to put Jan 14th in training set and Jan 15th in test set because there will be leakage

#### Stratified Splits

- Simple enough to google

#### Unstructured Data

- Photos, videos, text
- Use meta data to split the samples
    - Carefull of leakage e.g. videos shot on the same day

## Design Pattern 23: Birdged Schema

- Provides a way to adapt the data used to train a model from its older, original data schema to a newer, better data
- Useful when input provider makes improvements to their feed it can take time to update the schema to the new data
- This patterns allows us to use as much of the new data as is available, but augment it with some of the older data to improve accuracy

### Probelm

- We have a point of sale application that determines how much to tip the delivery driver
- One of the inputs is "cash" or "card"
- Card is then updated to now be "gift card", "debit card", "credit card"
- This is valuable because the tipping behaviour varies between the cards
- At prediction time this information is already available and we would like to use it as soon as possible
- Cannot train a model on exclusively new data because the quantity of new data is rather small

### Solution

- The solution is to bridge the schema of the old data to match the new data
- Then we train an ML model. using as much of the new data as is available and augment it with older data

#### Bridged Schema

- In the new schema the card category is much more granual ("gift card", "debit card", "credit card")
- We know that as transaction coded as card will in the old data would have been on of these types but the actual type was not recorded
- It is possible to bridge the schema probabilistically or statically
- Static is recommended 

#### Probabilistic Method

- Estimated from newer data that 10% are gift cards, 30% are debit cards and 60% are credit cards
- Each time an older training example is loaded we generate a random number between \[0, 100\)
    - < 10 = gift card
    - \[10, 40\) = debit card
    - >= 40 = credit card
- Provided we train for enough epochs, any training example would be presented as all three categories, but proportional to the acutal frequency of occurrence
- New training examples will use the actual recorded value


- Justification is we treat each older example as having happened hundreds of times 
- As the trainer goes through the data, in each epoch, we simulate one of those instances
- In the simulation, we expect that 10\% of the time that a card was used the transaction would have occurred with a gift card

#### Static Method

- Categorical variables usually one-hot-encoded
- If we train for long enough the average one-hot encoded value presented would be `[0, 0.1, 0.3, 0.6]` where the first value is the cash category


- To bridge the older data to the newer schema we can transform the older categorial data into this representation where we insert the a priori probaiblity of the new classes as estimated from the training data.
- Newer data on the other hand will have `[0, 0, 1, 0]` for a transaction that is know to have been paid by a debit card
- Static method is preffered because it is effectively what happens if the proabilistic method runs for a long enough.
- It is also simpler to implement since every card payment from the old data will ahve the exact same value (`[0, 0.1, 0.3, 0.6]`)
- We can update the older data in one line of code
- This is also compuationally less expensive 

#### Augmented Data

- If 95% of old data is in the old schema and 5% in the new what should be the data split?
- Models need to make predictions on new unseen data. The unseen data in this case will exclusively be in the new schema.
- Could set aside 2,000 examples in the new schema and add this to your evaluation set along with some from the bridged schema.

- How do we know 2,000 examples is enough? We can test this by evaluating the metric of the current production model (trained on old schema) on subsets of its evaluation datasest and determine how large the subset has to be before the evaluation metric is consistent.
- Start with a randomly selected sample size of 100 and increase in steps of 100 to 5000. This is only for the new datapoints
    - At each step take a random sample and calculate the evaluation metric
    - Re-run each step 25 times to calculate the standard deviation of the metric
    - Plot a graph where y-axis is evluation metric, x-axis is the evaluation size and the line drawn is the standard deviation of the evaluation metric at each step size
    - Where line plateaus is where the ideal number of data points in evaluation size
- Down side is we don't know how many older examples we need
    - This is a hyperparameter we will need to tune
- For best results, use the smalled number of older examples that we can get away with
- Over time as the number of new examples grows, we'll rely less and less on bridges examples

### Trade-Offs and Alternatives

#### Union Schema

- Could just take a union of both data sets, this would make the new and old datasets compatible so the possible values would be `[cash, card, debit card, credit card, gift card]`
- But the new predictions won't have `card` as a value and thus breaks the schema!

#### Cascade Method

- Impute the value, could take mean.
- Take priori frequencies as shown above


- Can add a cascade model. See Design pattern 8.
    - Train a model on the new data with card types
    - Output of this model is used to train the second model
    
#### Handling New Features

- Bridging might be needed when the input provider adds extra information
- If we have new features that we want to use immediately, we should bridge the older data (where this new feature will be missing) by imputing a value for this feature
- To imput we should try to use:
    - The mean value of the feature if the feature is numeric and normally distributed
    - The median if the feature is numeric and skewed or has lots of outliers
    - Median value of the feature if the feature is categorical and sortable
    - Mode of the feature if the featue is categorical and not sortable 
    - Frequency of the feature being true if it's boolean
    
- Taking the example of taxi journeys if the new feature is number of idle minutes we could use the median value 
- If the feature is a boolean for raining or not raining it's imputed value can be something line 0.02% if it rains 2% of the time in the new data

- The cascade pattern approach remains viable for all these cases, but a static imputation is often simplier and often sufficient

## Design Pattern 24: Windowed Inference

- Handles instances where the model requires an ongoing sequence of instances in order to run inference
- This pattern is useful when a ML model requires features that need to computed from aggregates over time windows 

## Problem

- Imagine we have flight data looking at arrival delays for flights
- The arrival delays will naturally exhibit variability
- But it still should be possible to note unusually large arrival delays
- The definition of "unusual" will vary by context
    - Early in the day flights are delayed less
    - In the afternoon flights are delayed more
    - The context here is "time of day"


- Determining a specific delay in anomalous depends on a time context.
    - e.g. arrival delays observed over the past two hours
- To determine that a delay is anomalous requires that we first sort the dataframe based on time
- We then apply an anomaly detection function to sliding windows of two hourse
- The function to detect the anomaly can be complex but a simple thing to do is highlight values 4 standard deviations from the mean in the two hour window


- This works on training data because the entire dataframe is at hand
- When running inference we will not have the entire dataframe available
- In prod, we will be reciving flight information one by one as each flight arrives
- All that we will have is a single delay value at a timestamp 
    - `2022-01-24 07:30:00, 43.0`
- Given that the flight above is 43 minutes delayed is that unusual or not?
- To carry out inference on a flight we only need the features of that flight
- In this case, the model requires information about all flights to the airport between `05:30` and `07:30`
- Is it not possible to carry out inference one flight at a time. We need to somehow provide the model information about all the previous flights


- How do we carry out inference when the model requires not just one instance, but a sequence of instances?

## Solution

- Carry out stateful stream processng, that is processing that keeps track of the model state through time


- Sliding window applied to flight arrival data.
    - The sliding window will be over 2 hours but the window can be closed more often such as every 10 mins
    - In such a case, aggregate values will be calculated every 10 mins over the previous 2 hours
- Internal model state (this could be a list of flights) is updated with flight information every time a new flight arrives, thus building a 2-hour historical record of flight data
- Every time a window is closed (e.g. every 10 mins), a time series ML model is trained on the 2-hour list of flights. This model is then used to predict future flight delays and the confidence bounds of such predictions
- Time series model parameters are externalised into a state variable. Could use ARIMA of LSTM and in such cases, the model params would be ARIMA coefficients or the LSTM weights.
    - To keep code understanable, we will use a zero-order regression model, and so our model. parameters will be the average flight delay and the variance of the flight delays over the two-hour window
- When a flight arrives, its arrival delay can be classified as anomalous or not using the externalised model state
- Every time the window is closed, the out put is extacted. The output here is the externalised model state and consists of the model parameters.
    - The params here are the mean delay time and acceptable deviations which is 4 * the standard deviation
    - Any new flight coming in compared against the model state and if outside the acceptable deviation is considered anomalous

## Trade-Offs and Alternatives

### High-throughput data stream
- If we revieve 5000 items a second then the in-memory dataframe over 10 mins will contain 3 million rows 
- The memory requirments can become considerable


- Storing all the records in order to compute the model parameters at the end of the window can become problematic
- When the data stream is high throughput, it becomes important to be able to update the model parameters with each element
- An online learning model maybe best here (see page 279 in book)

### Batching Requests
- If model deployed in cloud but the client is embedded in the device 
- Sending inference requests one by one to a cloud might be overwhelming 
- Collect a number of requests
- Then service those requests as a batch 
- See Design Pattern 19: Two-Phase Prediction
- This is suitable for latency tolerant use cases
- If we collect inputs for 5 minutes then the client will have to tolerant for up to 5 minutes delays to get back predictions

## Design Pattern 25: Workflow Pipeline

- The workflow pipeline addresses the problem of creating an end-to-end reproducible pipeline by containerising and orchestrating the steps in our machine learning process

### Problem

- A data scientist is able to run, data processing, training and model deployment steps in an end-to-end fashion with a single notebook
- However, as each step in the ML pipeline becomes more complex and more people within the organisation want to contribute to this code base, running these steps from a single notebook does not scale 


- In monolithic aapplications the applications logic is handled by a single program.
- To test a small feature in a monolithic app we mus run the entire program. Same goes deploying or debugging 
- Deploying a small bug fix means requires deploying the entire application 
- When the codebase is inextricably linked, it becomes difficult for individual developers to debug errors and word independently on different parts of the application
- Recently monolithic apps have been replaced in favor of a microservices architecture where individual pieces of business logic are built and deployed as isolated packages of code.
- With microservices, a large applications is split into smaller, more manageable parts so that developers can build, debug and deploy pieces of an applications independelty


- When someone is building an ML model on their own, a "monolithic" approach may be faster to iterate on.
- It also works because one person is actively involved in devloping and maintaining each piece of the pipeline e.g.
    - Data collection
    - Data validation
    - Data preprocessing
    - Model building
    - Training and validation
    - Model deployment
- When scaling this workflow different parts of the organisation may be responsible for different steps
- To scale the ML workflow, we need a way for the team building out the model to run trials independently of the data processing step
- We also need to keep track of performance for each step of the pipeline and manage the output files generated by each part of the process


- When the development of each step is complete, we'll want to schedule operations like retraining, or create event-triggered pipeline runs that are invoked in repsone to changed in your environment
    - e.g. new training data added to the bucket
- In such cases it'll be necessary for the solution to allow us to run the entire workflow from end-to-end in on call while still being able to track output and trace errors from individual steps