# Reproducibility Design Patterns

- Unit testing produces a deterministic output
- This reproducibility is difficult in ML
- Many models start with random values which are adjusted during training
- It is possible to set a `random_state` which returns the same results each time
- Beyond random seed many other artifacts need to be fixed in order to ensure reproducibility during training
- ML also has multiple stages e.g. training, deployment etc... need to ensure these are reproducible as well

## Design Pattern 21: Transform

- This pattern makes moving ML inot production easier by keeping inputs, features and transforms carfully separate

### Probelm

- Inputs to ML models are the not the features that the ML model uses
- e.g. words are not directly used in models and need to be converted into some numerical form
- Need to keep track of transforms otherwise we cant reproduce in prod


- Training serving skew is casued by differences between trainijg and prod.
- e.g. if wednesday is encoded into a 3 during training in prod we need to know that this is the case. Some libraries may encoded tuesday into a 3

### Solution

- Explcitily capture the transformations applied to convert the model inputs into features
- In sklearn you would pickle the transformer
- Load in the pickled transformer and use this to transform new data into the required model inputs

## Design Pattern 22: Repeatable Splitting

- Ensure that sampling is repeatable and reproducible
- Needed for a determistic hash function to split data into train, test and validation

### Problem 

- Not good to randomly split rows in data
    - It's rare that rows in data are independent
    - e.g. model to predict flight delays. A day with lots of delays, a lot of flights on that day will have delays. Having some of these flights in training and some in test is not right and leads to data leakage. The leakage is due to correlated rows and this is a common problem. We need to avoid this during ML training.
- Randomly splitting rows is also bad as it can lead to a different dataset each time which is bad for reproducibility and comparing models.
- Set a random seed on the splitting or store the data in advance to get around this

- For ML we need repeatable splitting and ensure correlated rows fall into the same split.
    - e.g. don't want flights on 28th June 2022 in both train and test set
    
### Solution

- Find the column that caputres the correlation relationship between rows
    - In our example of plane delays it would be the `date` column
- Hash the values in the column and apply a modulo to split the data into train, validation and test
    - All same flights for a given day will have the same hash value because they will occur on the same `date`
    - This make it repeatable

- Take the modulo of the has to ten e.g. `% 10`
    - If the value is < 8 it goes into training
    - If the value is =8 it goes into validation
    - If the value is =9 it goes into test
    - This is how we get the 80%, 10% and 10% split for training, validation and test sets
    
- The `date` column here make sense to split on because:
    - Rows at the same date tend to be correlated
    - `date` is not an input into the model. We can extract other features instead such as day of week.
    - We must have enough `date` values. We are computing the hash and finding the modulo with respect to 10, we need at least 10 unique hash values. The more unique values we have the better. A rule of thumb: number of unique values should be 3 - 5x the denominator for the modulo. In our case we want 40 or so unique dates
    - Labels have to well distributed among dates. If all delays happen on July 1st and no delays on other days of the year, this wouldn't work since the split dataset will be skewed
    
_We can automate checking whether the label distributions are similar across the three datasets by using the Kolomogorov-Smirnov test. Just plot the cumulative distributions functions of the label in the three datasets and find the maximum distance between each pair. The smaller the maximum distance, the better the split_

### Trade-Offs and Alternatives

#### Random Split

- If the rows are not correlated we can do a random split which is repeatable
- If there is no natural column to split by, hash the entire row of data by converting it to a string and take the modulo like above
- If there are duplicate rows they will have the exact same hash and end up in the same split. If this is not what you desire add a unique ID column

#### Split on Multiple Columns

- If a combination of rows captures when two rows are correlated, simply concatenate the fields before computing the hash
- This would help in the airport scenario above where we would concatenate the airport name and date, hash and then the modulo.

#### Sequential Split

- Time series will need sequential splits
    - e.g. train on past 45 days to predict the next 14 days
- Such splits also useful of fast changing environments
    - e.g. bad actors in fraud quickly adapt to the fraud algorithm. As a result the model needs to be repeatably re-trained on  the lastest data.
        - Not sufficient to generate random split from historical data because the goal is to predict behaviour that the bad actors will exhibit in the future
        - The indirect goal is the same as that of a time-series model in that a good model will be able to train on historical data and predicit future fraud. The data has to be split sequentially in terms of time to corrrectly evaluate this.
- Another place where sequential splits make sense is high correlations between successive times.
    - e.g. weather forecasting.
    - The weather on consecutive days is highly correlated
    - Not reasonable to put Jan 14th in training set and Jan 15th in test set because there will be leakage

#### Stratified Splits

- Simple enough to google

#### Unstructured Data

- Photos, videos, text
- Use meta data to split the samples
    - Carefull of leakage e.g. videos shot on the same day

## Design Pattern 23: Birdged Schema

- Provides a way to adapt the data used to train a model from its older, original data schema to a newer, better data
- Useful when input provider makes improvements to their feed it can take time to update the schema to the new data
- This patterns allows us to use as much of the new data as is available, but augment it with some of the older data to improve accuracy

### Probelm

- We have a point of sale application that determines how much to tip the delivery driver
- One of the inputs is "cash" or "card"
- Card is then updated to now be "gift card", "debit card", "credit card"
- This is valuable because the tipping behaviour varies between the cards
- At prediction time this information is already available and we would like to use it as soon as possible
- Cannot train a model on exclusively new data because the quantity of new data is rather small

### Solution

- The solution is to bridge the schema of the old data to match the new data
- Then we train an ML model. using as much of the new data as is available and augment it with older data

#### Bridged Schema

- In the new schema the card category is much more granual ("gift card", "debit card", "credit card")
- We know that as transaction coded as card will in the old data would have been on of these types but the actual type was not recorded
- It is possible to bridge the schema probabilistically or statically
- Static is recommended 

#### Probabilistic Method

- Estimated from newer data that 10% are gift cards, 30% are debit cards and 60% are credit cards
- Each time an older training example is loaded we generate a random number between \[0, 100\)
    - < 10 = gift card
    - \[10, 40\) = debit card
    - >= 40 = credit card
- Provided we train for enough epochs, any training example would be presented as all three categories, but proportional to the acutal frequency of occurrence
- New training examples will use the actual recorded value


- Justification is we treat each older example as having happened hundreds of times 
- As the trainer goes through the data, in each epoch, we simulate one of those instances
- In the simulation, we expect that 10\% of the time that a card was used the transaction would have occurred with a gift card

#### Static Method

- Categorical variables usually one-hot-encoded
- If we train for long enough the average one-hot encoded value presented would be `[0, 0.1, 0.3, 0.6]` where the first value is the cash category


- To bridge the older data to the newer schema we can transform the older categorial data into this representation where we insert the a priori probaiblity of the new classes as estimated from the training data.
- Newer data on the other hand will have `[0, 0, 1, 0]` for a transaction that is know to have been paid by a debit card
- Static method is preffered because it is effectively what happens if the proabilistic method runs for a long enough.
- It is also simpler to implement since every card payment from the old data will ahve the exact same value (`[0, 0.1, 0.3, 0.6]`)
- We can update the older data in one line of code
- This is also compuationally less expensive 

#### Augmented Data

- If 95% of old data is in the old schema and 5% in the new what should be the data split?
- Models need to make predictions on new unseen data. The unseen data in this case will exclusively be in the new schema.
- Could set aside 2,000 examples in the new schema and add this to your evaluation set along with some from the bridged schema.

- How do we know 2,000 examples is enough? We can test this by evaluating the metric of the current production model (trained on old schema) on subsets of its evaluation datasest and determine how large the subset has to be before the evaluation metric is consistent.
- Start with a randomly selected sample size of 100 and increase in steps of 100 to 5000. This is only for the new datapoints
    - At each step take a random sample and calculate the evaluation metric
    - Re-run each step 25 times to calculate the standard deviation of the metric
    - Plot a graph where y-axis is evluation metric, x-axis is the evaluation size and the line drawn is the standard deviation of the evaluation metric at each step size
    - Where line plateaus is where the ideal number of data points in evaluation size
- Down side is we don't know how many older examples we need
    - This is a hyperparameter we will need to tune
- For best results, use the smalled number of older examples that we can get away with
- Over time as the number of new examples grows, we'll rely less and less on bridges examples

### Trade-Offs and Alternatives