# Reproducibility Design Patterns

- Unit testing produces a deterministic output
- This reproducibility is difficult in ML
- Many models start with random values which are adjusted during training
- It is possible to set a `random_state` which returns the same results each time
- Beyond random seed many other artifacts need to be fixed in order to ensure reproducibility during training
- ML also has multiple stages e.g. training, deployment etc... need to ensure these are reproducible as well

## Design Pattern 21: Transform

- This pattern makes moving ML inot production easier by keeping inputs, features and transforms carfully separate

### Probelm

- Inputs to ML models are the not the features that the ML model uses
- e.g. words are not directly used in models and need to be converted into some numerical form
- Need to keep track of transforms otherwise we cant reproduce in prod


- Training serving skew is casued by differences between trainijg and prod.
- e.g. if wednesday is encoded into a 3 during training in prod we need to know that this is the case. Some libraries may encoded tuesday into a 3

### Solution

- Explcitily capture the transformations applied to convert the model inputs into features
- In sklearn you would pickle the transformer
- Load in the pickled transformer and use this to transform new data into the required model inputs

## Design Pattern 22: Repeatable Splitting

- Ensure that sampling is repeatable and reproducible
- Needed for a determistic hash function to split data into train, test and validation

### Problem 

- Not good to randomly split rows in data
    - It's rare that rows in data are independent
    - e.g. model to predict flight delays. A day with lots of delays, a lot of flights on that day will have delays. Having some of these flights in training and some in test is not right and leads to data leakage. The leakage is due to correlated rows and this is a common problem. We need to avoid this during ML training.
- Randomly splitting rows is also bad as it can lead to a different dataset each time which is bad for reproducibility and comparing models.
- Set a random seed on the splitting or store the data in advance to get around this

- For ML we need repeatable splitting and ensure correlated rows fall into the same split.
    - e.g. don't want flights on 28th June 2022 in both train and test set
    
### Solution

- Find the column that caputres the correlation relationship between rows
    - In our example of plane delays it would be the `date` column
- Hash the values in the column and apply a modulo to split the data into train, validation and test
    - All same flights for a given day will have the same hash value because they will occur on the same `date`
    - This make it repeatable

- Take the modulo of the has to ten e.g. `% 10`
    - If the value is < 8 it goes into training
    - If the value is =8 it goes into validation
    - If the value is =9 it goes into test
    - This is how we get the 80%, 10% and 10% split for training, validation and test sets
    
- The `date` column here make sense to split on because:
    - Rows at the same date tend to be correlated
    - `date` is not an input into the model. We can extract other features instead such as day of week.
    - We must have enough `date` values. We are computing the hash and finding the modulo with respect to 10, we need at least 10 unique hash values. The more unique values we have the better. A rule of thumb: number of unique values should be 3 - 5x the denominator for the modulo. In our case we want 40 or so unique dates
    - Labels have to well distributed among dates. If all delays happen on July 1st and no delays on other days of the year, this wouldn't work since the split dataset will be skewed
    
_We can automate checking whether the label distributions are similar across the three datasets by using the Kolomogorov-Smirnov test. Just plot the cumulative distributions functions of the label in the three datasets and find the maximum distance between each pair. The smaller the maximum distance, the better the split_