# Reproducibility Design Patterns

- Unit testing produces a deterministic output
- This reproducibility is difficult in ML
- Many models start with random values which are adjusted during training
- It is possible to set a `random_state` which returns the same results each time
- Beyond random seed many other artifacts need to be fixed in order to ensure reproducibility during training
- ML also has multiple stages e.g. training, deployment etc... need to ensure these are reproducible as well

## Design Pattern 21: Transform

- This pattern makes moving ML inot production easier by keeping inputs, features and transforms carfully separate

### Probelm

- Inputs to ML models are the not the features that the ML model uses
- e.g. words are not directly used in models and need to be converted into some numerical form
- Need to keep track of transforms otherwise we cant reproduce in prod


- Training serving skew is casued by differences between trainijg and prod.
- e.g. if wednesday is encoded into a 3 during training in prod we need to know that this is the case. Some libraries may encoded tuesday into a 3

### Solution

- Explcitily capture the transformations applied to convert the model inputs into features
- In sklearn you would pickle the transformer
- Load in the pickled transformer and use this to transform new data into the required model inputs

## Design Pattern 22: Repeatable Splitting

- Ensure that sampling is repeatable and reproducible
- Needed for a determistic hash function to split data into train, test and validation

### Problem 

- Not good to randomly split rows in data
    - It's rare that rows in data are independent
    - e.g. model to predict flight delays. A day with lots of delays, a lot of flights on that day will have delays. Having some of these flights in training and some in test is not right and leads to data leakage. The leakage is due to correlated rows and this is a common problem. We need to avoid this during ML training.
- Randomly splitting rows is also bad as it can lead to a different dataset each time which is bad for reproducibility and comparing models.
- Set a random seed on the splitting or store the data in advance to get around this

- For ML we need repeatable splitting and ensure correlated rows fall into the same split.
    - e.g. don't want flights on 28th June 2022 in both train and test set