# The Need for Machine Learning Design Patterns

## What are Design Patterns

- Each pattern describes a problem which occurs over and over again in our environment, and then describes the core of the solution to that problem, in such a way the solution can be used repeatedly.

Building production machine learning models is becoming more of an engineering discipline. As machine learning becomes more mainstream, it is important that pracitioners take advantage of tried and proven methods to address recurring problems

## The Machine Learning Process

1. First step is typically training (after data has be prepped and cleaned)
2. Test model performance against data outside of the training set - Model evaluation
3. Repeat steps 1 and 2 as needed. New features maybe engineered or added, different model hyperparameters etc...
4. Serve model to make predictions 
    - Online: get predictions in near real-time
    - Batch: predicion on a large dataset offline 
5. New data maybe ingested continuously and need to process this data Immediately before sending it to the model for training or prediction. This is known as streaming.

## Roles

There are many roles relating to data and machine learning. Here are just a few:

- Data scientist: Collects, interprets and processes datasets. Run exploratory data analysis. As it relates to machine learning, a data scietist may work on data collection, feature engineering, model building and more.
- Data engineer: Focuses on infrastructire and workflows powering an organisations data. They may help with data ingestion, data pipelines, and how data is stored and transferred. They implement infrastructure and pipelines around data.
- Machine learning engineers: Similar to data engineers but the focus is on ML models. Take models developed by data scientists and manage the infrastructureand operations around training and deploying those models. ML engineers help build production systems to handle updating the models, model versioning, and serving predictions to end users.
- Research scientist: focus primarily on finding and developing new algorithms to advance the discipline of ML.
- Data analysts: evaluate the gather insights from data, them summarise these insights for other teams within their organisation.

## Common Challenges in Machine Learning

### Data Quality 

ML models are only as reliable as the data used to train them. Training on a incomplete dataset with poorly selected features, or the data doesn't accurately represent the population using the model, your models predictions will be a direct reflection of that data. This is referred to "garbage in, garbage out". There are four components to data quality:
- **Accuracy**: Refers to both the training data features and ground truth labels corresponding with those features. Understanding where the data comes from and any potential errors in the data collection process can help ensure feature accuracy. It is important to screen for typos, duplicate entires, measurement inconsistencies, missing features and any other type of error that may impact data quality. Duplicate entires may cause the model to assign more weight to those entries. Data labels are just as important as feature accuracy. Models rely solely on the ground truth labels in the training set to update weights and minimise loss. Incorrectly labeled training examples can cause misleading model accuracy.
- **Completeness**: Ensuring training data contains varied representation of each label. E.g. model to predict dog breed if all the photos are close ups of the head the model may struggle with full body shots. With tabular data ensure the data isn't biased in a particular area e.g. if building a house price estimator if the all data points are house with 5+ rooms the model will give strange results when predicting on houses with less than 5 rooms.
- **Consistency**: If labelling of data is done by a group of people it is important to come up with standards for this processwhich will help ensure consistency across the dataset, since every person will have their own biases.
- **Timeliness**: Important to record as much information as apossible about a particular data point, and make sure that information is reflected when you transform the data into features. More specifically, keep track of timestamp of when an even occurred and when it was added to the dataset.

### Reproducibility

ML models have an inherent element of randomness, e.g. when ML weights are initialised with random values. These weights then converge during training as the model iterates and learns from the data. As a result, the same code on the same training data will product slightly different models. This introduces the challenge of reproduciblity.

To address this it is commong to set random seed values which ensure the same randomness is applied to each training run.

Reproducibility can refer to the training environment. With large datasets the training maybe distributed. With this acceleration, however, comes an added challenge of repeatability when you rerun the code that makes the use of distributed training.

### Data Drift

Data drift refers to the challenge of ensuring the ML model stay relevant, and that model predictions are an accurate reflection of the environment in which they're being used. To solve for drift, it is important to continually update the training dataset, re-train the model, and modify the weight your model assigns to particular groups of data.

### Scale

It's common to encounter scaling challenges in data collection and preprocessing, training and serving. The size of the dataset typically dicdates the tooling required for the solution.

ML engineers are responsible for determining the necessary infrastructure for a specific training job. Image model typically require much more infrastructure such as GPUs.

In the context of serving the infrastructure need to support a team of data scientists getting a prediction is vastly different from the infrastructure needed to support a production model getting millions of prediction requests every hour. Developers and ML engineers are responsible for handling scaling challenges.

### Multiple Objectives

A single team is often responsible for building a machine learning model but many teams across the organisation may use the model. Inevitably, these teams have different defintions of success.

For example, we build a model to identify cracks in glass. As a datascientis you want to minimise the cross-entropy loss. The product manager on the other hand may want to minimise the number of defective glass planes. Finally, the exec team may have a goal to increase profits by 30%. Each goal varies in what we're optimising for and balancing these differing needs within the organisation can present a challenge.

As a data scientist we and translate the product team's needs into the context of the model e.g. false negatives are 5x more costly than false positives.

When defining the goals of the model, it's important to consider the needs of different teams acorss an organisation and how each team's needs relate back to the model.