# Problem Representation Design Pattern

This chapter looks at the different types of machine learning problems and analyses how the model architectures vary depending on the problem.

The input and output types are tow key factors impacting the model architecture. For example the output required from a model could impact if we choose a regression or a classification model. Special nerual network layers exist for specific types of input data: convolutional layers for images, speech text and other data with spatiotemporal correlation, recurrent networks for sequential data. Special classes of solutions exist for commonly occuring problems like recommendations (matrix factorization), or time-series (ARIMA). A group of simpler model stogher with common isioms can be used to solve more compex problems e.g. text generation often involves a classification model whose outputs are postprocessed using a beam search algorithm.

## Design Pattern 5: Reframing

This pattern refers to chaning the ouput of ML problem. For example, we could take something that is intuitively a regression problem and instead pose it as a classification prblem.

### Problem

The first step of building any ML solution is framing the problem. Is this a supervised learning problem? Or unsupervised? What are the features? If it is a supervised problem what are the labels? What amount of error is acceptable? Of course, the answers to these questions must be considered in context with the training, the task at hand, and the metrics for success.

For example, if we wanted to predict the amount of rainfall in a given area we could make this a regression problem. We could also treat this as a time series model. There are lost of adjustments we can make to improve our model. Is regression the onle wat we can pose this task? Perhaps we can re-frame our machine learning objective in a way that improves our task performance.

### Solution

If we used a regression model to predict the the amount of rail fall we're limiting to a prediciton of a single number. We can reframe this as a classification problem where one approach would be to model a discrete probability distribution e.g. we have binned amount of rain-fall as a class e.g. `0-0.05mm`, `0.5 - 1.0mm` etc and each class will have an associated probability. We can also have a regerssion model to predict a real-value number.

Both the regression approach and the re-framed classification approach give a prediction of the rainfall. However, the classification approach allows the model to capture teh probability distribution of rainfall of different quantities.

### Why It Works

By reframing we lose a little precision due to bucketing, but gain the expressivess of a full probability density function. The discretised predictions provided by the classificatoin model are more adept at learning a complex target then a more rigid regression model.

Added advantage of this classification framing is that we obtain posterior probability distribution of our predicited values which provides more nuanced information. Suppose the learned distribution is bimodal. By modelling a classificaton as a discrete probability distribution, the model is able to caputre the biomodal structure of the predictions. Where as if we used a regression model we would lose this information.

#### Capturing Uncertainty

Looking at the dataset of babies born to 25 year old mothers at 38 weeks shows a normal distribution with a mean at 7.5 pounds. There is a nontrivial likelihood (33%) that a given baby is less than 6.5 pounds or more than 8.5 pounds (this is 1 STD either side of the mean, page 83 in book). The width of this distribuiton indicates the irreducible error inherent to the problem of predicting baby weight. If we framed it as a regression problem the best RMSE we can obtain is the standard deviation of the distribution.

If we frame it as a regression problem we would have to state the prediction result as 7.5 +/- 1.0 (or whatever the STD is). Yet the width if the distribution will vary for different combinations of inputs, and so learning the width is another machine learning problem in itself. For example, at the 36th week, for mothers of the same age, the standard deviation is 1.16 pounds.

Has the distribution been multimodal (with multiple peaks), the case for reframing the problem as a classification problem would have been even stronger. However, it is helpful to realise that because of the law of large numbers, as long as we capture all of the relevant inputs, many of the distributions we will encounter on large datasets will be bell-shaped, although other distributions are possible. The wider the bell curve the more the width varies at different values of inputs, the more important it is to capture uncertainty and the stronger the case for reframing the regression problem as a classification one.

By reframing the problem, we train the model as a multiclass classification that learns a discrete probability distribution for the given trainin examples. These discretised predictions are more flexible in terms of capturing uncertainty and better able to approximate the complex target than a regression model. At inference time, the model then predicts a collection of probabilities corresponding to these ptential outputs. That is, we obtain a discrete PDF giving the relative likelihood of any specific weight.

### Trade-Offs and Alternatives

There is rarely just one way to frame a problem. For example, bucketizing the output values of a regression is an approach to reframing the problem as a classification task. Another apporach is multitask learning that combines both tasks (classification and regression) into a single model using multiple prediction heads. With any reframing technique, being aware of data limitations or the risk of introducing label bias is important.

#### Bucketised outputs

The typical approach to reframing a regression task as a classification task is to bucketise the output values. For example, if out model is to be used to inidicate how much rain we will get on a given day it we may bucket the values into say 5 groups.

The regression problem now becomes a classification problem. Intuitively, it is easier to predict 5 categories than to predict a single continuous number. By using categorical outputs, the model is incentivised less for getting arbitrarily close to the actual output value since we've essentially changed the output label to a range of values instead of a single number.

#### Restricting the Prediction Range

For a given problem the prediction range may be `[3,20]`. If we train a regression model there is always a chance than the model may make predictions outside of this range. One way to limit the range is to reframe the problem. For example, we could build a DNN where the last layer is a sigmoid later and we then map the `[0,1]` to the range of `[3,20]`.

#### Label Bias

It is important to consider the nature of the target label when reframing the problem. For example, suppose we regramed our recommendation model to a classification task that predicts the likelihood a user will click on a certain video thumbnail. This seems like a resonable reframing since our goal is to provide content a user will select and watch. But be careful. This chance of albel is not actually in line with our prediction task. By optimising for user clicks, our model will inadvertently promote click bait and not actually recommend content to use to the user.

Instead, a more advantageous label would be video watch time, reframing our recommendation as a regression instead. Or predict the likelihood that the user will watch at least half the video clip.

## Design Pattern 6: Multilabel

The multilabel design pattern refers to a problem where we assign more than one label to a given training example.

### Problem

Often prediction tasks involve applying a single classification to a given training example. This prediction is determined from N possible classes where N is greater than 1. In this case, it's commin to use softmax as the activation function for the output layer. Using softmax, the output of out model is an N-element array, where the sum of all the values adds up to 1. Each value indicates the probability that a particular example is associated with the class at the index.

For example, if out model is classifying images as cats, dogs or rabbits, the softmax output might loos like this for a given image `[0.89, 0.02, 0.09]`. This means out model is predicting an 89% chanbe the image is a  cat. Because each image can only have one possible label in this scenario, we can take the `argmax` (index of the highest probability) to determine our model's predicted class. The less-common scenario is when each training exmaple can be assigned more than one label, which is what this pattern addresses.

The multilabel design pattern exists for odels trained on all data modalities. For image classificiation, in the earlier example we could instead used images which depicted multiple animals, and could therefore have multiple labels. The same can be applied to text models e.g. a news article could belong to many different categories.

The design pattern can also apply to tabular datasets for example healthecare data could be used to predicit multiple conditions.

### Solution

The solution is to use a sigmoid activation function in our final layer instead of a softmax. Each individual in a sigmoid array is a float between 0 and 1. That is to say, when implementing the Multilabel design pattern, our label needs to be multi-hot encoded. The length of the multi-hot array corresponds with the number of classes in our model, and each output in this label array will be a sigmoid value.

The main differenc between the sigmoid and softmax is that the softmax array is guaranteed to contain three values that sum to 1, where as the sigmoid out put will contain three values each between 0 and 1.

The sigmoid is a nonlinear, continuous and differentiable activation function that takes the outputs of each neuron in the previous layer in the modle and squashes the value of thos outputs between 0 and 1.

### Trade-Offs and Alternatives

- **Multiclass classification**: Each example can have only 1 label
- **Binary classification**: The number of classes is 2
- **Multilabel classification**: Each example can have many labels

If a multiclass scenario use softmax and in a binary classification scenario use sigmoid. In a multilabel scenario we use a sigmoid for each label.

For the multilabel scenario we can use the binary cross entropy loss because a multilabel problem is essentially `n` smaller binary classification problems.

#### Parsing Sigmoid Results

By applying a sigmoid per class we obtain a probability per class. To assign labels to a given prediction we can say if the probability of a label is above 50% it should be assigned to the data point. Additionally we can also apply `n_specific_tag` / `n_total_examples` as a threshold for each class. Here, `n_specific_tag` is the number of examples with one tag in the dataset and `n_total_examples` is the total number of examples in the training set across all tags. This ensures that the model is doing better than guessing a certain label based on its occurrence in the training dataset.

For a more precise approach read this [paper](https://pralab.diee.unica.it/sites/default/files/pillai_PR2013_Thresholding_0.pdf). Uses S-Cut for optimizing your models F-measure.

#### Dataset Considerations

Dataset balancing is important for ML models and is more nuanced for the Multlabel design pattern. 

For model to learn what each unique label is we'll want to ensure the training dataset consists of varied combinations of each tag. If two labels occur frequently together the model may not learn to classify the label if it appears on its own. To account fo this think about the relationships between the labels and count the number of training exmaples that belong to each overlapping combinations of labels.

We can consider hierarchical labels if the dataset allows. e.g.
```
animal -> invertebrate -> arthropod -> spider
```
There are two common approaches to for handling heirarchical labels:
- Use a flat approach and put every label in the same output array. Make sure there are enough samples at each leaf node
- Use cascade design pattern. Build one model to identify higher-level labels. Based on the higher-level classification, send the example to a separate model for a more specific classification task. E.g. higher level model predicits a datapoint to be an "animal" we then send the datapoint to differnent model(s) to apply more granular labels.

Flat approach more straighforward. However, this might cause the model to lose information about more detailed label classes since ther will naturally be more training examples with higher-level labels in the dataset.

#### Inputs with Overlapping Labels

The multilabel approach to predicitions is usefull in overlapping labels. For example, if an image contains multiple fashion items and two people we're to label the items in it such as:
- Long sleeved blazer
- Double breasted blazer
Both labels are correct but the issue arises in the situation where depending on who labelled the image the model may predict things differently. There is where multilabel is usefull because it allows us to associate both overlapping labels with an image.

#### One Versus Rest

Another technique for handling multilabel classification is to trian multiple binary classifiers instead of one multilabel model. This apporach is called _one versus rest_. We would train a binary classifier for each label.

This can help with tate categories since the modell will be performing only one classification taks at a time on each input. The disadvantage of this approach is the added complexity of training many different classifiers.

## Design Pattern 7: Ensembles

Pattern combines multiple ML models and aggregates their results to make predictions. Ensembles can be an effective means to improve performance and produce predictions that are better than any single model

### Problem

Imagine we have a model where it was trained such that the error on the training set it almost zero. However in production or on the holdout set a lot of our predictions are wrong. What went wrong? and how can we fix it?

Error in an ML model can be broken down into three parts:
- **Irreducible error**: Error due to bias and error due to variance. This is an inherent error resulting from noise in the dataset, the framing of the problem or bad training examples e.g. measurement errors. We can't do much about this error type.
- **Bias and Variance** This is a reducible error and here we can influence our model's performance.
    - Bias is the model's inability to learn enough about the relationship between the datapoints
    - Variance caputres the models inability to generalise on new unseen examples
    
High bias oversimplifies the relationship between the features and is said to underfit. High variance has learned too much about the training data and is said to overfit. The ideal model will have low bias and low variance, in practice this is difficult, known as the bias-variance trade-off. 

### Solution

Ensemble methods are meta-algorithms that combine several machine learning models as a technique to decrease the bias and/or variance and imporve model performance. By building several models with different inductive biases and aggregating their outputs, we hope to get a model with better performance.

#### Bagging

Bagging is short for bootstrap aggregating and is a type of parallel ensembling method and is used to address high variance in ML models. The bootstrap part of bagging referes to the datasets used for training the ensemble members. Specifically, if there are $k$ submodels, then there are $k$ separate datasets used for training each submodel of the ensemble. Each dataset is constructed by randomly sampling (with replacement) from the original training dataset. This means there is a high probability that any of the $k$ datasets will be missing some training examples, but also any dataset will likely have repeated training examples.

A good example of baggin is the random forest model. Each tree is trained on randomly sampled subsets of the entire training data, then the tree predicitions are aggregated to produce a prediction.

Model averaging as seen in bagging is a powerful and reliable method for reducing model variance. As we'll see, different ensemble methods combine multiple submodels in different ways, sometimes using different models, different algorithms, or even different objective functions. With bagging, the model and algorithms are the same. For example, with random forest, the submodels are all short decision tree.

#### Boosting

Another ensemble technique, different to bagging. Boosting ultimately constructs an ensemble model with more capacity than the individual member models. For this reason, boosting provides a more effective means of reducing bias than variance. The idea behind boosting is to iteratively build an ensemble of models where each successive model focuses on learning the examples the previous model got wrong. In short, boosting iteratively improves upon a sequence of weak learners taking a weighted average to ultimatley yield a strong learner.

At start of the boosting procedure, a simple base model `f_0` is selected. For a regression task, the base model could just be the average target value: `f_0 = np.mean(Y_train`. For the first iteration step, the residuals `delta_1` are measured and approximated via a separate model. This residual model can be anything, but typically isn't complicated e.g. a weak learner such as a decision tree. The approximation provided by the residual model is then added to the current prediction, and the process continues.

After many iterations, the residuals tend towards zero and the prediction gets better and better at modelling the original training dataset.

Some wekk-know boosting algorithms are: AdaBoost, Gradient Boosting Machines and XGBoost.

#### Stacking

Stacking is an ensemble method that combines the outputs of a collection of models to make a prediction. The initial models, which are typically of a different model types, are trained to completion on the full training dataset. Then, a secondary meta-model is trained using the inital model output featues. This second meta-model learns how to best combine the outcomes of the inital models to decrease the training error and can be any type of the machine learning model.

To do this we train all models in the ensemble on the full training dataset. These submodels are incorporated into a larger stacking ensemble model as individual inputs. We then train a model on the outputs of these sub-models.

### Why It Works

Model averaging methods like bagging work because typically the individual models that make up the ensemble model will not all make the same erros on the test set. In an ideal situation, each individual model is off by a random amount, so when theor results are averaged, the random errors cancel out, and the prediction os closer to the correct answer. There is wisdom in the crowd.

Boosting works because the model is punished more and more according to the residuals at each iteration step. With each iteration, the ensemble model is encouraged to get better and better at predicting those hard-to-predict examples. Stacking works because it combines the best of both badding and boosting. The secondary model can be thought of as a more sophisticated version of model averaging.

#### Bagging

If the errors in each model are correlated, model averaging doesn't help at all. On the other hand, if the errors are perfectly uncorrelated the variance should decrease with the number of models (k): `var/k`. So the expected square error decreases linearly with the number of `k` models in the ensemble. Overall, on average, the ensemble will perform at least as well as any of the individual models in the ensemble. Futhermore, if the models make independent errors i.e. their errors are not correlated, then the ensemble will perform significantly better. The key to success with bagging is model diversity.

Model averaging can even benefit neural networks trained on the same dataset. In fact, one recommended solution to fix the high variance of neural networks is to train multiple models and aggregate their predictions.

#### Boosting

Boosting algorithm works by iteratively imporving the model to reduce the prediciton error. Each new weak learner corrects for the mistakes of the previous prection by modeling the residuals at `delta_i` of each step. The final prediction is the sum of the outputs from the base learner and each of the successive weak learners. Boosting iteratively builds a strong learner from a sequence of weak learners that model the residual error of the previous iteration.

Thus, the resulting ensemble model becomes successively more and more complex, having more capactity than any one of its members. This also explains why bootstraping is particularly good for combating high bias. By iteratively focusing on the hard-to-predict exampels boosting effectively decreases the bias of the resulting model.

#### Stacking

Stacking can be thought of as an extension of simple model averaging where we train `k` models to completion on the training dataset, then average the results to deterimine a prediciton. Simple model averaging is similar to bagging, but the models in the ensemble could be of different types, while for bagging, the modesl are of the same type. More generally, we could modify the averaging step to take a weighted average, for example, to give more weight to one model en our ensemble over the others. The weighting could be based of the relative accuracy of the models.

Stacking is a more advanced version of model averaging, where instead of taking the average or weighted average, we train a second model ML model on the outputs to learn how to best combine the results to the models in our ensemble to product a prediction. This provides all the benefits of decreasing variance as with badding techniques but also controls for high bias.

### Trade-Offs and Alternatives

#### Increased Training and Design Time

By using an ensemble model we've introduced an additional amount of overhead in our model development, not to mention maintenance, inference compelxity and resource usage if the ensemble model goes into production. This can become impactical if the number of models in the ensemble increases.

We should carefully consider if the increased overhead is worth the complexity.

#### Dropout as Bagging

Techniques like dropout provide a powerful and effective alternative. Dropout is know as a regularisation technique in deep learning but can also understood as an approximation to bagging. Dropout in a neural network randomly turns off neurons for each mini-batch of training, essentially evaluating a bagged ensemble of exponentially many neural networks. Dropout is not the same as bagging though. In the case of bagging, the models are independent, while when training with dropout, the models share parameters. In bagging also, the models are trained to convergence on their respective training dataset. However, when training with dropout, the ensemble member models would onle be trained for a single training step because different nodes are dropped out each iteration in the training loop.

#### Decreased Model Interpretability

In deep learning understanding why our model makes predictions is already difficult. This problem is compounded with ensemble models.

#### Choosing the Right Tool for the Problem

Some ensemble techniques are better at addressing bias or variance than others:

| Problem                     | Ensemble Solution |
|-----------------------------|------------------:|
| High Bias (underfitting)    |          Boosting |
| High Variance (overfitting) |           Bagging |

Using the wrong ensemble method for our problem won't necessarily improve performance, it will add uneeded overhead.

#### Other Ensembles

There are an array of ensembles to choose from: bayesian, neural nets, RL etc... The ensemble design pattern encompasses techniques that combine multiple ML models to improve overall model performance and can be particularly useful when addressing common training issues like high bias or high variance.

## Design Pattern 8: Cascade

Addresses situations where a ML problem can be profitably broken into a series of ML problems

### Problem

What happens if we need to predict a value during usual and unusual activity? The model will ignore the unusual activity because its rare. If the unusual activity is also associated with abnormal values then trainability suffers.

Example, how to identify resellers? A store makes millions of tranactions and only a few thousand are reseller transactions. Don't really know at the time of purchase if the item being bought is from a retail buyer or reseller.

If we have labelled instances of re-seller tranactions we can overweight these instances when training the model. This is suboptimal because we need to get the more common retail buyer use case as correct as possible. We don't want trade of accuracy between the two types of customers. However, these types of customers behave differently a retail buyer may return an item within a week whereas a re-seller will only return the item if they cannot sell it which can be several months later.

A way to address this problem is with the cascade design pattern. We break the problem into four parts:
1. Prediciting whether a specific transaction is by a reseller
2. Train one model on sales to retail buyers
3. Train the second model on sales to resellers
4. In production, combining the output of the three separate models to predict return likelihood for every item purchased and the probability that the transactions is by a reseller.

This allows different decisions on items likely to be returned depending on the type of buyer and ensures that the models in steps 2 and 3 are as accurate as possible on their segment of the training data. Each of these models is relatively easy to train. The first is simply a classifier, and if the unusual activity is extremely rare, we can use the rebalancing patter to address this. The next two models are essentially classification models trained on different segments of the training data. The combination is deterministic since we choose which model to run based on whether the activity belonged to a reseller.

The problem comes during prediction. At prediction time, we don't have true labels just the output of the first classification model. Based on the output of the first model we will have to determine which of the two sales models to invoke. The problem is that we are training on labels, but at inference time, we will have to make decisions based on predictions. And predictions have errors. So, the second and third models will be required to make predictions on data that they might have never seen during training.

How do we train a cascade of models where the output of one model is an input to the following model or determines the selection of subsequent models.

### Solution

A ML problem where the output of one model is and input to the following model or determines the selection of subequent models is called a `cascade`. Special care is needed when training these models.

For example, a model which has unusual circumstances can be solved by treating is as a cascade of four ML models:
1. A classification model to identify the circumstance
2. One model trained on unusual circumstances
3. A separate model trained on typical circumstances
4. A model to combine the output of the two separate models, because the output is a probablisitc combination of two outputs

Looks similar to an ensemble of models but is considered not because of the special experiment design pattern required when doing a cascade.

Image we want to know where to stock bicycles at stations, we wish to predict the distance between rental and return stations. The goal of the model is to predicti the distance we need to transport the bicycle back to the rental location given features such as time of day, location of rental etc... Rentals longer than four hours are very different to shorter rentals in terms of behaviour and stocking alogrithm will require both outputs (prob rental longer than four hours and distance bicycle needs to be transported). However, only a small fraction of rentals involve such abnormal trips.

One solution is to train a model to classify trips into long or typical trips. It can be tempting to split the training set into two parts based on the actual duration of the rental and train the next two models, one on long rentals and the other on typical rentals. The problem is that the classification model just discussed will have errors, these errors will be passed to the models downstream.

Instead, after training the classification model, we need to use the predictions of this model to create the training dataset for the next set of models. For example, we could create the training dataset for the model to predict the distance of typical rentals. Take the predictions where `predicicted_trip_type = 'typical'` and use this as training data to make predictions on the distance. We do the same for where `predicicted_trip_type = 'long'`.

Finally, our evaluation prediction should take into account that we need to use three trained models not just one. This is what the term the cascade desing pattern.

When ever upstream models are re-trained the downstream models should also be re-trained.

### Trade-Off and Alternatives