# Problem Representation Design Patterns

To limit our discussion and stay away from areas of active research, wewill ignore patterns and idioms associated with specialized machine learning domains. Instead, we will focus on `regression` and `classification` and examine patterns with problem representation in just these two types of ML models.

The problem representation design patterns include:

1. `Reframing`  
It takes a solution that is intuitively a regression problem and poses it as a classification problem (and vice versa).

2. `Multilabel`  
It handles the case that training examples can belong to more than one class.

3. `Ensemble`  
It solves a problem by training multiple models and aggregating their responses.

4. `Cascade`  
It addresses situations where a machine learning problem can be profitably broken into a series (or cascade) of ML problems.


5. `Neutral Class`  
It looks at how to handle situations where experts disagree.

6. `Rebalancing`  
It recommends approaches to handle highly skewed or imbalanced data.

## #1: Reframing

The Reframing design pattern refers to changing the representation of the output of a machine learning problem. For example, we could take something that is intuitively a regression problem and instead pose it as a classification problem (and vice versa).

### Problem

The first step of building any machine learning solution is framing the problem. Is this a supervised learning problem? Or unsupervised? What are the features? If it is a supervised problem, what are the labels? What amount of error is acceptable? Of course, the answers to these questions must be considered in context with the training data, the task at hand, and the metrics for success.

For example, suppose we wanted to build a machine learning model to predict future rainfall amounts in a given location. Starting broadly, would this be a regression or classification task? Well, since we’re trying to predict rainfall amount (for example, 0.3 cm), it makes sense to consider this as a time-series forecasting problem: given the current and historical climate and weather patterns, what amount of rainfall should we expect in a given area in the next 15 minutes?


Alternately, because the label (the amount of rainfall) is a real number, we could build a regression model. As we start to develop and train our model, we find (perhaps not surprisingly) that weather prediction is harder than it sounds. Our predicted rainfall amounts are all off because, for the same set of features, it sometimes rains 0.3 cm and other times it rains 0.5 cm.

What should we do to improve our predictions? Should we add more layers to our network? Or engineer more features? Perhaps more data will help? Maybe we need a different loss function?

Any of these adjustments could improve our model. But wait. Is regression the only way we can pose this task? Perhaps we can reframe our machine learning objective in a way that improves our task
performance.

### Solution

The core issue here is that rainfall is probabilistic. For the same set of features, it sometimes rains 0.3 cm and other times it rains 0.5 cm. Yet, even if a regression model were able to learn the two possible amounts, it is limited to predicting only a single number.

Instead of trying to predict the amount of rainfall as a regression task, we can reframe our objective as a classification problem. There are different ways this can be accomplished. One approach is to model a discrete probability distribution. Instead of predicting rainfall as a real-valued output, we model the output as a multiclass classification giving the probability that the rainfall in the next 15 minutes is within a certain range of rainfall amounts.

## #2: Multilabel

The Multilabel design pattern refers to problems where we can assign more than one label to a given training example. For neural networks, this design requires changing the activation function used in the final output layer of the model and choosing how our application will parse model output. Note that this is different from multiclass classification problems, where a single example is assigned exactly one label from a group of many (> 1) possible classes.

### Problem

Often, model prediction tasks involve applying a single classification to a given training example. This prediction is determined from N possible classes where N is greater than 1. In this case, it’s common to use softmax as the activation function for the output layer. Using softmax, the output of our model is an N-element array, where the sum of all the values adds up to 1. Each value indicates the probability that a particular training example is associated with the class at that index.

### Solution

The solution for building models that can assign more than one label to a given training example is to use the `sigmoid` activation function in our final output layer. Rather than generating an array where all values sum to 1 (as in softmax), each individual value in a sigmoid array is a float between 0 and 1.

That is to say, when implementing the Multilabel design pattern, our label needs to be multi-hot encoded. The length of the multi-hot array corresponds with the number of classes in our model, and each output in this label array will be a sigmoid value.

Building on the image example, let’s say our training dataset included images with more than one animal. The sigmoid output for an image that contained a cat and a dog but not a rabbit might look like the following: [.92, .85, .11].

This output means the model is 92% confident the image contains a cat, 85% confident it contains a dog, and 11% confident it contains a rabbit.

## #3: Ensembles

The Ensembles design pattern refers to techniques in machine learning that combine multiple machine learning models and aggregate their results to make predictions. Ensembles can be an effective means to improve performance and produce predictions that are better than any single model.

### Problem

No machine learning model is perfect. To better understand where and
how our model is wrong, the error of an ML model can be broken down
into three parts:
- irreducible error
- error due to bias
- error due to variance.

The `irreducible error` is the inherent error in the model resulting from noise in the dataset, the framing of the problem, or bad training examples, like measurement errors or confounding factors. Just as the name implies, we can’t do much about irreducible error.

The other two, the bias and the variance, are referred to as the reducible error, and here is where we can influence our model’s performance. In short, the bias is the model’s inability to learn enough about the relationship between the model’s features and labels, while the variance captures the model’s inability to generalize on new, unseen examples.

A model with high bias oversimplifies the relationship and is said to be `underfit`. A model with high variance has learned too much about the training data and is said to be `overfit`.

Is there a way to mitigate this bias–variance trade-off on small- and medium-scale problems?

### Solution

Ensemble methods are meta-algorithms that combine several machine learning models as a technique to decrease the bias and/or variance and improve model performance. Generally speaking, the idea is that combining multiple models helps to improve the machine learning results.

By building several models with different inductive biases and aggregating their outputs, we hope to get a model with better performance. Some commonly used ensemble methods include:
- **bagging**  
Bagging (short for bootstrap aggregating) is a type of parallel ensembling
method and is used to address high variance in machine learning models.

- **boosting**  
Boosting is another Ensemble technique. However, unlike bagging, boosting ultimately constructs an ensemble model with more capacity than the individual member models. For this reason, boosting provides a more effective means of reducing bias than variance.

- **stacking**  
Stacking is an ensemble method that combines the outputs of a collection of models to make a prediction. The initial models, which are typically of different model types, are trained to completion on the full training dataset. Then, a secondary meta-model is trained using the initial model outputs as features. This second meta-model learns how to best combine the outcomes of the initial models to decrease the training error and can be any type of machine learning model.

## #4: Cascade

The Cascade design pattern addresses situations where a machine learning problem can be profitably broken into a series of ML problems. Such a cascade often requires careful design of the ML experiment.

### Problem

What happens if we need to predict a value during both usual and unusual activity? The model will learn to ignore the unusual activity because it is rare. If the unusual activity is also associated with abnormal values, then trainability suffers.

How do we train a cascade of models where the output of one model is an input to the following model or determines the selection of subsequent models?

### Solution

Any machine learning problem where the output of the one model is an input to the following model or  determines the selection of subsequent models is called a cascade. Special care has to be taken when training a cascade of ML models.

For example, a machine learning problem that sometimes involves unusual circumstances can be solved by treating it as a cascade of four machine learning problems:

1. A classification model to identify the circumstance
2. One model trained on unusual circumstances
3. A separate model trained on typical circumstances
4. A model to combine the output of the two separate models, because the output is a probabilistic combination of the two outputs.

## #5: Neutral Class

In many classification situations, creating a neutral class can be helpful. For example, instead of training a binary classifier that outputs the probability of an event, train a three-class classifier that outputs disjoint probabilities for Yes, No, and Maybe. Disjoint here means that the classes do not overlap. A training pattern can belong to only one class, and so there is no overlap between Yes and Maybe, for example. The Maybe in this case is the neutral class.

### Problem

Imagine that we are trying to create a model that provides guidance on pain relievers. There are two choices, ibuprofen and acetaminophen and it turns out in our historical dataset that acetaminophen tends to be prescribed preferentially to patients at risk of stomach problems, and ibuprofen tends to be prescribed preferentially to patients at risk of liver damage. Beyond that, things tend to be quite random; some physicians default to acetaminophen and others to ibuprofen. Training a binary classifier on such a dataset will lead to poor accuracy because the model will need to get the essentially arbitrary cases correct.

### Solution

If all we have is a historical dataset, we would need to get a labeling service involved. We could ask the human labelers to validate the doctor’s original choice and answer the question of whether an alternate pain medication would be acceptable.

## #6: Rebalancing

The Rebalancing design pattern provides various approaches for handling datasets that are inherently imbalanced. By this we mean datasets where one label makes up the majority of the dataset, leaving far fewer examples of other labels.

This design pattern does not address scenarios where a dataset lacks representation for a specific population or real-world environment. Cases like this can often only be solved by additional data collection. The Rebalancing design pattern primarily addresses how to build models with datasets where few examples exist for a specific class or classes.

### Problem

Take for example a fraud detection use case, where you are building a model to identify fraudulent credit card transactions. Fraudulent transactions are much rarer than regular transactions, and as such, there is less data on fraud cases available to train a model. A common pitfall in training models with imbalanced label classes is relying on misleading accuracy values for model evaluation.

### Solution

First, since accuracy can be misleading on imbalanced datasets, it’s important to choose an appropriate evaluation metric when building our model. Then, there are various techniques we can employ for handling inherently imbalanced datasets at both the dataset and model level. 

- **Downsampling** changes the balance of our underlying dataset, while weighting changes how our model handles certain classes.

- **Upsampling** duplicates examples from our minority class, and often involves applying augmentations to generate additional samples.

- **Weighted Classes** means to change the weight our model gives to examples from each class. Note that this is a different use of the term “weight” than the weights (or parameters) learned by our model during training, which you cannot set manually.