# Problem Representation Design Pattern

This chapter looks at the different types of machine learning problems and analyses how the model architectures vary depending on the problem.

The input and output types are tow key factors impacting the model architecture. For example the output required from a model could impact if we choose a regression or a classification model. Special nerual network layers exist for specific types of input data: convolutional layers for images, speech text and other data with spatiotemporal correlation, recurrent networks for sequential data. Special classes of solutions exist for commonly occuring problems like recommendations (matrix factorization), or time-series (ARIMA). A group of simpler model stogher with common isioms can be used to solve more compex problems e.g. text generation often involves a classification model whose outputs are postprocessed using a beam search algorithm.

## Design Pattern 5: Reframing

This pattern refers to chaning the ouput of ML problem. For example, we could take something that is intuitively a regression problem and instead pose it as a classification prblem.

### Problem

The first step of building any ML solution is framing the problem. Is this a supervised learning problem? Or unsupervised? What are the features? If it is a supervised problem what are the labels? What amount of error is acceptable? Of course, the answers to these questions must be considered in context with the training, the task at hand, and the metrics for success.

For example, if we wanted to predict the amount of rainfall in a given area we could make this a regression problem. We could also treat this as a time series model. There are lost of adjustments we can make to improve our model. Is regression the onle wat we can pose this task? Perhaps we can re-frame our machine learning objective in a way that improves our task performance.

### Solution

If we used a regression model to predict the the amount of rail fall we're limiting to a prediciton of a single number. We can reframe this as a classification problem where one approach would be to model a discrete probability distribution e.g. we have binned amount of rain-fall as a class e.g. `0-0.05mm`, `0.5 - 1.0mm` etc and each class will have an associated probability. We can also have a regerssion model to predict a real-value number.

Both the regression approach and the re-framed classification approach give a prediction of the rainfall. However, the classification approach allows the model to capture teh probability distribution of rainfall of different quantities.

### Why It Works

By reframing we lose a little precision due to bucketing, but gain the expressivess of a full probability density function. The discretised predictions provided by the classificatoin model are more adept at learning a complex target then a more rigid regression model.

Added advantage of this classification framing is that we obtain posterior probability distribution of our predicited values which provides more nuanced information. Suppose the learned distribution is bimodal. By modelling a classificaton as a discrete probability distribution, the model is able to caputre the biomodal structure of the predictions. Where as if we used a regression model we would lose this information.

#### Capturing Uncertainty

Looking at the dataset of babies born to 25 year old mothers at 38 weeks shows a normal distribution with a mean at 7.5 pounds. There is a nontrivial likelihood (33%) that a given baby is less than 6.5 pounds or more than 8.5 pounds (this is 1 STD either side of the mean, page 83 in book). The width of this distribuiton indicates the irreducible error inherent to the problem of predicting baby weight. If we framed it as a regression problem the best RMSE we can obtain is the standard deviation of the distribution.

If we frame it as a regression problem we would have to state the prediction result as 7.5 +/- 1.0 (or whatever the STD is). Yet the width if the distribution will vary for different combinations of inputs, and so learning the width is another machine learning problem in itself. For example, at the 36th week, for mothers of the same age, the standard deviation is 1.16 pounds.

Has the distribution been multimodal (with multiple peaks), the case for reframing the problem as a classification problem would have been even stronger. However, it is helpful to realise that because of the law of large numbers, as long as we capture all of the relevant inputs, many of the distributions we will encounter on large datasets will be bell-shaped, although other distributions are possible. The wider the bell curve the more the width varies at different values of inputs, the more important it is to capture uncertainty and the stronger the case for reframing the regression problem as a classification one.

By reframing the problem, we train the model as a multiclass classification that learns a discrete probability distribution for the given trainin examples. These discretised predictions are more flexible in terms of capturing uncertainty and better able to approximate the complex target than a regression model. At inference time, the model then predicts a collection of probabilities corresponding to these ptential outputs. That is, we obtain a discrete PDF giving the relative likelihood of any specific weight.

### Trade-Offs and Alternatives

There is rarely just one way to frame a problem. For example, bucketizing the output values of a regression is an approach to reframing the problem as a classification task. Another apporach is multitask learning that combines both tasks (classification and regression) into a single model using multiple prediction heads. With any reframing technique, being aware of data limitations or the risk of introducing label bias is important.

#### Bucketised outputs

The typical approach to reframing a regression task as a classification task is to bucketise the output values. For example, if out model is to be used to inidicate how much rain we will get on a given day it we may bucket the values into say 5 groups.

The regression problem now becomes a classification problem. Intuitively, it is easier to predict 5 categories than to predict a single continuous number. By using categorical outputs, the model is incentivised less for getting arbitrarily close to the actual output value since we've essentially changed the output label to a range of values instead of a single number.

#### Restricting the Prediction Range

For a given problem the prediction range may be `[3,20]`. If we train a regression model there is always a chance than the model may make predictions outside of this range. One way to limit the range is to reframe the problem. For example, we could build a DNN where the last layer is a sigmoid later and we then map the `[0,1]` to the range of `[3,20]`.

#### Label Bias

It is important to consider the nature of the target label when reframing the problem. For example, suppose we regramed our recommendation model to a classification task that predicts the likelihood a user will click on a certain video thumbnail. This seems like a resonable reframing since our goal is to provide content a user will select and watch. But be careful. This chance of albel is not actually in line with our prediction task. By optimising for user clicks, our model will inadvertently promote click bait and not actually recommend content to use to the user.

Instead, a more advantageous label would be video watch time, reframing our recommendation as a regression instead. Or predict the likelihood that the user will watch at least half the video clip.

## Design Pattern 6: Multilabel

The multilabel design pattern refers to a problem where we assign more than one label to a given training example.

### Problem

Often prediction tasks involve applying a single classification to a given training example. This prediction is determined from N possible classes where N is greater than 1. In this case, it's commin to use softmax as the activation function for the output layer. Using softmax, the output of out model is an N-element array, where the sum of all the values adds up to 1. Each value indicates the probability that a particular example is associated with the class at the index.

For example, if out model is classifying images as cats, dogs or rabbits, the softmax output might loos like this for a given image `[0.89, 0.02, 0.09]`. This means out model is predicting an 89% chanbe the image is a  cat. Because each image can only have one possible label in this scenario, we can take the `argmax` (index of the highest probability) to determine our model's predicted class. The less-common scenario is when each training exmaple can be assigned more than one label, which is what this pattern addresses.

The multilabel design pattern exists for odels trained on all data modalities. For image classificiation, in the earlier example we could instead used images which depicted multiple animals, and could therefore have multiple labels. The same can be applied to text models e.g. a news article could belong to many different categories.

The design pattern can also apply to tabular datasets for example healthecare data could be used to predicit multiple conditions.

### Solution

The solution is to use a sigmoid activation function in our final layer instead of a softmax. Each individual in a sigmoid array is a float between 0 and 1. That is to say, when implementing the Multilabel design pattern, our label needs to be multi-hot encoded. The length of the multi-hot array corresponds with the number of classes in our model, and each output in this label array will be a sigmoid value.

The main differenc between the sigmoid and softmax is that the softmax array is guaranteed to contain three values that sum to 1, where as the sigmoid out put will contain three values each between 0 and 1.

The sigmoid is a nonlinear, continuous and differentiable activation function that takes the outputs of each neuron in the previous layer in the modle and squashes the value of thos outputs between 0 and 1.

### Trade-Offs and Alternatives

- **Multiclass classification**: Each example can have only 1 label
- **Binary classification**: The number of classes is 2
- **Multilabel classification**: Each example can have many labels

If a multiclass scenario use softmax and in a binary classification scenario use sigmoid. In a multilabel scenario we use a sigmoid for each label.

For the multilabel scenario we can use the binary cross entropy loss because a multilabel problem is essentially `n` smaller binary classification problems.

#### Parsing Sigmoid Results

By applying a sigmoid per class we obtain a probability per class. To assign labels to a given prediction we can say if the probability of a label is above 50% it should be assigned to the data point. Additionally we can also apply `n_specific_tag` / `n_total_examples` as a threshold for each class. Here, `n_specific_tag` is the number of examples with one tag in the dataset and `n_total_examples` is the total number of examples in the training set across all tags. This ensures that the model is doing better than guessing a certain label based on its occurrence in the training dataset.

For a more precise approach read this [paper](https://pralab.diee.unica.it/sites/default/files/pillai_PR2013_Thresholding_0.pdf). Uses S-Cut for optimizing your models F-measure.

#### Dataset Considerations

Dataset balancing is important for ML models and is more nuanced for the Multlabel design pattern. 

For model to learn what each unique label is we'll want to ensure the training dataset consists of varied combinations of each tag. If two labels occur frequently together the model may not learn to classify the label if it appears on its own. To account fo this think about the relationships between the labels and count the number of training exmaples that belong to each overlapping combinations of labels.

We can consider hierarchical labels if the dataset allows. e.g.
```
animal -> invertebrate -> arthropod -> spider
```
There are two common approaches to for handling heirarchical labels:
- Use a flat approach and put every label in the same output array. Make sure there are enough samples at each leaf node
- Use cascade design pattern. Build one model to identify higher-level labels. Based on the higher-level classification, send the example to a separate model for a more specific classification task. E.g. higher level model predicits a datapoint to be an "animal" we then send the datapoint to differnent model(s) to apply more granular labels.

Flat approach more straighforward. However, this might cause the model to lose information about more detailed label classes since ther will naturally be more training examples with higher-level labels in the dataset.

#### Inputs with Overlapping Labels

The multilabel approach to predicitions is usefull in overlapping labels. For example, if an image contains multiple fashion items and two people we're to label the items in it such as:
- Long sleeved blazer
- Double breasted blazer
Both labels are correct but the issue arises in the situation where depending on who labelled the image the model may predict things differently. There is where multilabel is usefull because it allows us to associate both overlapping labels with an image.

#### One Versus Rest

Another technique for handling multilabel classification is to trian multiple binary classifiers instead of one multilabel model. This apporach is called _one versus rest_. We would train a binary classifier for each label.

This can help with tate categories since the modell will be performing only one classification taks at a time on each input. The disadvantage of this approach is the added complexity of training many different classifiers.

## Design Pattern 7: Ensembles

