# Problem Representation Design Pattern

This chapter looks at the different types of machine learning problems and analyses how the model architectures vary depending on the problem.

The input and output types are tow key factors impacting the model architecture. For example the output required from a model could impact if we choose a regression or a classification model. Special nerual network layers exist for specific types of input data: convolutional layers for images, speech text and other data with spatiotemporal correlation, recurrent networks for sequential data. Special classes of solutions exist for commonly occuring problems like recommendations (matrix factorization), or time-series (ARIMA). A group of simpler model stogher with common isioms can be used to solve more compex problems e.g. text generation often involves a classification model whose outputs are postprocessed using a beam search algorithm.

## Design Pattern 5: Reframing

This pattern refers to chaning the ouput of ML problem. For example, we could take something that is intuitively a regression problem and instead pose it as a classification prblem.

### Problem

The first step of building any ML solution is framing the problem. Is this a supervised learning problem? Or unsupervised? What are the features? If it is a supervised problem what are the labels? What amount of error is acceptable? Of course, the answers to these questions must be considered in context with the training, the task at hand, and the metrics for success.

For example, if we wanted to predict the amount of rainfall in a given area we could make this a regression problem. We could also treat this as a time series model. There are lost of adjustments we can make to improve our model. Is regression the onle wat we can pose this task? Perhaps we can re-frame our machine learning objective in a way that improves our task performance.

### Solution

If we used a regression model to predict the the amount of rail fall we're limiting to a prediciton of a single number. We can reframe this as a classification problem where one approach would be to model a discrete probability distribution e.g. we have binned amount of rain-fall as a class e.g. `0-0.05mm`, `0.5 - 1.0mm` etc and each class will have an associated probability. We can also have a regerssion model to predict a real-value number.

Both the regression approach and the re-framed classification approach give a prediction of the rainfall. However, the classification approach allows the model to capture teh probability distribution of rainfall of different quantities.

### Why It Works

By reframing we lose a little precision due to bucketing, but gain the expressivess of a full probability density function. The discretised predictions provided by the classificatoin model are more adept at learning a complex target then a more rigid regression model.

Added advantage of this classification framing is that we obtain posterior probability distribution of our predicited values which provides more nuanced information. Suppose the learned distribution is bimodal. By modelling a classificaton as a discrete probability distribution, the model is able to caputre the biomodal structure of the predictions. Where as if we used a regression model we would lose this information.

#### Capturing Uncertainty

Looking at the dataset of babies born to 25 year old mothers at 38 weeks shows a normal distribution with a mean at 7.5 pounds. There is a nontrivial likelihood (33%) that a given baby is less than 6.5 pounds or more than 8.5 pounds (this is 1 STD either side of the mean, page 83 in book). The width of this distribuiton indicates the irreducible error inherent to the problem of predicting baby weight. If we framed it as a regression problem the best RMSE we can obtain is the standard deviation of the distribution.

If we frame it as a regression problem we would have to state the prediction result as 7.5 +/- 1.0 (or whatever the STD is). Yet the width if the distribution will vary for different combinations of inputs, and so learning the width is another machine learning problem in itself. For example, at the 36th week, for mothers of the same age, the standard deviation is 1.16 pounds.

Has the distribution been multimodal (with multiple peaks), the case for reframing the problem as a classification problem would have been even stronger. However, it is helpful to realise that because of the law of large numbers, as long as we capture all of the relevant inputs, many of the distributions we will encounter on large datasets will be bell-shaped, although other distributions are possible. The wider the bell curve the more the width varies at different values of inputs, the more important it is to capture uncertainty and the stronger the case for reframing the regression problem as a classification one.

By reframing the problem, we train the model as a multiclass classification that learns a discrete probability distribution for the given trainin examples. These discretised predictions are more flexible in terms of capturing uncertainty and better able to approximate the complex target than a regression model. At inference time, the model then predicts a collection of probabilities corresponding to these ptential outputs. That is, we obtain a discrete PDF giving the relative likelihood of any specific weight.

#### Changing the Objective

