# Data Representation Design Patterns

ML models us a mathematical funciton that is defined to operate on specific types of data only. Real world data might not be able to be plugged straight into ML models. We sometime need to transform features so that can be used by ML models. The process of creating features to represent the input data is called feature engineering, and so we can think of featur engineering as a way of selecting the data representation.

Some models can learn the data representation e.g. a decision know can learn what threshold to split a float feature on. Embeddings are another design pattern that deep neural networks are capable of learning. Embeddings represent are dense and lower-dimensional compared to the input which can be sparse. The learning algorithm needs to extract the mose salient information from the input and represent it in a more concise way in the feature. The process of learning features to represnet the input is called feature extraction. We can think of learnable data representations e.g. embeddings as automatically engineered features.

Data representaiton doesn't need to be learned or fixed, a hybrid is also possible. The hashed feature design pattern is deterministic, but doesn't require a model to now all the potential values that a particular input can take.

## Simple Data Representations 

### Numerical Inputs

Mose ML models work on numerical inputs. If a feaure is numerical we can (doesn't mean we should) leave it unchanged.

#### Why Scaling is Desirable

A lot of ML models uses an optimiser that is tuned to work well with number in the range [-1, 1] range, scaling the numeric values to lie in that range can be beneficial.

Gradient descent optimisers require more steps to converge as the curvature of the loss function increases. This is beasue the derivatives of features with larger relative magnitudes will tend to be larger as well, and so will leand to abnormal weight updates. The large weight updates will require more steps to converge and thereby increase the computational load. "Centreing" the data to lie in the [-1 , 1] range makes the error function more spherical. Therefore, models trained with transformed data tend to converge faster and are therefore faster/cheaper to train. In addition, the [-1, 1] range offers the most floating point precision.

A quick test to do is to load the diabetes dataset from SKLearn and train a liner model with scaled and unscaled features. You should typically see a 9% improvement in the model which just uses on input feature.

Another important reason to scale is that ML models are sensitive to relative magnitudes of the different features. For example, K-Means algorithm uses the euclidean distance as its proximity measure and will end up relying heavily on features with larget magnitudes. Lack of scaling also impacts the performance of L1 or L2 regularisation since the magnitude of weights for a feature depends on the magnitude values for that feature, and so different feature will be affected differently by regularisation. By scaling all features to lie between [-1, 1] we ensure that there is not much of a difference in the relative magnitudes of different features.

#### Linear Scaling

Four forms of linear scaling are typically employed:

- **Min-max Scaling**: The numeric value is linearly scales so that the minimum value that the input can take is scaled to -1 and maximimum to 1
```x1_scaled = (2*x1 - max_x1 - min_x1) / (max_x1 - min_x1)```. The issue with this scaler is that the min and max values need to be estimated from the training dataset, and they are often outliers. The real data often gets shrunk to a very narrow range in [-1, 1] band.
- **Clipping (with min-mix)**: Helps address the problem of outliers by using "reasonable" values instead of estimating the min and max from the training dataset. The numeric value is linearly scaled between these two reasonable bounds, then clipped to lie in the range [-1, 1]. This has the effect of treating outliers as -1 or 1.
- **Z-score normalisation**: Addresses the problem of outliers without requiring prior knowledge of what the reasonable range is by linearly scaling the input using the mean and standard deviation estimated over the training dataset: ```x1_scaled = (x1 - mean_x1) / stddev_x1```. The name of the method reflects the fact that the scaled value has zero mean and is normalised by the standard deviation so that is has unit variance over the training dataset. The scaled value is unbounded, but does not lie between [-1, 1] the majority of the time (67%, if the underlying distribution is normal). Values outside this range get rarer the larger their absolute value gets, but are still present.
- **Winsorizing**: Clip the datset to the 10th and 90th percentile or 5th and 95th. The winsorized value is min-max scaled.

Min-max and clipping tend to work best on uniformly distributed data and Z-score for normally distributed data.

Don't throw away outliers. If we throw away outliers the model will treat outliers in production as the most extreme value in the dataset. We should keep outliers in the training dataset so the model can better reflect how to behave when outliers are encountered. It is valid the throw away invalid data but it is not acceptable to throw away valid data. Thus, we should be justified in throwing away data points that are valid.


#### Non-Linear Transformations

If the data is not uniformly distributed nor distributed like a bell curve it is better to use a non-linear transformation such as taking the logarithm before scaling. Other comming transformations are sigmoid and polynomial expansions (square, square root, cube, cube root and so on). We'll know that we have a good transformation function if the distribution of the transformed value becomes uniform or normally distributed.

Example with wikipedia page views we can apply a logarithm and then take the fourth root and then scale linearly and we get somewhat a bell shaped curve.

It can be difficult to devise a linearising function that makes the distribution look like a bell curve. An easier approach is to `bucketize` the number of views, choosing the bucket boundaries to fit the desired output distribution. A principled approach to choosing these buckets is to do histogram equalisation, where the bins of the histogram are chosed based on quantiles of the raw distribution. In the ideal situation, histogram equalisation results in a uniform distribution.

Another technique to handle skewed distributions is to use a parametric transformation such as `box-cox transformation`. Box-cox chooses a single parameter, lambda, to control the "heteroscedasticity" so that the variance no longer depends on the magnitude. Here, the variance among rarely viewed wikipedia pages will be much smaller than the variance among frequently viewed pages, and Box-cox tries to equalise the variance across all ranges of the number of views. This can be done in sklearn: `sklearn.preprocessing.power_transform` where the method is set to `box_cox`.

#### Array of Numbers

Common idioms to handle arrays of numbers:
- Representing the input array in terms of its buld statistics. For example, length of array, average, median, min, max and so forth.
- Representing the input array in terms of its emprirical distribution i.e. by the 10th, 20th percentile and so on.
- If the array is ordered in a specific way (e.g. time or size), representing the input array by the last three or some other fixed number of items. For arrays of length less than 3, the feature is padded to a length of three with missing values.

All these end up respresenting the variable-length array of data as a fixed-length feature.

### Categorical Inputs