# Data Representation Design Patterns

At the heart of any machine learning model is a mathematical function that
is defined to operate on specific types of data only. At the same time, real-
world machine learning models need to operate on data that may not be
directly pluggable into the mathematical function.

The process of creating features to represent the input data is called `feature engineering`, and so we can think of feature engineering as a way of selecting the data representation.

The process of learning features to represent the input data is called `feature extraction`, and we can think of learnable data representations (like embeddings) as automatically engineered features.

The data representation design patterns include:

1. `Hashed Feature`  
It involves encoding categorical inputs as unique strings and hashing them.

2. `Embeddings`  
It's a technique for representing high-cardinality data such as inputs with many possible categories or text data. Embeddings represent data in multidimensional space, where the dimension is dependent on our data and prediction task.

3. `Feature Cross`  
It's an approach that joins two features to extract relationships that may not have been easily captured by encoding the features on their own.

4. `Multimodal Input`  
It addresses the problem of how to combine inputs of different types into the same model, and how a single feature can be represented multiple ways.

## #1: Hashed Feature

The Hashed Feature design pattern addresses three possible problems
associated with categorical features:
- incomplete vocabulary
- model size due to cardinality
- cold start.

It does so by grouping the categorical
features and accepting the trade-off of collisions in the data representation.

### Problem
One-hot encoding a categorical input variable requires knowing the
vocabulary beforehand. This is not a problem if the input variable is
something like the language a book is written in or the day of the week
that traffic level is being predicted.

What if the categorical variable in question is something like the
hospital_id of where the baby is born or the physician_id of the
person delivering the baby? Categorical variables like these pose a few problems:

- Knowing the vocabulary requires extracting it from the training
data. Due to random sampling, it is possible that the training data
does not contain all the possible hospitals or physicians. The
vocabulary might be incomplete.

- The categorical variables have high cardinality. Instead of havingfeature vectors with three languages or seven days, we have
feature vectors whose length is in the thousands to millions. Such
feature vectors pose several problems in practice. They involve so
many weights that the training data may be insufficient. Even if
we can train the model, the trained model will require a lot of
space to store because the entire vocabulary is needed at serving
time. Thus, we may not be able to deploy the model on smaller
devices.

- After the model is placed into production, new hospitals might be
built and new physicians hired. The model will be unable to make
predictions for these, and so a separate serving infrastructure will
be required to handle such cold-start problems.

### Solution
The Hashed Feature design pattern represents a categorical input variable
by doing the following:

1. Converting the categorical input into a unique string.

2. Invoking a deterministic (no random seeds or salt) and portable
(so that the same algorithm can be used in both training and
serving) hashing algorithm on the string.

3. Taking the remainder when the hash result is divided by the desired number of buckets. Typically the hashing algorithm returns an integer that can be negative and the modulo of a negative integer is negative. So, the absolute value of the result is taken.

In the Hashed Feature design pattern, we have to use a fingerprint hashing
algorithm and not a cryptographic hashing algorithm. This is because the
goal of a fingerprint function is to produce a deterministic and unique
value. If you think about it, this is a key requirement of preprocessing
functions in machine learning, since we need to apply the same function
during model serving and get the same hashed value.

## #2: Embeddings

Embeddings are a learnable data representation that map high-cardinality data into a lower-dimensional space in such a way that the information relevant to the learning problem is preserved. Embeddings are at the heart of modern-day machine learning and have various incarnations throughout
the field.

### Problem

Machine learning models systematically look for patterns in data that capture how the properties of the model’s input features relate to the output label. As a result, the data representation of the input features directly affects the quality of the final model.

While handling structured, numeric input is fairly straightforward, the data needed to train a machine learning model can come in myriad varieties, such as categorical features, text, images, audio, time series, and many more.

For these data representations, we need a meaningful numeric value to supply our machine learning model so these features can fit within the typical training paradigm.

Embeddings provide a way to handle some of these disparate data types in a way that preserves similarity between items and thus improves our model’s ability to learn those essential patterns.

### Solution
The Embeddings design pattern addresses the problem of representing high-cardinality data densely in a lower dimension by passing the input data through an embedding layer that has trainable weights.

This will map the high-dimensional, categorical input variable to a real-valued vector in some low-dimensional space. The weights to create the dense representation are learned as part of the optimization of the model. In practice, these embeddings end up capturing closeness
relationships in the input data.

## #3: Feature Cross

The Feature Cross design pattern helps models learn relationships between
inputs faster by explicitly making each combination of input values a
separate feature.

### Problem
Consider the dataset in Figure below and the task of creating a binary classifier that separates the + and − labels.

<img src="images/feature-cross-problem.png" width=400>

Using only the x_1 and x_2 coordinates, it is not possible to find a linear boundary that separates the + and − classes. This means that to solve this problem, we have to make the model more
complex, perhaps by adding more layers to the model. However, a simpler
solution exists.

### Solution

In machine learning, feature engineering is the process of using domain knowledge to create new features that aid the machine learning process and increase the predictive power of our model. One commonly used feature engineering technique is creating a feature cross.

A feature cross is a synthetic feature formed by concatenating two or more categorical features in order to capture the interaction between them. By joining two features in this way, it is possible to encode nonlinearity into the model, which can allow for predictive abilities beyond what each of the features would have been able to provide individually.

Feature crosses provide a way to have the ML model learn relationships between the features faster. While more complex models like neural networks and trees can learn feature crosses on their own, using feature crosses explicitly can allow us to get away with training just a linear model.

Consequently, feature crosses can speed up model training (less expensive) and reduce model complexity (less training data is needed).

## #4: Multimodal Input

The Multimodal Input design pattern addresses the problem of representing different types of data or data that can be expressed in complex ways by concatenating all the available data representations.

### Problem


Typically, an input to a model can be represented as a number or as a category, an image, or free-form text. Many off-the-shelf models are defined for specific types of input only—a standard image classification model such as Resnet-50, for example, does not have the ability to handle inputs other than images.

To understand the need for multimodal inputs, let’s say we’ve got a camera capturing footage at an intersection to identify traffic violations. We want our model to handle both image data (camera footage) and some metadata about when the image was captured (time of day, day of week, weather, etc.)

This problem also occurs when training a structured data model where one of the inputs is free-form text. Unlike numerical data, images and text cannot be fed directly into a model. As a result, we’ll need to represent image and text inputs in a way our model can understand (usually using the  Embeddings design pattern), then combine these inputs with other tabular features.

### Solution

For example, we might want to predict a restaurant patron’s rating based on their review text and other attributes such as what they paid and whether it was lunch or dinner.

<img src="images/multimodal-input.png" width=600>

We’ll first combine the numerical and categorical features. There are three possible options for meal_type, so we can turn this into a one-hot encoding and will represent dinner as `[0, 0, 1]`. With this categorical feature represented as an array, we can now combine it with meal_total by adding the price of the meal as the fourth element of the array: `[0, 0, 1, 30.5]`.

The Embeddings design pattern is a common approach to encoding text for machine learning models. Then, we need to flatten the embedding in order to concatenate with the meal_type and meal_total.

We could then use a series of Dense layers to transform that very large array into smaller ones ending with our output that is an array of three numbers.

We now need to concatenate these three numbers, which form the sentence embedding of the review with the earlier inputs:
```
[0, 0, 1, 30.5, 0.75, -0.82, 0.45]
```