Chapter 2: Data Representation Design Patterns

At the heart of any ML model is a mathematical function that operates on specific data types. Real-world data may not be directly pluggable into the mathematical function. Therefore, we need data representations.

The process of creating features to represent the input data is called feature engineering, and so we can think of feature engineering as a way of selecting the data representation.

The process of learning features to represent the input data is called feature extraction, and we can think of learnable data representations (like embeddings) as automatically engineered features.

The data representation doesn't need to be learned or fixed - a hybrid is also possible.

We may represent input data of different types separately, or represent each piece of data as just one feature - multimodal input.

There are four data representations:

Simple Data Representation: eg scaling, normalizing, etc
Design Pattern 1: Hashed Feature
Design Pattern 2: Embeddings
Design Pattern 3: Feature Cross
Design Pattern 4: Mutltimodal Input

Simple Data Representations

Not a feature representation design pattern, but a common practice in ML models.

Numerical Inputs

For numerical values, we often scale them to take values in [-1, 1]. Why?

ML frameworks use optimizers that are tuned to work well with numbers in this range. Thus, it improves the accuracy of the model.
Some ML algorithms are sensitive to relative magnitude of features (eg K-Means).
It also improves L1 and L2 regularization; we ensure that there is not much difference between variables.

What are different types of scaling?

Linear scaling
- Min-max scaling
- Clipping
- Z-score normalization
- Winsorizing (clip data outside 10 and 90 percentiles)
Nonlinear transformations
- If data is skewed or is not uniformly distributed or is not distributed like Gaussian...
- We apply nonlinear transformation before scaling (to make them look like a bell shape)
  - Custom functions: e.g. log $\rightarrow$ fourth root $\rightarrow$ ...
  - Bucketize: so bucket boundaries fit the desired distribution
  - Box-Cox transform: this method chooses its single parameter ($\lambda$) to control the "heteroscedasticity", so that the variance no longer depends on the magnitude.

Categorical Inputs

Most ML models perform on numerical values. Thus, we need to transform our categorical data into numbers.

One-hot encoding: converts a categorical feature into a vector of size vocabulary

(eg "English" $\rightarrow$ [0, 0, 1, 0, ..., 0])
Array: if the array of categories is of fixed length, we can treat each array position as a feature.

Design Pattern 1: Hashed Feature

There are certain problems with categorical features. Namely,

incomplete vocabulary: training data does not contain all the possible values.
high cardinality: a feature vector may have a length of thousands to millions.
cold start: after the model is placed into production, new data is introduced.

Hashed Feature design pattern represents categorical variables by:

Converting the categorical input into a unique string.
Applying a deterministic hashing algorithm on the string.
Taking the remainder of hash result divided by the desired number of buckets.

It's easy to see that all three above-mentioned issues are addressed.

Tradeoffs and alternatives

Bucket collision: different values may share the same bucket.
Skew: same bucket but vastly different? (eg. Chicago airport in the same bucket as Vermont airport)
Aggregate features: it may be helpful to add an aggregate feature so the difference between different-values-placed-in-the-same-bucket is preserved. (eg. number_of_flights)
Hyperparameter tuning: to find the best number of buckets.
Cryptographic hash: not reproducible. That's why we use fingerprint hashing.
Order of operations: The order of operations (eg. ABS(MOD(FINGERPRINT(value), num_buckets)))) is important for reproducibility.
Empty hash buckets: It would be useful to apply L2 regularization to lower the weights associated with an empty bucket to near zero.

Design Pattern 2: Embeddings

Embeddings are a learnable data representation that map data into a lower-dimensional space in such a way that the information relevant to the learning problem is preserved. They provide a way to handle disparate data types in a way that preserves similarity between items and thus improves our model's ability to learn those essential patterns.

Remember one-hot-encoding? What if we had many categories to consider? Also, it treats the categorical variables as being independent. That might not be the case.

Embeddings address both problems.

Text embeddings

Text provides a natural setting where it is advantageous to use an embedding layer. To do that in Keras, see the following example:

# First, we create a tokenization for each word in our vocabulary.
tokenizer = keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(titles_df.title)

# Second, we use |texts_to_sequences| method to convert words into their indices.
integerized_titles = tokenizer.texts_to_sequences(titles_df.titles)

# Third, we pad sentences so all of them have the same length.
VOCAB_SIZE = len(tokenizer.index_word)
MAX_LEN = max(len(sequence) for sequence in integerized_titles)
def create_sequences(texts, max_len=MAX_LEN):
  sequences = tokernizers.texts_to_sequences(texts)
  padded_seqs = keras.preprocessing.sequence.pad_sequences(sequences,
                                                           max_len,
                                                           padding='post')
  return padded_seqs

# Finally, we can pass our |padded_sequences| to the model.
model = keras.models.Sequential([
  keras.layers.Embedding(input_dim=VOCAB_SIZE + 1,
                         output_dim=embed_dim,  # arbitrary embedding length
                         input_shape=[MAX_LEN]),
  keras.layers.Dense(N_CLASSES, activation='softmax')  # example: sentiment analysis
])

Image embeddings

Data types, such as images or audio, consist of dense and high-dimensional vectors. Therefore, lowering dimensionality by learning embeddings are essential.

For image embeddings, there are several pretrained CNN architectures - like Inception or ResNet - available. We usually use these pretrained CNNs by removing the last softmax layer to obtain a lower-dimension embedding for our images. Then we can plug that embedding layer into our network for our purpose. Suppose we have an image captioning task at hand:

Tradeoffs and alternatives

Choosing the embedding dimension: by lowering the dimensionality, we lose some information. The optimal embedding dimension is a hyperparameter that we need to tune.
Autoencoders: Training embeddings in a supervised way may require a lot of labeled data. If that is not possible, we can set up an autoencoder network to obtain the embeddings.

Context language models: a pretrained text embedding, like Word2Vec or BERT, can be added to a ML model to process text features in conjunction with learned embeddings from image/video/... input.
- Word2Vec: Bag of Words + skip-gram
- BERT: masked language model + next sentence prediction. Therefore, it can understand the difference between Apple as the fruit or as the company.
Embeddings in a data warehouse: we can load a pretrained model into our data warehouse and use it to transform text column into an embedding array. (More on this in Chapter 6)

Design Pattern 3: Feature Cross

A feature cross is formed by concatenating two or more categorical features in order to capture the interaction between them. It would also make it possible to encode nonlinearity into the model.

Complex models like neural networks and trees do this automatically, but performing feature cross can help simpler linear models to improve. It will also allow data warehouse queries to provide quick analytical reports.

Tradeoffs

Handling numerical features: the space is infinte for continous variables. We should bucketize the data to make them categorical.
Handling high cardinality
Need for regularization

Design Pattern 4: Mutltimodal Input

Sometimes, it is desired to have different data types fed into the model for more accurate predictions. For example, if we have image data with their metadata, we may be able to have a more accurate prediction for traffic control.

In addition to mixing different datatypes, we may also want to represent the same data in different ways to make it easier for our model to identify pattenrs.

Combining different types of data, like images + metadata
Representing complex data in multiple ways

Tabular data

We have five-star review ratings of restaurants. We may want to use the rating as-is, and also categorize them as dislike for {1, 2, 3}, and like for {4, 5}. That is because the value of rating as measured by 1 to 5 stars does not necessarily increase linearly.

Text

Text can be represented as embeddings as we have seen before. Or, we can represent them as bag of words (BoW). BoW does not preserve the order of our text, but it does detect the prsence or absence of certain words.

BoW does not require training and its encoding can be used in simpler models, like XGBoost or linear regression.
We can extract tabular features from text too. Like {title_len, word_count, ends_with_q_mark,...}.

Using these representations, next to embeddings, may also improve deep learning models.

Images

Images can be represented as arrays of pixel values (output of Flatten() as we see in deep neural networks built for MNIST) or as tiled structure (output of CNNs). The following code gives an example using Keras Functional API:

image_input = Input(shape=(28, 28, 3))

# Pixel values
pixel_layer = Flatten()(image_input)

# Tiled representation
tiled_layer = Conv2D(filters=16, kernel_size=3, activation='relu')(image_input)
tiled_layer = MaxPooling2D()(tiled_layer)
tiled_layer = keras.layers.Flatten()(tiled_layer)

# Concatenate into a single layer
merged_image_layers = keras.layers.concatenate([pixel_layer, tiled_layer])

# Output layer
merged_dense = Dense(16, activation='relu')(merged_image_layers)
merged_output = Dense(N_CLASSES)(merged_dense)

# Build the model
model = Model(inputs=image_input, outputs=merged_output)

Images + Metadata

Caveat: DL models are inherently diffciult to explain. However, there are several techniques for explaining image models that can highligh the pixels that affected model's predictions. By combining image data with metadata, these features become dependent on one another, and therefore, it can be difficult to explain how the model is making predictions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Chapter 2: Data Representation Design Patterns

Simple Data Representations

Numerical Inputs

Categorical Inputs

Design Pattern 1: Hashed Feature

Design Pattern 2: Embeddings

Text embeddings

Image embeddings

Design Pattern 3: Feature Cross

Design Pattern 4: Mutltimodal Input

Files

README.md

Latest commit

History

README.md

File metadata and controls

Chapter 2: Data Representation Design Patterns

Simple Data Representations

Numerical Inputs

Categorical Inputs

Design Pattern 1: Hashed Feature

Design Pattern 2: Embeddings

Text embeddings

Image embeddings

Design Pattern 3: Feature Cross

Design Pattern 4: Mutltimodal Input