/
feature_space_advanced.py
603 lines (491 loc) · 22.1 KB
/
feature_space_advanced.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
"""
Title: FeatureSpace advanced use cases
Author: [Dimitre Oliveira](https://www.linkedin.com/in/dimitre-oliveira-7a1a0113a/)
Date created: 2023/07/01
Last modified: 2023/07/01
Description: How to use FeatureSpace for advanced preprocessing use cases.
Accelerator: None
"""
"""
## Introduction
This example is an extension of the
[Structured data classification with FeatureSpace](https://keras.io/examples/structured_data/structured_data_classification_with_feature_space/)
code example, and here we will extend it to cover more complex use
cases of the [`keras.utils.FeatureSpace`](https://keras.io/api/utils/feature_space/)
preprocessing utility, like feature hashing, feature crosses, handling missing values and
integrating [Keras preprocessing layers](https://keras.io/api/layers/preprocessing_layers/)
with FeatureSpace.
The general task still is structured data classification (also known as tabular data
classification) using a data that includes numerical features, integer categorical
features, and string categorical features.
"""
"""
### The dataset
[Our dataset](https://archive.ics.uci.edu/dataset/222/bank+marketing) is provided by a
Portuguese banking institution.
It's a CSV file with 4119 rows. Each row contains information about marketing campaigns
based on phone calls, and each column describes an attribute of the client. We use the
features to predict whether the client subscribed ('yes') or not ('no') to the product
(bank term deposit).
Here's the description of each feature:
Column| Description| Feature Type
------|------------|-------------
Age | Age of the client | Numerical
Job | Type of job | Categorical
Marital | Marital status | Categorical
Education | Education level of the client | Categorical
Default | Has credit in default? | Categorical
Housing | Has housing loan? | Categorical
Loan | Has personal loan? | Categorical
Contact | Contact communication type | Categorical
Month | Last contact month of year | Categorical
Day_of_week | Last contact day of the week | Categorical
Duration | Last contact duration, in seconds | Numerical
Campaign | Number of contacts performed during this campaign and for this client | Numerical
Pdays | Number of days that passed by after the client was last contacted from a previous campaign | Numerical
Previous | Number of contacts performed before this campaign and for this client | Numerical
Poutcome | Outcome of the previous marketing campaign | Categorical
Emp.var.rate | Employment variation rate | Numerical
Cons.price.idx | Consumer price index | Numerical
Cons.conf.idx | Consumer confidence index | Numerical
Euribor3m | Euribor 3 month rate | Numerical
Nr.employed | Number of employees | Numerical
Y | Has the client subscribed a term deposit? | Target
**Important note regarding the feature `duration`**: this attribute highly affects the
output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a
call is performed. Also, after the end of the call y is obviously known. Thus, this input
should only be included for benchmark purposes and should be discarded if the intention
is to have a realistic predictive model. For this reason we will drop it.
"""
"""
## Setup
"""
import pandas as pd
import tensorflow as tf
from pathlib import Path
from zipfile import ZipFile
from tensorflow.keras.utils import FeatureSpace
"""
## Load the data
Let's download the data and load it into a Pandas dataframe:
"""
data_url = "https://archive.ics.uci.edu/static/public/222/bank+marketing.zip"
data_zipped_path = tf.keras.utils.get_file("bank_marketing.zip", data_url, extract=True)
keras_datasets_path = Path(data_zipped_path).parents[0]
with ZipFile(f"{keras_datasets_path}/bank-additional.zip", "r") as zip:
# Extract files
zip.extractall(path=keras_datasets_path)
dataframe = pd.read_csv(
f"{keras_datasets_path}/bank-additional/bank-additional.csv", sep=";"
)
"""
We will create a new feature `previously_contacted` to be able to demonstrate some useful
preprocessing techniques, this feature is based on `pdays`. According to the dataset
information if `pdays = 999` it means that the client was not previously contacted, so
let's create a feature to capture that.
"""
# Droping `duration` to avoid target leak
dataframe.drop("duration", axis=1, inplace=True)
# Creating the new feature `previously_contacted`
dataframe["previously_contacted"] = dataframe["pdays"].map(
lambda x: 0 if x == 999 else 1
)
"""
The dataset includes 4119 samples with 21 columns per sample (20 features, plus the
target label), here's a preview of a few samples:
"""
print(f"Dataframe shape: {dataframe.shape}")
display(dataframe.head())
"""
The column, "y", indicates whether the client has subscribed a term deposit or not.
"""
"""
## Train/validation split
Let's split the data into a training and validation set:
"""
valid_dataframe = dataframe.sample(frac=0.2, random_state=0)
train_dataframe = dataframe.drop(valid_dataframe.index)
print(
f"Using {len(train_dataframe)} samples for training and "
f"{len(valid_dataframe)} for validation"
)
"""
## Generating TF datasets
Let's generate
[`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) objects
for each dataframe, since our target column `y` is a string we also need to encode it as
an integer to be able to train our model with it. To achieve this we will create a
`StringLookup` layer that will map the strings "no" and "yes" into "0" and "1"
respectively.
"""
label_lookup = tf.keras.layers.StringLookup(
# the order here is important since the first index will be encoded as 0
vocabulary=["no", "yes"],
num_oov_indices=0,
)
def encode_label(x, y):
encoded_y = label_lookup(y)
return x, encoded_y
def dataframe_to_dataset(dataframe):
dataframe = dataframe.copy()
labels = dataframe.pop("y")
ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
ds = ds.map(encode_label, num_parallel_calls=tf.data.AUTOTUNE)
ds = ds.shuffle(buffer_size=len(dataframe))
return ds
train_ds = dataframe_to_dataset(train_dataframe)
valid_ds = dataframe_to_dataset(valid_dataframe)
"""
Each `Dataset` yields a tuple `(input, target)` where `input` is a dictionary of features
and `target` is the value `0` or `1`:
"""
for x, y in dataframe_to_dataset(train_dataframe).take(1):
print(f"Input: {x}")
print(f"Target: {y}")
"""
## Preprocessing
Usually our data is not on the proper or best format for modeling, this is why most of
the time we need to do some kind of preprocessing on the features to make them compatible
with the model or to extract the most of them for the task. We need to do this
preprocessing step for training but but at inference we also need to make sure that the
data goes through the same process, this where a utility like `FeatureSpace` shines, we
can define all the preprocessing once and re-use it at different stages of our system.
Here we will see how to use `FeatureSpace` to perform more complex transformations and
its flexibility, then combine everything together into a single component to preprocess
data for our model.
"""
"""
The `FeatureSpace` utility learns how to process the data by using the `adapt()` function
to learn from it, this requires a dataset containing only feature, so let's create it
together with a utility function to show the preprocessing example in practice:
"""
train_ds_with_no_labels = train_ds.map(lambda x, _: x)
def example_feature_space(dataset, feature_space, feature_names):
feature_space.adapt(dataset)
for x in dataset.take(1):
inputs = {feature_name: x[feature_name] for feature_name in feature_names}
preprocessed_x = feature_space(inputs)
print(f"Input: {[{k:v.numpy()} for k, v in inputs.items()]}")
print(
f"Preprocessed output: {[{k:v.numpy()} for k, v in preprocessed_x.items()]}"
)
"""
### Feature hashing
"""
"""
**Feature hashing** means hashing or encoding a set of values into a defined number of
bins, in this case we have `campaign` (number of contacts performed during this campaign
and for a client) which is a numerical feature that can assume a varying range of values
and we will hash it into 4 bins, this means that any possible value of the original
feature will be placed into one of those possible 4 bins. The output here can be a
one-hot encoded vector or a single number.
"""
feature_space = FeatureSpace(
features={
"campaign": FeatureSpace.integer_hashed(num_bins=4, output_mode="one_hot")
},
output_mode="dict",
)
example_feature_space(train_ds_with_no_labels, feature_space, ["campaign"])
"""
**Feature hashing** can also be used for string features.
"""
feature_space = FeatureSpace(
features={
"education": FeatureSpace.string_hashed(num_bins=3, output_mode="one_hot")
},
output_mode="dict",
)
example_feature_space(train_ds_with_no_labels, feature_space, ["education"])
"""
For numerical features we can get a similar behavior by using the `float_discretized`
option, the main difference between this and `integer_hashed` is that with the former we
bin the values while keeping some numerical relationship (close values will likely be
placed at the same bin) while the later (hashing) we cannot guarantee that those numbers
will be hashed into the same bin, it depends on the hashing function.
"""
feature_space = FeatureSpace(
features={"age": FeatureSpace.float_discretized(num_bins=3, output_mode="one_hot")},
output_mode="dict",
)
example_feature_space(train_ds_with_no_labels, feature_space, ["age"])
"""
### Feature indexing
"""
"""
**Indexing** a string feature essentially means creating a discrete numerical
representation for it, this is especially important for string features since most models
only accept numerical features. This transformation will place the string values into
different categories. The output here can be a one-hot encoded vector or a single number.
Note that by specifying `num_oov_indices=1` we leave one spot at our output vector for
OOV (out of vocabulary) values this is an important tool to handle missing or unseen
values after the training (values that were not seen during the `adapt()` step)
"""
feature_space = FeatureSpace(
features={
"default": FeatureSpace.string_categorical(
num_oov_indices=1, output_mode="one_hot"
)
},
output_mode="dict",
)
example_feature_space(train_ds_with_no_labels, feature_space, ["default"])
"""
We also can do **feature indexing** for integer features, this can be quite important for
some datasets where categorical features are replaced by numbers, for instance features
like `sex` or `gender` where values like (`1 and 0`) do not have a numerical relationship
between them, they are just different categories, this behavior can be perfectly captured
by this transformation.
On this dataset we can use the feature that we created `previously_contacted`. For this
case we want to explicitly set `num_oov_indices=0`, the reason is that we only expect two
possible values for the feature, anything else would be either wrong input or an issue
with the data creation, for this reason we would probably just want the code to throw an
error so that we can be aware of the issue and fix it.
"""
feature_space = FeatureSpace(
features={
"previously_contacted": FeatureSpace.integer_categorical(
num_oov_indices=0, output_mode="one_hot"
)
},
output_mode="dict",
)
example_feature_space(train_ds_with_no_labels, feature_space, ["previously_contacted"])
"""
### Feature crosses (mixing features of diverse types)
With **crosses** we can do feature interactions between an arbitrary number of features
of mixed types as long as they are categorical features, you can think of instead of
having a feature {'age': 20} and another {'job': 'entrepreneur'} we can have
{'age_X_job': 20_entrepreneur}, but with `FeatureSpace` and **crosses** we can apply
specific preprocessing to each individual feature and to the feature cross itself. This
option can be very powerful for specific use cases, here might be a good option since age
combined with job can have different meanings for the banking domain.
We will cross `age` and `job` and hash the combination output of them into a vector
representation of size 8. The output here can be a one-hot encoded vector or a single
number.
Sometimes the combination of multiple features can result into on a super large feature
space, think about crossing someone's ZIP code with its last name, the possibilities
would be in the thousands, that is why the `crossing_dim` parameter is so important it
limits the output dimension of the cross feature.
Note that the combination of possible values of the 6 bins of `age` and the 12 values of
`job` would be 72, so by choosing `crossing_dim = 8` we are choosing to constrain the
output vector.
"""
feature_space = FeatureSpace(
features={
"age": FeatureSpace.integer_hashed(num_bins=6, output_mode="one_hot"),
"job": FeatureSpace.string_categorical(
num_oov_indices=0, output_mode="one_hot"
),
},
crosses=[
FeatureSpace.cross(
feature_names=("age", "job"),
crossing_dim=8,
output_mode="one_hot",
)
],
output_mode="dict",
)
example_feature_space(train_ds_with_no_labels, feature_space, ["age", "job"])
"""
### FeatureSpace using a Keras preprocessing layer
To be a really flexible and extensible feature we cannot only rely on those pre-defined
transformation, we must be able to re-use other transformations from the Keras/TensorFlow
ecosystem and customize our own, this is why `FeatureSpace` is also designed to work with
[Keras preprocessing layers](https://keras.io/guides/preprocessing_layers/), this way we
can use sophisticated data transformations provided by the framework, you can even create
your own custom Keras preprocessing layers and use it in the same way.
Here we are going to use the
[`tf.keras.layers.TextVectorization`](https://keras.io/api/layers/preprocessing_layers/text/text_vectorization/#textvectorization-class)
preprocessing layer to create a TF-IDF
feature from our data. Note that this feature is not a really good use case for TF-IDF,
this is just for demonstration purposes.
"""
custom_layer = tf.keras.layers.TextVectorization(output_mode="tf_idf")
feature_space = FeatureSpace(
features={
"education": FeatureSpace.feature(
preprocessor=custom_layer, dtype="string", output_mode="float"
)
},
output_mode="dict",
)
example_feature_space(train_ds_with_no_labels, feature_space, ["education"])
"""
## Configuring the final `FeatureSpace`
Now that we know how to use `FeatureSpace` for more complex use cases let's pick the ones
that looks more useful for this task and create the final `FeatureSpace` component.
To configure how each feature should be preprocessed,
we instantiate a `keras.utils.FeatureSpace`, and we
pass to it a dictionary that maps the name of our features
to the feature transformation function.
"""
feature_space = FeatureSpace(
features={
# Categorical features encoded as integers
"previously_contacted": FeatureSpace.integer_categorical(num_oov_indices=0),
# Categorical features encoded as string
"marital": FeatureSpace.string_categorical(num_oov_indices=0),
"education": FeatureSpace.string_categorical(num_oov_indices=0),
"default": FeatureSpace.string_categorical(num_oov_indices=0),
"housing": FeatureSpace.string_categorical(num_oov_indices=0),
"loan": FeatureSpace.string_categorical(num_oov_indices=0),
"contact": FeatureSpace.string_categorical(num_oov_indices=0),
"month": FeatureSpace.string_categorical(num_oov_indices=0),
"day_of_week": FeatureSpace.string_categorical(num_oov_indices=0),
"poutcome": FeatureSpace.string_categorical(num_oov_indices=0),
# Categorical features to hash and bin
"job": FeatureSpace.string_hashed(num_bins=3),
# Numerical features to hash and bin
"pdays": FeatureSpace.integer_hashed(num_bins=4),
# Numerical features to normalize and bin
"age": FeatureSpace.float_discretized(num_bins=4),
# Numerical features to normalize
"campaign": FeatureSpace.float_normalized(),
"previous": FeatureSpace.float_normalized(),
"emp.var.rate": FeatureSpace.float_normalized(),
"cons.price.idx": FeatureSpace.float_normalized(),
"cons.conf.idx": FeatureSpace.float_normalized(),
"euribor3m": FeatureSpace.float_normalized(),
"nr.employed": FeatureSpace.float_normalized(),
},
# Specify feature cross with a custom crossing dim.
crosses=[
FeatureSpace.cross(feature_names=("age", "job"), crossing_dim=8),
FeatureSpace.cross(
feature_names=("default", "housing", "loan"), crossing_dim=6
),
FeatureSpace.cross(
feature_names=("poutcome", "previously_contacted"), crossing_dim=2
),
],
output_mode="concat",
)
"""
## Adapt the `FeatureSpace` to the training data
Before we start using the `FeatureSpace` to build a model, we have
to adapt it to the training data. During `adapt()`, the `FeatureSpace` will:
- Index the set of possible values for categorical features.
- Compute the mean and variance for numerical features to normalize.
- Compute the value boundaries for the different bins for numerical features to
discretize.
- Any other kind of preprocessing required by custom layers.
Note that `adapt()` should be called on a `tf.data.Dataset` which yields dicts
of feature values -- no labels.
But first let's batch the datasets
"""
train_ds = train_ds.batch(32)
valid_ds = valid_ds.batch(32)
train_ds_with_no_labels = train_ds.map(lambda x, _: x)
feature_space.adapt(train_ds_with_no_labels)
"""
At this point, the `FeatureSpace` can be called on a dict of raw feature values, and
because we set `output_mode="concat"` it will return a single concatenate vector for each
sample, combining encoded features and feature crosses.
"""
for x, _ in train_ds.take(1):
preprocessed_x = feature_space(x)
print(f"preprocessed_x shape: {preprocessed_x.shape}")
print(f"preprocessed_x sample: \n{preprocessed_x[0]}")
"""
## Saving the `FeatureSpace`
At this point we can choose to save our `FeatureSpace` component, this have many
advantages like re-using it on different experiments that use the same model, saving time
if you need to re-run the preprocessing step, and mainly for model deployment, where by
loading it you can be sure that you will be applying the same preprocessing steps don't
matter the device or environment, this is a great way to reduce
[training/servingskew](https://developers.google.com/machine-learning/guides/rules-of-ml#training-serving_skew).
"""
feature_space.save("myfeaturespace.keras")
"""
## Preprocessing with `FeatureSpace` as part of the tf.data pipeline
We will opt to use our component asynchronously by making it part of the tf.data
pipeline, as noted at the
[previous guide](https://keras.io/examples/structured_data/structured_data_classification_with_feature_space/)
This enables asynchronous parallel preprocessing of the data on CPU before it
hits the model. Usually, this is always the right thing to do during training.
Let's create a training and validation dataset of preprocessed batches:
"""
preprocessed_train_ds = train_ds.map(
lambda x, y: (feature_space(x), y), num_parallel_calls=tf.data.AUTOTUNE
).prefetch(tf.data.AUTOTUNE)
preprocessed_valid_ds = valid_ds.map(
lambda x, y: (feature_space(x), y), num_parallel_calls=tf.data.AUTOTUNE
).prefetch(tf.data.AUTOTUNE)
"""
## Model
We will take advantage of our `FeatureSpace` component to build the model, as we want the
model to be compatible with our preprocessing function, let's use the the `FeatureSpace`
feature map as the input of our model.
"""
encoded_features = feature_space.get_encoded_features()
print(encoded_features)
"""
This model is quite trivial only for demonstration purposes so don't pay too much
attention to the architecture.
"""
x = tf.keras.layers.Dense(64, activation="relu")(encoded_features)
x = tf.keras.layers.Dropout(0.5)(x)
output = tf.keras.layers.Dense(1, activation="sigmoid")(x)
model = tf.keras.Model(inputs=encoded_features, outputs=output)
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
"""
## Training
Let's train our model for 20 epochs. Note that feature preprocessing is happening as part
of the tf.data pipeline, not as part of the model.
"""
model.fit(
preprocessed_train_ds, validation_data=preprocessed_valid_ds, epochs=20, verbose=2
)
"""
## Inference on new data with the end-to-end model
Now, we can build our inference model (which includes the `FeatureSpace`) to make
predictions based on dicts of raw features values, as follows:
"""
"""
### Loading the `FeatureSpace`
First let's load the `FeatureSpace` that we saved a few moment ago, this can be quite
handy if you train a model but want to do inference at different time, possibly using a
different device or environment.
"""
loaded_feature_space = tf.keras.models.load_model("myfeaturespace.keras")
"""
### Building the inference end-to-end model
To build the inference model we need both the feature input map and the preprocessing
encoded Keras tensors.
"""
dict_inputs = loaded_feature_space.get_inputs()
encoded_features = loaded_feature_space.get_encoded_features()
print(encoded_features)
print(dict_inputs)
outputs = model(encoded_features)
inference_model = tf.keras.Model(inputs=dict_inputs, outputs=outputs)
sample = {
"age": 30,
"job": "blue-collar",
"marital": "married",
"education": "basic.9y",
"default": "no",
"housing": "yes",
"loan": "no",
"contact": "cellular",
"month": "may",
"day_of_week": "fri",
"campaign": 2,
"pdays": 999,
"previous": 0,
"poutcome": "nonexistent",
"emp.var.rate": -1.8,
"cons.price.idx": 92.893,
"cons.conf.idx": -46.2,
"euribor3m": 1.313,
"nr.employed": 5099.1,
"previously_contacted": 0,
}
input_dict = {name: tf.convert_to_tensor([value]) for name, value in sample.items()}
predictions = inference_model.predict(input_dict)
print(
f"This particular client has a {100 * predictions[0][0]:.2f}% probability "
"of subscribing a term deposit, as evaluated by our model."
)