## Feature Engineering using TFX Pipeline and TensorFlow Transform

Transform input data and train a model with a TFX pipeline.

You can increase the predictive quality of your data and/or reduce dimensionality with feature engineering. One of the benefits of using TFX is that you will write your transformation code once, and the resulting transforms will be consistent between training and serving in order to avoid training/serving skew.

We will add a Transform component to the pipeline. The Transform component is implemented using the tf.transform library.

In [1]:
import tensorflow as tf
print('TensorFlow version: {}'.format(tf.__version__))
from tfx import v1 as tfx
print('TFX version: {}'.format(tfx.__version__))

TensorFlow version: 2.8.1
TFX version: 1.7.1


In [2]:
import os

PIPELINE_NAME = "penguin-transform"

# Output directory to store artifacts generated from the pipeline.
PIPELINE_ROOT = os.path.join('pipelines', PIPELINE_NAME)
# Path to a SQLite DB file to use as an MLMD storage.
METADATA_PATH = os.path.join('metadata', PIPELINE_NAME, 'metadata.db')
# Output directory where created models from the pipeline will be exported.
SERVING_MODEL_DIR = os.path.join('serving_model', PIPELINE_NAME)

from absl import logging
logging.set_verbosity(logging.INFO)  # Set default logging level.

### Prepare example data

We will download the example dataset for use in our TFX pipeline. The dataset we are using is Palmer Penguins dataset.

However, unlike previous tutorials which used an already preprocessed dataset, we will use the raw Palmer Penguins dataset.

Because the TFX ExampleGen component reads inputs from a directory, we need to create a directory and copy the dataset to it.

In [3]:
import urllib.request
import tempfile

DATA_ROOT = tempfile.mkdtemp(prefix='tfx-data')  # Create a temporary directory.
_data_path = 'https://storage.googleapis.com/download.tensorflow.org/data/palmer_penguins/penguins_size.csv'
_data_filepath = os.path.join(DATA_ROOT, "data.csv")
urllib.request.urlretrieve(_data_path, _data_filepath)

('/tmp/tfx-datalvyu4pva/data.csv', <http.client.HTTPMessage at 0x7fbfba1883a0>)

Take a quick look at what the raw data looks like.

In [4]:
!head {_data_filepath}

species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
Adelie,Torgersen,39.1,18.7,181,3750,MALE
Adelie,Torgersen,39.5,17.4,186,3800,FEMALE
Adelie,Torgersen,40.3,18,195,3250,FEMALE
Adelie,Torgersen,NA,NA,NA,NA,NA
Adelie,Torgersen,36.7,19.3,193,3450,FEMALE
Adelie,Torgersen,39.3,20.6,190,3650,MALE
Adelie,Torgersen,38.9,17.8,181,3625,FEMALE
Adelie,Torgersen,39.2,19.6,195,4675,MALE
Adelie,Torgersen,34.1,18.1,193,3475,NA


There are some entries with missing values which are represented as NA. We will just delete those entries in this tutorial.




In [5]:
!sed -i '/\bNA\b/d' {_data_filepath}
!head {_data_filepath}

species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
Adelie,Torgersen,39.1,18.7,181,3750,MALE
Adelie,Torgersen,39.5,17.4,186,3800,FEMALE
Adelie,Torgersen,40.3,18,195,3250,FEMALE
Adelie,Torgersen,36.7,19.3,193,3450,FEMALE
Adelie,Torgersen,39.3,20.6,190,3650,MALE
Adelie,Torgersen,38.9,17.8,181,3625,FEMALE
Adelie,Torgersen,39.2,19.6,195,4675,MALE
Adelie,Torgersen,41.1,17.6,182,3200,FEMALE
Adelie,Torgersen,38.6,21.2,191,3800,MALE


You should be able to see seven features which describe penguins. We will use the same set of features as the previous tutorials - 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g' - and will predict the 'species' of a penguin.

The only difference will be that the input data is not preprocessed. Note that we will not use other features like 'island' or 'sex' in this tutorial.