# Simple TFX Pipeline Tutorial using Penguin dataset

***A Short tutorial to run a simple TFX pipeline.***

In this notebook-based tutorial, we will create and run a [TFX pipeline](https://www.tensorflow.org/tfx/guide/understanding_tfx_pipelines)
for a simple classification model.
The pipeline will consist of three essential TFX components: `ExampleGen`,
`Trainer` and `Pusher`. The pipeline includes the most minimal ML workflow like
importing data, training a model and exporting the trained model.

### Import

In [1]:
import tensorflow as tf
print('TensorFlow version: {}'.format(tf.__version__))
from tfx import v1 as tfx
print('TFX version: {}'.format(tfx.__version__))

2022-10-16 14:54:38.360439: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-10-16 14:54:38.360483: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


TensorFlow version: 2.9.2




TFX version: 1.10.0


### Set up variables

There are some variables used to define a pipeline. You can customize these
variables as you want. By default all output from the pipeline will be
generated under the current directory.

In [2]:
import os

PIPELINE_NAME = "penguin-simple"

# Output directory to store artifacts generated from the pipeline.
PIPELINE_ROOT = os.path.join('pipelines', PIPELINE_NAME)

# Path to a SQLite DB file to use as an MLMD storage.
METADATA_PATH = os.path.join('metadata', PIPELINE_NAME, 'metadata.db')

# Output directory where created models from the pipeline will be exported.
SERVING_MODEL_DIR = os.path.join('serving_model', PIPELINE_NAME)

from absl import logging
logging.set_verbosity(logging.INFO)  # Set default logging level.

### Dataset

We will download the [Penguin dataset](https://raw.githubusercontent.com/tensorflow/tfx/master/tfx/examples/penguin/data/labelled/penguins_processed.csv). There are 4 features in this dataset: `culmen_length_mm`, `culmen_depth_mm`, `flipper_length_mm`, and `body_mass_g`.

All features were already normalized to have range [0,1]. We will build a
classification model which predicts the `species` of penguins.

In [3]:
DATA_ROOT = "data"
_data_filepath = os.path.join(DATA_ROOT, "data.csv")

In [4]:
!head {_data_filepath}

species,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g
0,0.2545454545454545,0.6666666666666666,0.15254237288135594,0.2916666666666667
0,0.26909090909090905,0.5119047619047618,0.23728813559322035,0.3055555555555556
0,0.29818181818181805,0.5833333333333334,0.3898305084745763,0.1527777777777778
0,0.16727272727272732,0.7380952380952381,0.3559322033898305,0.20833333333333334
0,0.26181818181818167,0.892857142857143,0.3050847457627119,0.2638888888888889
0,0.24727272727272717,0.5595238095238096,0.15254237288135594,0.2569444444444444
0,0.25818181818181823,0.773809523809524,0.3898305084745763,0.5486111111111112
0,0.32727272727272727,0.5357142857142859,0.1694915254237288,0.1388888888888889
0,0.23636363636363636,0.9642857142857142,0.3220338983050847,0.3055555555555556


You should be able to see five values. `species` is one of 0, 1 or 2, and all
other features should have values between 0 and 1.

## Create a pipeline

TFX pipelines are defined using Python APIs. We will define a pipeline which
consists of following three components.
- CsvExampleGen: Reads in data files and convert them to TFX internal format
for further processing. There are multiple
[ExampleGen](https://www.tensorflow.org/tfx/guide/examplegen)s for various
formats. In this tutorial, we will use CsvExampleGen which takes CSV file input.
- Trainer: Trains an ML model.
[Trainer component](https://www.tensorflow.org/tfx/guide/trainer) requires a
model definition code from users. You can use TensorFlow APIs to specify how to
train a model and save it in a _saved_model_ format.
- Pusher: Copies the trained model outside of the TFX pipeline.
[Pusher component](https://www.tensorflow.org/tfx/guide/pusher) can be thought
of as a deployment process of the trained ML model.

Before actually define the pipeline, we need to write a model code for the
Trainer component first.

### Write model training code

We will create a simple DNN model for classification using TensorFlow Keras
API. This model training code will be saved to a separate `train.py` file.

In this tutorial we will use
[Generic Trainer](https://www.tensorflow.org/tfx/guide/trainer#generic_trainer)
of TFX which support Keras-based models. You need to write a Python file
containing `run_fn` function, which is the entrypoint for the `Trainer`
component.

In [10]:
_trainer_module_file = "train.py"

In [11]:
%%writefile {_trainer_module_file}

from typing import List
from absl import logging
import tensorflow as tf
from tensorflow import keras
from tensorflow_transform.tf_metadata import schema_utils

from tfx import v1 as tfx
from tfx_bsl.public import tfxio
from tensorflow_metadata.proto.v0 import schema_pb2


_FEATURE_KEYS = [
    'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g'
]
_LABEL_KEY = 'species'


# Define feature and label spec
_FEATURE_SPEC = {
    **{
        feature: tf.io.FixedLenFeature(shape=[1], dtype=tf.float32)
           for feature in _FEATURE_KEYS
       },
    _LABEL_KEY: tf.io.FixedLenFeature(shape=[1], dtype=tf.int64)
}


# Define batch size
_TRAIN_BATCH_SIZE = 20
_EVAL_BATCH_SIZE = 10


# Define data inputs
def _input_fn(file_pattern: List[str],
              data_accessor: tfx.components.DataAccessor,
              schema: schema_pb2.Schema,
              batch_size: int = 200) -> tf.data.Dataset:
    """Generates features and label for training.

    Args:
    file_pattern: List of paths or patterns of input tfrecord files.
    data_accessor: DataAccessor for converting input to RecordBatch.
    schema: schema of the input data.
    batch_size: number of examples in a single batch.

    Returns:
    A dataset that contains (features, indices) tuple where features is a
      dictionary of Tensors, and indices is a single Tensor of label indices.
    """
    return data_accessor.tf_dataset_factory(
      file_pattern,
      tfxio.TensorFlowDatasetOptions(
          batch_size=batch_size, label_key=_LABEL_KEY),
      schema=schema).repeat()


# Define model
def _build_keras_model() -> tf.keras.Model:
    inputs = [keras.layers.Input(shape=(1,), name=f) for f in _FEATURE_KEYS]
    d = keras.layers.concatenate(inputs)
    for _ in range(2):
        d = keras.layers.Dense(8, activation='relu')(d)
    outputs = keras.layers.Dense(3)(d)

    model = keras.Model(inputs=inputs, outputs=outputs)
    model.compile(
      optimizer=keras.optimizers.Adam(1e-2),
      loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
      metrics=[keras.metrics.SparseCategoricalAccuracy()])

    model.summary(print_fn=logging.info)
    return model


# Define trainer
def run_fn(fn_args: tfx.components.FnArgs):
    """Train the model based on given args.

    Args:
    fn_args: Holds args used to train the model as name/value pairs.
    """

    # This schema is usually either an output of SchemaGen or a manually-curated
    # version provided by pipeline author. A schema can also derived from TFT
    # graph if a Transform component is used. In the case when either is missing,
    # `schema_from_feature_spec` could be used to generate schema from very simple
    # feature_spec, but the schema returned would be very primitive.
    schema = schema_utils.schema_from_feature_spec(_FEATURE_SPEC)

    train_dataset = _input_fn(
        fn_args.train_files,
        fn_args.data_accessor,
        schema,
        batch_size=_TRAIN_BATCH_SIZE)
    eval_dataset = _input_fn(
        fn_args.eval_files,
        fn_args.data_accessor,
        schema,
        batch_size=_EVAL_BATCH_SIZE)

    model = _build_keras_model()
    model.fit(
        train_dataset,
        steps_per_epoch=fn_args.train_steps,
        validation_data=eval_dataset,
        validation_steps=fn_args.eval_steps)

    # Saved model directory.
    model.save(fn_args.serving_model_dir, save_format='tf')

Writing train.py


Now you have completed all preparation steps to build a TFX pipeline.

### Write a pipeline definition

We define a function to create a TFX pipeline. A `Pipeline` object
represents a TFX pipeline which can be run using one of the pipeline
orchestration systems that TFX supports.


In [12]:
def _create_pipeline(pipeline_name: str, pipeline_root: str, data_root: str,
                     module_file: str, serving_model_dir: str,
                     metadata_path: str) -> tfx.dsl.Pipeline:
    """Creates a three component penguin pipeline with TFX."""
    # Brings data into the pipeline.
    example_gen = tfx.components.CsvExampleGen(input_base=data_root)

    # Uses user-provided Python function that trains a model.
    trainer = tfx.components.Trainer(
        module_file=module_file,
        examples=example_gen.outputs['examples'],
        train_args=tfx.proto.TrainArgs(num_steps=100),
        eval_args=tfx.proto.EvalArgs(num_steps=5))

    # Pushes the model to a filesystem destination.
    pusher = tfx.components.Pusher(
        model=trainer.outputs['model'],
        push_destination=tfx.proto.PushDestination(
            filesystem=tfx.proto.PushDestination.Filesystem(
                base_directory=serving_model_dir)))

    # Following three components will be included in the pipeline.
    components = [
      example_gen,
      trainer,
      pusher,
    ]

    return tfx.dsl.Pipeline(
          pipeline_name=pipeline_name,
          pipeline_root=pipeline_root,
          metadata_connection_config=tfx.orchestration.metadata
          .sqlite_metadata_connection_config(metadata_path),
          components=components)

## Run the pipeline

TFX supports multiple orchestrators to run pipelines.
In this tutorial we will use `LocalDagRunner` which is included in the TFX
Python package and runs pipelines on local environment.
We often call TFX pipelines "DAGs" which stands for directed acyclic graph.

`LocalDagRunner` provides fast iterations for development and debugging.
TFX also supports other orchestrators including Kubeflow Pipelines and Apache
Airflow which are suitable for production use cases. See
[TFX on Cloud AI Platform Pipelines](https://www.tensorflow.org/tfx/tutorials/tfx/cloud-ai-platform-pipelines)
or
[TFX Airflow Tutorial](https://www.tensorflow.org/tfx/tutorials/tfx/airflow_workshop)
to learn more about other orchestration systems.

In [13]:
tfx.orchestration.LocalDagRunner().run(
  _create_pipeline(
      pipeline_name=PIPELINE_NAME,
      pipeline_root=PIPELINE_ROOT,
      data_root=DATA_ROOT,
      module_file=_trainer_module_file,
      serving_model_dir=SERVING_MODEL_DIR,
      metadata_path=METADATA_PATH))

INFO:absl:Generating ephemeral wheel package for '/media/hoang/665083FD5083D26B/Windows.old/Users/hoang/Desktop/workspace/mlops-labs/TFX-components/simple-pipeline/train.py' (including modules: ['train']).
INFO:absl:User module package has hash fingerprint version b229a6e2bc03c2463f2b827910ea514c8ff9a4546130850c7ee2967d293ab691.
INFO:absl:Executing: ['/usr/bin/python3.7', '/tmp/tmpp3__f3zi/_tfx_generated_setup.py', 'bdist_wheel', '--bdist-dir', '/tmp/tmpr95awj_f', '--dist-dir', '/tmp/tmptbfn241x']


running bdist_wheel
running build
running build_py
creating build
creating build/lib
copying train.py -> build/lib
installing to /tmp/tmpr95awj_f
running install
running install_lib
copying build/lib/train.py -> /tmp/tmpr95awj_f
running install_egg_info


INFO:absl:Successfully built user code wheel distribution at 'pipelines/penguin-simple/_wheels/tfx_user_code_Trainer-0.0+b229a6e2bc03c2463f2b827910ea514c8ff9a4546130850c7ee2967d293ab691-py3-none-any.whl'; target user module is 'train'.
INFO:absl:Full user module path is 'train@pipelines/penguin-simple/_wheels/tfx_user_code_Trainer-0.0+b229a6e2bc03c2463f2b827910ea514c8ff9a4546130850c7ee2967d293ab691-py3-none-any.whl'
INFO:absl:Using deployment config:
 executor_specs {
  key: "CsvExampleGen"
  value {
    beam_executable_spec {
      python_executor_spec {
        class_path: "tfx.components.example_gen.csv_example_gen.executor.Executor"
      }
    }
  }
}
executor_specs {
  key: "Pusher"
  value {
    python_class_executable_spec {
      class_path: "tfx.components.pusher.executor.Executor"
    }
  }
}
executor_specs {
  key: "Trainer"
  value {
    python_class_executable_spec {
      class_path: "tfx.components.trainer.executor.GenericExecutor"
    }
  }
}
custom_driver_specs {
  ke

running egg_info
creating tfx_user_code_Trainer.egg-info
writing tfx_user_code_Trainer.egg-info/PKG-INFO
writing dependency_links to tfx_user_code_Trainer.egg-info/dependency_links.txt
writing top-level names to tfx_user_code_Trainer.egg-info/top_level.txt
writing manifest file 'tfx_user_code_Trainer.egg-info/SOURCES.txt'
reading manifest file 'tfx_user_code_Trainer.egg-info/SOURCES.txt'
writing manifest file 'tfx_user_code_Trainer.egg-info/SOURCES.txt'
Copying tfx_user_code_Trainer.egg-info to /tmp/tmpr95awj_f/tfx_user_code_Trainer-0.0+b229a6e2bc03c2463f2b827910ea514c8ff9a4546130850c7ee2967d293ab691-py3.7.egg-info
running install_scripts
creating /tmp/tmpr95awj_f/tfx_user_code_Trainer-0.0+b229a6e2bc03c2463f2b827910ea514c8ff9a4546130850c7ee2967d293ab691.dist-info/WHEEL
creating '/tmp/tmptbfn241x/tfx_user_code_Trainer-0.0+b229a6e2bc03c2463f2b827910ea514c8ff9a4546130850c7ee2967d293ab691-py3-none-any.whl' and adding '/tmp/tmpr95awj_f' to it
adding 'train.py'
adding 'tfx_user_code_Trainer-

INFO:absl:MetadataStore with DB connection initialized
INFO:absl:select span and version = (0, None)
INFO:absl:latest span and version = (0, None)
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:Going to run a new execution 1
INFO:absl:Going to run a new execution: ExecutionInfo(execution_id=1, input_dict={}, output_dict=defaultdict(<class 'list'>, {'examples': [Artifact(artifact: uri: "pipelines/penguin-simple/CsvExampleGen/examples/1"
custom_properties {
  key: "input_fingerprint"
  value {
    string_value: "split:single_split,num_files:1,total_bytes:25648,xor_checksum:1665904369,sum_checksum:1665904369"
  }
}
custom_properties {
  key: "name"
  value {
    string_value: "penguin-simple:2022-10-16T14:57:05.044610:CsvExampleGen:1:examples:0"
  }
}
custom_properties {
  key: "span"
  value {
    int_value: 0
  }
}
name: "penguin-simple:2022-10-16T14:57:05.044610:CsvExampleGen:1:examples:0"
, artifact_type: name: "Examples"
properties {
  key: "span"
  value: INT
}
pro

INFO:absl:Processing input csv data data/* to TFExample.
INFO:absl:Examples generated.
INFO:absl:Value type <class 'NoneType'> of key version in exec_properties is not supported, going to drop it
INFO:absl:Value type <class 'list'> of key _beam_pipeline_args in exec_properties is not supported, going to drop it
INFO:absl:Cleaning up stateless execution info.
INFO:absl:Execution 1 succeeded.
INFO:absl:Cleaning up stateful execution info.
INFO:absl:Publishing output artifacts defaultdict(<class 'list'>, {'examples': [Artifact(artifact: uri: "pipelines/penguin-simple/CsvExampleGen/examples/1"
custom_properties {
  key: "input_fingerprint"
  value {
    string_value: "split:single_split,num_files:1,total_bytes:25648,xor_checksum:1665904369,sum_checksum:1665904369"
  }
}
custom_properties {
  key: "name"
  value {
    string_value: "penguin-simple:2022-10-16T14:57:05.044610:CsvExampleGen:1:examples:0"
  }
}
custom_properties {
  key: "span"
  value {
    int_value: 0
  }
}
custom_properties

INFO:absl:Train on the 'train' split when train_args.splits is not set.
INFO:absl:Evaluate on the 'eval' split when eval_args.splits is not set.
INFO:absl:udf_utils.get_fn {'custom_config': 'null', 'train_args': '{\n  "num_steps": 100\n}', 'eval_args': '{\n  "num_steps": 5\n}', 'module_path': 'train@pipelines/penguin-simple/_wheels/tfx_user_code_Trainer-0.0+b229a6e2bc03c2463f2b827910ea514c8ff9a4546130850c7ee2967d293ab691-py3-none-any.whl'} 'run_fn'
INFO:absl:Installing 'pipelines/penguin-simple/_wheels/tfx_user_code_Trainer-0.0+b229a6e2bc03c2463f2b827910ea514c8ff9a4546130850c7ee2967d293ab691-py3-none-any.whl' to a temporary directory.
INFO:absl:Executing: ['/usr/bin/python3.7', '-m', 'pip', 'install', '--target', '/tmp/tmpts_1md8b', 'pipelines/penguin-simple/_wheels/tfx_user_code_Trainer-0.0+b229a6e2bc03c2463f2b827910ea514c8ff9a4546130850c7ee2967d293ab691-py3-none-any.whl']


Processing ./pipelines/penguin-simple/_wheels/tfx_user_code_Trainer-0.0+b229a6e2bc03c2463f2b827910ea514c8ff9a4546130850c7ee2967d293ab691-py3-none-any.whl


INFO:absl:Successfully installed 'pipelines/penguin-simple/_wheels/tfx_user_code_Trainer-0.0+b229a6e2bc03c2463f2b827910ea514c8ff9a4546130850c7ee2967d293ab691-py3-none-any.whl'.
INFO:absl:Training model.
INFO:absl:Feature body_mass_g has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature culmen_depth_mm has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature culmen_length_mm has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature flipper_length_mm has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature species has a shape dim {
  size: 1
}
. Setting to DenseTensor.


Installing collected packages: tfx-user-code-Trainer
Successfully installed tfx-user-code-Trainer-0.0+b229a6e2bc03c2463f2b827910ea514c8ff9a4546130850c7ee2967d293ab691


2022-10-16 14:57:11.402496: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-10-16 14:57:11.402536: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-10-16 14:57:11.402558: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (jump-windows): /proc/driver/nvidia/version does not exist
2022-10-16 14:57:11.402897: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO:absl:Feature body_mass_g has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature culmen

INFO:tensorflow:Assets written to: pipelines/penguin-simple/Trainer/model/2/Format-Serving/assets


INFO:tensorflow:Assets written to: pipelines/penguin-simple/Trainer/model/2/Format-Serving/assets
INFO:absl:Training complete. Model written to pipelines/penguin-simple/Trainer/model/2/Format-Serving. ModelRun written to pipelines/penguin-simple/Trainer/model_run/2
INFO:absl:Cleaning up stateless execution info.
INFO:absl:Execution 2 succeeded.
INFO:absl:Cleaning up stateful execution info.
INFO:absl:Publishing output artifacts defaultdict(<class 'list'>, {'model': [Artifact(artifact: uri: "pipelines/penguin-simple/Trainer/model/2"
custom_properties {
  key: "name"
  value {
    string_value: "penguin-simple:2022-10-16T14:57:05.044610:Trainer:2:model:0"
  }
}
custom_properties {
  key: "tfx_version"
  value {
    string_value: "1.10.0"
  }
}
name: "penguin-simple:2022-10-16T14:57:05.044610:Trainer:2:model:0"
, artifact_type: name: "Model"
base_type: MODEL
)], 'model_run': [Artifact(artifact: uri: "pipelines/penguin-simple/Trainer/model_run/2"
custom_properties {
  key: "name"
  value {

INFO:absl:MetadataStore with DB connection initialized
INFO:absl:Component Pusher is finished.


The pusher component pushes the trained model to the `SERVING_MODEL_DIR` which
is the `serving_model/penguin-simple` directory.

In [15]:
!find {SERVING_MODEL_DIR}

serving_model/penguin-simple
serving_model/penguin-simple/1665907034
serving_model/penguin-simple/1665907034/assets
serving_model/penguin-simple/1665907034/keras_metadata.pb
serving_model/penguin-simple/1665907034/saved_model.pb
serving_model/penguin-simple/1665907034/variables
serving_model/penguin-simple/1665907034/variables/variables.data-00000-of-00001
serving_model/penguin-simple/1665907034/variables/variables.index
