## Simple TFX Pipeline Tutorial using Penguin dataset

### A short tutorial to run a simple TFX pipeline

This notebook follows the TFX tutorial here https://www.tensorflow.org/tfx/tucd%20torials/tfx/penguin_simple

Modications to the notebook were made to enable running the notebook locally

In [1]:
# Check the Tensorflow and TFX versions
import tensorflow as tf
print('TensorFlow version: {}'.format(tf.__version__))
from tfx import v1 as tfx
print('TFX version: {}'.format(tfx.__version__))

TensorFlow version: 2.8.1
TFX version: 1.7.1


### Set up variables

Set up variables used to define a pipeline

In [2]:
import os

PIPELINE_NAME='penguin-simple'

# Output directory to store artifacts generated from the pipeline
PIPELINE_ROOT = os.path.join('pipelines', PIPELINE_NAME)
# Path to a SQLITE DB file to use as an MLMD (ML Metadata) storage.
METADATA_PATH = os.path.join('metadata', PIPELINE_NAME,'metadata.db')
# Output directory where created models from the pipeline will be exported
SERVING_MODEL_DIR = os.path.join('serving_model', PIPELINE_NAME)

from absl import logging
logging.set_verbosity(logging.INFO) # set default logging level.

### Prepare example data

We will download the example Palmer Penguins dataset

There are 4 numeric features in this dataset:

    * culmen_length_mm    
    * culmen_depth_mm    
    * flipper_length_mm    
    * body_mass_g
    
All features were already normalized to have range[0,1]. Will will build a classification model which predicts the species of penguins. 

Because TFX ExampleGen reads inputs from a directory, we need to create a directory and copy dataset to it.

In [3]:
import urllib.request
import tempfile

DATA_ROOT = tempfile.mkdtemp(prefix='tfx-data')  # Create a temporary directory.
_data_url = 'https://raw.githubusercontent.com/tensorflow/tfx/master/tfx/examples/penguin/data/labelled/penguins_processed.csv'
_data_filepath = os.path.join(DATA_ROOT, "data.csv")
urllib.request.urlretrieve(_data_url, _data_filepath)

('/tmp/tfx-data0s2g_5f4/data.csv', <http.client.HTTPMessage at 0x7f37ebebd700>)

Take a quick look at the CSV file.

In [5]:
!head {_data_filepath}

species,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g
0,0.2545454545454545,0.6666666666666666,0.15254237288135594,0.2916666666666667
0,0.26909090909090905,0.5119047619047618,0.23728813559322035,0.3055555555555556
0,0.29818181818181805,0.5833333333333334,0.3898305084745763,0.1527777777777778
0,0.16727272727272732,0.7380952380952381,0.3559322033898305,0.20833333333333334
0,0.26181818181818167,0.892857142857143,0.3050847457627119,0.2638888888888889
0,0.24727272727272717,0.5595238095238096,0.15254237288135594,0.2569444444444444
0,0.25818181818181823,0.773809523809524,0.3898305084745763,0.5486111111111112
0,0.32727272727272727,0.5357142857142859,0.1694915254237288,0.1388888888888889
0,0.23636363636363636,0.9642857142857142,0.3220338983050847,0.3055555555555556


You should be able to see five values, **species** is one of 0, 1, or 2, and all other features should have values between 0 and 1.