```
From: https://github.com/ksatola
Version: 0.0.1

TODOs
1. 

```

# TensorFlow Extended Interactive Pipeline - Working version
- Python 3.9.7
- TensorFlow version: 2.5.1
- TFX version: 1.2.0

Releases: https://github.com/tensorflow/tfx/blob/master/RELEASE.md

Extended and adapted from: https://github.com/Dawit-1621/Machine-Learning-Model-Development-using-TensorFlow-Extended-TFX-/blob/main/Machine_Learning_Model_Development_using_TensorFlow_Extended_TFX.ipynb

## Table of contents

- [Jupyter Lab and Python Environment Setup](#toc00)
- [TFX Pipeline Setup and Raw Data Download](#toc01)
- [Data Ingestion](#toc02)
- [Data Validation (with Statistics and Schema)](#toc03)
- [Data Preprocessing](#toc04)

---
<a id='toc00'></a>

## Jupyter Lab and Python Environment Setup
In the terminal/shell
```
#pip install virtualenv
#pip install virtualenvwrapper
#brew install pyenv-virtualenv

pyenv install 3.9.7
pyenv virtualenv 3.9.7 tfx3.9.7
pyenv versions

pyenv shell tfx3.9.7
python -V

pip install tfx
#pip install apache-beam[interactive]
#sudo apt-get install libbz2-dev

jupyter lab --no-browser


#Use Python 3 kernel in Jupyter
```

---
<a id='toc01'></a>

## TFX Pipeline Setup

In [1]:
!python -V

Python 3.9.7


In [2]:
!which python

/home/ksatola/.pyenv/shims/python


In [3]:
!jupyter kernelspec list

Available kernels:
  python3     /home/ksatola/.local/share/jupyter/kernels/python3
  tfx3.9.7    /home/ksatola/.local/share/jupyter/kernels/tfx3.9.7


In [4]:
!jupyter --paths

config:
    /home/ksatola/.jupyter
    /usr/etc/jupyter
    /usr/local/etc/jupyter
    /etc/jupyter
data:
    /home/ksatola/.local/share/jupyter
    /usr/local/share/jupyter
    /usr/share/jupyter
runtime:
    /home/ksatola/.local/share/jupyter/runtime


In [5]:
!tfx

2021-09-18 14:11:01.778586: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-09-18 14:11:01.778631: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Usage: tfx [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  pipeline
  run
  template  [Experimental] Helps creating a new TFX pipeline scaffold.


### 1 - Setup and Imports

In [6]:
import os
import pprint
import tempfile
import urllib

from tfx.components import CsvExampleGen
from tfx.components import StatisticsGen
from tfx.components import SchemaGen
from tfx.components import ExampleValidator
from tfx.components import Transform
import absl
import tensorflow as tf
import tensorflow_model_analysis as tfma
tf.get_logger().propagate = False
pp = pprint.PrettyPrinter()

from tfx import v1 as tfx
from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext

%load_ext tfx.orchestration.experimental.interactive.notebook_extensions.skip

In [7]:
import tensorflow as tf
print('TensorFlow version: {}'.format(tf.__version__))

from tfx import v1 as tfx
print('TFX version: {}'.format(tfx.__version__))

TensorFlow version: 2.5.1
TFX version: 1.2.0


**tfx.components.CsvExampleGen** The csv examplegen component takes csv data, and generates train and eval examples for downstream components

In [8]:
!pip install utils

Defaulting to user installation because normal site-packages is not writeable


In [10]:
#Set up pipeline paths
#This is the root directory for your TFX pip package installation
_tfx_root = tfx.__path__[0]
print(_tfx_root)

#This is the directory containing the TFX Chicago Taxi Pipeline example
_taxi_root = os.path.join(_tfx_root, 'examples/chicago_taxi_pipeline')
print(_taxi_root)

#This is the path where your model will be pushed for serving
_serving_model_dir = os.path.join(tempfile.mkdtemp(), 'serving_model/taxi_simple')
print(_serving_model_dir)

#Set up logging
absl.logging.set_verbosity(absl.logging.INFO)

/usr/local/lib/python3.8/dist-packages/tfx/v1
/usr/local/lib/python3.8/dist-packages/tfx/v1/examples/chicago_taxi_pipeline
/tmp/tmpgbacxoj_/serving_model/taxi_simple


### 2 - Download example dataset

In [11]:
#Download example dataset
_data_root = tempfile.mkdtemp(prefix='tfx-data')
print(_data_root)

DATA_PATH = "https://raw.githubusercontent.com/tensorflow/tfx/master/tfx/examples/chicago_taxi_pipeline/data/simple/data.csv"
_data_filepath = os.path.join(_data_root, "data.csv")
print(_data_filepath)

urllib.request.urlretrieve(DATA_PATH, _data_filepath)

/tmp/tfx-data1hdfpoe3
/tmp/tfx-data1hdfpoe3/data.csv


('/tmp/tfx-data1hdfpoe3/data.csv', <http.client.HTTPMessage at 0x7f26accbd730>)

In [12]:
!head {_data_filepath}

pickup_community_area,fare,trip_start_month,trip_start_hour,trip_start_day,trip_start_timestamp,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude,trip_miles,pickup_census_tract,dropoff_census_tract,payment_type,company,trip_seconds,dropoff_community_area,tips
,12.45,5,19,6,1400269500,,,,,0.0,,,Credit Card,Chicago Elite Cab Corp. (Chicago Carriag,0,,0.0
,0,3,19,5,1362683700,,,,,0,,,Unknown,Chicago Elite Cab Corp.,300,,0
60,27.05,10,2,3,1380593700,41.836150155,-87.648787952,,,12.6,,,Cash,Taxi Affiliation Services,1380,,0.0
10,5.85,10,1,2,1382319000,41.985015101,-87.804532006,,,0.0,,,Cash,Taxi Affiliation Services,180,,0.0
14,16.65,5,7,5,1369897200,41.968069,-87.721559063,,,0.0,,,Cash,Dispatch Taxi Affiliation,1080,,0.0
13,16.45,11,12,3,1446554700,41.983636307,-87.723583185,,,6.9,,,Cash,,780,,0.0
16,32.05,12,1,1,1417916700,41.953582125,-87.72345239,,,15.4,,,Cash,,1200,,0.0
30,38.45,10,10,5,1444301100,41.839086906,-87.714003807,,,14.6,,,Cash,,2580,,0.0
11,14.65,1,1,3,1358

In [14]:
#InteractiveContext will allow to run TFX components interactively in a notebook to visualize its output
#It also setups metadata.sqlite engine
context = InteractiveContext(pipeline_root='../tfx_poc')



In [15]:
import pandas as pd
df = pd.read_csv(_data_filepath)
df.head()

Unnamed: 0,pickup_community_area,fare,trip_start_month,trip_start_hour,trip_start_day,trip_start_timestamp,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude,trip_miles,pickup_census_tract,dropoff_census_tract,payment_type,company,trip_seconds,dropoff_community_area,tips
0,,12.45,5,19,6,1400269500,,,,,0.0,,,Credit Card,Chicago Elite Cab Corp. (Chicago Carriag,0.0,,0.0
1,,0.0,3,19,5,1362683700,,,,,0.0,,,Unknown,Chicago Elite Cab Corp.,300.0,,0.0
2,60.0,27.05,10,2,3,1380593700,41.83615,-87.648788,,,12.6,,,Cash,Taxi Affiliation Services,1380.0,,0.0
3,10.0,5.85,10,1,2,1382319000,41.985015,-87.804532,,,0.0,,,Cash,Taxi Affiliation Services,180.0,,0.0
4,14.0,16.65,5,7,5,1369897200,41.968069,-87.721559,,,0.0,,,Cash,Dispatch Taxi Affiliation,1080.0,,0.0


In [16]:
df.shape

(15002, 18)

In [17]:
df.describe()

Unnamed: 0,pickup_community_area,fare,trip_start_month,trip_start_hour,trip_start_day,trip_start_timestamp,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude,trip_miles,pickup_census_tract,dropoff_census_tract,trip_seconds,dropoff_community_area,tips
count,15000.0,15002.0,15002.0,15002.0,15002.0,15002.0,15000.0,15000.0,14519.0,14519.0,15002.0,1.0,10761.0,14996.0,14495.0,15002.0
mean,22.250267,11.768216,6.585655,13.632316,4.186642,1408495000.0,41.903046,-87.657551,41.902672,-87.654113,2.87282,17031080000.0,17031350000.0,777.627501,20.967782,1.076674
std,19.414828,11.53885,3.390997,6.620927,2.015694,29160430.0,0.037751,0.067846,0.038478,0.056616,15.276007,,331224.3,977.538769,17.641056,2.15834
min,1.0,0.0,1.0,0.0,1.0,1357000000.0,41.694879,-87.913625,41.663671,-87.913625,0.0,17031080000.0,17031010000.0,0.0,1.0,0.0
25%,8.0,5.85,4.0,9.0,2.0,1384622000.0,41.880994,-87.655998,41.880994,-87.656804,0.0,17031080000.0,17031080000.0,360.0,8.0,0.0
50%,8.0,7.85,7.0,15.0,4.0,1407260000.0,41.892508,-87.633308,41.893216,-87.634156,1.0,17031080000.0,17031240000.0,540.0,12.0,0.0
75%,32.0,12.45,10.0,19.0,6.0,1431339000.0,41.921877,-87.626211,41.922686,-87.626215,2.5,17031080000.0,17031830000.0,960.0,32.0,2.0
max,77.0,700.07,12.0,23.0,7.0,1483116000.0,42.009623,-87.572782,42.021224,-87.540936,1710.0,17031080000.0,17031980000.0,72120.0,77.0,47.0


In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15002 entries, 0 to 15001
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   pickup_community_area   15000 non-null  float64
 1   fare                    15002 non-null  float64
 2   trip_start_month        15002 non-null  int64  
 3   trip_start_hour         15002 non-null  int64  
 4   trip_start_day          15002 non-null  int64  
 5   trip_start_timestamp    15002 non-null  int64  
 6   pickup_latitude         15000 non-null  float64
 7   pickup_longitude        15000 non-null  float64
 8   dropoff_latitude        14519 non-null  float64
 9   dropoff_longitude       14519 non-null  float64
 10  trip_miles              15002 non-null  float64
 11  pickup_census_tract     1 non-null      float64
 12  dropoff_census_tract    10761 non-null  float64
 13  payment_type            15002 non-null  object 
 14  company                 9862 non-null 

---
<a id='toc02'></a>

## Data Ingestion

### 3 - Data Ingestion with ExampleGen
The **ExampleGen** TFX pipeline component ingests data into TFX pipelines. It consumes external files/services to generate Examples/Observations which will be read by other TFX components. It also provides consistent and configurable partition, and shuffles the dataset for ML best practice. It uses [TFRecord data format](https://www.tensorflow.org/tutorials/load_data/tfrecord) based on Protocol Buffers.

The ExampleGen component is usually at the start of a TFX pipeline. It will:
- Split the data into training and evaluation set (by default, 2/3 training + 1/3 for eval)
- Convert data into the tf.Example format
- Copy data into the _tfx_root directory for other components to access ExampleGen takes as input the path to your data source. In our case, this is the _data_root path tha contains the downloaded CSV.


- **INPUT**: Raw data.
- **OUTPUT**: Data in a form which can be consumed by a pipeline.

#### Test: Create TFRecord dataset

In [19]:
with tf.io.TFRecordWriter("test.tfrecord") as w:
    w.write(b"First record") #bytes = b'...' literals = a sequence of octets (integers between 0 and 255)
    w.write(b"Second record")
    
for record in tf.data.TFRecordDataset("test.tfrecord"):
    print(record)

tf.Tensor(b'First record', shape=(), dtype=string)
tf.Tensor(b'Second record', shape=(), dtype=string)


#### Test: Convert custom data to TFRecord data structure and import it to the TFX pipeline using ImportExampleGen

In [20]:
import csv
from tqdm import tqdm

#Create a separate tmp folder for TFRecord file
_data_root2 = tempfile.mkdtemp(prefix='tfx-data-coverted')
print(_data_root2)

#Convert dataset to files containing the TFRecord data structure
original_data_file = _data_filepath #input file
tfrecords_filename = "chicago_taxi.tfrecords"
_data_filepath2 = os.path.join(_data_root2, tfrecords_filename) #output file
tf_record_writer = tf.io.TFRecordWriter(_data_filepath2)

#Helper functions
def _bytes_feature(value):
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value.encode()]))

def _float_feature(value):
    return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))


# Convert CSV (a part of it) to TFRecord
# Will not work with NaN values in columns!
with open(original_data_file) as csv_file:
    reader = csv.DictReader(csv_file, delimiter=",", quotechar='"')
    for row in tqdm(reader):
        example = tf.train.Example(
            features=tf.train.Features(
                feature={
                    "fare": _float_feature(float(row["fare"])),
                    "trip_start_month": _int64_feature(int(row["trip_start_month"])),
                    "company": _bytes_feature(row["company"]),
                }
            )
        )
        tf_record_writer.write(example.SerializeToString())
    tf_record_writer.close()

/tmp/tfx-data-covertedjbm9xpqa


15002it [00:00, 28237.37it/s]


In [21]:
print(_data_root2)

from tfx.components import ImportExampleGen

example_gen = ImportExampleGen(input_base=_data_root2)
context.run(example_gen)

/tmp/tfx-data-covertedjbm9xpqa


INFO:absl:Running driver for ImportExampleGen
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:select span and version = (0, None)
INFO:absl:latest span and version = (0, None)
INFO:absl:Running executor for ImportExampleGen
INFO:absl:Generating examples.


INFO:absl:Reading input TFRecord data /tmp/tfx-data-covertedjbm9xpqa/*.
INFO:absl:Examples generated.
INFO:absl:Running publisher for ImportExampleGen
INFO:absl:MetadataStore with DB connection initialized


0,1
.execution_id,1
.component,"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } ImportExampleGen at 0x7f26accbdc70.inputs{}.outputs['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x7f28149b1a00.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/ImportExampleGen/examples/1) at 0x7f28149b1880.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/ImportExampleGen/examples/1.span0.split_names[""train"", ""eval""].version0.exec_properties['input_base']/tmp/tfx-data-covertedjbm9xpqa['input_config']{  ""splits"": [  {  ""name"": ""single_split"",  ""pattern"": ""*""  }  ] }['output_config']{  ""split_config"": {  ""splits"": [  {  ""hash_buckets"": 2,  ""name"": ""train""  },  {  ""hash_buckets"": 1,  ""name"": ""eval""  }  ]  } }['output_data_format']6['output_file_format']5['custom_config']None['range_config']None['span']0['version']None['input_fingerprint']split:single_split,num_files:1,total_bytes:1451758,xor_checksum:1631967181,sum_checksum:1631967181"
.component.inputs,{}
.component.outputs,"['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x7f28149b1a00.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/ImportExampleGen/examples/1) at 0x7f28149b1880.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/ImportExampleGen/examples/1.span0.split_names[""train"", ""eval""].version0"

0,1
.inputs,{}
.outputs,"['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x7f28149b1a00.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/ImportExampleGen/examples/1) at 0x7f28149b1880.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/ImportExampleGen/examples/1.span0.split_names[""train"", ""eval""].version0"
.exec_properties,"['input_base']/tmp/tfx-data-covertedjbm9xpqa['input_config']{  ""splits"": [  {  ""name"": ""single_split"",  ""pattern"": ""*""  }  ] }['output_config']{  ""split_config"": {  ""splits"": [  {  ""hash_buckets"": 2,  ""name"": ""train""  },  {  ""hash_buckets"": 1,  ""name"": ""eval""  }  ]  } }['output_data_format']6['output_file_format']5['custom_config']None['range_config']None['span']0['version']None['input_fingerprint']split:single_split,num_files:1,total_bytes:1451758,xor_checksum:1631967181,sum_checksum:1631967181"

0,1
['examples'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x7f28149b1a00.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/ImportExampleGen/examples/1) at 0x7f28149b1880.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/ImportExampleGen/examples/1.span0.split_names[""train"", ""eval""].version0"

0,1
.type_name,Examples
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/ImportExampleGen/examples/1) at 0x7f28149b1880.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/ImportExampleGen/examples/1.span0.split_names[""train"", ""eval""].version0"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/ImportExampleGen/examples/1) at 0x7f28149b1880.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/ImportExampleGen/examples/1.span0.split_names[""train"", ""eval""].version0"

0,1
.type,<class 'tfx.types.standard_artifacts.Examples'>
.uri,../tfx_poc/ImportExampleGen/examples/1
.span,0
.split_names,"[""train"", ""eval""]"
.version,0

0,1
['input_base'],/tmp/tfx-data-covertedjbm9xpqa
['input_config'],"{  ""splits"": [  {  ""name"": ""single_split"",  ""pattern"": ""*""  }  ] }"
['output_config'],"{  ""split_config"": {  ""splits"": [  {  ""hash_buckets"": 2,  ""name"": ""train""  },  {  ""hash_buckets"": 1,  ""name"": ""eval""  }  ]  } }"
['output_data_format'],6
['output_file_format'],5
['custom_config'],
['range_config'],
['span'],0
['version'],
['input_fingerprint'],"split:single_split,num_files:1,total_bytes:1451758,xor_checksum:1631967181,sum_checksum:1631967181"

0,1
['examples'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x7f28149b1a00.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/ImportExampleGen/examples/1) at 0x7f28149b1880.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/ImportExampleGen/examples/1.span0.split_names[""train"", ""eval""].version0"

0,1
.type_name,Examples
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/ImportExampleGen/examples/1) at 0x7f28149b1880.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/ImportExampleGen/examples/1.span0.split_names[""train"", ""eval""].version0"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/ImportExampleGen/examples/1) at 0x7f28149b1880.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/ImportExampleGen/examples/1.span0.split_names[""train"", ""eval""].version0"

0,1
.type,<class 'tfx.types.standard_artifacts.Examples'>
.uri,../tfx_poc/ImportExampleGen/examples/1
.span,0
.split_names,"[""train"", ""eval""]"
.version,0


#### Test: Define custom splits (training, evaluation, test sets with a ratio of 6:2:2)

In [22]:
print(_data_root)
print(_data_filepath)

/tmp/tfx-data1hdfpoe3
/tmp/tfx-data1hdfpoe3/data.csv


In [23]:
from tfx.proto import example_gen_pb2

output = example_gen_pb2.Output(
    split_config=example_gen_pb2.SplitConfig(splits=[
        example_gen_pb2.SplitConfig.Split(name='train', hash_buckets=6),
        example_gen_pb2.SplitConfig.Split(name='eval', hash_buckets=2),
        example_gen_pb2.SplitConfig.Split(name='test', hash_buckets=2)
    ])
)

example_gen = CsvExampleGen(input_base=_data_root, output_config=output)
context.run(example_gen)

INFO:absl:Running driver for CsvExampleGen
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:select span and version = (0, None)
INFO:absl:latest span and version = (0, None)
INFO:absl:Running executor for CsvExampleGen
INFO:absl:Generating examples.
INFO:absl:Processing input csv data /tmp/tfx-data1hdfpoe3/* to TFExample.
INFO:absl:Examples generated.
INFO:absl:Running publisher for CsvExampleGen
INFO:absl:MetadataStore with DB connection initialized


0,1
.execution_id,2
.component,"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } CsvExampleGen at 0x7f2814306ac0.inputs{}.outputs['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x7f28143063a0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/2) at 0x7f281436d820.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/2.span0.split_names[""train"", ""eval"", ""test""].version0.exec_properties['input_base']/tmp/tfx-data1hdfpoe3['input_config']{  ""splits"": [  {  ""name"": ""single_split"",  ""pattern"": ""*""  }  ] }['output_config']{  ""split_config"": {  ""splits"": [  {  ""hash_buckets"": 6,  ""name"": ""train""  },  {  ""hash_buckets"": 2,  ""name"": ""eval""  },  {  ""hash_buckets"": 2,  ""name"": ""test""  }  ]  } }['output_data_format']6['output_file_format']5['custom_config']None['range_config']None['span']0['version']None['input_fingerprint']split:single_split,num_files:1,total_bytes:1922812,xor_checksum:1631967114,sum_checksum:1631967114"
.component.inputs,{}
.component.outputs,"['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x7f28143063a0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/2) at 0x7f281436d820.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/2.span0.split_names[""train"", ""eval"", ""test""].version0"

0,1
.inputs,{}
.outputs,"['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x7f28143063a0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/2) at 0x7f281436d820.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/2.span0.split_names[""train"", ""eval"", ""test""].version0"
.exec_properties,"['input_base']/tmp/tfx-data1hdfpoe3['input_config']{  ""splits"": [  {  ""name"": ""single_split"",  ""pattern"": ""*""  }  ] }['output_config']{  ""split_config"": {  ""splits"": [  {  ""hash_buckets"": 6,  ""name"": ""train""  },  {  ""hash_buckets"": 2,  ""name"": ""eval""  },  {  ""hash_buckets"": 2,  ""name"": ""test""  }  ]  } }['output_data_format']6['output_file_format']5['custom_config']None['range_config']None['span']0['version']None['input_fingerprint']split:single_split,num_files:1,total_bytes:1922812,xor_checksum:1631967114,sum_checksum:1631967114"

0,1
['examples'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x7f28143063a0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/2) at 0x7f281436d820.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/2.span0.split_names[""train"", ""eval"", ""test""].version0"

0,1
.type_name,Examples
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/2) at 0x7f281436d820.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/2.span0.split_names[""train"", ""eval"", ""test""].version0"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/2) at 0x7f281436d820.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/2.span0.split_names[""train"", ""eval"", ""test""].version0"

0,1
.type,<class 'tfx.types.standard_artifacts.Examples'>
.uri,../tfx_poc/CsvExampleGen/examples/2
.span,0
.split_names,"[""train"", ""eval"", ""test""]"
.version,0

0,1
['input_base'],/tmp/tfx-data1hdfpoe3
['input_config'],"{  ""splits"": [  {  ""name"": ""single_split"",  ""pattern"": ""*""  }  ] }"
['output_config'],"{  ""split_config"": {  ""splits"": [  {  ""hash_buckets"": 6,  ""name"": ""train""  },  {  ""hash_buckets"": 2,  ""name"": ""eval""  },  {  ""hash_buckets"": 2,  ""name"": ""test""  }  ]  } }"
['output_data_format'],6
['output_file_format'],5
['custom_config'],
['range_config'],
['span'],0
['version'],
['input_fingerprint'],"split:single_split,num_files:1,total_bytes:1922812,xor_checksum:1631967114,sum_checksum:1631967114"

0,1
['examples'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x7f28143063a0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/2) at 0x7f281436d820.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/2.span0.split_names[""train"", ""eval"", ""test""].version0"

0,1
.type_name,Examples
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/2) at 0x7f281436d820.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/2.span0.split_names[""train"", ""eval"", ""test""].version0"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/2) at 0x7f281436d820.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/2.span0.split_names[""train"", ""eval"", ""test""].version0"

0,1
.type,<class 'tfx.types.standard_artifacts.Examples'>
.uri,../tfx_poc/CsvExampleGen/examples/2
.span,0
.split_names,"[""train"", ""eval"", ""test""]"
.version,0


In [24]:
#Inspect generated artifacts
for artifact in example_gen.outputs['examples'].get():
    print(artifact)

Artifact(artifact: id: 2
type_id: 14
uri: "../tfx_poc/CsvExampleGen/examples/2"
properties {
  key: "split_names"
  value {
    string_value: "[\"train\", \"eval\", \"test\"]"
  }
}
custom_properties {
  key: "file_format"
  value {
    string_value: "tfrecords_gzip"
  }
}
custom_properties {
  key: "input_fingerprint"
  value {
    string_value: "split:single_split,num_files:1,total_bytes:1922812,xor_checksum:1631967114,sum_checksum:1631967114"
  }
}
custom_properties {
  key: "payload_format"
  value {
    string_value: "FORMAT_TF_EXAMPLE"
  }
}
custom_properties {
  key: "span"
  value {
    int_value: 0
  }
}
custom_properties {
  key: "state"
  value {
    string_value: "published"
  }
}
custom_properties {
  key: "tfx_version"
  value {
    string_value: "1.2.0"
  }
}
state: LIVE
, artifact_type: id: 14
name: "Examples"
properties {
  key: "span"
  value: INT
}
properties {
  key: "split_names"
  value: STRING
}
properties {
  key: "version"
  value: INT
}
)


#### Test: Preserve externally made split

In [26]:
round(df.shape[0]*.6)

9001

In [27]:
round(df.shape[0]*.8)

12002

In [28]:
# Split df into 3 subsets (train, eval, test)

split1 = round(df.shape[0]*.6)
split2 = round(df.shape[0]*.8)

df_train = df.iloc[0:split1].copy()
df_eval =  df.iloc[0:split2].copy()
df_test =  df.iloc[split2:].copy()

In [29]:
print(_data_root)
print(_data_filepath)

#Create a separate tmp folder for subfolders with dataset splits
_data_root3 = tempfile.mkdtemp(prefix='tfx-data-external-splits') #output
print(_data_root3)

os.mkdir(os.path.join(_data_root3, 'train'))
os.mkdir(os.path.join(_data_root3, 'eval'))
os.mkdir(os.path.join(_data_root3, 'test'))

df_train.to_csv(path_or_buf=os.path.join(_data_root3, 'train/data-train.csv'))
df_eval.to_csv(path_or_buf=os.path.join(_data_root3, 'eval/data-eval.csv'))
df_test.to_csv(path_or_buf=os.path.join(_data_root3, 'test/data-test.csv'))

/tmp/tfx-data1hdfpoe3
/tmp/tfx-data1hdfpoe3/data.csv
/tmp/tfx-data-external-splitsem7le30w


In [30]:
from tfx.proto import example_gen_pb2

input = example_gen_pb2.Input(splits=[
        example_gen_pb2.Input.Split(name='train', pattern='train/*'),
        example_gen_pb2.Input.Split(name='eval', pattern='eval/*'),
        example_gen_pb2.Input.Split(name='test', pattern='test/*')
    ]
)

example_gen = CsvExampleGen(input_base=_data_root3, input_config=input)
context.run(example_gen)

INFO:absl:Running driver for CsvExampleGen
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:select span and version = (0, None)
INFO:absl:latest span and version = (0, None)
INFO:absl:select span and version = (0, None)
INFO:absl:latest span and version = (0, None)
INFO:absl:select span and version = (0, None)
INFO:absl:latest span and version = (0, None)
INFO:absl:Running executor for CsvExampleGen
INFO:absl:Generating examples.
INFO:absl:Processing input csv data /tmp/tfx-data-external-splitsem7le30w/train/* to TFExample.
INFO:absl:Processing input csv data /tmp/tfx-data-external-splitsem7le30w/eval/* to TFExample.
INFO:absl:Processing input csv data /tmp/tfx-data-external-splitsem7le30w/test/* to TFExample.
INFO:absl:Examples generated.
INFO:absl:Running publisher for CsvExampleGen
INFO:absl:MetadataStore with DB connection initialized


0,1
.execution_id,3
.component,"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } CsvExampleGen at 0x7f26ac112a00.inputs{}.outputs['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x7f26ac112e20.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/3) at 0x7f26ac0f8fa0.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/3.span0.split_names[""train"", ""eval"", ""test""].version0.exec_properties['input_base']/tmp/tfx-data-external-splitsem7le30w['input_config']{  ""splits"": [  {  ""name"": ""train"",  ""pattern"": ""train/*""  },  {  ""name"": ""eval"",  ""pattern"": ""eval/*""  },  {  ""name"": ""test"",  ""pattern"": ""test/*""  }  ] }['output_config']{}['output_data_format']6['output_file_format']5['custom_config']None['range_config']None['span']0['version']None['input_fingerprint']split:train,num_files:1,total_bytes:1257775,xor_checksum:1631967242,sum_checksum:1631967242 split:eval,num_files:1,total_bytes:1687383,xor_checksum:1631967242,sum_checksum:1631967242 split:test,num_files:1,total_bytes:425123,xor_checksum:1631967242,sum_checksum:1631967242"
.component.inputs,{}
.component.outputs,"['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x7f26ac112e20.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/3) at 0x7f26ac0f8fa0.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/3.span0.split_names[""train"", ""eval"", ""test""].version0"

0,1
.inputs,{}
.outputs,"['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x7f26ac112e20.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/3) at 0x7f26ac0f8fa0.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/3.span0.split_names[""train"", ""eval"", ""test""].version0"
.exec_properties,"['input_base']/tmp/tfx-data-external-splitsem7le30w['input_config']{  ""splits"": [  {  ""name"": ""train"",  ""pattern"": ""train/*""  },  {  ""name"": ""eval"",  ""pattern"": ""eval/*""  },  {  ""name"": ""test"",  ""pattern"": ""test/*""  }  ] }['output_config']{}['output_data_format']6['output_file_format']5['custom_config']None['range_config']None['span']0['version']None['input_fingerprint']split:train,num_files:1,total_bytes:1257775,xor_checksum:1631967242,sum_checksum:1631967242 split:eval,num_files:1,total_bytes:1687383,xor_checksum:1631967242,sum_checksum:1631967242 split:test,num_files:1,total_bytes:425123,xor_checksum:1631967242,sum_checksum:1631967242"

0,1
['examples'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x7f26ac112e20.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/3) at 0x7f26ac0f8fa0.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/3.span0.split_names[""train"", ""eval"", ""test""].version0"

0,1
.type_name,Examples
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/3) at 0x7f26ac0f8fa0.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/3.span0.split_names[""train"", ""eval"", ""test""].version0"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/3) at 0x7f26ac0f8fa0.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/3.span0.split_names[""train"", ""eval"", ""test""].version0"

0,1
.type,<class 'tfx.types.standard_artifacts.Examples'>
.uri,../tfx_poc/CsvExampleGen/examples/3
.span,0
.split_names,"[""train"", ""eval"", ""test""]"
.version,0

0,1
['input_base'],/tmp/tfx-data-external-splitsem7le30w
['input_config'],"{  ""splits"": [  {  ""name"": ""train"",  ""pattern"": ""train/*""  },  {  ""name"": ""eval"",  ""pattern"": ""eval/*""  },  {  ""name"": ""test"",  ""pattern"": ""test/*""  }  ] }"
['output_config'],{}
['output_data_format'],6
['output_file_format'],5
['custom_config'],
['range_config'],
['span'],0
['version'],
['input_fingerprint'],"split:train,num_files:1,total_bytes:1257775,xor_checksum:1631967242,sum_checksum:1631967242 split:eval,num_files:1,total_bytes:1687383,xor_checksum:1631967242,sum_checksum:1631967242 split:test,num_files:1,total_bytes:425123,xor_checksum:1631967242,sum_checksum:1631967242"

0,1
['examples'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x7f26ac112e20.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/3) at 0x7f26ac0f8fa0.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/3.span0.split_names[""train"", ""eval"", ""test""].version0"

0,1
.type_name,Examples
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/3) at 0x7f26ac0f8fa0.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/3.span0.split_names[""train"", ""eval"", ""test""].version0"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/3) at 0x7f26ac0f8fa0.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/3.span0.split_names[""train"", ""eval"", ""test""].version0"

0,1
.type,<class 'tfx.types.standard_artifacts.Examples'>
.uri,../tfx_poc/CsvExampleGen/examples/3
.span,0
.split_names,"[""train"", ""eval"", ""test""]"
.version,0


#### Convert CSV file to tf.Example (Default)

In [31]:
#Convert CSV to tf.Example
#Get all files form the _data_root folder
print(_data_root)
example_gen = CsvExampleGen(input_base=_data_root)
context.run(example_gen)

/tmp/tfx-data1hdfpoe3


INFO:absl:Running driver for CsvExampleGen
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:select span and version = (0, None)
INFO:absl:latest span and version = (0, None)
INFO:absl:Running executor for CsvExampleGen
INFO:absl:Generating examples.
INFO:absl:Processing input csv data /tmp/tfx-data1hdfpoe3/* to TFExample.
INFO:absl:Examples generated.
INFO:absl:Running publisher for CsvExampleGen
INFO:absl:MetadataStore with DB connection initialized


0,1
.execution_id,4
.component,"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } CsvExampleGen at 0x7f28144cd9a0.inputs{}.outputs['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x7f28144cd3a0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/4) at 0x7f28147cc1f0.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/4.span0.split_names[""train"", ""eval""].version0.exec_properties['input_base']/tmp/tfx-data1hdfpoe3['input_config']{  ""splits"": [  {  ""name"": ""single_split"",  ""pattern"": ""*""  }  ] }['output_config']{  ""split_config"": {  ""splits"": [  {  ""hash_buckets"": 2,  ""name"": ""train""  },  {  ""hash_buckets"": 1,  ""name"": ""eval""  }  ]  } }['output_data_format']6['output_file_format']5['custom_config']None['range_config']None['span']0['version']None['input_fingerprint']split:single_split,num_files:1,total_bytes:1922812,xor_checksum:1631967114,sum_checksum:1631967114"
.component.inputs,{}
.component.outputs,"['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x7f28144cd3a0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/4) at 0x7f28147cc1f0.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/4.span0.split_names[""train"", ""eval""].version0"

0,1
.inputs,{}
.outputs,"['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x7f28144cd3a0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/4) at 0x7f28147cc1f0.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/4.span0.split_names[""train"", ""eval""].version0"
.exec_properties,"['input_base']/tmp/tfx-data1hdfpoe3['input_config']{  ""splits"": [  {  ""name"": ""single_split"",  ""pattern"": ""*""  }  ] }['output_config']{  ""split_config"": {  ""splits"": [  {  ""hash_buckets"": 2,  ""name"": ""train""  },  {  ""hash_buckets"": 1,  ""name"": ""eval""  }  ]  } }['output_data_format']6['output_file_format']5['custom_config']None['range_config']None['span']0['version']None['input_fingerprint']split:single_split,num_files:1,total_bytes:1922812,xor_checksum:1631967114,sum_checksum:1631967114"

0,1
['examples'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x7f28144cd3a0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/4) at 0x7f28147cc1f0.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/4.span0.split_names[""train"", ""eval""].version0"

0,1
.type_name,Examples
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/4) at 0x7f28147cc1f0.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/4.span0.split_names[""train"", ""eval""].version0"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/4) at 0x7f28147cc1f0.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/4.span0.split_names[""train"", ""eval""].version0"

0,1
.type,<class 'tfx.types.standard_artifacts.Examples'>
.uri,../tfx_poc/CsvExampleGen/examples/4
.span,0
.split_names,"[""train"", ""eval""]"
.version,0

0,1
['input_base'],/tmp/tfx-data1hdfpoe3
['input_config'],"{  ""splits"": [  {  ""name"": ""single_split"",  ""pattern"": ""*""  }  ] }"
['output_config'],"{  ""split_config"": {  ""splits"": [  {  ""hash_buckets"": 2,  ""name"": ""train""  },  {  ""hash_buckets"": 1,  ""name"": ""eval""  }  ]  } }"
['output_data_format'],6
['output_file_format'],5
['custom_config'],
['range_config'],
['span'],0
['version'],
['input_fingerprint'],"split:single_split,num_files:1,total_bytes:1922812,xor_checksum:1631967114,sum_checksum:1631967114"

0,1
['examples'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x7f28144cd3a0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/4) at 0x7f28147cc1f0.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/4.span0.split_names[""train"", ""eval""].version0"

0,1
.type_name,Examples
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/4) at 0x7f28147cc1f0.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/4.span0.split_names[""train"", ""eval""].version0"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/4) at 0x7f28147cc1f0.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/4.span0.split_names[""train"", ""eval""].version0"

0,1
.type,<class 'tfx.types.standard_artifacts.Examples'>
.uri,../tfx_poc/CsvExampleGen/examples/4
.span,0
.split_names,"[""train"", ""eval""]"
.version,0


In [32]:
#Let's examine the output artifacts of ExampleGen. This component produces two artifacts, training examples and evaluation examples:
artifact = example_gen.outputs['examples'].get()[0]
print(artifact.split_names, artifact.uri)

["train", "eval"] ../tfx_poc/CsvExampleGen/examples/4


In [34]:
#Let's look at the first three training examples
#Get the URI of the output artifact representing the training examples
train_uri = os.path.join(example_gen.outputs['examples'].get()[0].uri, 'Split-train')

#Get the list of files in this directory
tfrecord_filenames = [os.path.join(train_uri, name) for name in os.listdir(train_uri)]

#Create a 'TFRecordDataset' to read thes files
dataset = tf.data.TFRecordDataset(tfrecord_filenames, compression_type = 'GZIP')

#Iterate over the first 1 records and decode them.
for tfrecord in dataset.take(1):
    serialized_example = tfrecord.numpy()
    example = tf.train.Example()
    example.ParseFromString(serialized_example)
    pp.pprint(example)

features {
  feature {
    key: "company"
    value {
      bytes_list {
        value: "Chicago Elite Cab Corp. (Chicago Carriag"
      }
    }
  }
  feature {
    key: "dropoff_census_tract"
    value {
      int64_list {
      }
    }
  }
  feature {
    key: "dropoff_community_area"
    value {
      int64_list {
      }
    }
  }
  feature {
    key: "dropoff_latitude"
    value {
      float_list {
      }
    }
  }
  feature {
    key: "dropoff_longitude"
    value {
      float_list {
      }
    }
  }
  feature {
    key: "fare"
    value {
      float_list {
        value: 12.449999809265137
      }
    }
  }
  feature {
    key: "payment_type"
    value {
      bytes_list {
        value: "Credit Card"
      }
    }
  }
  feature {
    key: "pickup_census_tract"
    value {
      int64_list {
      }
    }
  }
  feature {
    key: "pickup_community_area"
    value {
      int64_list {
      }
    }
  }
  feature {
    key: "pickup_latitude"
    value {
      float_list {
   

---
<a id='toc03'></a>

## Data Validation

### TensorFlow Data Validation (TFDV) - Standalone Package
TFDV can be used as a standalone package but is also a part of TFX

In [41]:
import tensorflow_data_validation as tfdv

# Generate summary statistics of data

# From CSV
stats1 = tfdv.generate_statistics_from_csv(data_location=_data_filepath, delimiter=',')

# From TFRecord
stats2 = tfdv.generate_statistics_from_tfrecord(data_location=_data_filepath2)



#stats1
```
feature {
  name: "payment_type"
  type: BYTES
  domain: "payment_type"
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "company"
  value_count {
    min: 1
    max: 1
  }
  type: BYTES
  domain: "company"
  presence {
    min_count: 1
  }
}
...
```

In [55]:
# Generate data schema
schema1 = tfdv.infer_schema(stats1)
schema2 = tfdv.infer_schema(stats2)

tfdv.display_schema(schema1)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'payment_type',STRING,required,,'payment_type'
'company',STRING,optional,single,'company'
'pickup_community_area',INT,optional,single,-
'fare',FLOAT,required,,-
'trip_start_month',INT,required,,-
'trip_start_hour',INT,required,,-
'trip_start_day',INT,required,,-
'trip_start_timestamp',INT,required,,-
'pickup_latitude',FLOAT,optional,single,-
'pickup_longitude',FLOAT,optional,single,-


  pd.set_option('max_colwidth', -1)


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'payment_type',"'Cash', 'Credit Card', 'Dispute', 'No Charge', 'Pcard', 'Prcard', 'Unknown'"
'company',"'0118 - 42111 Godfrey S.Awir', '0694 - 59280 Chinesco Trans Inc', '1085 - 72312 N and W Cab Co', '2092 - 61288 Sbeih company', '2192 - 73487 Zeymane Corp', '2192 - Zeymane Corp', '2733 - 74600 Benny Jona', '2809 - 95474 C & D Cab Co Inc.', '2823 - 73307 Seung Lee', '3011 - 66308 JBL Cab Inc.', '3094 - 24059 G.L.B. Cab Co', '3152 - 97284 Crystal Abernathy', '3201 - C&D Cab Co Inc', '3201 - CID Cab Co Inc', '3253 - 91138 Gaither Cab Co.', '3319 - CD Cab Co', '3385 - 23210 Eman Cab', '3385 - Eman Cab', '3623 - 72222 Arrington Enterprises', '3897 - 57856 Ilie Malec', '3897 - Ilie Malec', '4053 - 40193 Adwar H. Nikola', '4053 - Adwar H. Nikola', '4197 - 41842 Royal Star', '4197 - Royal Star', '4615 - 83503 Tyrone Henderson', '4615 - Tyrone Henderson', '4623 - Jay Kim', '5006 - 39261 Salifu Bawa', '5006 - Salifu Bawa', '5074 - 54002 Ahzmi Inc', '5074 - Ahzmi Inc', '5129 - 87128', '5129 - 98755 Mengisti Taxi', '5129 - Mengisti Taxi', '5724 - KYVI Cab Inc', '585 - 88805 Valley Cab Co', '585 - Valley Cab Co', '5864 - 73614 Thomas Owusu', '5864 - Thomas Owusu', '5874 - 73628 Sergey Cab Corp.', '5874 - Sergey Cab Corp.', '5997 - 65283 AW Services Inc.', '5997 - AW Services Inc.', '6057 - 24657 Richard Addo', '6488 - 83287 Zuha Taxi', '6574 - Babylon Express Inc.', '6742 - 83735 Tasha ride inc', '6743 - Luhak Corp', 'Blue Ribbon Taxi Association Inc.', 'C & D Cab Co Inc', 'Chicago Elite Cab Corp.', 'Chicago Elite Cab Corp. (Chicago Carriag', 'Chicago Medallion Leasing INC', 'Chicago Medallion Management', 'Choice Taxi Association', 'Dispatch Taxi Affiliation', 'KOAM Taxi Association', 'Northwest Management LLC', 'Taxi Affiliation Services', 'Top Cab Affiliation'"


#### Compare Datasets and Looking for Anomalies

In [50]:
# Compare datasets statistics
train_dataset_file = os.path.join(_data_root3, 'train/data-train.csv')
eval_dataset_file = os.path.join(_data_root3, 'eval/data-eval.csv')

train_stats = tfdv.generate_statistics_from_csv(data_location=train_dataset_file, delimiter=',')
eval_stats = tfdv.generate_statistics_from_csv(data_location=eval_dataset_file, delimiter=',')

tfdv.visualize_statistics(lhs_statistics=eval_stats, rhs_statistics=train_stats, lhs_name='VALIDATION', rhs_name='TRAINING')



In [52]:
schema3 = tfdv.infer_schema(train_stats)

#Detect anomalies
anomalies = tfdv.validate_statistics(statistics=eval_stats, schema=schema3)

tfdv.display_anomalies(anomalies)

  pd.set_option('max_colwidth', -1)


Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'company',Unexpected string values,"Examples contain values missing from the schema: 2809 - 95474 C & D Cab Co Inc. (<1%), 3897 - Ilie Malec (<1%), 5864 - Thomas Owusu (<1%)."


anomalies
```
anomaly_info {
  key: "company"
  value {
    description: "Examples contain values missing from the schema: 2809 - 95474 C & D Cab Co Inc. (<1%), 3897 - Ilie Malec (<1%), 5864 - Thomas Owusu (<1%). "
    severity: ERROR
    short_description: "Unexpected string values"
    reason {
      type: ENUM_TYPE_UNEXPECTED_STRING_VALUES
      short_description: "Unexpected string values"
      description: "Examples contain values missing from the schema: 2809 - 95474 C & D Cab Co Inc. (<1%), 3897 - Ilie Malec (<1%), 5864 - Thomas Owusu (<1%). "
    }
    path {
      step: "company"
    }
  }
}
anomaly_name_format: SERIALIZED_PATH
```

#### Updating the Schema

In [58]:
# Load schema from its serialized location
schema4 = tfdv.load_schema_text('schema.txt')

In [60]:
# Modify fare feature schema manually
fare_feature = tfdv.get_feature(schema4, 'fare')
fare_feature.presence.min_fraction = 0.8

In [61]:
# Modify payment type
payment_type_domain = tfdv.get_domain(schema4, 'payment_type')
payment_type_domain.value.remove('Pcard')

In [62]:
# Save/serialize the schema
tfdv.write_schema_text(schema4, 'schema_manual.txt')

In [63]:
# Revalidate the statistics to view the updated anomalies
updated_anomalies = tfdv.validate_statistics(statistics=eval_stats, schema=schema4)
tfdv.display_anomalies(updated_anomalies)

  pd.set_option('max_colwidth', -1)


Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'payment_type',Unexpected string values,Examples contain values missing from the schema: Pcard (<1%).
'company',Unexpected string values,"Examples contain values missing from the schema: 2809 - 95474 C & D Cab Co Inc. (<1%), 3897 - Ilie Malec (<1%), 5864 - Thomas Owusu (<1%)."


#### Data Skew and Drift

In [70]:
# Compare the skew (L-infinity norm of the difference between serving_statistics of two datasets)
tfdv.get_feature(schema4, 'fare').skew_comparator.infinity_norm.threshold = 0.01
skew_anomalies = tfdv.validate_statistics(statistics=train_stats, schema=schema4, serving_statistics=eval_stats)
tfdv.display_anomalies(skew_anomalies)

  pd.set_option('max_colwidth', -1)


Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'payment_type',Unexpected string values,Examples contain values missing from the schema: Pcard (<1%).


In [71]:
# Compare the drift (ex. stats from yesterday to stats from today)
tfdv.get_feature(schema4, 'fare').drift_comparator.infinity_norm.threshold = 0.01
drift_anomalies = tfdv.validate_statistics(statistics=train_stats, schema=schema4, previous_statistics=eval_stats)
tfdv.display_anomalies(drift_anomalies)

  pd.set_option('max_colwidth', -1)


Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'payment_type',Unexpected string values,Examples contain values missing from the schema: Pcard (<1%).


### 4 - The StatisticsGen TFX Pipeline Component
The **StatisticsGen** TFX pipeline component generates features statistics over both training and serving data, which can be used by other pipeline components. **StatisticsGen** uses Beam to scale to large datasets.
- **INTPUT**: datasets created by an ExampleGen pipeline component.
- **OUTPUT**: Dataset statistics.

In [72]:
#The StatisticsGen component computes statistics over the dataset for data analysis 
#StatisticsGen takes as input the dataset we just ingested using ExampleGen
statistics_gen = StatisticsGen(examples=example_gen.outputs['examples'])
context.run(statistics_gen)

INFO:absl:Excluding no splits because exclude_splits is not set.
INFO:absl:Running driver for StatisticsGen
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:Running publisher for StatisticsGen
INFO:absl:MetadataStore with DB connection initialized


0,1
.execution_id,7
.component,"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } StatisticsGen at 0x7f25d6e9f370.inputs['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x7f28144cd3a0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/4) at 0x7f28147cc1f0.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/4.span0.split_names[""train"", ""eval""].version0.outputs['statistics'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleStatistics' (1 artifact) at 0x7f26ac25b580.type_nameExampleStatistics._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: ../tfx_poc/StatisticsGen/statistics/5) at 0x7f26ac25b1c0.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri../tfx_poc/StatisticsGen/statistics/5.span0.split_names[""train"", ""eval""].exec_properties['stats_options_json']None['exclude_splits'][]"
.component.inputs,"['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x7f28144cd3a0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/4) at 0x7f28147cc1f0.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/4.span0.split_names[""train"", ""eval""].version0"
.component.outputs,"['statistics'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleStatistics' (1 artifact) at 0x7f26ac25b580.type_nameExampleStatistics._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: ../tfx_poc/StatisticsGen/statistics/5) at 0x7f26ac25b1c0.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri../tfx_poc/StatisticsGen/statistics/5.span0.split_names[""train"", ""eval""]"

0,1
.inputs,"['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x7f28144cd3a0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/4) at 0x7f28147cc1f0.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/4.span0.split_names[""train"", ""eval""].version0"
.outputs,"['statistics'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleStatistics' (1 artifact) at 0x7f26ac25b580.type_nameExampleStatistics._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: ../tfx_poc/StatisticsGen/statistics/5) at 0x7f26ac25b1c0.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri../tfx_poc/StatisticsGen/statistics/5.span0.split_names[""train"", ""eval""]"
.exec_properties,['stats_options_json']None['exclude_splits'][]

0,1
['examples'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x7f28144cd3a0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/4) at 0x7f28147cc1f0.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/4.span0.split_names[""train"", ""eval""].version0"

0,1
.type_name,Examples
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/4) at 0x7f28147cc1f0.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/4.span0.split_names[""train"", ""eval""].version0"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/4) at 0x7f28147cc1f0.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/4.span0.split_names[""train"", ""eval""].version0"

0,1
.type,<class 'tfx.types.standard_artifacts.Examples'>
.uri,../tfx_poc/CsvExampleGen/examples/4
.span,0
.split_names,"[""train"", ""eval""]"
.version,0

0,1
['statistics'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleStatistics' (1 artifact) at 0x7f26ac25b580.type_nameExampleStatistics._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: ../tfx_poc/StatisticsGen/statistics/5) at 0x7f26ac25b1c0.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri../tfx_poc/StatisticsGen/statistics/5.span0.split_names[""train"", ""eval""]"

0,1
.type_name,ExampleStatistics
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: ../tfx_poc/StatisticsGen/statistics/5) at 0x7f26ac25b1c0.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri../tfx_poc/StatisticsGen/statistics/5.span0.split_names[""train"", ""eval""]"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: ../tfx_poc/StatisticsGen/statistics/5) at 0x7f26ac25b1c0.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri../tfx_poc/StatisticsGen/statistics/5.span0.split_names[""train"", ""eval""]"

0,1
.type,<class 'tfx.types.standard_artifacts.ExampleStatistics'>
.uri,../tfx_poc/StatisticsGen/statistics/5
.span,0
.split_names,"[""train"", ""eval""]"

0,1
['stats_options_json'],
['exclude_splits'],[]

0,1
['examples'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x7f28144cd3a0.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/4) at 0x7f28147cc1f0.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/4.span0.split_names[""train"", ""eval""].version0"

0,1
.type_name,Examples
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/4) at 0x7f28147cc1f0.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/4.span0.split_names[""train"", ""eval""].version0"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ../tfx_poc/CsvExampleGen/examples/4) at 0x7f28147cc1f0.type<class 'tfx.types.standard_artifacts.Examples'>.uri../tfx_poc/CsvExampleGen/examples/4.span0.split_names[""train"", ""eval""].version0"

0,1
.type,<class 'tfx.types.standard_artifacts.Examples'>
.uri,../tfx_poc/CsvExampleGen/examples/4
.span,0
.split_names,"[""train"", ""eval""]"
.version,0

0,1
['statistics'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleStatistics' (1 artifact) at 0x7f26ac25b580.type_nameExampleStatistics._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: ../tfx_poc/StatisticsGen/statistics/5) at 0x7f26ac25b1c0.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri../tfx_poc/StatisticsGen/statistics/5.span0.split_names[""train"", ""eval""]"

0,1
.type_name,ExampleStatistics
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: ../tfx_poc/StatisticsGen/statistics/5) at 0x7f26ac25b1c0.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri../tfx_poc/StatisticsGen/statistics/5.span0.split_names[""train"", ""eval""]"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: ../tfx_poc/StatisticsGen/statistics/5) at 0x7f26ac25b1c0.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri../tfx_poc/StatisticsGen/statistics/5.span0.split_names[""train"", ""eval""]"

0,1
.type,<class 'tfx.types.standard_artifacts.ExampleStatistics'>
.uri,../tfx_poc/StatisticsGen/statistics/5
.span,0
.split_names,"[""train"", ""eval""]"


In [73]:
context.show(statistics_gen.outputs['statistics'])

### 5 - The SchemaGen TFX Pipeline Component
Some TFX components use a description of your input data called a schema. The schema is an instance of schema.proto. It can specify data types for feature values, whether a feature has to be present in all examples, allowed value ranges, and other properties. A **SchemaGen** pipeline component will automatically generate a schema by inferring types, categories, and ranges from the training data.
- **INPUT**: statistics from a StatisticsGen component.
- **OUTPUT**: Data schema proto.

In [74]:
#the SchemaGen component generates a schema based on your data statistics.
#SchemaGen uses the TensorFlow Data Validation
#SchemaGen will take as input the statistics that we generated with StatisticsGen, looking at the training split by default.
schema_gen = SchemaGen(statistics=statistics_gen.outputs['statistics'], infer_feature_shape=False)
context.run(schema_gen)

INFO:absl:Excluding no splits because exclude_splits is not set.
INFO:absl:Running driver for SchemaGen
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:Running publisher for SchemaGen
INFO:absl:MetadataStore with DB connection initialized


0,1
.execution_id,8
.component,"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } SchemaGen at 0x7f26ac25bc70.inputs['statistics'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleStatistics' (1 artifact) at 0x7f26ac25b580.type_nameExampleStatistics._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: ../tfx_poc/StatisticsGen/statistics/5) at 0x7f26ac25b1c0.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri../tfx_poc/StatisticsGen/statistics/5.span0.split_names[""train"", ""eval""].outputs['schema'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Schema' (1 artifact) at 0x7f26ac25bd90.type_nameSchema._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Schema' (uri: ../tfx_poc/SchemaGen/schema/6) at 0x7f25d6e9fe80.type<class 'tfx.types.standard_artifacts.Schema'>.uri../tfx_poc/SchemaGen/schema/6.exec_properties['infer_feature_shape']0['exclude_splits'][]"
.component.inputs,"['statistics'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleStatistics' (1 artifact) at 0x7f26ac25b580.type_nameExampleStatistics._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: ../tfx_poc/StatisticsGen/statistics/5) at 0x7f26ac25b1c0.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri../tfx_poc/StatisticsGen/statistics/5.span0.split_names[""train"", ""eval""]"
.component.outputs,['schema'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Schema' (1 artifact) at 0x7f26ac25bd90.type_nameSchema._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Schema' (uri: ../tfx_poc/SchemaGen/schema/6) at 0x7f25d6e9fe80.type<class 'tfx.types.standard_artifacts.Schema'>.uri../tfx_poc/SchemaGen/schema/6

0,1
.inputs,"['statistics'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleStatistics' (1 artifact) at 0x7f26ac25b580.type_nameExampleStatistics._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: ../tfx_poc/StatisticsGen/statistics/5) at 0x7f26ac25b1c0.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri../tfx_poc/StatisticsGen/statistics/5.span0.split_names[""train"", ""eval""]"
.outputs,['schema'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Schema' (1 artifact) at 0x7f26ac25bd90.type_nameSchema._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Schema' (uri: ../tfx_poc/SchemaGen/schema/6) at 0x7f25d6e9fe80.type<class 'tfx.types.standard_artifacts.Schema'>.uri../tfx_poc/SchemaGen/schema/6
.exec_properties,['infer_feature_shape']0['exclude_splits'][]

0,1
['statistics'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleStatistics' (1 artifact) at 0x7f26ac25b580.type_nameExampleStatistics._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: ../tfx_poc/StatisticsGen/statistics/5) at 0x7f26ac25b1c0.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri../tfx_poc/StatisticsGen/statistics/5.span0.split_names[""train"", ""eval""]"

0,1
.type_name,ExampleStatistics
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: ../tfx_poc/StatisticsGen/statistics/5) at 0x7f26ac25b1c0.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri../tfx_poc/StatisticsGen/statistics/5.span0.split_names[""train"", ""eval""]"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: ../tfx_poc/StatisticsGen/statistics/5) at 0x7f26ac25b1c0.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri../tfx_poc/StatisticsGen/statistics/5.span0.split_names[""train"", ""eval""]"

0,1
.type,<class 'tfx.types.standard_artifacts.ExampleStatistics'>
.uri,../tfx_poc/StatisticsGen/statistics/5
.span,0
.split_names,"[""train"", ""eval""]"

0,1
['schema'],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Schema' (1 artifact) at 0x7f26ac25bd90.type_nameSchema._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Schema' (uri: ../tfx_poc/SchemaGen/schema/6) at 0x7f25d6e9fe80.type<class 'tfx.types.standard_artifacts.Schema'>.uri../tfx_poc/SchemaGen/schema/6

0,1
.type_name,Schema
._artifacts,[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Schema' (uri: ../tfx_poc/SchemaGen/schema/6) at 0x7f25d6e9fe80.type<class 'tfx.types.standard_artifacts.Schema'>.uri../tfx_poc/SchemaGen/schema/6

0,1
[0],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Schema' (uri: ../tfx_poc/SchemaGen/schema/6) at 0x7f25d6e9fe80.type<class 'tfx.types.standard_artifacts.Schema'>.uri../tfx_poc/SchemaGen/schema/6

0,1
.type,<class 'tfx.types.standard_artifacts.Schema'>
.uri,../tfx_poc/SchemaGen/schema/6

0,1
['infer_feature_shape'],0
['exclude_splits'],[]

0,1
['statistics'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleStatistics' (1 artifact) at 0x7f26ac25b580.type_nameExampleStatistics._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: ../tfx_poc/StatisticsGen/statistics/5) at 0x7f26ac25b1c0.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri../tfx_poc/StatisticsGen/statistics/5.span0.split_names[""train"", ""eval""]"

0,1
.type_name,ExampleStatistics
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: ../tfx_poc/StatisticsGen/statistics/5) at 0x7f26ac25b1c0.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri../tfx_poc/StatisticsGen/statistics/5.span0.split_names[""train"", ""eval""]"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: ../tfx_poc/StatisticsGen/statistics/5) at 0x7f26ac25b1c0.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri../tfx_poc/StatisticsGen/statistics/5.span0.split_names[""train"", ""eval""]"

0,1
.type,<class 'tfx.types.standard_artifacts.ExampleStatistics'>
.uri,../tfx_poc/StatisticsGen/statistics/5
.span,0
.split_names,"[""train"", ""eval""]"

0,1
['schema'],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Schema' (1 artifact) at 0x7f26ac25bd90.type_nameSchema._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Schema' (uri: ../tfx_poc/SchemaGen/schema/6) at 0x7f25d6e9fe80.type<class 'tfx.types.standard_artifacts.Schema'>.uri../tfx_poc/SchemaGen/schema/6

0,1
.type_name,Schema
._artifacts,[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Schema' (uri: ../tfx_poc/SchemaGen/schema/6) at 0x7f25d6e9fe80.type<class 'tfx.types.standard_artifacts.Schema'>.uri../tfx_poc/SchemaGen/schema/6

0,1
[0],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Schema' (uri: ../tfx_poc/SchemaGen/schema/6) at 0x7f25d6e9fe80.type<class 'tfx.types.standard_artifacts.Schema'>.uri../tfx_poc/SchemaGen/schema/6

0,1
.type,<class 'tfx.types.standard_artifacts.Schema'>
.uri,../tfx_poc/SchemaGen/schema/6


Each feature in your dataset shows up as a row in the schema table, alongside its properties. The schema also captures all the values that a categorical feature takes on, denoted as its domain.

In [75]:
context.show(schema_gen.outputs['schema'])

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'company',STRING,required,,'company'
'dropoff_census_tract',INT,required,,-
'dropoff_community_area',INT,required,,-
'dropoff_latitude',FLOAT,required,,-
'dropoff_longitude',FLOAT,required,,-
'fare',FLOAT,required,single,-
'payment_type',STRING,required,single,'payment_type'
'pickup_census_tract',INT,required,,-
'pickup_community_area',INT,required,,-
'pickup_latitude',FLOAT,required,,-


  pd.set_option('max_colwidth', -1)


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'company',"'0118 - 42111 Godfrey S.Awir', '1085 - 72312 N and W Cab Co', '2192 - 73487 Zeymane Corp', '2733 - 74600 Benny Jona', '3011 - 66308 JBL Cab Inc.', '3152 - 97284 Crystal Abernathy', '3201 - C&D Cab Co Inc', '3201 - CID Cab Co Inc', '3253 - 91138 Gaither Cab Co.', '3319 - CD Cab Co', '3385 - 23210 Eman Cab', '3385 - Eman Cab', '3623 - 72222 Arrington Enterprises', '3897 - 57856 Ilie Malec', '4053 - 40193 Adwar H. Nikola', '4197 - 41842 Royal Star', '4197 - Royal Star', '4615 - 83503 Tyrone Henderson', '4615 - Tyrone Henderson', '4623 - Jay Kim', '5006 - 39261 Salifu Bawa', '5074 - 54002 Ahzmi Inc', '5074 - Ahzmi Inc', '5129 - 87128', '5129 - 98755 Mengisti Taxi', '585 - 88805 Valley Cab Co', '5864 - Thomas Owusu', '5874 - 73628 Sergey Cab Corp.', '5874 - Sergey Cab Corp.', '5997 - 65283 AW Services Inc.', '6488 - 83287 Zuha Taxi', '6574 - Babylon Express Inc.', '6742 - 83735 Tasha ride inc', 'Blue Ribbon Taxi Association Inc.', 'C & D Cab Co Inc', 'Chicago Elite Cab Corp.', 'Chicago Elite Cab Corp. (Chicago Carriag', 'Chicago Medallion Leasing INC', 'Chicago Medallion Management', 'Choice Taxi Association', 'Dispatch Taxi Affiliation', 'KOAM Taxi Association', 'Northwest Management LLC', 'Taxi Affiliation Services', 'Top Cab Affiliation', '0694 - 59280 Chinesco Trans Inc', '2092 - 61288 Sbeih company', '2192 - Zeymane Corp', '2809 - 95474 C & D Cab Co Inc.', '2823 - 73307 Seung Lee', '3094 - 24059 G.L.B. Cab Co', '3897 - Ilie Malec', '4053 - Adwar H. Nikola', '5006 - Salifu Bawa', '5129 - Mengisti Taxi', '5724 - KYVI Cab Inc', '585 - Valley Cab Co', '5864 - 73614 Thomas Owusu', '5997 - AW Services Inc.', '6057 - 24657 Richard Addo', '6743 - Luhak Corp'"
'payment_type',"'Cash', 'Credit Card', 'Dispute', 'No Charge', 'Pcard', 'Unknown', 'Prcard'"


### 6 - The ExampleValidator TFX Pipeline Component
The **ExampleValidator** pipeline component identifies anomalies in training and serving data. It can detect different classes of anomalies in the data. For example it can:
- Perform validity checks by comparing data statistics against a schema that codifies expectations of the user.
- Detect training-serving skew by comparing training and serving data.
- Detect data drift by looking at a series of data. The **ExampleValidator** pipeline component identifies any anomalies in the example data by comparing data statistics computed by the **StatisticsGen** pipeline component against a schema (by **SchemaGen**). The inferred schema codifies properties which the input data is expected to satisfy, and can be modified by the developer.


- **INPUT**: A schema from a SchemaGen component, and statistics from a StatisticsGen component.
- **OUTPUT**: Validation results.

In [76]:
#ExampleValidator
example_validator = ExampleValidator(statistics=statistics_gen.outputs['statistics'], schema=schema_gen.outputs['schema'])
context.run(example_validator)

INFO:absl:Excluding no splits because exclude_splits is not set.
INFO:absl:Running driver for ExampleValidator
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:Running executor for ExampleValidator
INFO:absl:Validating schema against the computed statistics for split train.
INFO:absl:Validation complete for split train. Anomalies written to ../tfx_poc/ExampleValidator/anomalies/9/Split-train.
INFO:absl:Validating schema against the computed statistics for split eval.
INFO:absl:Validation complete for split eval. Anomalies written to ../tfx_poc/ExampleValidator/anomalies/9/Split-eval.
INFO:absl:Running publisher for ExampleValidator
INFO:absl:MetadataStore with DB connection initialized


0,1
.execution_id,9
.component,"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } ExampleValidator at 0x7f26ac0b3fd0.inputs['statistics'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleStatistics' (1 artifact) at 0x7f26ac25b580.type_nameExampleStatistics._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: ../tfx_poc/StatisticsGen/statistics/5) at 0x7f26ac25b1c0.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri../tfx_poc/StatisticsGen/statistics/5.span0.split_names[""train"", ""eval""]['schema'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Schema' (1 artifact) at 0x7f26ac25bd90.type_nameSchema._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Schema' (uri: ../tfx_poc/SchemaGen/schema/6) at 0x7f25d6e9fe80.type<class 'tfx.types.standard_artifacts.Schema'>.uri../tfx_poc/SchemaGen/schema/6.outputs['anomalies'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleAnomalies' (1 artifact) at 0x7f25f42dc8b0.type_nameExampleAnomalies._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleAnomalies' (uri: ../tfx_poc/ExampleValidator/anomalies/9) at 0x7f25d6e9f880.type<class 'tfx.types.standard_artifacts.ExampleAnomalies'>.uri../tfx_poc/ExampleValidator/anomalies/9.span0.split_names[""train"", ""eval""].exec_properties['exclude_splits'][]"
.component.inputs,"['statistics'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleStatistics' (1 artifact) at 0x7f26ac25b580.type_nameExampleStatistics._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: ../tfx_poc/StatisticsGen/statistics/5) at 0x7f26ac25b1c0.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri../tfx_poc/StatisticsGen/statistics/5.span0.split_names[""train"", ""eval""]['schema'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Schema' (1 artifact) at 0x7f26ac25bd90.type_nameSchema._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Schema' (uri: ../tfx_poc/SchemaGen/schema/6) at 0x7f25d6e9fe80.type<class 'tfx.types.standard_artifacts.Schema'>.uri../tfx_poc/SchemaGen/schema/6"
.component.outputs,"['anomalies'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleAnomalies' (1 artifact) at 0x7f25f42dc8b0.type_nameExampleAnomalies._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleAnomalies' (uri: ../tfx_poc/ExampleValidator/anomalies/9) at 0x7f25d6e9f880.type<class 'tfx.types.standard_artifacts.ExampleAnomalies'>.uri../tfx_poc/ExampleValidator/anomalies/9.span0.split_names[""train"", ""eval""]"

0,1
.inputs,"['statistics'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleStatistics' (1 artifact) at 0x7f26ac25b580.type_nameExampleStatistics._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: ../tfx_poc/StatisticsGen/statistics/5) at 0x7f26ac25b1c0.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri../tfx_poc/StatisticsGen/statistics/5.span0.split_names[""train"", ""eval""]['schema'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Schema' (1 artifact) at 0x7f26ac25bd90.type_nameSchema._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Schema' (uri: ../tfx_poc/SchemaGen/schema/6) at 0x7f25d6e9fe80.type<class 'tfx.types.standard_artifacts.Schema'>.uri../tfx_poc/SchemaGen/schema/6"
.outputs,"['anomalies'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleAnomalies' (1 artifact) at 0x7f25f42dc8b0.type_nameExampleAnomalies._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleAnomalies' (uri: ../tfx_poc/ExampleValidator/anomalies/9) at 0x7f25d6e9f880.type<class 'tfx.types.standard_artifacts.ExampleAnomalies'>.uri../tfx_poc/ExampleValidator/anomalies/9.span0.split_names[""train"", ""eval""]"
.exec_properties,['exclude_splits'][]

0,1
['statistics'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleStatistics' (1 artifact) at 0x7f26ac25b580.type_nameExampleStatistics._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: ../tfx_poc/StatisticsGen/statistics/5) at 0x7f26ac25b1c0.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri../tfx_poc/StatisticsGen/statistics/5.span0.split_names[""train"", ""eval""]"
['schema'],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Schema' (1 artifact) at 0x7f26ac25bd90.type_nameSchema._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Schema' (uri: ../tfx_poc/SchemaGen/schema/6) at 0x7f25d6e9fe80.type<class 'tfx.types.standard_artifacts.Schema'>.uri../tfx_poc/SchemaGen/schema/6

0,1
.type_name,ExampleStatistics
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: ../tfx_poc/StatisticsGen/statistics/5) at 0x7f26ac25b1c0.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri../tfx_poc/StatisticsGen/statistics/5.span0.split_names[""train"", ""eval""]"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: ../tfx_poc/StatisticsGen/statistics/5) at 0x7f26ac25b1c0.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri../tfx_poc/StatisticsGen/statistics/5.span0.split_names[""train"", ""eval""]"

0,1
.type,<class 'tfx.types.standard_artifacts.ExampleStatistics'>
.uri,../tfx_poc/StatisticsGen/statistics/5
.span,0
.split_names,"[""train"", ""eval""]"

0,1
.type_name,Schema
._artifacts,[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Schema' (uri: ../tfx_poc/SchemaGen/schema/6) at 0x7f25d6e9fe80.type<class 'tfx.types.standard_artifacts.Schema'>.uri../tfx_poc/SchemaGen/schema/6

0,1
[0],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Schema' (uri: ../tfx_poc/SchemaGen/schema/6) at 0x7f25d6e9fe80.type<class 'tfx.types.standard_artifacts.Schema'>.uri../tfx_poc/SchemaGen/schema/6

0,1
.type,<class 'tfx.types.standard_artifacts.Schema'>
.uri,../tfx_poc/SchemaGen/schema/6

0,1
['anomalies'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleAnomalies' (1 artifact) at 0x7f25f42dc8b0.type_nameExampleAnomalies._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleAnomalies' (uri: ../tfx_poc/ExampleValidator/anomalies/9) at 0x7f25d6e9f880.type<class 'tfx.types.standard_artifacts.ExampleAnomalies'>.uri../tfx_poc/ExampleValidator/anomalies/9.span0.split_names[""train"", ""eval""]"

0,1
.type_name,ExampleAnomalies
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleAnomalies' (uri: ../tfx_poc/ExampleValidator/anomalies/9) at 0x7f25d6e9f880.type<class 'tfx.types.standard_artifacts.ExampleAnomalies'>.uri../tfx_poc/ExampleValidator/anomalies/9.span0.split_names[""train"", ""eval""]"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleAnomalies' (uri: ../tfx_poc/ExampleValidator/anomalies/9) at 0x7f25d6e9f880.type<class 'tfx.types.standard_artifacts.ExampleAnomalies'>.uri../tfx_poc/ExampleValidator/anomalies/9.span0.split_names[""train"", ""eval""]"

0,1
.type,<class 'tfx.types.standard_artifacts.ExampleAnomalies'>
.uri,../tfx_poc/ExampleValidator/anomalies/9
.span,0
.split_names,"[""train"", ""eval""]"

0,1
['exclude_splits'],[]

0,1
['statistics'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleStatistics' (1 artifact) at 0x7f26ac25b580.type_nameExampleStatistics._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: ../tfx_poc/StatisticsGen/statistics/5) at 0x7f26ac25b1c0.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri../tfx_poc/StatisticsGen/statistics/5.span0.split_names[""train"", ""eval""]"
['schema'],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Schema' (1 artifact) at 0x7f26ac25bd90.type_nameSchema._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Schema' (uri: ../tfx_poc/SchemaGen/schema/6) at 0x7f25d6e9fe80.type<class 'tfx.types.standard_artifacts.Schema'>.uri../tfx_poc/SchemaGen/schema/6

0,1
.type_name,ExampleStatistics
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: ../tfx_poc/StatisticsGen/statistics/5) at 0x7f26ac25b1c0.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri../tfx_poc/StatisticsGen/statistics/5.span0.split_names[""train"", ""eval""]"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleStatistics' (uri: ../tfx_poc/StatisticsGen/statistics/5) at 0x7f26ac25b1c0.type<class 'tfx.types.standard_artifacts.ExampleStatistics'>.uri../tfx_poc/StatisticsGen/statistics/5.span0.split_names[""train"", ""eval""]"

0,1
.type,<class 'tfx.types.standard_artifacts.ExampleStatistics'>
.uri,../tfx_poc/StatisticsGen/statistics/5
.span,0
.split_names,"[""train"", ""eval""]"

0,1
.type_name,Schema
._artifacts,[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Schema' (uri: ../tfx_poc/SchemaGen/schema/6) at 0x7f25d6e9fe80.type<class 'tfx.types.standard_artifacts.Schema'>.uri../tfx_poc/SchemaGen/schema/6

0,1
[0],function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Schema' (uri: ../tfx_poc/SchemaGen/schema/6) at 0x7f25d6e9fe80.type<class 'tfx.types.standard_artifacts.Schema'>.uri../tfx_poc/SchemaGen/schema/6

0,1
.type,<class 'tfx.types.standard_artifacts.Schema'>
.uri,../tfx_poc/SchemaGen/schema/6

0,1
['anomalies'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'ExampleAnomalies' (1 artifact) at 0x7f25f42dc8b0.type_nameExampleAnomalies._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleAnomalies' (uri: ../tfx_poc/ExampleValidator/anomalies/9) at 0x7f25d6e9f880.type<class 'tfx.types.standard_artifacts.ExampleAnomalies'>.uri../tfx_poc/ExampleValidator/anomalies/9.span0.split_names[""train"", ""eval""]"

0,1
.type_name,ExampleAnomalies
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleAnomalies' (uri: ../tfx_poc/ExampleValidator/anomalies/9) at 0x7f25d6e9f880.type<class 'tfx.types.standard_artifacts.ExampleAnomalies'>.uri../tfx_poc/ExampleValidator/anomalies/9.span0.split_names[""train"", ""eval""]"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'ExampleAnomalies' (uri: ../tfx_poc/ExampleValidator/anomalies/9) at 0x7f25d6e9f880.type<class 'tfx.types.standard_artifacts.ExampleAnomalies'>.uri../tfx_poc/ExampleValidator/anomalies/9.span0.split_names[""train"", ""eval""]"

0,1
.type,<class 'tfx.types.standard_artifacts.ExampleAnomalies'>
.uri,../tfx_poc/ExampleValidator/anomalies/9
.span,0
.split_names,"[""train"", ""eval""]"


In [77]:
#Visualize the anomalies as a table
context.show(example_validator.outputs['anomalies'])

  pd.set_option('max_colwidth', -1)


---
<a id='toc04'></a>

## Data Preprocessing

In [None]:
_taxi_constants_module_file = 'taxi_constants.py'

In [None]:
%%writefile {_taxi_constants_module_file}

# Categorical features are assumed to each have a maximum value in the dataset.
MAX_CATEGORICAL_FEATURE_VALUES = [24, 31, 12]

CATEGORICAL_FEATURE_KEYS = [
    'trip_start_hour', 'trip_start_day', 'trip_start_month',
    'pickup_census_tract', 'dropoff_census_tract', 'pickup_community_area',
    'dropoff_community_area'
]

DENSE_FLOAT_FEATURE_KEYS = ['trip_miles', 'fare', 'trip_seconds']

# Number of buckets used by tf.transform for encoding each feature.
FEATURE_BUCKET_COUNT = 10

BUCKET_FEATURE_KEYS = [
    'pickup_latitude', 'pickup_longitude', 'dropoff_latitude',
    'dropoff_longitude'
]

# Number of vocabulary terms used for encoding VOCAB_FEATURES by tf.transform
VOCAB_SIZE = 1000

# Count of out-of-vocab buckets in which unrecognized VOCAB_FEATURES are hashed.
OOV_SIZE = 10

VOCAB_FEATURE_KEYS = [
    'payment_type',
    'company',
]

# Keys
LABEL_KEY = 'tips'
FARE_KEY = 'fare'

def transformed_name(key):
    return key + '_xf'

In [None]:
_taxi_transform_module_file = 'taxi_transform.py'

In [None]:
%%writefile {_taxi_transform_module_file}

import tensorflow as tf
import tensorflow_transform as tft

import taxi_constants

_DENSE_FLOAT_FEATURE_KEYS = taxi_constants.DENSE_FLOAT_FEATURE_KEYS
_VOCAB_FEATURE_KEYS = taxi_constants.VOCAB_FEATURE_KEYS
_VOCAB_SIZE = taxi_constants.VOCAB_SIZE
_OOV_SIZE = taxi_constants.OOV_SIZE
_FEATURE_BUCKET_COUNT = taxi_constants.FEATURE_BUCKET_COUNT
_BUCKET_FEATURE_KEYS = taxi_constants.BUCKET_FEATURE_KEYS
_CATEGORICAL_FEATURE_KEYS = taxi_constants.CATEGORICAL_FEATURE_KEYS
_FARE_KEY = taxi_constants.FARE_KEY
_LABEL_KEY = taxi_constants.LABEL_KEY
_transformed_name = taxi_constants.transformed_name


def preprocessing_fn(inputs):
  """tf.transform's callback function for preprocessing inputs.
  Args:
    inputs: map from feature keys to raw not-yet-transformed features.
  Returns:
    Map from string feature key to transformed feature operations.
  """
  outputs = {}
  for key in _DENSE_FLOAT_FEATURE_KEYS:
  # Preserve this feature as a dense float, setting nan's to the mean.
    outputs[_transformed_name(key)] = tft.scale_to_z_score(_fill_in_missing(inputs[key]))

  for key in _VOCAB_FEATURE_KEYS:
  # Build a vocabulary for this feature.
    outputs[_transformed_name(key)] = tft.compute_and_apply_vocabulary(_fill_in_missing(inputs[key]),
    top_k=_VOCAB_SIZE,
    num_oov_buckets=_OOV_SIZE)

  for key in _BUCKET_FEATURE_KEYS:
    outputs[_transformed_name(key)] = tft.bucketize(_fill_in_missing(inputs[key]), _FEATURE_BUCKET_COUNT)

  for key in _CATEGORICAL_FEATURE_KEYS:
    outputs[_transformed_name(key)] = _fill_in_missing(inputs[key])

# Was this passenger a big tipper?
  taxi_fare = _fill_in_missing(inputs[_FARE_KEY])
  tips = _fill_in_missing(inputs[_LABEL_KEY])
  outputs[_transformed_name(_LABEL_KEY)] = tf.where(
  tf.math.is_nan(taxi_fare),
  tf.cast(tf.zeros_like(taxi_fare), tf.int64),
  # Test if the tip was > 20% of the fare.
  tf.cast(tf.greater(tips, tf.multiply(taxi_fare, tf.constant(0.2))), tf.int64))

  return outputs


def _fill_in_missing(x):
  """Replace missing values in a SparseTensor.
  Fills in missing values of `x` with '' or 0, and converts to a dense tensor.
  Args:
    x: A `SparseTensor` of rank 2.  Its dense shape should have size at most 1
      in the second dimension.
  Returns:
    A rank 1 tensor where missing values of `x` have been filled in.
  """
  if not isinstance(x, tf.sparse.SparseTensor):
    return x

  default_value = '' if x.dtype == tf.string else 0
  return tf.squeeze(
      tf.sparse.to_dense(
          tf.SparseTensor(x.indices, x.values, [x.dense_shape[0], 1]),
          default_value),
      axis=1)

In [None]:
#Transform performs Data/Feature Engineering steps on the dataset.
#It uses TensorFlow Transform
#It takes data from ExampleGen, the schema from SchemaGen, and a module that contains user-defined Transform code as an input.
transform = Transform(examples=example_gen.outputs['examples'], schema=schema_gen.outputs['schema'],module_file=os.path.abspath(_taxi_transform_module_file))
context.run(transform)

In [None]:
transform.outputs

In [None]:

train_uri = transform.outputs['transform_graph'].get()[0].uri
os.listdir(train_uri)

In [None]:
#Get the URI of the output artifact
train_uri = os.path.join(transform.outputs['transformed_examples'].get()[0].uri, 'Split-train')

#Get the list of files in this directory
tfrecord_filenames = [os.path.join(train_uri, name) for name in os.listdir(train_uri)]

#Create a "TFRecordDataset" to read these file
dataset = tf.data.TFRecordDataset(tfrecord_filenames, compression_type='GZIP')

#Iterate over the first 3 records and decode them.
for tfrecord in dataset.take(3):
  serialized_example = tfrecord.numpy()
  example = tf.train.Example()
  example.ParseFromString(serialized_example)
  pp.pprint(example)

In [None]:
_taxi_trainer_module_file = 'taxi_trainer.py'

In [None]:
%%writefile {_taxi_trainer_module_file}

from typing import List, Text

import os
import absl
import datetime
import tensorflow as tf
import tensorflow_transform as tft

from tfx import v1 as tfx
from tfx_bsl.public import tfxio

import taxi_constants

_DENSE_FLOAT_FEATURE_KEYS = taxi_constants.DENSE_FLOAT_FEATURE_KEYS
_VOCAB_FEATURE_KEYS = taxi_constants.VOCAB_FEATURE_KEYS
_VOCAB_SIZE = taxi_constants.VOCAB_SIZE
_OOV_SIZE = taxi_constants.OOV_SIZE
_FEATURE_BUCKET_COUNT = taxi_constants.FEATURE_BUCKET_COUNT
_BUCKET_FEATURE_KEYS = taxi_constants.BUCKET_FEATURE_KEYS
_CATEGORICAL_FEATURE_KEYS = taxi_constants.CATEGORICAL_FEATURE_KEYS
_MAX_CATEGORICAL_FEATURE_VALUES = taxi_constants.MAX_CATEGORICAL_FEATURE_VALUES
_LABEL_KEY = taxi_constants.LABEL_KEY
_transformed_name = taxi_constants.transformed_name


def _transformed_names(keys):
  return [_transformed_name(key) for key in keys]


def _get_serve_tf_examples_fn(model, tf_transform_output):
  """Returns a function that parses a serialized tf.Example and applies TFT."""

  model.tft_layer = tf_transform_output.transform_features_layer()

  @tf.function
  def serve_tf_examples_fn(serialized_tf_examples):
    """Returns the output to be used in the serving signature."""
    feature_spec = tf_transform_output.raw_feature_spec()
    feature_spec.pop(_LABEL_KEY)
    parsed_features = tf.io.parse_example(serialized_tf_examples, feature_spec)
    transformed_features = model.tft_layer(parsed_features)
    return model(transformed_features)

  return serve_tf_examples_fn


def _input_fn(file_pattern: List[Text],
              data_accessor: tfx.components.DataAccessor,
              tf_transform_output: tft.TFTransformOutput,
              batch_size: int = 200) -> tf.data.Dataset:
  """Generates features and label for tuning/training.

  Args:
    file_pattern: List of paths or patterns of input tfrecord files.
    data_accessor: DataAccessor for converting input to RecordBatch.
    tf_transform_output: A TFTransformOutput.
    batch_size: representing the number of consecutive elements of returned
      dataset to combine in a single batch

  Returns:
    A dataset that contains (features, indices) tuple where features is a
      dictionary of Tensors, and indices is a single Tensor of label indices.
  """
  return data_accessor.tf_dataset_factory(
      file_pattern,
      tfxio.TensorFlowDatasetOptions(
          batch_size=batch_size, label_key=_transformed_name(_LABEL_KEY)),
      tf_transform_output.transformed_metadata.schema)


def _build_keras_model(hidden_units: List[int] = None) -> tf.keras.Model:
  """Creates a DNN Keras model for classifying taxi data.

  Args:
    hidden_units: [int], the layer sizes of the DNN (input layer first).

  Returns:
    A keras Model.
  """
  real_valued_columns = [
      tf.feature_column.numeric_column(key, shape=())
      for key in _transformed_names(_DENSE_FLOAT_FEATURE_KEYS)
  ]
  categorical_columns = [
      tf.feature_column.categorical_column_with_identity(
          key, num_buckets=_VOCAB_SIZE + _OOV_SIZE, default_value=0)
      for key in _transformed_names(_VOCAB_FEATURE_KEYS)
  ]
  categorical_columns += [
      tf.feature_column.categorical_column_with_identity(
          key, num_buckets=_FEATURE_BUCKET_COUNT, default_value=0)
      for key in _transformed_names(_BUCKET_FEATURE_KEYS)
  ]
  categorical_columns += [
      tf.feature_column.categorical_column_with_identity(  # pylint: disable=g-complex-comprehension
          key,
          num_buckets=num_buckets,
          default_value=0) for key, num_buckets in zip(
              _transformed_names(_CATEGORICAL_FEATURE_KEYS),
              _MAX_CATEGORICAL_FEATURE_VALUES)
  ]
  indicator_column = [
      tf.feature_column.indicator_column(categorical_column)
      for categorical_column in categorical_columns
  ]

  model = _wide_and_deep_classifier(
      # TODO(b/139668410) replace with premade wide_and_deep keras model
      wide_columns=indicator_column,
      deep_columns=real_valued_columns,
      dnn_hidden_units=hidden_units or [100, 70, 50, 25])
  return model


def _wide_and_deep_classifier(wide_columns, deep_columns, dnn_hidden_units):
  """Build a simple keras wide and deep model.

  Args:
    wide_columns: Feature columns wrapped in indicator_column for wide (linear)
      part of the model.
    deep_columns: Feature columns for deep part of the model.
    dnn_hidden_units: [int], the layer sizes of the hidden DNN.

  Returns:
    A Wide and Deep Keras model
  """
  # Following values are hard coded for simplicity in this example,
  # However prefarably they should be passsed in as hparams.

  # Keras needs the feature definitions at compile time.
  # TODO(b/139081439): Automate generation of input layers from FeatureColumn.
  input_layers = {
      colname: tf.keras.layers.Input(name=colname, shape=(), dtype=tf.float32)
      for colname in _transformed_names(_DENSE_FLOAT_FEATURE_KEYS)
  }
  input_layers.update({
      colname: tf.keras.layers.Input(name=colname, shape=(), dtype='int32')
      for colname in _transformed_names(_VOCAB_FEATURE_KEYS)
  })
  input_layers.update({
      colname: tf.keras.layers.Input(name=colname, shape=(), dtype='int32')
      for colname in _transformed_names(_BUCKET_FEATURE_KEYS)
  })
  input_layers.update({
      colname: tf.keras.layers.Input(name=colname, shape=(), dtype='int32')
      for colname in _transformed_names(_CATEGORICAL_FEATURE_KEYS)
  })

  # TODO(b/161952382): Replace with Keras preprocessing layers.
  deep = tf.keras.layers.DenseFeatures(deep_columns)(input_layers)
  for numnodes in dnn_hidden_units:
    deep = tf.keras.layers.Dense(numnodes)(deep)
  wide = tf.keras.layers.DenseFeatures(wide_columns)(input_layers)

  output = tf.keras.layers.Dense(
      1, activation='sigmoid')(
          tf.keras.layers.concatenate([deep, wide]))

  model = tf.keras.Model(input_layers, output)
  model.compile(
      loss='binary_crossentropy',
      optimizer=tf.keras.optimizers.Adam(lr=0.001),
      metrics=[tf.keras.metrics.BinaryAccuracy()])
  model.summary(print_fn=absl.logging.info)
  return model


# TFX Trainer will call this function.
def run_fn(fn_args: tfx.components.FnArgs):
  """Train the model based on given args.

  Args:
    fn_args: Holds args used to train the model as name/value pairs.
  """
  # Number of nodes in the first layer of the DNN
  first_dnn_layer_size = 100
  num_dnn_layers = 4
  dnn_decay_factor = 0.7

  tf_transform_output = tft.TFTransformOutput(fn_args.transform_output)

  train_dataset = _input_fn(fn_args.train_files, fn_args.data_accessor, 
                            tf_transform_output, 40)
  eval_dataset = _input_fn(fn_args.eval_files, fn_args.data_accessor, 
                           tf_transform_output, 40)

  model = _build_keras_model(
      # Construct layers sizes with exponetial decay
      hidden_units=[
          max(2, int(first_dnn_layer_size * dnn_decay_factor**i))
          for i in range(num_dnn_layers)
      ])

  tensorboard_callback = tf.keras.callbacks.TensorBoard(
      log_dir=fn_args.model_run_dir, update_freq='batch')
  model.fit(
      train_dataset,
      steps_per_epoch=fn_args.train_steps,
      validation_data=eval_dataset,
      validation_steps=fn_args.eval_steps,
      callbacks=[tensorboard_callback])

  signatures = {
      'serving_default':
          _get_serve_tf_examples_fn(model,
                                    tf_transform_output).get_concrete_function(
                                        tf.TensorSpec(
                                            shape=[None],
                                            dtype=tf.string,
                                            name='examples')),
  }
  model.save(fn_args.serving_model_dir, save_format='tf', signatures=signatures)

In [None]:
trainer = tfx.components.Trainer(
    module_file=os.path.abspath(_taxi_trainer_module_file),
    examples=transform.outputs['transformed_examples'],
    transform_graph=transform.outputs['transform_graph'],
    schema=schema_gen.outputs['schema'],
    train_args=tfx.proto.TrainArgs(num_steps=10000),
    eval_args=tfx.proto.EvalArgs(num_steps=5000))
context.run(trainer)

In [None]:

model_artifact_dir = trainer.outputs['model'].get()[0].uri
pp.pprint(os.listdir(model_artifact_dir))
model_dir = os.path.join(model_artifact_dir, "Format-Serving")
pp.pprint(os.listdir(model_dir))

In [None]:
!pip install tensorflow

In [None]:
%reload_ext tensorboard

In [None]:
#lets connect to TensorBoard to the Trainer to analyzer our model's training
model_run_artifact_dir = trainer.outputs['model_run'].get()[0].uri
%load_ext tensorboard 
%tensorboard --logdir {model_run_artifact_dir}

In [None]:
#Evaluator
#The evaluator component computes model performance metrices over the evaluation set.
#It uses the TensorFlow Model Analysis library.
#The Evaluator can aslo optionally validate that a newly trained model is better that the previous model.
eval_config = tfma.EvalConfig(
    model_specs=[
        # This assumes a serving model with signature 'serving_default'. If
        # using estimator based EvalSavedModel, add signature_name: 'eval' and 
        # remove the label_key.
        tfma.ModelSpec(label_key='tips')
    ],
    metrics_specs=[
        tfma.MetricsSpec(
            # The metrics added here are in addition to those saved with the
            # model (assuming either a keras model or EvalSavedModel is used).
            # Any metrics added into the saved model (for example using
            # model.compile(..., metrics=[...]), etc) will be computed
            # automatically.
            # To add validation thresholds for metrics saved with the model,
            # add them keyed by metric name to the thresholds map.
            metrics=[
                tfma.MetricConfig(class_name='ExampleCount'),
                tfma.MetricConfig(class_name='BinaryAccuracy',
                  threshold=tfma.MetricThreshold(
                      value_threshold=tfma.GenericValueThreshold(
                          lower_bound={'value': 0.5}),
                      # Change threshold will be ignored if there is no
                      # baseline model resolved from MLMD (first run).
                      change_threshold=tfma.GenericChangeThreshold(
                          direction=tfma.MetricDirection.HIGHER_IS_BETTER,
                          absolute={'value': -1e-10})))
            ]
        )
    ],
    slicing_specs=[
        # An empty slice spec means the overall slice, i.e. the whole dataset.
        tfma.SlicingSpec(),
        # Data can be sliced along a feature column. In this case, data is
        # sliced along feature column trip_start_hour.
        tfma.SlicingSpec(feature_keys=['trip_start_hour'])
    ])

In [None]:
#Let's configura the Evaluator and run it.
model_resolver = tfx.dsl.Resolver(strategy_class=tfx.dsl.experimental.LatestBlessedModelStrategy,
    model = tfx.dsl.Channel(type=tfx.types.standard_artifacts.Model),
    model_blessing = tfx.dsl.Channel( type=tfx.types.standard_artifacts.ModelBlessing)).with_id('latest_blessed_model_resolver')
context.run(model_resolver)

evaluator = tfx.components.Evaluator(
    examples=example_gen.outputs['examples'],
    model= trainer.outputs['model'],
    baseline_model = model_resolver.outputs['model'],
    eval_config= eval_config
)
context.run(evaluator)

In [None]:
#Let's examine the output artifacts of Evaluator
evaluator.outputs

In [None]:
context.show(evaluator.outputs['evaluation'])

In [None]:
import tensorflow_model_analysis as tfma

#Get the TFMA output result path and load the result
PATH_TO_RESULT = evaluator.outputs['evaluation'].get()[0].uri
tfma_result = tfma.load_eval_result(PATH_TO_RESULT)

#Show data sliced along feature column trip_start_hour.
tfma.view.render_slicing_metrics(
    tfma_result, slicing_column='trip_start_hour'
)

In [None]:
blessing_uri = evaluator.outputs.blessing.get()[0].uri
!ls -l {blessing_uri}

In [None]:

PATH_TO_RESULT = evaluator.outputs['evaluation'].get()[0].uri
print(tfma.load_validation_result(PATH_TO_RESULT))

In [None]:
#Pusher
pusher = tfx.components.Pusher(
    model=trainer.outputs['model'],
    model_blessing=evaluator.outputs['blessing'],
    push_destination=tfx.proto.PushDestination(
        filesystem = tfx.proto.PushDestination.Filesystem(
            base_directory = _serving_model_dir
        )
    )
)

context.run(pusher)

In [None]:
#Let's examine the outputs artifacts of Pusher.
pusher.outputs

In [None]:
push_uri = pusher.outputs.pushed_model.get()[0].uri
model = tf.saved_model.load(push_uri)

for item in model.signatures.items():
  pp.pprint(item)