# Requirements

Tensorflow Transform requires both `Tensorflow` and `Apache Beam`. If you install `Tensorflow Extended (TFX)`, you don't require installing other packages. Otherwise, two packages `Tensorflow` and `Apache Beam` are the lowest requirements. you can install `tensorflow-transform` after that.

In [None]:
!pip install tensorflow==1.14.0

In [None]:
import tensorflow as tf
import tensorflow_transform as tft
import tensorflow_transform.beam as tft_beam
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.tf_metadata import dataset_schema

print("TF version: {}".format(tf.__version__))

# Define a preprocessing function

There are two types of preprocessing_fn:
* Any function accepts and returns tensor. These operations would be added into tensorflow graph.
* Any of analyzers provided by `tf.Transform`. Analyzers also accept and return tensors, but these operations would not be added into tensorflow graph. They work outside the tensorflow graph and return the tensor as a constant to the tensorflow graph.

Analyzers process the input data whose shape like `(batch_size, )`, but the returned tensor whose shape is like `()`. Analyzers use all data to calculate the result, not to calculate within a single data.

In [None]:
def preprocessing_fn(inputs):
    x = inputs["x"]
    y = inputs["y"]
    s = inputs["s"]
    x_centered = x - tft.mean(x)
    y_normalized = tft.scale_to_0_1(y)
    s_integerized = tft.compute_and_apply_vocabulary(s)
    x_centered_times_y_normalized = x_centered * y_normalized
    return {
        "x_centered": x_centered,
        "y_normalized": y_normalized,
        "s_integerized": s_integerized,
        "x_centered_times_y_normalized": x_centered_times_y_normalized
    }

# Apache Beam Implementation

In [None]:
raw_data = [
    {'x': 1, 'y': 1, 's': 'hello'},
    {'x': 2, 'y': 2, 's': 'world'},
    {'x': 3, 'y': 3, 's': 'hello'}
]

raw_data_metadata = dataset_metadata.DatasetMetadata(
    dataset_schema.from_feature_spec({
        's': tf.FixedLenFeature([], tf.string),
        'y': tf.FixedLenFeature([], tf.float32),
        'x': tf.FixedLenFeature([], tf.float32),
    }))

# requires a tft_beam context to write out the graph temp
with tft_beam.Context(temp_dir="/tmp"):
    transformed_dataset, transform_fn = (
        (raw_data, raw_data_metadata) | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn))

    transformed_data, transformed_metadata = transformed_dataset

In [6]:
transformed_data, type(transformed_data)

([{'s_integerized': 0,
   'x_centered': -1.0,
   'x_centered_times_y_normalized': -0.0,
   'y_normalized': 0.0},
  {'s_integerized': 1,
   'x_centered': 0.0,
   'x_centered_times_y_normalized': 0.0,
   'y_normalized': 0.5},
  {'s_integerized': 0,
   'x_centered': 1.0,
   'x_centered_times_y_normalized': 1.0,
   'y_normalized': 1.0}],
 list)