Tensorflow offers efficient APIs to preprocess data

In [43]:
import tensorflow as tf
import numpy as np

## All preprocessing layers
`['CategoryCrossing',
 'CategoryEncoding',
 'CenterCrop',
 'Discretization',
 'Hashing',
 'IntegerLookup',
 'Normalization',
 'PreprocessingLayer',
 'RandomContrast',
 'RandomCrop',
 'RandomFlip',
 'RandomHeight',
 'RandomRotation',
 'RandomTranslation',
 'RandomWidth',
 'RandomZoom',
 'Rescaling',
 'Resizing',
 'StringLookup',
 'TextVectorization']`

All are keras Layers

In [11]:
issubclass(tf.keras.layers.experimental.preprocessing.TextVectorization, tf.keras.layers.Layer)

True

### NLP preprocessing

#### TextVectorization

Convert text to vectors.

In [51]:
text_vectorizer = tf.keras.layers.experimental.preprocessing.TextVectorization(standardize=None)

standardize options `None`, `lower_and_strip_punctuation`, or a `Callable`

In [52]:
text_vectorizer.adapt(["I am going to the gym today", "They had a bady", "i Am going"])

In [53]:
text_vectorizer(["I am going"])

<tf.Tensor: shape=(1, 3), dtype=int64, numpy=array([[13, 10,  2]])>

In [54]:
text_vectorizer(["i Am going"])

<tf.Tensor: shape=(1, 3), dtype=int64, numpy=array([[ 6, 14,  2]])>

In [55]:
text_vectorizer(["She am going"])

<tf.Tensor: shape=(1, 3), dtype=int64, numpy=array([[ 1, 10,  2]])>

In [56]:
text_vectorizer.get_vocabulary()

['',
 '[UNK]',
 'going',
 'today',
 'to',
 'the',
 'i',
 'had',
 'gym',
 'bady',
 'am',
 'a',
 'They',
 'I',
 'Am']

### Normalization

This layer normalizes the input feature-wise. You can pass some data in `adapt` on which the mean and variance will be calculated or you can pass the mean and variance youtself as an argument

In [57]:
normalization = tf.keras.layers.experimental.preprocessing.Normalization()

In [58]:
normalization.adapt(np.array([[1., -1], [2., 0], [3., 1]], np.float32))

In [59]:
normalization(np.array([[1., -1], [2., 0], [3., 1]], np.float32))

<tf.Tensor: shape=(3, 2), dtype=float32, numpy=
array([[-1.2247448, -1.2247448],
       [ 0.       ,  0.       ],
       [ 1.2247448,  1.2247448]], dtype=float32)>

Other important Layers are

`StringLookup` - Maps strings from a vocabulary to integer indices. This is similar to `TextVectorization`

In [None]:
Tensorflow also offers a bunch of image preprocessing and image augumentation layer. The latest ones are available in the T