In Tensorflow 2.x, more public datasets are registered officially. By default, six types of datasets are available to the Tensorflow, that are (listing several famous datasets)
* `audio` 
* `image`  
* `structured`
* `text`
* `translate`
* `video`.

More details under each type of datasets please refer to the link (https://www.tensorflow.org/datasets/datasets) or you can use `list_builders()` to list available dataset builders.

It is convenient to use the default public datasets to develop the ML model or optimize it. However, you can also prepare your own datasets as a type of Tensorflow Dataset (TFDS). Further, a new dataset can make a request to Tensorflow Develop Team for releasing to the public.

If you are supported to release your own dataset, please refer to the link (https://www.tensorflow.org/datasets/add_dataset).

In [0]:
!pip install tensorflow-gpu==2.0.0
!pip install tensorflow_hub

## Tensorflow Datasets (tfds)

With the same data structure, different datasets are prepared and easy to use via the same API. 



In [0]:
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_hub as hub
import numpy as np

In [0]:
print("Tensorflow Version: {}".format(tf.__version__))

Tensorflow Version: 2.0.0


### List Tensorflow dataset builders.

In [0]:
# would output 
# ['abstract_reasoning',
#  'bair_robot_pushing_small',
#  'caltech101',
#  'cats_vs_dogs',
#  'celeb_a',
#  'celeb_a_hq',
#  'chexpert',
# ...
# ]
available_datasets = tfds.list_builders()

In [0]:
print("Number of datasets {} and top 3 are: {}.".format(
    len(available_datasets), available_datasets[:3]))

Number of datasets 132 and top 3 are: ['abstract_reasoning', 'aeslc', 'aflw2k3d'].


### Load the dataset via builder.

In [0]:
builder_name = "imdb_reviews"
assert builder_name in available_datasets, "Builder is not valid."

imdbr_builder = tfds.builder(builder_name)

### Prepare the dataset.

You may need to download the dataset. It depends on the dataset you are going to use.

In most cases, you can use `download_and_prepare()` to prepare the dataset.

In [0]:
imdbr_builder.download_and_prepare()

After downloading the dataset, the default path to the dataset would be at `$HOME/tensorflow_datasets/<dataset_name>`.

Now you can load the dataset. The parameter `split` will return the specific dataset which you selected. There are four types that you can select, that are `ALL`, `TEST`, `TRAIN` and `VALIDATION`. If you leave it at default, all of the datasets will be returned.

In [0]:
for name in dir(tfds.Split):
  if name[0] != "_": print(name)

ALL
TEST
TRAIN
VALIDATION


In [0]:
imdbr_ds = tfds.load("imdb_reviews", split=tfds.Split.TRAIN)
assert isinstance(imdbr_ds, tf.data.Dataset)

### Quick view of the dataset.

After loading the dataset it consists of two attributes, that are `shapes` and `types`. The attribute `shapes` describes the data names and their shapes. The attribute `types` describes the data type which is like `string` or `int`.

In [0]:
imdbr_ds

<_OptionsDataset shapes: {label: (), text: ()}, types: {label: tf.int64, text: tf.string}>

You can access the data by using `take()`.

In [0]:
imdbr_eg, = imdbr_ds.take(count=1)
text, label = imdbr_eg["text"], imdbr_eg["label"]
print(text)
print(label)

tf.Tensor(b"As a lifelong fan of Dickens, I have invariably been disappointed by adaptations of his novels.<br /><br />Although his works presented an extremely accurate re-telling of human life at every level in Victorian Britain, throughout them all was a pervasive thread of humour that could be both playful or sarcastic as the narrative dictated. In a way, he was a literary caricaturist and cartoonist. He could be serious and hilarious in the same sentence. He pricked pride, lampooned arrogance, celebrated modesty, and empathised with loneliness and poverty. It may be a clich\xc3\xa9, but he was a people's writer.<br /><br />And it is the comedy that is so often missing from his interpretations. At the time of writing, Oliver Twist is being dramatised in serial form on BBC television. All of the misery and cruelty is their, but non of the humour, irony, and savage lampoonery. The result is just a dark, dismal experience: the story penned by a journalist rather than a novelist. It's 

You can also access more than one data as the batch data.

In [0]:
imdbr_batch_eg = imdbr_ds.take(count=10)

# iterate each data point
for item in imdbr_batch_eg:
  print(item["text"], item["label"])

tf.Tensor(b"As a lifelong fan of Dickens, I have invariably been disappointed by adaptations of his novels.<br /><br />Although his works presented an extremely accurate re-telling of human life at every level in Victorian Britain, throughout them all was a pervasive thread of humour that could be both playful or sarcastic as the narrative dictated. In a way, he was a literary caricaturist and cartoonist. He could be serious and hilarious in the same sentence. He pricked pride, lampooned arrogance, celebrated modesty, and empathised with loneliness and poverty. It may be a clich\xc3\xa9, but he was a people's writer.<br /><br />And it is the comedy that is so often missing from his interpretations. At the time of writing, Oliver Twist is being dramatised in serial form on BBC television. All of the misery and cruelty is their, but non of the humour, irony, and savage lampoonery. The result is just a dark, dismal experience: the story penned by a journalist rather than a novelist. It's 

### Dataset Information

You can simply access the attribute `info` of dataset builder to view the dataset information.

In [0]:
print(imdbr_builder.info)

tfds.core.DatasetInfo(
    name='imdb_reviews',
    version=0.1.0,
    description='Large Movie Review Dataset.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.',
    homepage='http://ai.stanford.edu/~amaas/data/sentiment/',
    features=FeaturesDict({
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
        'text': Text(shape=(), dtype=tf.string),
    }),
    total_num_examples=100000,
    splits={
        'test': 25000,
        'train': 25000,
        'unsupervised': 50000,
    },
    supervised_keys=('text', 'label'),
    citation="""@InProceedings{maas-EtAl:2011:ACL-HLT2011,
      author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
      title     = {Learning Word

# TFDS being as Input Pipeline

## Batch Data

In training, tfds library provides you a better way to feeding data into the model. The following is a simple example.

The `repeat()` means fetching data from head when it comes to the last one.

The `shuffle(buffer_size)` means fetching data in a number of count.

The `batch(count)` means how many data in a batch.

The `prefetch(count)` means how many prefetched data.

In [0]:
imdbr_train = imdbr_ds.repeat().prefetch(tf.data.experimental.AUTOTUNE).shuffle(128).batch(32)

You can combine different processing orders to meet your requirement.

In [0]:
imdbr_train_2 = imdbr_ds.repeat().prefetch(tf.data.experimental.AUTOTUNE).batch(32)

In [0]:
imdbr_train

<BatchDataset shapes: {label: (None,), text: (None,)}, types: {label: tf.int64, text: tf.string}>

In [0]:
for batch in imdbr_train:
  # in each batchm there are two attributes, text and label
  print("There are {} attributes.".format(len(batch)))
  print()
  print("Use `.item()` fetch data array.")
  print(batch.items())
  print()
  print("Use `key` to fetch data, e.g. batch['text'].")
  print(batch["text"])
  break

There are 2 attributes.

Use `.item()` fetch data array.
dict_items([('label', <tf.Tensor: id=329, shape=(8,), dtype=int64, numpy=array([0, 1, 1, 1, 0, 1, 0, 1])>), ('text', <tf.Tensor: id=330, shape=(8,), dtype=string, numpy=
array([b'There is no way to avoid a comparison between The Cat in the Hat and The Grinch Who Stole Christmas, so let\'s get that part out of the way. First of all, let me start by saying that I think Grinch was an underrated and unappreciated film. Cat was... well, just awful.<br /><br />Jim Carey was cast because he is a brilliant physical comedian, and fearlessly commits to over the top, outrageous characters. Mike Myers fell back on his old bag of tricks.<br /><br />Why, why, why Mike Myers?? The kids could care less, and the Austin Powers demographic isn\'t going to spy this film. So, what was the studio thinking?<br /><br />The Cat was also apparently related to Linda Richmond. Can we talk? Why a New York Accent? Not entirely consistent with anything Dr. Seu

## Processing / Maping in Batch Data for Training

In training, we must make the trainer to automatically fetch and preprocessing the data. In Tensorflow 2.x, we use `.map()` function to address it.

In [0]:
def parse_fn(batch):
  # refer to `PrefetchDataset shapes: {text: (None,), label: (None,)}`
  return batch["text"], batch["label"]

In [0]:
imdbr_ds_map = imdbr_train.map(parse_fn)

# Simple Training Example

Here we demo how to use mapping function in the training process. In the following, we merge a pretrained model from tf.hub (an famous stoage for pretrained model hosted by Google) and customized layers. We are going to solve the simple binary question. Imdb dataset mainly consists of two attributes, a review text and a binary flag. The flag represents reviewing score.  A negative flag (review) has a score <= 4 out of 10, and a positive flag (review) has a score >= 7 out of 10.

## Build a Model

You can easily merge a SavedModel into a layer of a bigger model. The parameter `trainable=Ture` can be set to the `hub.KerasLayer()` method if you want to retrain or fine-tune the SavedModel. Otherwise, the embedded SavedModel would be only as an inference layer.

In [0]:
model = tf.keras.Sequential([
    hub.KerasLayer("https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1", 
                   output_shape=[128], input_shape=[], dtype=tf.string), 
    tf.keras.layers.Dense(16, activation="relu"),
    tf.keras.layers.Dense(1, activation="sigmoid")
])

In [0]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
keras_layer (KerasLayer)     (None, 128)               124642688 
_________________________________________________________________
dense (Dense)                (None, 16)                2064      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 124,644,769
Trainable params: 2,081
Non-trainable params: 124,642,688
_________________________________________________________________


After built a model, we are going to compile it and start a training via `.fit()` function.

## Training

In [0]:
model.compile(optimizer="adam", 
              loss="mean_squared_error", 
              metrics=["accuracy"])

Due to no end in PrefetchDataset `repeat()`, here fit() requires a `steps_per_epoch` flag.

In [0]:
steps_per_epoch = 25000 // 32

model.fit(imdbr_ds_map, 
          epochs=10, 
          steps_per_epoch=steps_per_epoch)

Train for 781 steps
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f645600cf28>

## Evaluate

In [0]:
imdbr_ds_TEST = tfds.load("imdb_reviews", split=tfds.Split.TEST)
imdbr_test = imdbr_ds_TEST.batch(1000)
imdbr_test_map = imdbr_test.map(parse_fn)

In [0]:
model.evaluate(imdbr_test_map)



[0.14263434290885926, 0.793]

## Predict

You can now type some sentences to predict the score.

In [0]:
imdbr_predict_data = np.array(["Hello world, Tensorflow 2.0!",
                               "Test",
                               "1,2,3", 
                               "Nice story, very good tech.", 
                               "It is bad in telling the whole story."])
model.predict(imdbr_predict_data)

array([[0.58443654],
       [0.5376988 ],
       [0.5217016 ],
       [0.9476365 ],
       [0.28862488]], dtype=float32)

You can also use the tensorflow dataset to predict the result.

In [0]:
imdbr_predict_data_2, = imdbr_ds_TEST.take(1)
imdbr_predict_data_2_label = imdbr_predict_data_2["label"].numpy()
imdbr_predict_data_2 = np.array([imdbr_predict_data_2["text"].numpy().decode("UTF-8")])
imdbr_predict_data_2, imdbr_predict_data_2_label

(array(["I've watched the movie actually several times. And what i want to say about it is the only thing that made this movie high rank was the Burak Altay's incredible performance, absolutely nothing but that. Not even those silly model named Deniz Akkaya and some of these popular names at times in the movie... Burak is definitely very talented i've seen a few jobs he made and been through. Even though this is kind of horror movie, he's doing really good job in comedy movies and also in dramas too. I bet most of you all saw Asmali Konak the movie and TV series, those two would go for an example... All i'm gonna say is you better watch out for the new works coming out from Burak then you'll see.. Keep the good work bro, much love.."],
       dtype='<U733'), 1)

In [0]:
model.predict(imdbr_predict_data_2)

array([[0.7412185]], dtype=float32)

## Export a Model

It is simple to save the trained model. 

### keras h5 format

In [0]:
model.save("imdbr.h5", include_optimizer=True)

### SavedModel format

In [0]:
tf.keras.experimental.export_saved_model(model, "./imdbr")

Instructions for updating:
Please use `model.save(..., save_format="tf")` or `tf.keras.models.save_model(..., save_format="tf")`.


Instructions for updating:
Please use `model.save(..., save_format="tf")` or `tf.keras.models.save_model(..., save_format="tf")`.


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.


Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.


INFO:tensorflow:Signatures INCLUDED in export for Classify: None


INFO:tensorflow:Signatures INCLUDED in export for Classify: None


INFO:tensorflow:Signatures INCLUDED in export for Regress: None


INFO:tensorflow:Signatures INCLUDED in export for Regress: None


INFO:tensorflow:Signatures INCLUDED in export for Predict: None


INFO:tensorflow:Signatures INCLUDED in export for Predict: None


INFO:tensorflow:Signatures INCLUDED in export for Train: ['train']


INFO:tensorflow:Signatures INCLUDED in export for Train: ['train']


INFO:tensorflow:Signatures INCLUDED in export for Eval: None


INFO:tensorflow:Signatures INCLUDED in export for Eval: None






'list' object has no attribute 'name'


'list' object has no attribute 'name'


INFO:tensorflow:No assets to save.


INFO:tensorflow:No assets to save.


INFO:tensorflow:No assets to write.


INFO:tensorflow:No assets to write.


INFO:tensorflow:Signatures INCLUDED in export for Classify: None


INFO:tensorflow:Signatures INCLUDED in export for Classify: None


INFO:tensorflow:Signatures INCLUDED in export for Regress: None


INFO:tensorflow:Signatures INCLUDED in export for Regress: None


INFO:tensorflow:Signatures INCLUDED in export for Predict: None


INFO:tensorflow:Signatures INCLUDED in export for Predict: None


INFO:tensorflow:Signatures INCLUDED in export for Train: None


INFO:tensorflow:Signatures INCLUDED in export for Train: None


INFO:tensorflow:Signatures INCLUDED in export for Eval: ['eval']


INFO:tensorflow:Signatures INCLUDED in export for Eval: ['eval']






'list' object has no attribute 'name'


'list' object has no attribute 'name'


INFO:tensorflow:No assets to save.


INFO:tensorflow:No assets to save.


INFO:tensorflow:No assets to write.


INFO:tensorflow:No assets to write.


INFO:tensorflow:Signatures INCLUDED in export for Classify: None


INFO:tensorflow:Signatures INCLUDED in export for Classify: None


INFO:tensorflow:Signatures INCLUDED in export for Regress: None


INFO:tensorflow:Signatures INCLUDED in export for Regress: None


INFO:tensorflow:Signatures INCLUDED in export for Predict: ['serving_default']


INFO:tensorflow:Signatures INCLUDED in export for Predict: ['serving_default']


INFO:tensorflow:Signatures INCLUDED in export for Train: None


INFO:tensorflow:Signatures INCLUDED in export for Train: None


INFO:tensorflow:Signatures INCLUDED in export for Eval: None


INFO:tensorflow:Signatures INCLUDED in export for Eval: None


'list' object has no attribute 'name'


'list' object has no attribute 'name'


INFO:tensorflow:No assets to save.


INFO:tensorflow:No assets to save.


INFO:tensorflow:No assets to write.


INFO:tensorflow:No assets to write.


INFO:tensorflow:SavedModel written to: ./imdbr/saved_model.pb


INFO:tensorflow:SavedModel written to: ./imdbr/saved_model.pb


## View the weights

It is simple to transfrom the model weight into different formats. You can access the attribute `.weights` to fetch all weights.

In [0]:
model.weights

[<tf.Variable 'Variable:0' shape=(973771, 128) dtype=float32, numpy=
 array([[ 8.16051960e-01,  3.51142138e-02, -3.23012704e-03, ...,
         -2.68346369e-02, -4.95113060e-02,  1.63635537e-02],
        [ 6.26858950e-01,  1.22697828e-02, -5.47063090e-02, ...,
          7.37033179e-03, -7.09955022e-02, -7.38640875e-02],
        [-1.19781224e-02, -4.65284288e-02, -4.78441420e-04, ...,
          1.27587229e-01,  1.21942766e-01, -3.41316722e-02],
        ...,
        [-1.96687713e-01,  1.08706986e-03, -8.78152475e-02, ...,
          1.77201703e-01, -1.04697749e-01,  4.37770933e-02],
        [-2.06271142e-01, -2.30716169e-02, -7.13550597e-02, ...,
         -3.06905657e-02, -1.14739329e-01, -5.64496219e-02],
        [-1.38024807e-01, -1.27754761e-02, -6.36914894e-02, ...,
          1.03301600e-01,  6.30483963e-04, -4.08159662e-03]], dtype=float32)>,
 <tf.Variable 'dense/kernel:0' shape=(128, 16) dtype=float32, numpy=
 array([[ 0.15137854,  0.04307334, -0.07645128, ..., -0.0343016 ,
         