In [0]:
!pip install tensorflow-gpu==2.0.0alpha
!pip install tensorflow_hub

In Tensorflow 2.x, more public datasets are registered. By default, six types of datasets are available to the Tensorflow, that are (listing several famous datasets)
* `audio` 
* `image`  
* `structured`
* `text`
* `translate`
* `video`

More details under each type of datasets please refer to the link (https://www.tensorflow.org/datasets/datasets) or you can use `list_builders()` to list available dataset builders.

It is convenient to use the default public datasets to develop the ML model or optimize it. However, you can also prepare your own datasets as a type of Tensorflow Dataset (TFDS). Further, a new dataset can make a request to Tensorflow Develop Team for releasing to the public.

If you are supported to release your own dataset, please refer to the link (https://www.tensorflow.org/datasets/add_dataset).

## Tensorflow Datasets (tfds)

With the same data structure, different datasets are prepared and easy to use via the same API. 



In [0]:
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_hub as hub
import numpy as np

In [0]:
print("Tensorflow Version: {}".format(tf.__version__))

Tensorflow Version: 2.0.0-alpha0


### List Tensorflow dataset builders.

In [0]:
# would output 
# ['abstract_reasoning',
#  'bair_robot_pushing_small',
#  'caltech101',
#  'cats_vs_dogs',
#  'celeb_a',
#  'celeb_a_hq',
#  'chexpert',
# ...
# ]
available_datasets = tfds.list_builders()

In [0]:
print("Number of datasets {} and top 3 are: {}.".format(
    len(available_datasets), available_datasets[:3]))

Number of datasets 70 and top 3 are: ['abstract_reasoning', 'bair_robot_pushing_small', 'caltech101'].


### Load the dataset via builder.

In [0]:
builder_name = "imdb_reviews"
assert builder_name in available_datasets, "Builder is not valid."

imdbr_builder = tfds.builder(builder_name)

### Prepare the dataset.

You may need to download the dataset. It depends on the dataset you are going to use.

In most cases, you can use `download_and_prepare()` to prepare the dataset.

In [0]:
imdbr_builder.download_and_prepare()

After download the dataset, the default path to the dataset would be at `$HOME/tensorflow_datasets/<dataset_name>`.

Now you can load the dataset. The parameter `split` will return the specific dataset which you selected. There are four types that you can select, that are `ALL`, `TEST`, `TRAIN` and `VALIDATION`. If you leave it at default, all of the datasets will be returned.

In [0]:
for name in dir(tfds.Split):
  if name[0] != "_": print(name)

ALL
TEST
TRAIN
VALIDATION


In [0]:
imdbr_ds = tfds.load("imdb_reviews", split=tfds.Split.TRAIN)
assert isinstance(imdbr_ds, tf.data.Dataset)

### Quick view to the dataset.

After loading the dataset it consists of two attributes, that are `shapes` and `types`. The attribute `shapes` describes the data names and their shapes. The attribute `types` describes the data type which is like `string` or `int`.

In [0]:
imdbr_ds

TensorShape([])

You can access the data by using `take()`. **Notice that `take(count=1)` randomly fetch data points.**

In [0]:
imdbr_eg, = imdbr_ds.take(count=1)
text, label = imdbr_eg["text"], imdbr_eg["label"]
print(text)
print(label)

tf.Tensor(b"This is the most depressing film I have ever seen. I first saw it as a child and even thinking about it now really upsets me. I know it was set in a time when life was hard and I know these people were poor and the crops were vital. Yes, I get all that. What I find hard to take is I can't remember one single light moment in the entire film. Maybe it was true to life, I don't know. I'm quite sure the acting was top notch and the direction and quality of filming etc etc was wonderful and I know that every film can't have a happy ending but as a family film it is dire in my opinion.<br /><br />I wouldn't recommend it to anyone who wants to be entertained by a film. I can't stress enough how this film affected me as a child. I was talking about it recently and all the sad memories came flooding back. I think it would have all but the heartless reaching for the Prozac.", shape=(), dtype=string)
tf.Tensor(0, shape=(), dtype=int64)


You can also access more than one data as the batch data.

In [0]:
imdbr_batch_eg = imdbr_ds.take(count=10)

# iterate each data point
for item in imdbr_batch_eg:
  print(item["text"], item["label"])

tf.Tensor(b"This is the most depressing film I have ever seen. I first saw it as a child and even thinking about it now really upsets me. I know it was set in a time when life was hard and I know these people were poor and the crops were vital. Yes, I get all that. What I find hard to take is I can't remember one single light moment in the entire film. Maybe it was true to life, I don't know. I'm quite sure the acting was top notch and the direction and quality of filming etc etc was wonderful and I know that every film can't have a happy ending but as a family film it is dire in my opinion.<br /><br />I wouldn't recommend it to anyone who wants to be entertained by a film. I can't stress enough how this film affected me as a child. I was talking about it recently and all the sad memories came flooding back. I think it would have all but the heartless reaching for the Prozac.", shape=(), dtype=string) tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(b"I saw this film on True Movies (which

### Dataset Information

You can simply access the attribute `info` of dataset builder to view the dataset information.

In [0]:
print(imdbr_builder.info)

tfds.core.DatasetInfo(
    name='imdb_reviews',
    version=0.1.0,
    description='Large Movie Review Dataset.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.',
    urls=['http://ai.stanford.edu/~amaas/data/sentiment/'],
    features=FeaturesDict({
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
        'text': Text(shape=(), dtype=tf.string, encoder=None)
    },
    total_num_examples=100000,
    splits={
        'test': <tfds.core.SplitInfo num_examples=25000>,
        'train': <tfds.core.SplitInfo num_examples=25000>,
        'unsupervised': <tfds.core.SplitInfo num_examples=50000>
    },
    supervised_keys=('text', 'label'),
    citation='"""
        @InProceedings{maas-EtAl:2011:ACL-HLT2011,
          author    = {Maas, Andrew L.  and  Daly, 

# TFDS being as Input Pipeline

## Batch Data

In training, tfds library provides you a better way to feeding data into the model. The following is a simple example.

The `repeat()` means fetching data from head when it comes to the last one.

The `shuffle(buffer_size)` means fetching data in a number of count.

The `batch(count)` means how many data in a batch.

The `prefetch(count)` means how many prefetched data.

In [0]:
imdbr_train = imdbr_ds.repeat().shuffle(64).batch(8).prefetch(tf.data.experimental.AUTOTUNE)

You can combine different processing orders to meet your requirement.

In [0]:
imdbr_train_2 = imdbr_ds.repeat().batch(32).prefetch(tf.data.experimental.AUTOTUNE)

In [0]:
imdbr_train

<PrefetchDataset shapes: {text: (None,), label: (None,)}, types: {text: tf.string, label: tf.int64}>

In [0]:
for batch in imdbr_train:
  # in each batchm there are two attributes, text and label
  print("There are {} attributes.".format(len(batch)))
  print()
  print("Use `.item()` fetch data array.")
  print(batch.items())
  print()
  print("Use `key` to fetch data, e.g. batch['text'].")
  print(batch["text"])
  break

There are 2 attributes.

Use `.item()` fetch data array.
dict_items([('text', <tf.Tensor: id=375, shape=(8,), dtype=string, numpy=
array([b'A very close and sharp discription of the bubbling and dynamic emotional world of specialy one 18year old guy, that makes his first experiences in his gay love to an other boy, during an vacation with a part of his family.<br /><br />I liked this film because of his extremly clear and surrogated storytelling , with all this "Sound-close-ups" and quiet moments wich had been full of intensive moods.<br /><br />',
       b"Another case of a decent DVD case betraying the shot-on-video quality of the film. <br /><br />It wasn't that bad. Rochon does a serviceable job and Damn! the cast is good looking. I've never seen that many musclebound guys hang out together on a regular basis. This movie really wanted to make you think Rochon was the killer, but it was not to be. My biggest problem with the film was that by the end, I didn't much care who was the k

## Processing / Maping in Batch Data for Training

In training, we must make the trainer to automatically fetch and preprocessing the data. In Tensorflow 2.x, we use `.map()` function to address it.

In [0]:
def parse_fn(batch):
  # refer to `PrefetchDataset shapes: {text: (None,), label: (None,)}`
  return batch["text"], batch["label"]

In [0]:
imdbr_ds_map = imdbr_train.map(parse_fn)

# Simple Training Example

Here we demo how to use mapping function in the training process. In the following, we merge a pretrained model from tf.hub (an famous stoage for pretrained model hosted by Google) and customized layers. We are going to solve the simple binary question. Imdb dataset mainly consists of two attributes, a review text and a binary flag. The flag represents reviewing score.  A negative flag (review) has a score <= 4 out of 10, and a positive flag (review) has a score >= 7 out of 10.

## Build a Model

In [0]:
model = tf.keras.Sequential([
    hub.KerasLayer("https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1", 
                   output_shape=[128], input_shape=[], dtype=tf.string), 
    tf.keras.layers.Dense(16, activation="relu"),
    tf.keras.layers.Dense(1, activation="sigmoid")
])

In [0]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
keras_layer (KerasLayer)     (None, 128)               124642688 
_________________________________________________________________
dense (Dense)                (None, 16)                2064      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 124,644,769
Trainable params: 2,081
Non-trainable params: 124,642,688
_________________________________________________________________


After built a model, we are going to compile it and start a training via `.fit()` function.

## Training

In [0]:
model.compile(optimizer="adam", 
              loss="mean_squared_error", 
              metrics=["accuracy"])

Due to no end in PrefetchDataset `repeat()`, here fit() requires a `steps_per_epoch` flag.

In [0]:
steps_per_epoch = 25000 // 8

model.fit(imdbr_ds_map, 
          epochs=5, 
          steps_per_epoch=steps_per_epoch)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f51c57fbf60>

## Evaluate

In [0]:
imdbr_ds_TEST = tfds.load("imdb_reviews", split=tfds.Split.TEST)
imdbr_test = imdbr_ds_TEST.batch(1000)
imdbr_test_map = imdbr_test.map(parse_fn)

In [0]:
model.evaluate(imdbr_test_map)

     25/Unknown - 5s 193ms/step - loss: 0.1442 - accuracy: 0.7910

[0.14415757179260255, 0.791]

## Predict

You can now type some sentences to predict the score.

In [0]:
imdbr_predict_data = np.array(["Hello world, Tensorflow 2.0!",
                               "Test",
                               "1,2,3", 
                               "Nice story, very good tech.", 
                               "It is bad in telling the whole story."])
model.predict(imdbr_predict_data)

array([[0.54320854],
       [0.55411524],
       [0.41002184],
       [0.8421466 ],
       [0.28753376]], dtype=float32)

You can also use the tensorflow dataset to predict the result.

In [0]:
imdbr_predict_data_2, = imdbr_ds_TEST.take(1)
imdbr_predict_data_2 = np.array([imdbr_predict_data_2["text"].numpy().decode("UTF-8")])
imdbr_predict_data_2

array(["I've watched the movie actually several times. And what i want to say about it is the only thing that made this movie high rank was the Burak Altay's incredible performance, absolutely nothing but that. Not even those silly model named Deniz Akkaya and some of these popular names at times in the movie... Burak is definitely very talented i've seen a few jobs he made and been through. Even though this is kind of horror movie, he's doing really good job in comedy movies and also in dramas too. I bet most of you all saw Asmali Konak the movie and TV series, those two would go for an example... All i'm gonna say is you better watch out for the new works coming out from Burak then you'll see.. Keep the good work bro, much love.."],
      dtype='<U733')

In [0]:
model.predict(imdbr_predict_data_2)

array([[0.71404517]], dtype=float32)

## Export a Model

It is simple to save the trained model. 

In [0]:
model.save("imdbr.h5", include_optimizer=True)

## View the weights

It is simple to transfrom the model weight into different formats. You can access the attribute `.weights` to fetch all weights.

In [0]:
model.weights

[<tf.Variable 'dense/kernel:0' shape=(128, 16) dtype=float32, numpy=
 array([[ 0.05103301, -0.09987675, -0.02509112, ..., -0.07569588,
         -0.03953914,  0.15262881],
        [ 0.10278282,  0.21286143,  0.08690263, ..., -0.20403317,
          0.21032059,  0.12193388],
        [-0.1462182 , -0.09899718, -0.3288752 , ...,  0.04722484,
         -0.18032527,  0.3393469 ],
        ...,
        [ 0.02443554, -0.19251642, -0.18985495, ...,  0.07229494,
         -0.3040509 ,  0.21191847],
        [ 0.08839218,  0.15343545,  0.13075665, ...,  0.11757858,
          0.20322913, -0.21505691],
        [ 0.05775356, -0.01526705,  0.02360127, ...,  0.1987856 ,
         -0.06532481,  0.0211694 ]], dtype=float32)>,
 <tf.Variable 'dense/bias:0' shape=(16,) dtype=float32, numpy=
 array([-0.05715915,  0.02144725,  0.08002391,  0.0697367 ,  0.11318063,
         0.05424993,  0.0753491 ,  0.15644073,  0.07161279,  0.08346546,
         0.06093833,  0.06376005,  0.07308774,  0.1033849 ,  0.04979755,
      