In [0]:
!pip install tensorflow-gpu==2.0.0alpha tensorflow_datasets

In [0]:
import tensorflow as tf
import tensorflow_datasets as tfds

In [4]:
print("Tensorflow version: {}".format(tf.__version__))

Tensorflow version: 2.0.0-alpha0


# TensorFlow Datasets Reference

* https://www.tensorflow.org/datasets/overview

# Splits

All `DatasetBuilder`s expose various data subsets defined as `tfds.Splits` (e.g. `tfds.Split.TRAIN` or `tfds.Split.TEST`). A given dataset's splits are defined in `tfds.DatasetBuilder.info.splits`. They are accessiable via `tfds.load` and `tfds.DatasetBuilder.as_dataset`. Both of them take the parameter (`split=`) as the input.

## Using splits

In [0]:
combined_split = tfds.Split.TRAIN + tfds.Split.TEST

In [0]:
combined_ds = tfds.load("mnist", split=combined_split)

You can also merge all splits and load the whole dataset.

In [0]:
combined_ds_all = tfds.load("mnist", split=tfds.Split.ALL)

One thing reminds that options for the argument (`split=`) lead the dataset merged together and being returned. It is necessary to split the dataset during training.

In [12]:
combined_ds

<_OptionsDataset shapes: {image: (28, 28, 1), label: ()}, types: {image: tf.uint8, label: tf.int64}>

## Using subsplits

You can further split the dataset into customized splits via `tfds.Split.subsplit`. 

There are three options for you to split the dataset.
* specify number of subsplits
* specify a percentage slice
* specify weights

There are sveral **warnings** for you while using the api `tfds.Split.subsplit`.
* TFDS does not currently guarantee the order of the data on the disk. If you regenerate the data, the subsplits may no longer be the same.
* If the total number of examples can't be devided by 100, the remainder examples may not be evenly distributed among subsplits.

```
# 此内容为代码格式
```



### Specify number of subsplits

In [0]:
train_half_1, train_half_2 = tfds.Split.TRAIN.subsplit(2)

In [0]:
train_half_dataset = tfds.load("mnist", split=train_half_1)

### Specify a percentage slice

Here we use an another api `tfds.percent[<beginning in percentage>:<ending in percentage>]` to split the dataset.

In [0]:
first_10_percent = tfds.Split.TRAIN.subsplit(tfds.percent[:10])
last_2_percent = tfds.Split.TRAIN.subsplit(tfds.percent[-2:])
middle_50_percent = tfds.Split.TRAIN.subsplit(tfds.percent[25:75])

In [0]:
middle_50_dataset = tfds.load("mnist", split=middle_50_percent)

### Specify weights

In [0]:
half, quarter1, quarter2 = tfds.Split.TRAIN.subsplit([2, 1, 1])

In [0]:
half_dataset = tfds.load("mnist", split=half)

## Composing split, adding, and subsplitting

It is possible to compose the above operations.

In [0]:
# merge the half of TRAIN split and the TEST
split = tfds.Split.TRAIN.subsplit(tfds.percent[25:75]) + tfds.Split.TEST


# merge TRAIN and TEST Split and then split into 2 parts
first_half, second_half = (tfds.Split.TRAIN + tfds.Split.TEST).subsplit(2)

In [0]:
first_half_ds = tfds.load("mnist", split=first_half)

Notice that a split can not be added twice, the operation `subsplit` can only happen once.

```python
#
# WRONG
# Overlap between splits. Split {'train'} has been added with itself.
#
wrong_split = tfds.Split.TRAIN.subsplit(tfds.percent[:25]) + tfds.Split.TRAIN
tfds.load("mnist", split=wrong_split)

#
# WRONG
# Trying to slice Split train which has already been sliced
#
wrong_split1, worng_split2 = tfds.Split.TRAIN.subsplit(tfds.percent[:25]).subsplit(2)
tfds.load("mnist", split=wrong_split1)

#
# WRONG
# Trying to slice Split train which has already been sliced
#
wrong_split = (tfds.Split.TRAIN.subsplit(tfds.percent[:25]) + \
               tfds.Split.TEST.subsplit(tfds.percent[:25])).subsplit(tfds.percent[:50])
tfds.load("mnist", split=wrong_split)
```

## Dataset using non-conventional named split

For dataset using splits not in conventional names (`tfds.Split.{TRAIN, TEST, VALIDATION}`), you can still use the split api given the customized name that is like `tfds.Split('customized_split')`.

In [0]:
coco2014_split = tfds.Split("test2015") + tfds.Split.TEST
ds = tfds.load("coco2014", split=coco2014_split)

# Dataset Builder


You can build your own data as a Tensorflow dataset via `DatasetBuilder`. There are several sequential steps:
* Writing `my_dataset.py`
* Specify `DatasetInfo`
* Downloading and extracting source data
* Specify dataset splits
* Writing an example generator
* Dataset configuration
* Create your own `FeatureConnector`
* Adding the dataset to `tensorflow/datasets` (optional)
* Large datasets and distributed generation
* Testing `MyDataset`

TDFS provides a way to transform datasets into a standard format, do the preprocessing to make they ready for a machine learning pipeline, and more importantly provides a standard input pipeline using `tf.data`. In order to enable the above, each dataset must implement a class of `DataBuilder`. The implementation specifies
* where the data is coming from (e.g. URL)
* what the dataset looks like (e.g. features)
* how the dataset should be splitted (e.g. TRAIN or TEST)

## Writing `my_dataset.py`

### Using the default template

If you want to contribute to official tensorflow dataset repo and add a new dataset, it is recommaned for you to start from auto generating the required python scripts. After that, search `TODO(my_dataset)` to do some modifications.

In [0]:
!git clone https://github.com/tensorflow/datasets.git
!mv datasets tensorflow_dataset

!python ../tensorflow_dataset/tensorflow_datasets/scripts/create_new_dataset.py \
  --dataset my_dataset \
  --type image   # text, audio, translation, ...

A alternative way.

In [0]:
!pip install tensorflow_datasets
!python -m tensorflow_datasets.scripts.create_new_dataset --dataset dataset_name --type dataset_type

By default, two files would be generated (thay are `my_dataset.py` and `my_dataset_test.py`).

### Starting from `DatasetBuilder`

Each dataset is defained as a subclass (inherited) of `tfds.core.DataBuilder` implementing the necessary methods:
* `_info`
* `_download_and_prepare`
* `_as_dataset`