<a href="https://colab.research.google.com/github/https-deeplearning-ai/tensorflow-2-public/blob/adding_C3/C3/W2/ungraded_labs/C3_W2_Lab_1_splits_api.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Exploring the Splits API

## Setup

We'll start by importing TensorFlow and TensorFlow Datasets.

In [None]:
try:
    %tensorflow_version 2.x
except:
    pass

In [None]:
import tensorflow as tf
import tensorflow_datasets as tfds

print("\u2022 Using TensorFlow Version:", tf.__version__)

## Exploring the Splits API

In [None]:
train_ds, test_ds = tfds.load('mnist:3.*.*', split=['train', 'test'])

print(len(list(train_ds)))
print(len(list(test_ds)))

With the slicing API we can use strings to specify the slicing instructions. For example, in the cell below we will merge the training and test sets by passing the string `’train+test'` to the `split` argument.

In [None]:
combined = tfds.load('mnist:3.*.*', split='train+test')

print(len(list(combined)))

We can also use Python style list slicers to specify the data we want. For example, we can specify that we want to take the first 10,000 records of the `train` split with the string `'train[:10000]'`, as shown below:

In [None]:
first10k = tfds.load('mnist:3.*.*', split='train[:10000]')

print(len(list(first10k)))

It also allows us to specify the percentage of the data we want to use. For example, we can select the first 20\% of the training set with the string `'train[:20%]'`, as shown below:

In [None]:
first20p = tfds.load('mnist:3.*.*', split='train[:20%]')

print(len(list(first20p)))

We can see that `first20p` contains 12,000 records, which is indeed 20\% the total number of records in the training set. Recall that the training set contains 60,000 records. 

Because the slices are string-based we can use loops, like the ones shown below, to slice up the dataset and make some pretty complex splits. For example, the loops below create 10 complimentary validation and training sets (each loop returns a list with 5 data sets).

In [None]:
val_ds = tfds.load('mnist:3.*.*', split=['train[{}%:{}%]'.format(k, k+20) for k in range(0, 100, 20)])

train_ds = tfds.load('mnist:3.*.*', split=['train[:{}%]+train[{}%:]'.format(k, k+20) for k in range(0, 100, 20)])

In [None]:
val_ds

In [None]:
train_ds

In [None]:
print(len(list(val_ds)))
print(len(list(train_ds)))

We can also compose new datasets by using pieces from different splits. For example, we can create a new dataset from the first 10\% of the test set and the last 80\% of the training set, as shown below.

In [None]:
composed_ds = tfds.load('mnist:3.*.*', split='test[:10%]+train[-80%:]')

print(len(list(composed_ds)))