In [3]:
# default_exp data.processing.tf_data

%reload_ext autoreload
%autoreload 2

In [1]:
import tensorflow as tf

In [2]:
tf.__version__

'2.2.0'

In [3]:
import pathlib
import os
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

np.set_printoptions(precision=4)

# tf.data: Build TensorFlow input pipelines
https://tensorflow.google.cn/guide/data

使用tf.data API，您可以从简单，可重用的片段中构建复杂的输入管道。 例如，图像模型的管道可能会聚集来自分布式文件系统中文件的数据，对每个图像应用随机扰动，然后将随机选择的图像合并为一批进行训练。 文本模型的管道可能涉及从原始文本数据中提取符号，将它们转换为带有查找表的嵌入标识符，以及将不同长度的序列分批处理。 tf.data API使处理大量数据，从不同数据格式读取数据以及执行复杂的转换成为可能。

tf.data API引入了tf.data.Dataset抽象，它表示a sequence of elements， in which each element consists of one or more components。 

For example, in an image pipeline, an element might be a single training example, with a pair of tensor components representing the image and its label.

创建数据集有两种不同的方法：

A data source constructs a Dataset from data stored in memory or in one or more files.

A data transformation constructs a dataset from one or more tf.data.Dataset object



## Basic mechanics
要创建输入管道，您必须从数据源开始。 例如，要从内存中的数据构造数据集，可以使用tf.data.Dataset.from_tensors()或tf.data.Dataset.from_tensor_slices()。 或者，如果输入数据以推荐的TFRecord格式存储在文件中，则可以使用tf.data.TFRecordDataset()。

Once you have a Dataset object, you can transform it into a new Dataset by chaining method calls on the tf.data.Dataset object. For example, you can apply per-element transformations such as Dataset.map(), and multi-element transformations such as Dataset.batch(). See the documentation for tf.data.Dataset for a complete list of transformations.


数据集对象是Python迭代的。 这使得可以使用for循环使用其元素：

In [4]:
dataset = tf.data.Dataset.from_tensor_slices([8, 3, 0, 8, 2, 1])
dataset

<TensorSliceDataset shapes: (), types: tf.int32>

In [5]:
for elem in dataset:
  print(elem.numpy())

8
3
0
8
2
1


In [6]:
# Or by explicitly creating a Python iterator using iter and consuming its elements using next:
it = iter(dataset)

print(next(it).numpy())

8


或者，可以使用reduce转换来消耗数据集元素，从而减少所有元素以产生单个结果。 以下示例说明了如何使用reduce转换来计算整数数据集的总和。

In [7]:
print(dataset.reduce(0, lambda state, value: state + value).numpy())

22


### Dataset structure
数据集包含每个具有相同（嵌套）结构的元素，并且该结构的各个组件可以是tf.TypeSpec可表示的任何类型，包括tf.Tensor，tf.sparse.SparseTensor，tf.RaggedTensor，tf.TensorArray， 或tf.data.Dataset。

Dataset.element_spec属性使您可以检查每个元素组件的类型。 该属性返回tf.TypeSpec对象的嵌套结构，该嵌套结构与元素的结构匹配，该元素可以是单个组件，组件的元组或组件的嵌套元组。 例如：

In [8]:
dataset1 = tf.data.Dataset.from_tensor_slices(tf.random.uniform([4, 10]))

dataset1.element_spec


TensorSpec(shape=(10,), dtype=tf.float32, name=None)

In [9]:
dataset2 = tf.data.Dataset.from_tensor_slices(
   (tf.random.uniform([4]),
    tf.random.uniform([4, 100], maxval=100, dtype=tf.int32)))

dataset2.element_spec


(TensorSpec(shape=(), dtype=tf.float32, name=None),
 TensorSpec(shape=(100,), dtype=tf.int32, name=None))

In [10]:
dataset3 = tf.data.Dataset.zip((dataset1, dataset2))

dataset3.element_spec


(TensorSpec(shape=(10,), dtype=tf.float32, name=None),
 (TensorSpec(shape=(), dtype=tf.float32, name=None),
  TensorSpec(shape=(100,), dtype=tf.int32, name=None)))

In [11]:
# Dataset containing a sparse tensor.
dataset4 = tf.data.Dataset.from_tensors(tf.SparseTensor(indices=[[0, 0], [1, 2]], values=[1, 2], dense_shape=[3, 4]))

dataset4.element_spec


SparseTensorSpec(TensorShape([3, 4]), tf.int32)

In [12]:
# Use value_type to see the type of value represented by the element spec
dataset4.element_spec.value_type


tensorflow.python.framework.sparse_tensor.SparseTensor

The Dataset transformations support datasets of any structure. When using the Dataset.map(), and Dataset.filter() transformations, which apply a function to each element, the element structure determines the arguments of the function:

In [13]:
dataset1 = tf.data.Dataset.from_tensor_slices(
    tf.random.uniform([4, 10], minval=1, maxval=10, dtype=tf.int32))

dataset1


<TensorSliceDataset shapes: (10,), types: tf.int32>

In [14]:
for z in dataset1:
  print(z.numpy())


[5 1 9 4 1 9 4 9 1 3]
[8 9 5 8 4 3 1 8 9 5]
[1 9 9 3 7 7 8 7 9 1]
[3 5 5 4 1 5 2 8 2 4]


In [15]:
dataset2 = tf.data.Dataset.from_tensor_slices(
   (tf.random.uniform([4]),
    tf.random.uniform([4, 100], maxval=100, dtype=tf.int32)))

dataset2


<TensorSliceDataset shapes: ((), (100,)), types: (tf.float32, tf.int32)>

In [16]:
dataset3 = tf.data.Dataset.zip((dataset1, dataset2))

dataset3


<ZipDataset shapes: ((10,), ((), (100,))), types: (tf.int32, (tf.float32, tf.int32))>

In [17]:
for a, (b,c) in dataset3:
  print('shapes: {a.shape}, {b.shape}, {c.shape}'.format(a=a, b=b, c=c))


shapes: (10,), (), (100,)
shapes: (10,), (), (100,)
shapes: (10,), (), (100,)
shapes: (10,), (), (100,)


# nb_export

In [2]:
from nbdev.export import *
notebook2script()

Converted 00_core.ipynb.
Converted 00_template.ipynb.
Converted active_learning.ipynb.
Converted algo_dl_keras.ipynb.
Converted algo_ml_eda.ipynb.
Converted algo_ml_tree_catboost.ipynb.
Converted algo_ml_tree_lgb.ipynb.
Converted algo_rs_associated_rules.ipynb.
Converted algo_rs_match_deepmatch.ipynb.
Converted algo_rs_matrix.ipynb.
Converted algo_rs_search_vector_faiss.ipynb.
Converted algo_seq_embeding.ipynb.
Converted algo_seq_features_extraction_text.ipynb.
Converted datastructure_dict_list_set.ipynb.
Converted datastructure_matrix_sparse.ipynb.
Converted engineering_concurrency.ipynb.
Converted engineering_nbdev.ipynb.
Converted engineering_panel.ipynb.
Converted engineering_snorkel.ipynb.
Converted index.ipynb.
Converted math_func_basic.ipynb.
Converted math_func_loss.ipynb.
Converted operating_system_command.ipynb.
Converted plot.ipynb.
Converted utils_functools.ipynb.
Converted utils_json.ipynb.
Converted utils_pickle.ipynb.
Converted utils_time.ipynb.


In [7]:
!nbdev_build_docs

No notebooks were modified
converting /Users/luoyonggui/PycharmProjects/nbdevlib/index.ipynb to README.md
