# TF DEV Summit - Modu Labs


## TF.DATA - Deep NLP 신성진

## TF의 시작, 결국 데이터를 빠르고, 쉽게 다루는 것이 핵심

## **The tf.data mission**

Input piplines for Tensorflow should be:

**Fast** : to keep up with GPUs and TPUs

**Flexible** : to handle diverse data sources and use cases

**Easy to use** : to democratize machine learning

## Extract Transform Load for Tensorflow

```python
#Extract
files = tf.data.Dataset.list_files(file_pattern)
dataset = tf.data.TFRecordDataset(files)

#Transform
dataset = dataset.shuffle(10000)
dataset = dataset.repeat(NUM_EPOCHS)
dataset = dataset.map(lambda x: tf.parse_single_example(x, features)) #single example?
dataset = dataset.batch(BATCH_SIZE)

#Load
iterator = dataset.make_one_shot_iterator() #Sequencial Access
features = iterator.get_next()
```

## Performance

- CNN benchmarks reach > **13,000 images/second** with tf.data -> 8달 전과 비교하여 성능 2배 성장 (DGX - Imagenet)
 <br />
- 텐서플로우 벤치마크 프로젝트를 사용하여 활용: www.tensorflow.org/performance/datasets_performance

- New tf.contrib.data.prefetch_to_device() for GPUs tf 1.8 (tf-nightly에 보유)

```python

## 속도를 올리기 위해서는?? Parallel!!

#Extract
files = tf.data.Dataset.list_files(file_pattern)
# dataset = tf.data.TFRecordDataset(files) ->
dataset = tf.data.TFRecordDataset(files, num_parallel_reads=32)

#Transform

#shuffle_and_repeat: epochs와 buffers사이에서 정지하는 현상 방지
dataset = dataset.apply(
    tf.contrib.data.shuffle_and_repeat(10000, NUM_EPOCHS))

#map_and_batch: map과 data transfer를 동시에 함
dataset = dataset.apply(
    tf.contrib.data.map_and_batch(lambda x: ..., BATCH_SIZE))

#Load

#prefetch_to_device = 그 다음 batch가 미리 GPU 메모리 대기
dataset = dataset.apply(tf.contrib.data.prefetch_to_device("/gpu:0"))
iterator = dataset.make_one_shot_iterator() #Sequencial Access
features = iterator.get_next()

```

# Flexibility

- tf.SparseTensor를 지원 (1.5ver) -> 복잡한 Categorical data나 embedding 모델을 다룰때

- Custom Python code via Dataset.from_generator() ->?

- Custom C++ code via DatasetOpKernel plugis

-> 실무 새로운 데이터셋을 만들거나 개선할때 좋다.

# Easy of Use

**더 이상 데이터를 읽고 하는데 고생할 필요가 없음**

### Use Python for loops in eager exeuction mode

```python
#Extract
files = tf.data.Dataset.list_files(file_pattern)
dataset = tf.data.TFRecordDataset(files)

#Transform
dataset = dataset.shuffle(10000)
dataset = dataset.repeat(NUM_EPOCHS)
dataset = dataset.map(lambda x: tf.parse_single_example(x, features))
dataset = dataset.batch(BATCH_SIZE)

#Eager execution make datset a normal Python iterable.

for batch in dataset:
    train_model(batch)
```

</br>

### Standard Method CSV file with protocal buffer (tf 1.8)

```python
tf.enable_eager_execution()

#make_batched_features_dataset
dataset = tf.contrib.data.make_batched_features_dataset(
    file_pattern, BATCH_SIZE, features, num_epochs=NUM_EPOCHS)

#일반적으로는 속도를 위해서는 tf.example, tfrecord와 같은 binary를 추천
#하지만 큰 데이터를 항상 가지고 있는 것은 아님.

#kaggle의 예
#$ pip install kaggle
#$ kaggle datasets downaload -d theronk/million-headlins -p .

for batch in dataset:
    train_model(batch["publish_data"], batch["headline_text"])

```

### Integration with Esitmators(and Keras comming soon!!)

```python
def input_fn():
    dataset = tf.contrib.data.make_csv_dataset(
        "*.csv", BATCH_SIZE, num_epochs=NUM_EPOCHS)
    return dataset

# train an estimator on the dataset
tf.esitmator.Estimator(model_fn=train_model).train(input_fn=input_fn)

```

추가 정보들

- www.tensorflow.org/programmers_guide/datasets
- www.tensorflow.org/performance/datasets_performance

# 실습 및 활용

https://towardsdatascience.com/how-to-use-dataset-in-tensorflow-c758ef9e4428

1. Importing Data: 데이터셋 생성
 - From numpy
 - From tensor
 - From a placeholder
 - From generator
 
 </br>

2. Create an Iterator: 생성된 데이터셋을 바탕으로 Iterator 인스턴스를 만들기
 - One shot Iterator
 - Initializable Iterator
 - Reinitializable Iterator
 - Feedable Iterator
 
 </br>

3. Consuming Data. By using the created iterator we can get the elements from the dataset to feed the model

In [1]:
import tensorflow as tf
import numpy as np

#with keras

In [10]:
#numpy 데이터 처리

x = np.random.sample((300,2))
print(x.shape)
# make a dataset from a numpy array
dataset = tf.data.Dataset.from_tensor_slices(x)

iter = dataset.make_one_shot_iterator()
el = iter.get_next()

with tf.Session() as sess:
    print(sess.run(el))

features, labels = (np.random.sample((100,2)), np.random.sample((100,1)))
dataset = tf.data.Dataset.from_tensor_slices((features,labels))

(300, 2)
[0.82791794 0.90319535]


In [13]:
dataset

<TensorSliceDataset shapes: ((2,), (1,)), types: (tf.float64, tf.float64)>