<a href="https://colab.research.google.com/github/iypc-team/CoLab/blob/master/Defcon_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import glob, os, shutil
from os.path import *
from google.colab import drive, files

contentPth = os.getcwd()
drivePth = abspath('/content/gdrive')
myDrivePth = abspath('/content/gdrive/My Drive')
pythonPth = abspath('/content/gdrive/My Drive/PythonFiles')
tfImagePth = abspath('/content/gdrive/My Drive/Tensorflow Images')
dataPth = abspath('/content/data')

deletePth = abspath('/content/sample_data')
if exists(deletePth):
    print(deletePth)
    shutil.rmtree(deletePth)

drive.mount('/content/gdrive', force_remount=True)

if not exists(dataPth):
    try: shutil.copytree(src=tfImagePth, dst=dataPth)
    except Exception as err:
        print(err)

os.chdir(pythonPth)        
print(f'cwd: {os.getcwd()}')
from BashColors import C

In [None]:
os.chdir(dataPth)
fileNames=[]
imageDataList = glob.glob('**', recursive=True)
for item in sorted(imageDataList):
    filPth = abspath(item)
    if isdir(item):
        print(f'{C.IBlue}{filPth}')
    elif isfile(item):
        fileNames.append(filPth)
        print(f'{C.IWhite}{filPth}')

In [None]:
import tensorflow as tf
print('tensorflow:', tf.__version__)

In [None]:
import pathlib
import os
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# np.set_printoptions(precision=4)

Create the `image.ImageDataGenerator`

In [None]:
img_gen = tf.keras.preprocessing.image.ImageDataGenerator(
    rescale=1./255,
    rotation_range=180,
    zoom_range=5
    )

In [None]:
images, labels = next(img_gen.flow_from_directory(dataPth))

In [None]:
print(images.dtype, images.shape)
# print()
print(labels.dtype, labels.shape)
# print()

imgCount=0
for img in images:
    imgCount+=1
    print(f'\n\n{C.IBlue}{imgCount}.\n{C.IWhite}{img}')

In [None]:
ds  = tf.data.Dataset.from_generator(
    lambda: img_gen.flow_from_directory(dataPth), 
    output_types=(tf.float32, tf.float32), 
    output_shapes=([13,256,256,3], [13,3]))

defconData = tf.data.Dataset.from_generator(
    lambda: img_gen.flow_from_directory(dataPth), 
    output_types=(tf.float32, tf.float32), 
    output_shapes=([13,256,256,3], [13,3]))

print(ds.element_spec)
print('defconData', defconData.element_spec)

In [None]:
for images, label in defconData.take(1):
    print('images.shape: ', images.shape)
    print('labels.shape: ', labels.shape)


In [None]:
def process_path(dataPth):
    label = tf.strings.split(file_path, os.sep)[-2]
    return tf.io.read_file(file_path), label

labeled_defconData = list_ds.map(ds)

But a repeat before a shuffle mixes the epoch boundaries together:

### Decoding image data and resizing it

<!-- TODO(markdaoust): link to image augmentation when it exists -->
When training a neural network on real-world image data, it is often necessary
to convert images of different sizes to a common size, so that they may be
batched into a fixed size.

Rebuild the flower filenames dataset:

In [None]:
# list_ds = tf.data.Dataset.list_files(str(flowers_root/'*/*'))
list_ds = tf.data.Dataset.list_files(file_pattern=fileNames)

list_ds
defconData

Write a function that manipulates the dataset elements.

In [None]:
# Reads an image from a file, decodes it into a dense tensor, and resizes it
# to a fixed shape.
def parse_image(filename):
    parts = tf.strings.split(filename, os.sep)
    label = parts[-2]
    
    image = tf.io.read_file(filename)
    image = tf.image.decode_jpeg(image)
    image = tf.image.convert_image_dtype(image, tf.float32)
    image = tf.image.resize(image, [128, 128])
    return image, label

Test that it works.

In [None]:
file_path = next(iter(list_ds))
image, label = parse_image(file_path)

def show(image, label):
    plt.figure()
    plt.imshow(image)
    plt.title(label.numpy().decode('utf-8'))
    plt.axis('on')

show(image, label)

Map it over the dataset.

In [None]:
images_ds = list_ds.map(parse_image)
for image, label in images_ds.take(13):
  show(image, label)

### Applying arbitrary Python logic

For performance reasons, use TensorFlow operations for
preprocessing your data whenever possible. However, it is sometimes useful to
call external Python libraries when parsing your input data. You can use the `tf.py_function()` operation in a `Dataset.map()` transformation.

In [None]:
import scipy.ndimage as ndimage

def fixed_rotate_image(image):
    image = ndimage.rotate(image, angle=270.0)
    return image

def random_rotate_image(image):
    image = ndimage.rotate(image, np.random.uniform(-30, 30), reshape=False)
    return image

In [None]:
image, label = next(iter(images_ds))
image = fixed_rotate_image(image)
show(image, label)

To use this function with `Dataset.map` the same caveats apply as with `Dataset.from_generator`, you need to describe the return shapes and types when you apply the function:

In [None]:
def tf_random_rotate_image(image, label):
    print()
    im_shape = image.shape
    [image,] = tf.py_function(fixed_rotate_image, [image], [tf.float32])
    image.set_shape(im_shape)
    return image, label

You can work with `tf.train.Example` protos outside of a `tf.data.Dataset` to understand the data:

In [None]:
img, txt = tf_parse(raw_example)
print(txt.numpy())
print(repr(img.numpy()[:20]), "...")

In [None]:
decoded = images_ds.map(parse_image())
decoded

In [None]:
image_batch, text_batch = next(iter(decoded.batch(10)))
image_batch.shape

<a id="time_series_windowing"></a>

### Time series windowing

For an end to end time series example see: [Time series forecasting](../../tutorials/text/time_series.ipynb).

Time series data is often organized with the time axis intact.

Use a simple `Dataset.range` to demonstrate:

In [None]:
range_ds = tf.data.Dataset.range(100000)

Typically, models based on this sort of data will want a contiguous time slice. 

The simplest approach would be to batch the data:

#### Using `batch`

In [None]:
batches = range_ds.batch(10, drop_remainder=True)

for batch in batches.take(5):
  print(batch.numpy())

Or to make dense predictions one step into the future, you might shift the features and labels by one step relative to each other:

In [None]:
def dense_1_step(batch):
  # Shift features and labels one step relative to each other.
  return batch[:-1], batch[1:]

predict_dense_1_step = batches.map(dense_1_step)

for features, label in predict_dense_1_step.take(3):
  print(features.numpy(), " => ", label.numpy())

To predict a whole window instead of a fixed offset you can split the batches into two parts:

In [None]:
batches = range_ds.batch(15, drop_remainder=True)

def label_next_5_steps(batch):
  return (batch[:-5],   # Take the first 5 steps
          batch[-5:])   # take the remainder

predict_5_steps = batches.map(label_next_5_steps)

for features, label in predict_5_steps.take(3):
  print(features.numpy(), " => ", label.numpy())

To allow some overlap between the features of one batch and the labels of another, use `Dataset.zip`:

In [None]:
feature_length = 10
label_length = 3

features = range_ds.batch(feature_length, drop_remainder=True)
labels = range_ds.batch(feature_length).skip(1).map(lambda labels: labels[:label_length])

predicted_steps = tf.data.Dataset.zip((features, labels))

for features, label in predicted_steps.take(5):
  print(features.numpy(), " => ", label.numpy())

#### Using `window`

While using `Dataset.batch` works, there are situations where you may need finer control. The `Dataset.window` method gives you complete control, but requires some care: it returns a `Dataset` of `Datasets`. See [Dataset structure](#dataset_structure) for details.

In [None]:
window_size = 5

windows = range_ds.window(window_size, shift=1)
for sub_ds in windows.take(5):
  print(sub_ds)

The `Dataset.flat_map` method can take a dataset of datasets and flatten it into a single dataset:

In [None]:
 for x in windows.flat_map(lambda x: x).take(30):
   print(x.numpy(), end=' ')

In nearly all cases, you will want to `.batch` the dataset first:

In [None]:
def sub_to_batch(sub):
  return sub.batch(window_size, drop_remainder=True)

for example in windows.flat_map(sub_to_batch).take(5):
  print(example.numpy())

Now, you can see that the `shift` argument controls how much each window moves over.

Putting this together you might write this function:

In [None]:
def make_window_dataset(ds, window_size=5, shift=1, stride=1):
  windows = ds.window(window_size, shift=shift, stride=stride)

  def sub_to_batch(sub):
    return sub.batch(window_size, drop_remainder=True)

  windows = windows.flat_map(sub_to_batch)
  return windows


In [None]:
ds = make_window_dataset(range_ds, window_size=10, shift = 5, stride=3)

for example in ds.take(10):
  print(example.numpy())

Then it's easy to extract labels, as before:

In [None]:
dense_labels_ds = ds.map(dense_1_step)

for inputs,labels in dense_labels_ds.take(3):
  print(inputs.numpy(), "=>", labels.numpy())

### Resampling

When working with a dataset that is very class-imbalanced, you may want to resample the dataset. `tf.data` provides two methods to do this. The credit card fraud dataset is a good example of this sort of problem.

Note: See [Imbalanced Data](../tutorials/keras/imbalanced_data.ipynb) for a full tutorial.


In [None]:
zip_path = tf.keras.utils.get_file(
    origin='https://storage.googleapis.com/download.tensorflow.org/data/creditcard.zip',
    fname='creditcard.zip',
    extract=True)

csv_path = zip_path.replace('.zip', '.csv')

In [None]:
creditcard_ds = tf.data.experimental.make_csv_dataset(
    csv_path, batch_size=1024, label_name="Class",
    # Set the column types: 30 floats and an int.
    column_defaults=[float()]*30+[int()])

Now, check the distribution of classes, it is highly skewed:

In [None]:
def count(counts, batch):
  features, labels = batch
  class_1 = labels == 1
  class_1 = tf.cast(class_1, tf.int32)

  class_0 = labels == 0
  class_0 = tf.cast(class_0, tf.int32)

  counts['class_0'] += tf.reduce_sum(class_0)
  counts['class_1'] += tf.reduce_sum(class_1)

  return counts

In [None]:
counts = creditcard_ds.take(10).reduce(
    initial_state={'class_0': 0, 'class_1': 0},
    reduce_func = count)

counts = np.array([counts['class_0'].numpy(),
                   counts['class_1'].numpy()]).astype(np.float32)

fractions = counts/counts.sum()
print(fractions)

A common approach to training with an imbalanced dataset is to balance it. `tf.data` includes a few methods which enable this workflow:

#### Datasets sampling

One approach to resampling a dataset is to use `sample_from_datasets`. This is more applicable when you have a separate `data.Dataset` for each class.

Here, just use filter to generate them from the credit card fraud data:

In [None]:
negative_ds = (
  creditcard_ds
    .unbatch()
    .filter(lambda features, label: label==0)
    .repeat())
positive_ds = (
  creditcard_ds
    .unbatch()
    .filter(lambda features, label: label==1)
    .repeat())

In [None]:
for features, label in positive_ds.batch(10).take(1):
  print(label.numpy())

To use `tf.data.experimental.sample_from_datasets` pass the datasets, and the weight for each:

In [None]:
balanced_ds = tf.data.experimental.sample_from_datasets(
    [negative_ds, positive_ds], [0.5, 0.5]).batch(10)

Now the dataset produces examples of each class with 50/50 probability:

In [None]:
for features, labels in balanced_ds.take(10):
  print(labels.numpy())

#### Rejection resampling

One problem with the above `experimental.sample_from_datasets` approach is that
it needs a separate `tf.data.Dataset` per class. Using `Dataset.filter`
works, but results in all the data being loaded twice.

The `data.experimental.rejection_resample` function can be applied to a dataset to rebalance it, while only loading it once. Elements will be dropped from the dataset to achieve balance.

`data.experimental.rejection_resample` takes a `class_func` argument. This `class_func` is applied to each dataset element, and is used to determine which class an example belongs to for the purposes of balancing.

The elements of `creditcard_ds` are already `(features, label)` pairs. So the `class_func` just needs to return those labels:

In [None]:
def class_func(features, label):
  return label

The resampler also needs a target distribution, and optionally an initial distribution estimate:

In [None]:
resampler = tf.data.experimental.rejection_resample(
    class_func, target_dist=[0.5, 0.5], initial_dist=fractions)

The resampler deals with individual examples, so you must `unbatch` the dataset before applying the resampler:

In [None]:
resample_ds = creditcard_ds.unbatch().apply(resampler).batch(10)

The resampler returns creates `(class, example)` pairs from the output of the `class_func`. In this case, the `example` was already a `(feature, label)` pair, so use `map` to drop the extra copy of the labels:

In [None]:
balanced_ds = resample_ds.map(lambda extra_label, features_and_label: features_and_label)

Now the dataset produces examples of each class with 50/50 probability:

In [None]:
for features, labels in balanced_ds.take(10):
  print(labels.numpy())

## Iterator Checkpointing

Tensorflow supports [taking checkpoints](https://www.tensorflow.org/guide/checkpoint) so that when your training process restarts it can restore the latest checkpoint to recover most of its progress. In addition to checkpointing the model variables, you can also checkpoint the progress of the dataset iterator. This could be useful if you have a large dataset and don't want to start the dataset from the beginning on each restart. Note however that iterator checkpoints may be large, since transformations such as `shuffle` and `prefetch` require buffering elements within the iterator. 

To include your iterator in a checkpoint, pass the iterator to the `tf.train.Checkpoint` constructor.

In [None]:
range_ds = tf.data.Dataset.range(20)

iterator = iter(range_ds)
ckpt = tf.train.Checkpoint(step=tf.Variable(0), iterator=iterator)
manager = tf.train.CheckpointManager(ckpt, '/tmp/my_ckpt', max_to_keep=3)

print([next(iterator).numpy() for _ in range(5)])

save_path = manager.save()

print([next(iterator).numpy() for _ in range(5)])

ckpt.restore(manager.latest_checkpoint)

print([next(iterator).numpy() for _ in range(5)])

Note: It is not possible to checkpoint an iterator which relies on external state such as a `tf.py_function`. Attempting to do so will raise an exception complaining about the external state.

## Using tf.data with tf.keras

The `tf.keras` API simplifies many aspects of creating and executing machine
learning models. Its `.fit()` and `.evaluate()` and `.predict()` APIs support datasets as inputs. Here is a quick dataset and model setup:

In [None]:
train, test = tf.keras.datasets.fashion_mnist.load_data()

images, labels = train
images = images/255.0
labels = labels.astype(np.int32)

In [None]:
fmnist_train_ds = tf.data.Dataset.from_tensor_slices((images, labels))
fmnist_train_ds = fmnist_train_ds.shuffle(5000).batch(32)

model = tf.keras.Sequential([
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(10)
])

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
              metrics=['accuracy'])

 Passing a dataset of `(feature, label)` pairs is all that's needed for `Model.fit` and `Model.evaluate`:

In [None]:
model.fit(fmnist_train_ds, epochs=2)

If you pass an infinite dataset, for example by calling `Dataset.repeat()`, you just need to also pass the `steps_per_epoch` argument:

In [None]:
model.fit(fmnist_train_ds.repeat(), epochs=2, steps_per_epoch=20)

For evaluation you can pass the number of evaluation steps:

In [None]:
loss, accuracy = model.evaluate(fmnist_train_ds)
print("Loss :", loss)
print("Accuracy :", accuracy)

For long datasets, set the number of steps to evaluate:

In [None]:
loss, accuracy = model.evaluate(fmnist_train_ds.repeat(), steps=10)
print("Loss :", loss)
print("Accuracy :", accuracy)

The labels are not required in when calling `Model.predict`. 

In [None]:
predict_ds = tf.data.Dataset.from_tensor_slices(images).batch(32)
result = model.predict(predict_ds, steps = 10)
print(result.shape)

But the labels are ignored if you do pass a dataset containing them:

In [None]:
result = model.predict(fmnist_train_ds, steps = 10)
print(result.shape)