In [1]:
import os
import glob
from joblib import Parallel, delayed

# Convert classified tile image chips to TFRecords

We assume that we have a dataset consisting of one folder of images, and a parallel folder of label images. The files in each folder are identically named and the label image gives the ground-truth classification of each image in the images folder.

This notebook contains approaches for efficiently converting such datasets into TFRecords.

## Convert classified image folders to tfrecords

For each pair of image/label files, we want to create a tensorflow Example and then write those to a .tfrecords file.

In order to be able to parallelise this step, we'll create sharded output files i.e. one .tfrecords for each worker.

We have two alternative approaches: one which uses tensorflow code for image I/O and decoding - this code releases the GIL and can profitably be multithreaded; and another approach which uses GDAL to decode images - this code cannot usefully be multithreaded but a multiprocessing approach is used instead.

**Unless you have a specific reason to do otherwise I recommend you use the multiprocessing approach; the difference in speed is minimal and it's a lot more flexible.**

### Define path to the image chips

Define 3 variables:

* `IMAGE_FOLDER` - the folder where the image chips have been created. Should contain subfolders called `images` and `labels`

* `TF_FOLDER` - the folder where the TFRecord files should be created

* `TF_DATASET` - the filename stem for the TFRecord files. The output filenames will be in the format TF_DATASET-00001-of-nnnnn


In [6]:
IMAGE_FOLDER = r'/mnt/c/Users/harry/Documents/Data/airbus_50cm_0.5m_4pad_256_lusaka_2017-closest/'

_tf_base = r'/mnt/c/Users/harry/Documents/Data/lusaka_tfrecords'

_tf_base = r'../../Data/lusaka_tfrecords'
TF_FOLDER = os.path.join(_tf_base, 'lusaka_50cm_rgb_2017')
TF_DATASET = "lusaka_50cm_rgb_2017"


---
## Approach 1: Translate to tfrecords via multithreaded process.

This is taken from #https://github.com/tensorflow/models/blob/f87a58cd96d45de73c9a8330a06b2ab56749a7fa/research/inception/inception/data/build_image_data.py#L291

This uses tensorflow libraries to do all the time consuming parts like reading the image files. Because they are written with this in mind they can be parallelised through multithreading (they don't lock the GIL). However this means that we're limited to what they provide in terms of supported data types etc. In particular they don't support geotiffs or images >8bit depth. 

So this approach can only be used for RGB imagery in JPG/PNG/BMP format. 

### Translate all the tiffs to PNGs if needed

The multithreaded process can only read PNG/JPG/BMP format images. If you have 3-band 8-bit imagery in some other format (e.g. GeoTIFF) then you can use this cell to translate them all to PNGs in order to then use the multithreaded translation to TFRecords. 

In [None]:
# define a folder for the translated PNG image chips
png_images = "path\to\png_output"

import rasterio
def translate_tif_to_png(tif_path):
    with rasterio.open(tif_path, 'r') as ds:
        out_path = tif_path.replace(IMAGE_FOLDER, png_images).replace('.tif', '.png')
        if not os.path.exists(os.path.dirname(out_path)):
            os.makedirs(os.path.dirname(out_path))
        with rasterio.open(out_path, 'w', driver='PNG',
                          width=ds.width, height=ds.height, count=ds.count,
                           dtype=ds.dtypes[0],nodata=ds.nodata,transform=ds.transform, 
                           crs=ds.crs) as dst:
            dst.write(ds.read())

images = glob.glob(os.path.join(IMAGE_FOLDER, 'images',  '*.tif'))
labels = glob.glob(os.path.join(IMAGE_FOLDER, 'labels',  '*.tif'))
Parallel(n_jobs=8)(delayed(translate_tif_to_png)(t) for t in images)
Parallel(n_jobs=8)(delayed(translate_tif_to_png)(t) for t in labels)
IMAGE_FOLDER = png_images

In [13]:
IMAGE_FOLDER = png_images

### Run multithreaded TFRecord creation

In [1]:
from dl_segmentation_utils import images_to_tfrecords_mt
import tensorflow as tf
import numpy as np

We can use this function to translate a complete image folder tree to sharded tfrecords,  if the images are 8-bit PNG or JPG files (although currently it's coded to look for only PNG). PNGs can be transcoded to JPG along the way, which decreases filesize but is lossy.

This process is super fast as it uses optimised TF code throughout.

In [None]:
images_to_tfrecords_mt(name="airbus_raw_mumbai_png",  directory=IMAGE_FOLDER, out_directory=TF_FOLDER, 
                              num_shards=12, num_threads=12,
                             dltile_from_filename=True, convert_png_to_jpg=False,
                             store_as_array=False)

----

## Approach 2: Translate to tfrecords via multiprocessing

Tensorflow native code doesn't have any readers for Tiff files and imagery stuff in general seems to be strongly oriented around 3 band 8 bit images, so isn't helpful for multispectral tiff files with higher bit depths.

We'll use rasterio (a wrapper around GDAL) to load the image data. However within this code we still use the TF API to actually read the data from disk into memory, then Rasterio parses it as an in-memory dataset. This is hugely faster than having rasterio simply read the files from disk itself.

We can't profitably use multithreading because the GIL won't get released so it won't be any faster. Instead we process the images using multiprocessing to split into shards / batches, each process will write one or more separate tfrecords files. 

In [3]:
from dl_segmentation_utils import images_to_tfrecords_mp

In [4]:
import tensorflow as tf

In [8]:
images_to_tfrecords_mp(name=TF_DATASET, directory=IMAGE_FOLDER, out_directory=TF_FOLDER, 
                   num_shards=12, num_proc=12, 
                   dltile_from_filename=True, file_ext='tif', 
                   store_as_array=True) 

Determining list of input files and labels from /mnt/c/Users/harry/Documents/Data/airbus_50cm_0.5m_4pad_256_lusaka_2017-closest/.
Found 3444 tif image files and 3444 label files inside /mnt/c/Users/harry/Documents/Data/airbus_50cm_0.5m_4pad_256_lusaka_2017-closest/.


---
## Approach 3 : don't translate to tfrecords files, define a mapping in the pipeline


(TBD)

In [45]:
img_files = tf.constant([tpl[0] for tpl in res])
lbl_files = tf.constant([tpl[1] for tpl in res])

In [46]:
ds = tf.data.Dataset.from_tensor_slices((img_files, lbl_files))

In [None]:
from dl_segmentation_utils import convert_to_example
# todo - intended this to be private, expose it on the package if we do want to do this.
from dl_segmentation_utils._img_to_tf_threaded import _process_image, ImageCoder

coder = ImageCoder()

def eg_from_image_paths(image_file, label_file):
    img_arr, h, w, c, k = _process_image(image_file, coder)
    target_arr, _, _, _, k2 = _process_image(label_file, coder)
    assert k == k2
    eg = convert_to_example(img_arr, target_arr, h, w, c, h, w, k)
    #[img_arr, lbl_arr, shp, dlkey] = tf.py_function(_process_image_and_lbl, image_file, label_file)
    #img_arr.set_shape(shp)
    return img_arr, lbl_arr, shp, dlkey

In [None]:
ds = ds.map(eg_from_image_paths)

In [None]:

# convert classified image folders to tfrecords

def tfrecord_from_images(imgpath, lblpath):
    #  parse the dl tile key back out of the filename
    dl_tile_key = os.path.basename(imgpath).split()
infiles = glob.glob(r'E:\Temp\mumbai_esri_train2m\images\*.tif')
len(infiles)