In this competition, we deal with TIFF images. I've never worked with this format before so my first question was: what library should I use to read this format? The most obvious choices were `Pillow` and `tifffile` packages. However, they didn't work: the former failed with `DecompressionBombError` (!), while the latter complained about NumPy array being non-contiguous for some image in the dataset.

[After some search](https://stackoverflow.com/questions/7569553/working-with-tiffs-import-export-in-python-using-numpy), I stumbled upon the `osgeo` library. As the name suggests, the library was probably written to work with maps, satellite images, or something like that. But it has TIFF reading functionality built-in so why not to apply it for medical imagery, right? 

Here I show my approach to read the dataset images. This approach worked out for me right out of the box, without any errors that I encountered with other libraries. But if you know a better/faster way, please let me know!

In [1]:
import gc
import glob
import json
from os.path import basename, dirname, splitext
import numpy as np
from osgeo import gdal

In [2]:
def human_readable_size(arr: np.ndarray) -> str:
    """Gets array's size as a verbose, human-readable string."""
    
    n = arr.nbytes
    for unit in ('bytes', 'Kb', 'Mb', 'Gb'):
        if n >= 1024:
            n /= 1024
        else:
            break
    return f'{n:.3f} {unit}'

In [3]:
def read_tiff(path: str) -> np.ndarray:
    """Reads TIFF file."""
    
    dataset = gdal.Open(path, gdal.GA_ReadOnly)
    n_channels = dataset.RasterCount
    width = dataset.RasterXSize
    height = dataset.RasterYSize
    image = np.zeros((n_channels, height, width), dtype=np.uint8)
    for i in range(n_channels):
        band = dataset.GetRasterBand(i+1)
        channel = band.ReadAsArray()
        image[i] = channel
    return image

In [4]:
meta = []

for filename in glob.glob('/kaggle/input/hubmap-kidney-segmentation/**/*.tiff'):
    print(f'Processing file: {filename}')
    identifier, _ = splitext(basename(filename))
    subset = basename(dirname(filename))
    img = read_tiff(filename)
    meta.append(dict(
        identifier=identifier,
        filename=filename,
        subset=subset,
        memory_bytes=img.nbytes,
        memory_human_readable=human_readable_size(img),
        image_shape=img.shape
    ))
    del img
    gc.collect()

Processing file: /kaggle/input/hubmap-kidney-segmentation/train/095bf7a1f.tiff
Processing file: /kaggle/input/hubmap-kidney-segmentation/train/1e2425f28.tiff
Processing file: /kaggle/input/hubmap-kidney-segmentation/train/54f2eec69.tiff
Processing file: /kaggle/input/hubmap-kidney-segmentation/train/cb2d976f4.tiff
Processing file: /kaggle/input/hubmap-kidney-segmentation/train/aaa6a05cc.tiff
Processing file: /kaggle/input/hubmap-kidney-segmentation/train/0486052bb.tiff
Processing file: /kaggle/input/hubmap-kidney-segmentation/train/2f6ecfcdf.tiff
Processing file: /kaggle/input/hubmap-kidney-segmentation/train/e79de561c.tiff
Processing file: /kaggle/input/hubmap-kidney-segmentation/test/afa5e8098.tiff
Processing file: /kaggle/input/hubmap-kidney-segmentation/test/26dc41664.tiff
Processing file: /kaggle/input/hubmap-kidney-segmentation/test/b9a3865fc.tiff
Processing file: /kaggle/input/hubmap-kidney-segmentation/test/c68fe75ea.tiff
Processing file: /kaggle/input/hubmap-kidney-segmentatio

In [5]:
for info in meta:
    print(
        f'id: {info["identifier"]}, '
        f'memory: {info["memory_human_readable"]:>10s}, '
        f'shape: {info["image_shape"]}'
    )

id: 095bf7a1f, memory:   4.158 Gb, shape: (3, 38160, 39000)
id: 1e2425f28, memory:   2.411 Gb, shape: (3, 26780, 32220)
id: 54f2eec69, memory:   1.891 Gb, shape: (3, 30440, 22240)
id: cb2d976f4, memory:   4.837 Gb, shape: (3, 34940, 49548)
id: aaa6a05cc, memory: 688.168 Mb, shape: (3, 18484, 13013)
id: 0486052bb, memory:   2.517 Gb, shape: (3, 25784, 34937)
id: 2f6ecfcdf, memory:   2.254 Gb, shape: (3, 31278, 25794)
id: e79de561c, memory:   1.221 Gb, shape: (3, 16180, 27020)
id: afa5e8098, memory:   4.501 Gb, shape: (3, 36800, 43780)
id: 26dc41664, memory:   4.516 Gb, shape: (3, 38160, 42360)
id: b9a3865fc, memory:   3.535 Gb, shape: (3, 31295, 40429)
id: c68fe75ea, memory:   3.733 Gb, shape: (3, 26840, 49780)
id: b2dc8411c, memory:   1.297 Gb, shape: (3, 14844, 31262)


In [6]:
with open('/kaggle/working/meta.json', 'w') as fp:
    json.dump(meta, fp)