Using XArray to read tiled GeoTIFF datasets
===========================================

This notebook shows how to use XArray and Dask to process large GeoTIFF datasets efficiently.

Download Data
-------------

Lets download a sample GeoTIFF dataset

https://oin-hotosm.s3.amazonaws.com/5abae68e65bd8f00110f3e42/0/5abae68e65bd8f00110f3e43.tif

In [1]:
import os
if not os.path.exists('myfile.tif'):
    import requests
    response = requests.get('https://oin-hotosm.s3.amazonaws.com/5abae68e65bd8f00110f3e42/0/5abae68e65bd8f00110f3e43.tif')
    with open('myfile.tif', 'wb') as f:
        f.write(response.content)

## Look at metadata with XArray and Rasterio

In [3]:
import xarray as xr
xr.open_rasterio('myfile.tif')  # this only reads metadata to start

  return f(*args, **kwds)


<xarray.DataArray (band: 3, y: 10376, x: 10211)>
[317848008 values with dtype=uint8]
Coordinates:
  * band     (band) int64 1 2 3
  * y        (y) float64 41.55 41.55 41.55 41.55 41.55 41.55 41.55 41.55 ...
  * x        (x) float64 -70.61 -70.61 -70.61 -70.61 -70.61 -70.61 -70.61 ...
Attributes:
    transform:   (-70.61029151325789, 2.858800307578235e-07, 0.0, 41.55024643...
    crs:         +init=epsg:4326
    res:         (2.858800307578235e-07, 2.1475464908746744e-07)
    is_tiled:    1
    nodatavals:  (0.0, 0.0, 0.0)

In [4]:
import rasterio
img = rasterio.open('myfile.tif')
img

<open RasterReader name='myfile.tif' mode='r'>

In [5]:
img.is_tiled  # can we read this data in chunks?

True

In [6]:
set(img.block_shapes)  # what are the block shapes that we expect from this file?

{(512, 512)}

Great, this dataset is chunked by band (each band separate) and x/y blocks of size 512x512.  We'll want our Dask array chunks to be a bit bigger than this, but we'll use a clean mulitple.

Create lazy XArray dataset around GeoTIFF file
-------------------------------------

In [9]:
ds = xr.open_rasterio('myfile.tif', 
                      chunks={'band': 1, 'x': 2048, 'y': 2048})
ds

<xarray.DataArray (band: 3, y: 10376, x: 10211)>
dask.array<shape=(3, 10376, 10211), dtype=uint8, chunksize=(1, 2048, 2048)>
Coordinates:
  * band     (band) int64 1 2 3
  * y        (y) float64 41.55 41.55 41.55 41.55 41.55 41.55 41.55 41.55 ...
  * x        (x) float64 -70.61 -70.61 -70.61 -70.61 -70.61 -70.61 -70.61 ...
Attributes:
    transform:   (-70.61029151325789, 2.858800307578235e-07, 0.0, 41.55024643...
    crs:         +init=epsg:4326
    res:         (2.858800307578235e-07, 2.1475464908746744e-07)
    is_tiled:    1
    nodatavals:  (0.0, 0.0, 0.0)

We note that the variables are dask arrays rather than numpy arrays

In [10]:
ds.variable.data

dask.array<open_rasterio-68063c932a231ca3c4260a129dc70974<this-array>, shape=(3, 10376, 10211), dtype=uint8, chunksize=(1, 2048, 2048)>

Optionally create a Dask Client
-------------------------------

I do this just to look at the dashboard during execution.  You don't need to run this though. Things will work just fine with the local thread pool scheduler.

In [8]:
from dask.distributed import Client
client = Client(processes=False)
client

0,1
Client  Scheduler: inproc://192.168.50.100/7229/1  Dashboard: http://localhost:8787/status,Cluster  Workers: 1  Cores: 4  Memory: 16.68 GB


In [11]:
import dask
ds.sum(dim=['x', 'y']).compute()

<xarray.DataArray (band: 3)>
array([10452513028, 10553994024, 12156173330], dtype=uint64)
Coordinates:
  * band     (band) int64 1 2 3