# Raster acquisition, processing and analysis with Databricks

Today we'll answer the biggest open question in the field of Earth observation and GIS systems:
__which British golf course has the greenest, healthiest vegetation?__

<img src='./assets/John-Daly-4.jpg'/>

In this first notebook, we will demonstrate how to:
- Install and configure a Databricks cluster ready for raster processing, including installing the Datbaricks Labs Mosaic[↗︎](https://github.com/databrickslabs/mosaic) project and its GDAL[↗︎](https://gdal.org/) extensions;
- Read a publicly available vector dataset describing green space locations in Great Britain and prepare this for later use by reprojecting coordinates and converting the geometries into GeoJSON format;
- Query the Microsoft Planetary Computer's Sentinel 2 catalog[↗︎](https://planetarycomputer.microsoft.com/dataset/sentinel-2-l2a) to obtain links to the relevant imagery for our areas of interest; and
- Download the single-band GeoTIFF images to a location in the Databricks file system.

## Install the libraries and prepare the environment

For this demo we will require a few spatial libraries that can be easily installed via pip install. We will be using gdal, rasterio, pystac and databricks-mosaic for data download and data manipulation. We will use planetary computer as the source of the raster data for the analysis.

In [0]:
import os

notebook_path = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()
project_path = os.path.dirname(notebook_path)
os.environ["PROJECTCWD"] = project_path

%pip install /Workspace$PROJECTCWD/databricks_mosaic-0.4.3-py3-none-any.whl
%pip install --quiet rasterio==1.3.5 gdal==3.4.1 pystac pystac_client planetary_computer tenacity rich osdatahub

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
Processing /Workspace/Users/stuart.lynn@databricks.com/customers/Arup/sentinel2-demo/databricks_mosaic-0.4.3-py3-none-any.whl
Collecting geopandas<0.14.4,>=0.14
  Using cached geopandas-0.14.3-py3-none-any.whl (1.1 MB)
Collecting h3<4.0,>=3.7
  Using cached h3-3.7.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
Collecting keplergl==0.3.2
  Using cached keplergl-0.3.2-py2.py3-none-any.whl
Collecting traittypes>=0.2.1
  Using cached traittypes-0.2.1-py2.py3-none-any.whl (8.6 kB)
Collecting Shapely>=1.6.4.post2
  Using cached shapely-2.0.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB)
Collecting pyproj>=3.3.0
  Using cached pyproj-3.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.3 MB)
Collecting fiona>=1.8.21
  Using cached fiona-1.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
Collecting 

In [0]:
%reload_ext autoreload
%autoreload 2

In [0]:
import library
import pystac_client
import planetary_computer
import mosaic as mos

from datetime import datetime
from osdatahub import OpenDataDownload
from pyspark.sql import functions as F
from pyspark.sql import Window

data_product = "OpenGreenspace"

current_user = spark.sql("select current_user() as user").first()["user"]
data_root = f"/tmp/{current_user}/{data_product}/data"
output_path = data_root.replace("/data", "/outputs")

dbutils.fs.mkdirs(data_root)
dbutils.fs.mkdirs(output_path)

os.environ["DATADIR"] = f"/dbfs{data_root}"
os.environ["OUTDIR"] = f"/dbfs{output_path}"

CATALOG = "stuart"
SCHEMA = "arup"

spark.sql(f"CREATE CATALOG IF NOT EXISTS {CATALOG}")
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}")

DataFrame[]

## 1a. Acquire areas of interest (AoI) dataset
For this example, we shall use a publicly available set of shape data: the [Ordnance Survey OpenGreenspace](https://www.ordnancesurvey.co.uk/products/os-open-greenspace) product. This dataset describes the locations of green spaces of various types across the British Isles.

The dataset is available in multiple formats but we'll take the GeoPackage since that's usually a simple, reliable format to work with.

In [0]:
downloader = OpenDataDownload(data_product)
os.environ["PRODUCT"] = product_filename = "opgrsp_gpkg_gb.zip"
downloader.download(output_dir=f"/dbfs{data_root}", file_name=product_filename, overwrite=True)

opgrsp_gpkg_gb.zip:   0%|          | 0.00/52.2M [00:00<?, ?B/s]opgrsp_gpkg_gb.zip:   2%|▏         | 1.05M/52.2M [00:00<00:26, 1.95MB/s]opgrsp_gpkg_gb.zip:   4%|▍         | 2.10M/52.2M [00:00<00:13, 3.73MB/s]opgrsp_gpkg_gb.zip:   8%|▊         | 4.19M/52.2M [00:00<00:06, 7.02MB/s]opgrsp_gpkg_gb.zip:  12%|█▏        | 6.29M/52.2M [00:00<00:05, 9.15MB/s]opgrsp_gpkg_gb.zip:  18%|█▊        | 9.44M/52.2M [00:01<00:03, 12.8MB/s]opgrsp_gpkg_gb.zip:  26%|██▌       | 13.6M/52.2M [00:01<00:02, 17.4MB/s]opgrsp_gpkg_gb.zip:  34%|███▍      | 17.8M/52.2M [00:01<00:01, 20.6MB/s]opgrsp_gpkg_gb.zip:  42%|████▏     | 22.0M/52.2M [00:01<00:01, 22.7MB/s]opgrsp_gpkg_gb.zip:  50%|█████     | 26.2M/52.2M [00:01<00:01, 24.3MB/s]opgrsp_gpkg_gb.zip:  58%|█████▊    | 30.4M/52.2M [00:01<00:00, 25.3MB/s]opgrsp_gpkg_gb.zip:  66%|██████▋   | 34.6M/52.2M [00:01<00:00, 26.1MB/s]opgrsp_gpkg_gb.zip:  74%|███████▍  | 38.8M/52.2M [00:02<00:00, 26.7MB/s]opgrsp_gpkg_gb.zip:  82%|████████▏ | 43.0M/52.2M [00:02<00:

Finished downloading opgrsp_gpkg_gb.zip to /dbfs/tmp/stuart.lynn@databricks.com/OpenGreenspace/data/opgrsp_gpkg_gb.zip





['/dbfs/tmp/stuart.lynn@databricks.com/OpenGreenspace/data/opgrsp_gpkg_gb.zip']

In [0]:
%sh
mkdir -p $DATADIR/geopackage
unzip -n $DATADIR/opgrsp_gpkg_gb.zip -d $DATADIR/geopackage/

Archive:  /dbfs/tmp/stuart.lynn@databricks.com/OpenGreenspace/data/opgrsp_gpkg_gb.zip


In [0]:
%sh ls -lah $DATADIR/geopackage/Data/

total 122M
drwxrwxrwx 2 nobody nogroup 4.0K May 28 17:55 .
drwxrwxrwx 2 nobody nogroup 4.0K May 28 17:55 ..
-rwxrwxrwx 1 nobody nogroup 122M May 28 17:55 opgrsp_gb.gpkg


## 1b. Read AoI data into Spark, reproject and store in Delta Lake

We will use Mosaic to read in this vector dataset, reproject it into a WGS84 coordinate reference system and write the feature geometries and their associated properties into a table in Unity Catalog.

- Enabling Mosaic is a straightforward call to `mosaic.enable_mosaic()`.
- If we want to use Mosaic's GDAL extensions (multiple vector and raster format readers, raster transformation functions etc.) then we also need to call `mosaic.enable_gdal()`.
- GDAL needs to be installed and available on the cluster and Mosaic can also help us with this task (see instructions [here](https://databrickslabs.github.io/mosaic/usage/install-gdal.html))
- The raster functions in Mosaic have two modes of operation. We'll opt for the more stable 'checkpointing enabled' mode, which persists intermediate raster results to a location in DBFS during execution of Spark jobs.

In [0]:
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "false")
mos.enable_mosaic(spark, dbutils)
mos.enable_gdal(spark, with_checkpoint_path=f"/dbfs{output_path}/checkpoint/{datetime.now().isoformat()}")

GDAL enabled.

checkpoint path '/dbfs/tmp/stuart.lynn@databricks.com/OpenGreenspace/outputs/checkpoint/2024-09-11T15:35:23.953585' configured for this session.
GDAL 3.4.1, released 2021/12/27




Mosaic has specialised readers for geopackages, file geodatabases, shapefiles and other GDAL supported vector data formats (see [docs](https://databrickslabs.github.io/mosaic/api/vector-format-readers.html) for more info).

It even has a mechanism for parallelising the read process by 'chunking' the source data and allocating a chunk of features to be read to a Spark task ([docs](https://databrickslabs.github.io/mosaic/api/vector-format-readers.html#mos-read-format-multi-read-ogr)).

In [0]:
green_spaces = (
  mos.read().format("multi_read_ogr")
  .option("chunkSize", "500")
  .option("layerName", "greenspace_site")
  .load(f"{data_root}/geopackage/Data/opgrsp_gb.gpkg")
  ).cache()



In [0]:
green_spaces.display()

+--------------------+--------------------+--------------------+------------------+------------------+------------------+--------------------+-------------+
|                  id|            function|  distinctive_name_1|distinctive_name_2|distinctive_name_3|distinctive_name_4|            geometry|geometry_srid|
+--------------------+--------------------+--------------------+------------------+------------------+------------------+--------------------+-------------+
|10687FC3-B01A-64A...|   Religious Grounds|   St Helen's Church|                  |                  |                  |MULTIPOLYGON (((4...|        27700|
|10687FB4-EBA1-64A...|   Religious Grounds|Old Kilpatrick Pa...|                  |                  |                  |MULTIPOLYGON (((2...|        27700|
|10687F6C-4D71-64A...|       Playing Field|                    |                  |                  |                  |MULTIPOLYGON (((3...|        27700|
|10688009-8491-64A...|          Play Space|               

For the purposes of this demo, we'll select one type of green space for our analysis (golf courses).

The feature geometries are supplied in British National Grid projection. To make our data processing easier, we'll reproject this into EPSG:4326.

Many of the feature geometries are compound geometries, i.e. `MULTIPOLYGON`. For the sake of keeping the task as simple as possible, we'll unpack these into their constituent parts.

In [0]:
aoi_type = "Golf Course"
aoi_table_ref = f"{CATALOG}.{SCHEMA}.aois"

aois = (
  green_spaces
  .where(F.col("function") == aoi_type)
  .withColumn("geometry_4326", mos.st_updatesrid("geometry", "geometry_srid", F.lit(4326)))
  )
aois.write.mode("overwrite").saveAsTable(aoi_table_ref)
aois = spark.table(aoi_table_ref)
aois.count()

2998

Let's go ahead and examine a subset of these using the excellent kepler.gl mapping tool.

In [0]:
filter_geom = "POLYGON (( -5.15 51.8, -4.9 51.8, -4.9 51.65, -5.15 51.65, -5.15 51.8 ))"

to_show = (
  aois.select("id", "geometry_4326")
  .where(mos.st_intersects("geometry_4326", F.lit(filter_geom)))
  )

to_show.count()

4

Here's an example of the map we'd expect to see if we run the following cell.
<img src='./assets/wales-course.png'/>

In [0]:
%%mosaic_kepler
to_show geometry_4326 geometry

## 2. Acquire imagery from the Planetary Computer

It is fairly easy to interface with the pystac_client and remote raster data catalogs. We can browse resource collections and individual assets.

For our search here, we'll take the geometries of our AoIs, express them as geojson and use them to query the catalog. A time range filter can also be supplied, as well as an upper bound on the level of cloud cover in the imagery.

In [0]:
time_range = "2021-05-01/2021-07-31"

In [0]:
json_geoms = (
  aois
  .select("id", "geometry_4326")
  .withColumn("geojson", mos.st_asgeojson("geometry_4326"))
)

In [0]:
json_geoms.display()

+--------------------+--------------------+--------------------+
|                  id|       geometry_4326|             geojson|
+--------------------+--------------------+--------------------+
|10687FF9-1ECE-64A...|MULTIPOLYGON (((-...|{"type":"MultiPol...|
|10687FB5-6AE6-64A...|MULTIPOLYGON (((-...|{"type":"MultiPol...|
|106880D8-695B-64A...|MULTIPOLYGON (((0...|{"type":"MultiPol...|
|10687FB5-9CD0-64A...|MULTIPOLYGON (((-...|{"type":"MultiPol...|
|10687F75-E624-64A...|MULTIPOLYGON (((-...|{"type":"MultiPol...|
+--------------------+--------------------+--------------------+



Our framework allows for easy preparation of stac requests with only one line of code. This data is delta ready as this point and can easily be stored for lineage purposes.

For the purposes of this exercise, we'll retain the relationship between granules / sweeps and areas of interest by creating 'sets' of identifiers against each granule.

In [0]:
eod_items = (
  library.get_assets_for_cells(
    json_geoms.repartition(sc.defaultParallelism),
    time_range,
    "sentinel-2-l2a"
    )
  .where(F.col("asset.type") == "image/tiff; application=geotiff; profile=cloud-optimized")
  .where(F.col("asset.name") != "preview")
  .groupBy("item_id", "asset.name", "item_properties.datetime")
  .agg(
    F.collect_set("id").alias("ids"),
    F.first("asset.href").alias("href"),
    )
  ).cache()
eod_items.display()

+--------------------+----+--------------------+--------------------+--------------------+
|             item_id|name|            datetime|                 ids|                href|
+--------------------+----+--------------------+--------------------+--------------------+
|S2A_MSIL2A_202105...| B12|2021-05-31T11:06:...|[10687FA8-890B-64...|https://sentinel2...|
|S2A_MSIL2A_202105...| B01|2021-05-31T11:06:...|[10687FA8-31EF-64...|https://sentinel2...|
|S2A_MSIL2A_202106...| B01|2021-06-29T11:43:...|[10687F88-C137-64...|https://sentinel2...|
|S2A_MSIL2A_202107...| B09|2021-07-20T11:06:...|[10687FD9-0307-64...|https://sentinel2...|
|S2A_MSIL2A_202107...| AOT|2021-07-20T11:06:...|[10687F95-83D6-64...|https://sentinel2...|
+--------------------+----+--------------------+--------------------+--------------------+



In [0]:
eod_items.count()

7120

### Download images into DBFS
Now we have interrogated the catalogue, we can go ahead and directly download the imagery from the Planetary Computer storage account.

In [0]:
imagery_root = f"/tmp/{current_user}/{data_product}/imagery"
dbutils.fs.mkdirs(imagery_root)

imagery_table_ref = f"{CATALOG}.{SCHEMA}.imagery"

In [0]:
downloads = (
  eod_items
  .withColumn("downloaded_path", library.download_asset(F.col("href"), F.lit(f"/dbfs{imagery_root}")))
  )
downloads.write.mode("overwrite").saveAsTable(imagery_table_ref)

In [0]:
spark.table(imagery_table_ref).display()

+--------------------+----+--------------------+--------------------+--------------------+--------------------+
|             item_id|name|            datetime|                 ids|                href|     downloaded_path|
+--------------------+----+--------------------+--------------------+--------------------+--------------------+
|S2A_MSIL2A_202105...| B12|2021-05-31T11:06:...|[10687FA8-890B-64...|https://sentinel2...|/dbfs/tmp/stuart....|
|S2A_MSIL2A_202105...| B01|2021-05-31T11:06:...|[10687FA8-31EF-64...|https://sentinel2...|/dbfs/tmp/stuart....|
|S2A_MSIL2A_202106...| B01|2021-06-29T11:43:...|[10687F88-C137-64...|https://sentinel2...|/dbfs/tmp/stuart....|
|S2A_MSIL2A_202107...| B09|2021-07-20T11:06:...|[10687FD9-0307-64...|https://sentinel2...|/dbfs/tmp/stuart....|
|S2A_MSIL2A_202107...| AOT|2021-07-20T11:06:...|[10687F95-83D6-64...|https://sentinel2...|/dbfs/tmp/stuart....|
+--------------------+----+--------------------+--------------------+--------------------+--------------

In [0]:
spark.table(imagery_table_ref).count()

7120

In [0]:
spark.table(imagery_table_ref).where("downloaded_path = ''").display()

item_id,name,datetime,ids,href,downloaded_path
