diff --git a/datasets/ms-buildings/ms-buildings-example.ipynb b/datasets/ms-buildings/ms-buildings-example.ipynb index 843f3e98..0b326b8b 100755 --- a/datasets/ms-buildings/ms-buildings-example.ipynb +++ b/datasets/ms-buildings/ms-buildings-example.ipynb @@ -14,17 +14,20 @@ "cell_type": "code", "execution_count": 1, "id": "05217579-1f22-4c5c-a649-6fcbd2cc2773", - "metadata": {}, + "metadata": { + "tags": [] + }, "outputs": [], "source": [ - "import geopandas\n", "import planetary_computer\n", "import pystac_client\n", "import dask.dataframe\n", "import dask_geopandas\n", "import dask.distributed\n", + "import deltalake\n", "import shapely.geometry\n", - "import contextily" + "import contextily\n", + "import mercantile" ] }, { @@ -34,74 +37,93 @@ "source": [ "### Data access\n", "\n", - "The datasets hosted by the Planetary Computer are available from [Azure Blob Storage](https://docs.microsoft.com/en-us/azure/storage/blobs/). We'll use [pystac-client](https://pystac-client.readthedocs.io/) to search the Planetary Computer's [STAC API](https://planetarycomputer.microsoft.com/api/stac/v1/docs) for the subset of the data that we care about, and then we'll load the data directly from Azure Blob Storage. We'll specify a `modifier` so that we can access the data stored in the Planetary Computer's private Blob Storage Containers. See [Reading from the STAC API](https://planetarycomputer.microsoft.com/docs/quickstarts/reading-stac/) and [Using tokens for data access](https://planetarycomputer.microsoft.com/docs/concepts/sas/) for more." + "The datasets hosted by the Planetary Computer are available from [Azure Blob Storage](https://docs.microsoft.com/en-us/azure/storage/blobs/). We'll use [pystac-client](https://pystac-client.readthedocs.io/) to search the Planetary Computer's [STAC API](https://planetarycomputer.microsoft.com/api/stac/v1/docs) to get a link to the assets. We'll specify a `modifier` so that we can access the data stored in the Planetary Computer's private Blob Storage Containers. See [Reading from the STAC API](https://planetarycomputer.microsoft.com/docs/quickstarts/reading-stac/) and [Using tokens for data access](https://planetarycomputer.microsoft.com/docs/concepts/sas/) for more." ] }, { "cell_type": "code", "execution_count": 2, "id": "76dfc9a9-c787-41c3-aaa8-95787bed362c", - "metadata": {}, + "metadata": { + "tags": [] + }, "outputs": [], "source": [ "catalog = pystac_client.Client.open(\n", - " \"https://planetarycomputer.microsoft.com/api/stac/v1\",\n", + " \"https://planetarycomputer-staging.microsoft.com/api/stac/v1\",\n", " modifier=planetary_computer.sign_inplace,\n", - ")" + ")\n", + "collection = catalog.get_collection(\"ms-buildings\")" ] }, { "cell_type": "markdown", - "id": "a2b99c9b-0efa-4342-a7e5-6141a305c74b", + "id": "46c937a5-8ceb-46e8-9b27-b154184236fb", "metadata": {}, "source": [ - "### Querying the STAC API\n", + "### Using Delta Table Files\n", "\n", - "The files are available as a set of GeoParquet datasets, released in batches by date. There's one parquet datsaet per region-date pair." + "The assets are a set of [geoparquet](https://geoparquet.org/) files grouped by a processing date. Newer files (since April 25th, 2023) are stored in [Delta Format](https://docs.delta.io/latest/delta-intro.html). This is a layer on top of parquet files offering scalable metadata handling, which is useful for this dataset." ] }, { "cell_type": "code", "execution_count": 3, - "id": "9863d3da-ee62-44e0-8ff1-7c54432a06db", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" - } - ], + "id": "aace14f8-5a33-42e1-9651-44e0f1826546", + "metadata": { + "tags": [] + }, + "outputs": [], "source": [ - "items = catalog.search(\n", - " collections=[\"ms-buildings\"], query={\"msbuildings:region\": {\"eq\": \"Vatican City\"}}\n", - ")\n", - "item = next(items.items())\n", - "item" + "asset = collection.assets[\"delta\"]" ] }, { "cell_type": "markdown", - "id": "78088f2c-b8af-422a-a9c0-cac7144633fa", - "metadata": {}, + "id": "c78f7f60-1761-4d2e-8e7f-938aad628b87", + "metadata": { + "tags": [] + }, "source": [ - "This STAC item has a `data` asset linking to the GeoParquet dataset with the building footprints." + "This Delta Table is partitioned by `RegionName` and `quadkey`. Each `(RegionName, quadkey)` pair will contain one or more parquet files (depending on how dense that particular quadkey) is. The quadkeys are at level 9 of the [Bing Maps Tile System](https://learn.microsoft.com/en-us/bingmaps/articles/bing-maps-tile-system)." ] }, { "cell_type": "code", "execution_count": 4, - "id": "2d31c615-2170-4765-97ac-67b0566f85a8", + "id": "c7d30272-1914-4c57-b0d6-c93a9db59c60", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "storage_options = {\n", + " \"account_name\": asset.extra_fields[\"table:storage_options\"][\"account_name\"],\n", + " \"sas_token\": asset.extra_fields[\"table:storage_options\"][\"credential\"],\n", + "}\n", + "table = deltalake.DeltaTable(asset.href, storage_options=storage_options)" + ] + }, + { + "cell_type": "markdown", + "id": "8a247424-0e04-460f-8b77-d333c715b217", "metadata": {}, + "source": [ + "You can load all the files for a given `RegionName` with a query like this:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "076c0903-a71c-4ca8-8500-3bf9f7edb1cc", + "metadata": { + "tags": [] + }, "outputs": [ { "data": { "text/html": [ + "
Dask-GeoPandas GeoDataFrame Structure:
\n", "
\n", "