# Part1: STAC,Geopandas, Xarray, Dask, Holoviz

This notebook will showcase foundational open-source Python libraries in the Pangeo stack of tools, working up from small data to datasets that excede local memory.

## Learning objectives:

- discover data with [STAC](https://stacspec.org) APIs
- perform basic geospatial vector operations with [Geopandas](https://geopandas.org)
- perform basic geospatial raster operations with [Xarray](http://xarray.pydata.org/en/stable/)/[Rioxarray](https://corteva.github.io/rioxarray/stable/)
- single-machine scaling with [Dask](https://dask.org)
- interactive browser visualizations with [Holoviz](https://holoviz.org)

In [None]:
# STAC API search
import json
import pystac_client

# Vector utilities
import geopandas as gpd

# Raster utilities
import xarray as xr
import rioxarray

# Visualization
import hvplot.xarray
import hvplot.pandas

In [None]:
# It's good practice to keep track of library versions
print(f'pystac_client={pystac_client.__version__}')
print(f'geopandas={gpd.__version__}')
print(f'xarray={xr.__version__}')
print(f'rioxarray={rioxarray.__version__}')
print(f'hvplot={hvplot.__version__}')

## Vector data 

Geospatial vector data consists basic geometries (Points, Lines, Polygons) with coordinate reference system information (CRS). If you're new to vector data, check out this Software Carpentry [lesson](https://carpentries-incubator.github.io/geospatial-python/).

[Pandas](https://pandas.pydata.org) is a core scientific Python library to work with "Panel Data" (PanDas). Basically if you have a spreadsheet or database you should be using Pandas! Pandas has many input/output (I/O) functions, and two core data structures - the "Series" and "DataFrame". 

[Geopandas](http://geopandas.org) extends Pandas to work efficently with collections of geographic Vector data - geometric shapes that are georeferenced to a position on Earth's surface. Geopandas data objects are, you might have guessed, called "GeoSeries" and "GeoDataFrame".

There are *many* vector formats for geospatial data. A very common one is [GeoJSON](https://gdal.org/drivers/vector/geojson.html), which can be easily represented as a Python dictionary:

In [None]:
# Barreal, Argentina location in GeoJSON
# from https://geojson.io

area_of_interest = {
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "properties": {},
      "geometry": {
        "type": "Point",
        "coordinates": [
          -69.466552734375,
          -31.62532121329918
          
        ]
      }
    }
  ]
}

with open('point.geojson', 'w') as f:
   json.dump(area_of_interest, f)

In [None]:
gf = gpd.read_file('point.geojson')
gf['id'] = 'barreal'
gf

In [None]:
# Geopandas facilitates geospatial operations such as reprojection
# https://epsg.io/32719
gf_utm = gf.to_crs('EPSG:32719')
gf_utm.buffer(100) 
# Why reproject? What are the units here?

## Search for data

[SpatioTemporal Asset Catalogs (STAC)](https://stacspec.org) are a standard among imagery providers to simplify and unify search capabilities. Metadata is in JSON format and definited by a community-built standard [core specification](https://github.com/radiantearth/stac-spec) with optional [extensions](https://stac-extensions.github.io).

[pystac_client](https://github.com/stac-utils/pystac-client) is a Python client for working with STAC Catalogs and APIs. It uses [PySTAC](https://pystac.readthedocs.io) behind the scenes to navigate STAC metadata.

There are several public STAC API endpoints, which you can find on the [STAC Index Website](https://stacindex.org/catalogs?access=public&type=api), a few are listed below:

| provider | endpoint | datacenters |
| - | - | - | 
| [Microsoft Planetary Computer](https://planetarycomputer.microsoft.com/docs/quickstarts/reading-stac/) | https://planetarycomputer.microsoft.com/api/stac/v1 | Azure West Europe |
| [Element84 Earthsearch](https://www.element84.com/earth-search/) | https://earth-search.aws.element84.com/v0 | AWS multiple regions | 
| [NASA CMR STAC Cloud Proxy](https://github.com/nasa/cmr-stac) | https://cmr.earthdata.nasa.gov/cloudstac | AWS us-west-2 | 

For high-performance and cost-effective analysis, always keep in mind where data is located! For this workshop we are running on servers in Microsoft Azure’s `West Europe` region, so we'll use mostly datasets hosted in that region by the [Planetary Computer Initiative](https://planetarycomputer.microsoft.com). We run computations where the data is stored, and bring small subsets or visualizations back for download.

In [None]:
from pystac_client import Client

# Connect to a STAC API

# See documentation of all datasets at https://planetarycomputer.microsoft.com/docs/quickstarts/reading-stac/
catalog = Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")

# Use the 'geometry' information from GeoJSON
search = catalog.search(collections=["nasadem"], 
                        intersects=area_of_interest['features'][0]['geometry'], 
                       )

STAC ItemCollections can be represented as *vector* data in GeoJSON format !

In [None]:
# A convenient way to display results is as a Geopandas GeoDataFrame
gf = gpd.GeoDataFrame.from_features(search.get_all_items_as_dict())

# BUG: 'ids' droppped https://github.com/geopandas/geopandas/pull/2003
gf['id'] = [item.id for item in search.get_all_items()]
gf

**Aside:** Above we have a simple work-around for a geopandas bug. If open-source libraries are missing some functionality you can help! In fact, the success of open source software relies on community contributions and volunteer efforts. Remember, you are not just a *user* of these tools, but a *supporter* of these tools. Check out the excellent Xarray contributing guide for ideas of how to get started contributing http://xarray.pydata.org/en/stable/contributing.html.

In [None]:
# As above we can use holoviews to plot this
gf.hvplot.polygons(geo=True, tiles=True, alpha=0.2)

In [None]:
# Use PySTAC to iterate of STAC Items
for item in search.get_all_items():
    print(item.id)
    # print the full JSON metadata
    #display(item.to_dict())

In [None]:
# We have a single STAC Item, that can contain multiple STAC Assets:
for asset_key, asset in item.assets.items():
    print(f"{asset_key:<10} - {asset.href}")

## Raster data

In [None]:
da = rioxarray.open_rasterio(asset.href)
da

## Dask