## Setup environment

In the terminal, create a virtual environment in the parent `bigstac` directory
```sh
dirs
# ~/projects/bigstac

python3 -m venv .
source bin/activate
```

You can open this notebook in VSCode and it will suggest installing the missing `ipykernel` for you automatically. Or to serve your own Jupyter notebook, run:

```sh
pip install jupyterlab
jupyter lab # default browser will open to localhost:8888
```

The rest of the cells can be executed in the running notebook

In [None]:
%pip install pyarrow shapely

## Download one collection's STAC Items

In [2]:
import pyarrow.parquet as pq
from shapely import wkb, wkt
import json

In [3]:
%%bash
curl https://cmr.earthdata.nasa.gov/stac/GES_DISC/collections/LPRM_WINDSAT_NT_SOILM3_001/items -o windsat_items.json

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 23071  100 23071    0     0  17676      0  0:00:01  0:00:01 --:--:-- 17678


In [4]:
with open('windsat_items.json', 'r') as file:
  windsat_json = json.load(file)

Check some top level STAC API response properties

In [5]:
json_keys = ['type', 'stac_version', 'description', 'numberMatched', 'numberReturned']
for key in json_keys:
  print(f"{key+':' : <17}{windsat_json[key]}")

type:            FeatureCollection
stac_version:    1.0.0
description:     Items in the collection LPRM_WINDSAT_NT_SOILM3_001
numberMatched:   3169
numberReturned:  20


List all first level elements of a **feature** and their second level elements.

Note that there is deeper nesting not shown: the asset has four child elements.

In [6]:
for key in windsat_json['features'][0].keys():
  keyElements = windsat_json['features'][0][key]
  elements = ''
  #print('DEBUG, ', type(keyElements))
  if type(keyElements) == list and len(keyElements) == 1:
    keyElements = keyElements[0]
  if type(keyElements) == dict:
    elements = ': \n   ' + ', '.join(item for item in keyElements.keys())
  print(key, elements, sep = "")

type
id
stac_version
stac_extensions
properties: 
   datetime, start_datetime, end_datetime
geometry: 
   type, coordinates
bbox
assets: 
   001/2003/02/LPRM-WINDSAT_L3_NT_SOILM3_V001_20030201012753
links: 
   rel, href, type, title
collection


## Convert from STAC JSON to GeoParquet

In [None]:
%%bash
brew install planetlabs/tap/gpq

In [8]:
%%bash
gpq convert windsat_items.json windsat_items.parquet

In [9]:
windsat = pq.ParquetFile('windsat_items.parquet')

In [10]:
windsat.metadata

<pyarrow._parquet.FileMetaData object at 0x10b56b420>
  created_by: parquet-go version 14.0.2
  num_columns: 4
  num_rows: 20
  num_row_groups: 1
  format_version: 2.6
  serialized_size: 923

In [11]:
windsat.schema

<pyarrow._parquet.ParquetSchema object at 0x10b592640>
repeated group field_id=-1 schema {
  optional binary field_id=-1 datetime (String);
  optional binary field_id=-1 end_datetime (String);
  optional binary field_id=-1 geometry;
  optional binary field_id=-1 start_datetime (String);
}

## Missing metadata after `gpq` conversion

Only `datetime`, `start_datetime`, `end_datetime`, (all part of the feature `properties`) and `geometry` were converted to geoparquet.

This is because `gpq` reads the STAC JSON as if it is plan GeoJSON, and the [spec](https://datatracker.ietf.org/doc/html/rfc7946#page-3) describes GeoJSON as only its `geometry` and `properties`. It doesn't support other top-level JSON elements.

## Read parquet file and verify geometry

In [12]:
w_reader = windsat.read()

Slice a single row off the top to work with

In [13]:
w_s1 = w_reader.slice(length = 1)

In [14]:
w_s1

pyarrow.Table
datetime: string
end_datetime: string
geometry: binary
start_datetime: string
----
datetime: [["2003-02-01T01:27:53.000Z"]]
end_datetime: [["2003-02-02T01:12:06.000Z"]]
geometry: [[0103000000010000000500000000000000008066C000000000008056C0000000000080664000000000008056C00000000000806640000000000080564000000000008066C0000000000080564000000000008066C000000000008056C0]]
start_datetime: [["2003-02-01T01:27:53.000Z"]]

In [15]:
w_s1['geometry'][0].as_py()

b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00\x00\x80f\xc0\x00\x00\x00\x00\x00\x80V\xc0\x00\x00\x00\x00\x00\x80f@\x00\x00\x00\x00\x00\x80V\xc0\x00\x00\x00\x00\x00\x80f@\x00\x00\x00\x00\x00\x80V@\x00\x00\x00\x00\x00\x80f\xc0\x00\x00\x00\x00\x00\x80V@\x00\x00\x00\x00\x00\x80f\xc0\x00\x00\x00\x00\x00\x80V\xc0'

For converting WKB to WKT:
[https://stackoverflow.com/a/74399148](https://stackoverflow.com/a/74399148)

In [16]:
loaded = wkb.loads(w_s1['geometry'][0].as_py())

In [17]:
wkt.dumps(loaded)

'POLYGON ((-180.0000000000000000 -90.0000000000000000, 180.0000000000000000 -90.0000000000000000, 180.0000000000000000 90.0000000000000000, -180.0000000000000000 90.0000000000000000, -180.0000000000000000 -90.0000000000000000))'