# Create covariate stacks

### Covariates will be derived by reducing satellite imagery to useful modelling layers
#### Covariates can include: Landsat 8, NASADEM, Sentinel-2, Sentinel-1

### Inputs:
#### bbox_img_tile: An image tile extent will provide the bbox that accesses cloud-optimized GeoTiffs
#### filters: search for imagery based on days-of-year, cloudcover, year range
##### Note: Image tile extent will be dictated by data volume of covars & CPU RAM

In [1]:
# Search for imagery
# https://github.com/developmentseed/example-jupyter-notebooks/blob/landsat-search/notebooks/Landsat8-Search/L8-USGS-satapi.ipynb
import os
import json
import requests
import datetime

sat_api_url = "https://landsatlook.usgs.gov/sat-api"

def query_satapi(query):
    headers = {
            "Content-Type": "application/json",
            "Accept-Encoding": "gzip",
            "Accept": "application/geo+json",
        }

    url = f"{sat_api_url}/stac/search"
    data = requests.post(url, headers=headers, json=query).json()
    
    return data

def query_year(year, bbox, min_cloud, max_cloud):
    '''Given the year, finds the number of scenes matching the query and returns it.'''
    date_min = '-'.join([str(year), "06-01"])
    date_max = '-'.join([str(year), "09-15"])
    start_date = datetime.datetime.strptime(date_min, "%Y-%m-%d")
    end_date = datetime.datetime.strptime(date_max, "%Y-%m-%d") 
    start = start_date.strftime("%Y-%m-%dT00:00:00Z")
    end = end_date.strftime("%Y-%m-%dT23:59:59Z")
    
    query = {
    "time": f"{start}/{end}",
    "bbox":bbox,
    "query": {
        "eo:platform": {"eq": "LANDSAT_8"},
        "eo:cloud_cover": {"gte": min_cloud, "lt": max_cloud},
        "collection":{"eq": "landsat-c2l2-sr"},
        "landsat:collection_category":{"eq": "T1"}
        },
    "limit": 20 # We limit to 500 items per Page (requests) to make sure sat-api doesn't fail to return big features collection
    }
    
    data = query_satapi(query)
    
    # you can't trouble shoot if you don't return the actual results
    return data

In [2]:
# Accessing imagery
# Select an area of interest
bbox_list = [[-105,45,-100,50], [-101,45,-100,46]]
min_cloud = 0
max_cloud = 20
for bbox in bbox_list:
    # Geojson of total scenes - Change to list of scenes
    response_by_year = [query_year(year, bbox, min_cloud, max_cloud ) for year in range(2015,2020 + 1)]
    scene_totals = [each['meta']['found'] for each in response_by_year]
    print(scene_totals)

[78, 80, 99, 95, 58, 97]
[12, 20, 19, 17, 17, 15]


## Debugging

The next few code chunks were inserted just to debug why the query was returning so many results for such a small bounding box. The answer was that bbox needs to be outside the query, and adding a platform and tier to the query reduces even more. Note: everything in the query section follows a very specific format. TODO link the STAC docs explaining the query language.

In [41]:
# It's better to just return the whole response, so you can iterate and debug through it.
bbox_list = [[-105,45,-100,50], [-101,45,-100,46]]
min_cloud = 0
max_cloud = 20

response = query_year(2020, bbox_list[1], min_cloud, max_cloud) 
scenes = response['meta']['found']
scenes

15

In [45]:
# Some helpful ways to debug
#response['features'][0]
#for item in response['features']: print(item['id'])

LC08_L2SP_031029_20200915_20200919_02_T1
LC08_L2SP_032029_20200906_20200918_02_T1
LC08_L2SP_031029_20200830_20200906_02_T1
LC08_L2SP_033028_20200828_20200906_02_T1
LC08_L2SP_031029_20200814_20200920_02_T1
LC08_L2SP_033028_20200812_20200918_02_T1
LC08_L2SP_032029_20200805_20200916_02_T1
LC08_L2SP_032029_20200805_20200821_02_T1
LC08_L2SP_033028_20200727_20200908_02_T1
LC08_L2SP_033028_20200727_20200806_02_T1
LC08_L2SP_032029_20200704_20200913_02_T1
LC08_L2SP_031029_20200627_20200823_02_T1
LC08_L2SP_031028_20200627_20200824_02_T1
LC08_L2SP_031028_20200611_20200824_02_T1
LC08_L2SP_031029_20200611_20200824_02_T1


## Accessing Data

Now that we've queried for which landsat scenes to use in a given analysis tile, we need to setup AWS authentication and iterate over the assets, subsetting based on the tile (bbox). An AWS IAM account access key and secret are required so the bill goes to the right place for any requester pays costs.

In [3]:
# Skip if you already have everything installed
%pip install -q python-dotenv
%pip install -q boto3
%pip install -q rasterio
%pip install -q pyproj
# TODO add the CURL ssl path to the .env file
%env CURL_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
env: CURL_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt


In [4]:
# load env variables from a .env file
import os
import rasterio as rio
from rasterio.session import AWSSession 
import numpy as np
import pyproj
import boto3
from matplotlib.pyplot import imshow
from dotenv import load_dotenv
load_dotenv()

True

In [5]:
# One option is to read this from the env, should work in ADE
# TODO: how do you pass secrets to DPS jobs?

aws_access_key_id=os.getenv('aws_access_key_id')
aws_secret_access_key=os.getenv('aws_secret_access_key')

In [6]:
# Processing imagery (band math)
# is this done on the fly? No


# Authenticate to aws to access a requester pays bucket
aws_session = AWSSession(boto3.Session(aws_access_key_id=aws_access_key_id,
                                      aws_secret_access_key=aws_secret_access_key),
                        requester_pays=True)

In [7]:
# The returned data is actually geojson, could load into a geopandas dataframe

# Get the first feature, and lookup the url of Band 4 asset
asset = response_by_year[0]['features'][1]['assets']['SR_B3.TIF']['href']

# Convert to S3 url for use with requester pays
cog = asset.replace('https://landsatlook.usgs.gov/data/', 's3://usgs-landsat/')
print(cog)

s3://usgs-landsat/collection02/level-2/standard/oli-tirs/2015/032/028/LC08_L2SP_032028_20150824_20200908_02_T1/LC08_L2SP_032028_20150824_20200908_02_T1_SR_B3.TIF


In [8]:
# Basic access example
with rio.Env(aws_session):
    with rio.open(cog) as src:
        profile = src.profile
        arr = src.read(1)
        print(profile)

#plot the raster
imshow(arr)

RasterioIOError: '/vsis3/usgs-landsat/collection02/level-2/standard/oli-tirs/2015/032/028/LC08_L2SP_032028_20150824_20200908_02_T1/LC08_L2SP_032028_20150824_20200908_02_T1_SR_B3.TIF' not recognized as a supported file format.

In [1]:
def reproject_bbox(src, bbox, bbox_crs='epsg:4326'):
    '''
    Convert the bounding box to local coordinates of the data
    src is the raster data handler
    bbox, and bbox_crs supplied by user
    returns a rasterio window object 
    '''
    # bbox has 2 sets, row,col
    data_crs=pyproj.Proj(src.crs)
    bbox_crs=pyproj.Proj(init=bbox_crs)
    lower = pyproj.transform(bbox_crs, data_crs, bbox[0], bbox[1])
    upper = pyproj.transform(bbox_crs, data_crs, bbox[2], bbox[3])
    
    bottom, left = src.index(lower[0], lower[1])
    top, right = src.index(upper[0], upper[1])
    
    # merge back into a window
    # Remember gdal reads from the upper left corner
    width = abs(top-bottom)
    height = abs(right-left)
    local_window = rasterio.windows.Window(left, top, width, height)
    #print(local_window)
    return local_window

def extract_subset(cog, bbox, band):
    '''
    Given a path to an S3 COG geotiff, a bbox in latlon(e.g. wgs84), a band.
    Extracts a subset of an image by reading a window. 
    When used with COGs will only read the required portion of the file.
    BBOX format [minX, minY, maxX, maxY]
    bbox = [11.6, -0.1, 11.7, -0.0]
    Band is the layer in the file, a single band tif, only has 1.
    '''
    
    with rio.Env(aws_session):
        with rasterio.open(cog) as src:
            local_window = reproject_bbox(src, bbox, bbox_crs='epsg:4326')
            # query the subset with a window
            # todo: modify to allow multiband data sources
            subset = src.read(band, window=local_window)
        
    return subset, src.crs, src.transform

In [10]:
## Retrieving Pixels
bands = [2,3,4,5,6]
bbox = bbox_list[1]
# TODO: 
# Loop over each scene
response = response_by_year[1]
# Each season should actually be it's own DPS job
for item in response['features']:
    # for each scene Loop over bands 2,3,4,5,6 (assets)
    for band in bands:
        # For each scene, read subset set by bounding box
        asset = item['assets'][f'SR_B{band}.TIF']['href']
        # Convert to S3 url for use with requester pays
        cog = asset.replace('https://landsatlook.usgs.gov/data/', 's3://usgs-landsat/')
        print(cog) 
        # Since the source files are per band, the 1st band in a given file is default
        # Bound Box reprojected on the fly to the native projection of the asset
        #subset, crs, transform = extract_subset(cog, bbox, 1)

    # stack the bands into the same array with n layers (z direction)?
    # optional: calculate indexes based on the bands and store as additional layers
    # save cog to disk (could be a kea or Zarr(xarray))
# after looping, make a VRT of cogs so it can be treated as a single file ?

# Questions, 
# 1. should the tiling scheme be LonLat 1 degree, the end Equal area projection, or utm zone based? take bbox reproject to LonLat for the query
# 2. do we need the QA band or can we do that separate, the only limitation on cogs is that the storage type needs to be identical in all bands. Kea or Zarr could accomodate, or if the pixels are cloud filtered before saving the stack then a COG is ok.

0190727_20200827_02_T1/LC08_L2SP_031028_20190727_20200827_02_T1_SR_B3.TIF
s3://usgs-landsat/collection02/level-2/standard/oli-tirs/2019/031/028/LC08_L2SP_031028_20190727_20200827_02_T1/LC08_L2SP_031028_20190727_20200827_02_T1_SR_B4.TIF
s3://usgs-landsat/collection02/level-2/standard/oli-tirs/2019/031/028/LC08_L2SP_031028_20190727_20200827_02_T1/LC08_L2SP_031028_20190727_20200827_02_T1_SR_B5.TIF
s3://usgs-landsat/collection02/level-2/standard/oli-tirs/2019/031/028/LC08_L2SP_031028_20190727_20200827_02_T1/LC08_L2SP_031028_20190727_20200827_02_T1_SR_B6.TIF
s3://usgs-landsat/collection02/level-2/standard/oli-tirs/2019/033/028/LC08_L2SP_033028_20190725_20200827_02_T1/LC08_L2SP_033028_20190725_20200827_02_T1_SR_B2.TIF
s3://usgs-landsat/collection02/level-2/standard/oli-tirs/2019/033/028/LC08_L2SP_033028_20190725_20200827_02_T1/LC08_L2SP_033028_20190725_20200827_02_T1_SR_B3.TIF
s3://usgs-landsat/collection02/level-2/standard/oli-tirs/2019/033/028/LC08_L2SP_033028_20190725_20200827_02_T1/LC08_

In [None]:
# Adding processed imagery to a covariate stack