## ICESat-2 Filtering and Extractions
This notebook searches for a spatial subset of data, runs DPS jobs to extract_atl08.py (by Nathan Thomas & Paul Montesano), processing and visualizing the outputs. Set up a notebook run in the first cell, especially whether or not DPS jobs should be submitted.

Outline:
1. Get BBOX of interest
2. Query CMR for ATL08 Granules
3. Convert ATL08 Granules to las format only keeping relevant vars/dimensions (h5py/pdal)
4. Convert las files to EPT data store

In [1]:
# 2.3 ICESat-2 extraction, merging, filtering, exploring, mapping
from maap.maap import MAAP
maap = MAAP()

import ipycmc
w = ipycmc.MapCMC()

if False:
    import importlib
    lib_loader = importlib.find_loader('cartopy')

    if lib_loader is not None:
        REBUILD_CONDA_ENV = False
        print("No need to re-build conda env.")
    else:
        REBUILD_CONDA_ENV = True
        print("Re-build conda env...")

    if REBUILD_CONDA_ENV:
        #### This notebook uses a DPS job to run extract_atl08.py to convert h5's to csv's, then appends all csv's into a pandas geodataframe.
        #### Returns: a pandas geodataframe that should hold the entire set of ATL08 data for this project
        #### Notes:
        ###### ISSUE: how to relaibly activate a conda env that can support this notebook.
        ###### Need to 'conda activate' an env that has geopandas - but where do I do this 'activate'. How does terminal env interact with nb?
        ###### Workaround: always do this to base:
        ! conda install -c conda-forge geopandas -y
        #! conda install -c conda-forge cartopy -y
        ! conda install -c conda-forge descartes -y
        ! conda install -c conda-forge seaborn -y
        ! conda install contextily --channel conda-forge -y
        #! conda install -c conda-forge matplotlib_scalebar -y
        ##https://www.essoar.org/doi/10.1002/essoar.10501423.1
        ##https://www.essoar.org/pdfjs/10.1002/essoar.10501423.1
        ##https://github.com/icesat2py/icepyx/blob/master/examples/ICESat-2_DEM_comparison_Colombia_working.ipynb
        ##https://github.com/ICESAT-2HackWeek/2020_ICESat-2_Hackweek_Tutorials
        ##https://icesat-2hackweek.github.io/learning-resources/logistics/schedule/
        ##https://github.com/giswqs/earthengine-py-notebooks

import geopandas as gpd
#import descartes
import numpy as np
#import seaborn as sb
from geopandas import GeoDataFrame
from geopandas.tools import sjoin
import pandas as pd
import glob
import os
import random 
import shutil
import time
import math

import matplotlib.pyplot as plt

#import cartopy.crs as ccrs
#from cartopy.feature import NaturalEarthFeature, LAND, COASTLINE
#from cartopy.mpl.gridliner import LONGITUDE_FORMATTER, LATITUDE_FORMATTER

import datetime
from matplotlib.colors import LinearSegmentedColormap
from mpl_toolkits.axes_grid1 import make_axes_locatable
#from matplotlib_scalebar.scalebar import ScaleBar
import contextily as ctx



The Shapely GEOS version (3.8.0-CAPI-1.13.1 ) is incompatible with the GEOS version PyGEOS was compiled with (3.8.1-CAPI-1.13.3). Conversions between both will be slow.



In [10]:
import sys
sys.path.append('/projects/Developer/icesat2_boreal/notebooks/3.Gridded_product_development')
from CovariateUtils import *

import FilterUtils
import ExtractUtils

boreal_tile_index_path = '/projects/shared-buckets/nathanmthomas/boreal_grid_albers90k_gpkg.gpkg' #'/projects/maap-users/alexdevseed/boreal_tiles.gpkg'

# Run extract_at08.py as a DPS job (see nb 1.3 for template of how this can be done)
DPS_OUTPUT_DIR = '/projects/r2d2/dps_outputs/extract_atl08_dps_orig/master/2021/06'
#DPS_OUTPUT_DIR = '/projects/jabba/dps_output/2.3_output' #'/projects/above/processed_data/2.3_output'

RUN_DPS = False

EPT_APPROACH = False

H_DIFF_THRESH = 100
H_CAN_THRESH = 100

READ_PICKLE = False
DIR_PICKLE = '/projects/jabba/data'#'/projects/above'

DO_ATL08_CSV_SUBSET = False # <- set to True for testing
SUBSET_FRAC_SIZE = 0.50

#COPY_CSVS = False
CSV_TO_DIR = '/projects/jabba/data'#/projects/r2d2/above/atl08_csvs
%matplotlib inline

## MAAP.searchGranule, make h5 and CSV list: specify the tiles (bbox) and years, get a list of granules, and get a list of ATL08 CSVs

#### Set up the INPUT_TILE_NUM_LIST here (testing, or full runs)

In [13]:
# Make a list of ATL08 files you will want to DPS over

# Test tiles
#INPUT_TILE_NUM_LIST = [30542, 30543, 30821, 30822, 30823]

# NA tiles
# Read the boreal tile index file
boreal_tile_index_path = '/projects/my-public-bucket/boreal_tiles.gpkg'#'/projects/my-public-bucket/boreal_grid_albers90k_gpkg.gpkg'
boreal_tile_layer_name = 'boreal_tiles'#'grid_boreal_albers90k_gpkg'
boreal_tile_index = gpd.read_file(boreal_tile_index_path)
boreal_tile_index_subset = boreal_tile_index.to_crs(4326).cx[-170:-50, 50:75]

# Boreal NA tiles: need just a list of tile_ids
INPUT_TILE_NUM_LIST = boreal_tile_index_subset['layer'].astype(int).tolist()

INPUT_TILE_NUM_LIST=INPUT_TILE_NUM_LIST[0:5]

#### Loop over the INPUT_TILE_NUM_LIST

In [14]:
INPUT_YEARS_LIST = ['2018','2019','2020','2021']

all_atl08 = []

for YEAR in INPUT_YEARS_LIST:
    
    print("Year: ", YEAR)
    all_atl08_year = []
    
    for TILE_NUM in INPUT_TILE_NUM_LIST:
        
        # This is fine for a single tile
        tile_parts = get_index_tile(boreal_tile_index_path, TILE_NUM, buffer=0, layer = boreal_tile_layer_name)

        BBOX_TILE = ','.join(str(x) for x in tile_parts['bbox_4326'])

        # Other BBOXs
        BBOX_NA = "-180,50,-50,75"
        BBOX_CIRC = "-180,40,180,75" # You'll need to edit run_above.sh to adjust the geo filtering called for with extract_atl08.py

        BBOX = BBOX_TILE
        print("\tTILE_NUM: {} ({})".format(TILE_NUM, BBOX) )

        COLLECTID_ATL08_V3 = "C1200235747-NASA_MAAP"

        # Note: we want to be able to do a 'recurring' seasonal search, regardless of year
        DATERANGE_SUMMER = YEAR+'-06-01T00:00:00Z,'+YEAR+'-09-30T23:59:59Z'

        # We dont really want a limit: Not really sure how to set this; just use very high number?
        MAX_ATL08_ORBITS = 100000

        granules = maap.searchGranule(collection_concept_id=COLLECTID_ATL08_V3, 
                                      temporal=DATERANGE_SUMMER, 
                                      bounding_box=BBOX, 
                                      limit=MAX_ATL08_ORBITS)

        # This is a list of the granule URLs for processing
        granules_list_ATL08 = FilterUtils.get_granules_list(granules)
               
        print("\t\t# ATL08 in BBOX: {}".format(len(granules_list_ATL08)) )
        all_atl08_year += granules_list_ATL08
    print("\t# ATL08 in {}: {}".format(YEAR, len(all_atl08_year)) )
    all_atl08 += all_atl08_year
    
print('# ATL08 h5s total: {}'.format(len(all_atl08)) )

# Change h5s to CSVs
all_atl08_csvs = [os.path.join(DPS_OUTPUT_DIR, os.path.basename(f).replace('.h5','_30m.csv')) for f in all_atl08]

all_atl08_csvs_NOT_FOUND = []
for file in all_atl08_csvs: 
    if not os.path.isfile(file):
        all_atl08_csvs_NOT_FOUND.append(file)
             
all_atl08_csvs_FOUND = [x for x in all_atl08_csvs if x not in all_atl08_csvs_NOT_FOUND]
print('# ATL08 CSVs found: {}'.format(len(all_atl08_csvs_FOUND)) ) 
#print('\n'.join(all_atl08_csvs_FOUND) )
print('# ATL08 CSVs not found in {}: {}'.format(DPS_OUTPUT_DIR, len(all_atl08_csvs_NOT_FOUND)) )  

Year:  2018


ValueError: Null layer: 'boreal_tiles'

## Run a single DPS job to test

In [None]:
if RUN_DPS:
    ##################################
    #Test DPS submission on a single file
    granule=granules_list_ATL08[0]

    submit_result = maap.submitJob(identifier="nothing", algo_id="run_above_ubuntu", 
                                       version="master", 
                                       username="r2d2", 
                                       icesat2_granule=granule)
    print(submit_result)

## Run DPS in Batch Mode

In [5]:
# Extraction
#
# DPS SUBMISSION
if RUN_DPS:
    # Here is where I submit a job 
    # identified with 'algo_id' (in yaml file)
    # that specifies a bash script /projects/above/gitlab_repos/atl08_extract_repo/run_above.sh 
    # that will call the 'algorithm' (extract_atl08.py)

    # Uses granule list from nb 2.1
    # CHANGE the submitJob args!
    for g in range(len(granules_list_ATL08)):
        granule = granules_list_ATL08[g]
        submit_result = maap.submitJob(identifier="nothing", algo_id="run_above_ubuntu", 
                                   version="master", 
                                   username="r2d2", 
                                   icesat2_granule=granule)
        if g == 1:
            print(submit_result)
        if g == 100:
            print (submit_result)
        if g == 1000:
            print (submit_result)
        if g == 2000:
            print (submit_result)
        if g == 3000:
            print (submit_result)
        if g == 4000:
            print (submit_result)
        if g == len(granules_list_ATL08):
            print (submit_result)
            print ('done!')
        
else:
    print("Not running DPS; probably because output from extract_atl08 DPS job already exists.")
    print(DPS_OUTPUT_DIR)

Not running DPS; probably because output from extract_atl08 DPS job already exists.
/projects/above/processed_data/2.3_output


## Merge DPS outputs into data frame for visualizatioon

In [147]:
%%time

if not READ_PICKLE:
    
    # Merging

    # List of CSVs made in MAAP.searchGranule chunk above
    print("# of ATL08 files found after DPS to extract atl08 to CSV: ",len(all_atl08_csvs_FOUND))
    
    if False:
        # Find and delete any CSV that has a size of 0
        #! find $DPS_OUTPUT_DIR -name "*.csv" -size 0 -delete
        print("Making list of ATL08 csv files...")
        # Find all remaining output CSVs from DPS jobs
        all_atl08_csvs = glob.glob(DPS_OUTPUT_DIR+"/ATL08*.csv", recursive=True)

        # This could break if you randomly grab an incomplete or empty CSV
        if DO_ATL08_CSV_SUBSET:
            all_atl08_csvs_FOUND = random.sample(all_atl08_csvs_FOUND, math.floor(SUBSET_FRAC_SIZE * len(all_atl08_csvs_FOUND)))
            print("# of ATL08 files after test sample: ",len(all_atl08_csvs_FOUND))

    # Merge all files in the list
    print("Creating pandas data frame...")
    atl08 = pd.concat([pd.read_csv(f) for f in all_atl08_csvs_FOUND ], sort=False)
    
    # Probably not necessary
    #print('finished pickle') #<--no; there isnt any pickling here; its written aftern the Filtering chunk
    #atl08.to_csv( "/projects/above/processed_data/atl08_merged.csv", index=False, encoding='utf-8-sig')

# of ATL08 files found after DPS to extract atl08 to CSV:  76
Creating pandas data frame...
CPU times: user 19.3 s, sys: 1.13 s, total: 20.5 s
Wall time: 22.2 s


## Read in old data frame from a pickle

In [141]:
%%time
if False:
    if not READ_PICKLE and not EPT_APPROACH:

        if False:
            # Filtering    <------ THIS IS UPDATED USING THE METHOD IN FilterUtils.py (~5-26-2021)
            atl08 =  atl08[
                           (atl08.msw_flg == 0) & 
                           (atl08.beam_type == 'Strong') & 
                           (atl08.seg_snow == 'snow free land')
                            ]
            print(f"After filtering, there are {atl08.shape[0]} observations in this dataframe.")

        # Pickle the file
        cur_time = time.strftime("%Y%m%d%H%M%S")
        samp_frac_str = "samp-all"
        if DO_ATL08_CSV_SUBSET:
            samp_frac_str = "samp-" + '{:1.2f}'.format(SUBSET_FRAC_SIZE).replace('.','p')
        atl08.to_pickle(os.path.join(DIR_PICKLE, "atl08_"+samp_frac_str+"_"+cur_time+".pkl"))
    else:
        print("Getting the latest merged, filtered, & compressed file of ATL08 obs as a pandas dataframe...")
        list_of_pickles = glob.glob(DIR_PICKLE+'/atl08*.pkl') # * means all if need specific format then *.csv
        latest_pickle_file = max(list_of_pickles, key=os.path.getctime)
        print(latest_pickle_file)
        atl08 = pd.read_pickle(latest_pickle_file)
        print("ATL08 db now available from pickled file.")


CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 6.2 µs


## Applying filtering by bounds and by quality for each TILE

In [9]:
%%time
import warnings; warnings.simplefilter('ignore')

from importlib import reload 
reload(FilterUtils)


print("Applying additional filtering using FilterUtils; returning a final atl08 dataframe...")
print(atl08.shape)
in_tile_fn = '/projects/maap-users/alexdevseed/boreal_tiles.gpkg'
in_tile_layer = 'boreal_tiles_albers'

atl08_filt_df_dict = {}

for TILE_NUM in INPUT_TILE_NUM_LIST:
    
    print("\nFiltering by tile: {}".format(TILE_NUM))
    
    # Get tile bounds as xmin,xmax,ymin,ymax
    in_bounds = FilterUtils.reorder_4326_bounds(in_tile_fn, TILE_NUM, buffer=0, layer = in_tile_layer)

    # Filter by bounds
    if EPT_APPROACH:
        # INPUT EPT APPROACH HERE to return a filtered atl08 data frame?
        # https://docs.maap-project.org/en/latest/query/testing-ept-stores.html
        # Read a massive EPT of all ATL08 h5 files
        #in_ept_fn = ????
        # Filtering bounds of EPT by tile
        # EPT is filtered using 2.3_dps.py to subset by tile and filter with above_filter_atl08() function in FilterUtils.py
        atl08_tmp = FilterUtils.filter_atl08_bounds_tile_ept(in_ept_fn, in_tile_fn, TILE_NUM, in_tile_layer, output_dir, return_pdf=True)
    else:
        atl08_tmp = FilterUtils.filter_atl08_bounds(atl08_df=atl08, in_bounds=in_bounds)

    # Filter by quality
    atl08_tmp = FilterUtils.filter_atl08_qual(atl08_tmp, SUBSET_COLS=True, 
                                                       subset_cols_list=['rh25','rh50','rh60','rh70','rh75','rh80','rh85','rh90','rh95','h_can','h_max_can'], 
                                                       filt_cols=['h_can','h_dif_ref','m','msw_flg','beam_type','seg_snow'], 
                                                       thresh_h_can=100, thresh_h_dif=100, month_min=6, month_max=9)
    # Build a dict of filtered atl08 by tile_num
    atl08_filt_df_dict[TILE_NUM] = atl08_tmp
    atl08_tmp = None

#import pprint
## Prints the nicely formatted dictionary
#pprint.pprint(atl08_filt_df_dict)        

Applying additional filtering using FilterUtils; returning a final atl08 dataframe...


NameError: name 'atl08' is not defined

## Extracting covariates by filtered ATL08 by TILE and writing CSV

In [202]:
import ExtractUtils
from importlib import reload 
reload(ExtractUtils)

# Extract values to points; return a CSV for modelling
topo_covar_root = "/projects/jabba/dps_output/do_topo_stack_3-1-5_ubuntu/master/2021"
landsat_covar_root = "/projects/jabba/dps_output/do_landsat_stack_3-1-2_ubuntu/master/2021"
output_dir = "/projects/jabba/data"

for tile_num, atl08_df_filt in atl08_filt_df_dict.items():
    atl08_gdf_topo_landsat = None
    print("\nTile {} has {} filtered ATL08 obs".format(tile_num, len(atl08_df_filt)))
    # Convert to geopandas data frame in lat/lon
    atl08_gdf = GeoDataFrame(atl08_df_filt, geometry=gpd.points_from_xy(atl08_df_filt.lon, atl08_df_filt.lat), crs='epsg:4326')
    
    # Get the topo covar COG
    topo_covar_tile_list = ExtractUtils.get_covar_fn_list(topo_covar_root, tile_num)
    
    # Get most recent topo covar COG, reproject ATL08 to match, extract covars
    if len(topo_covar_tile_list)>0:
        topo_covar_fn = topo_covar_tile_list[-1]
        print(topo_covar_fn) 
        atl08_gdf_topo = ExtractUtils.extract_value_gdf(topo_covar_fn, atl08_gdf, ["elevation","slope","tsri","tpi", "slopemask"], reproject=True)
    else:
        print("-----> No topo covar COG for tile {}\n\n".format(tile_num))
        continue
    
    # Get the landsat covar COG
    landsat_covar_tile_list = ExtractUtils.get_covar_fn_list(landsat_covar_root, tile_num)
    
    # Get most recent landsat covar COG, extract covars
    if len(landsat_covar_tile_list)>0:
        landsat_covar_fn = landsat_covar_tile_list[-1]
        print(landsat_covar_fn)
        atl08_gdf_topo_landsat = ExtractUtils.extract_value_gdf(landsat_covar_fn, atl08_gdf_topo, ['Blue', 'Green', 'Red', 'NIR', 'SWIR', 'NDVI', 'SAVI', 'MSAVI', 'NDMI', 'EVI', 'NBR', 'NBR2', 'TCB', 'TCG', 'TCW', 'ValidMask', 'Xgeo', 'Ygeo'], reproject=False)
    else:
        print("-----> No landsat covar COG for tile {}\n\n".format(tile_num))
        continue
        
    if atl08_gdf_topo_landsat is not None:
        # CSV the file
        cur_time = time.strftime("%Y%m%d%H%M%S")
        out_csv_fn = os.path.join(output_dir, "atl08_filt_"+str(tile_num)+"_topo_landsat_"+cur_time+".csv")
        atl08_gdf_topo_landsat.to_csv(out_csv_fn,index=False, encoding="utf-8-sig")

        print("Wrote output csv of filtered ATL08 obs with topo and Landsat covariates for tile {}: {}".format(tile_num, out_csv_fn) )


Tile 30542 has 270 filtered ATL08 obs
/projects/jabba/dps_output/do_topo_stack_3-1-5_ubuntu/master/2021/05/28/18/40/58/507315/Copernicus_30542_covars_cog_topo_stack.tif
	Open the raster and store metadata...
	Re-project points to match raster...
	Dataframe has new raster value column: elevation
	Dataframe has new raster value column: slope
	Dataframe has new raster value column: tsri
	Dataframe has new raster value column: tpi
	Dataframe has new raster value column: slopemask
Returning re-projected points with 5 new raster value column: ['elevation', 'slope', 'tsri', 'tpi', 'slopemask']
/projects/jabba/dps_output/do_landsat_stack_3-1-2_ubuntu/master/2021/06/04/20/03/25/674070/Landsat8_30542_comp_cog_2015-2020_dps.tif
	Open the raster and store metadata...
	Dataframe has new raster value column: Blue
	Dataframe has new raster value column: Green
	Dataframe has new raster value column: Red
	Dataframe has new raster value column: NIR
	Dataframe has new raster value column: SWIR
	Datafram