# 01: Sampling

This notebook locates and samples plots, exporting an HDF5 dataset with arrays of LiDAR metrics and NAIP reflectance values for each plot footprint.

In [None]:
import ee

from naip_cnn import sampling
from naip_cnn.acquisitions import (
    MAL2007,
    MAL2008_CampCreek,
    MAL2008_2009_MalheurRiver,
    MAL2010,
    MAL2014,
    MAL2016_CanyonCreek,
    MAL2017_Crow,
    MAL2017_JohnDay,
    MAL2018_Aldrich_UpperBear,
    MAL2018_Rattlesnake,
    MAL2019,
    MAL2020_UpperJohnDay,
)
from naip_cnn.data import NAIPDatasetWrapper

# If you are not authenticated locally, run `ee.Authenticate()` first.
ee.Initialize()

## Load Data

Our training data will be a set of points with 1) LiDAR attributes for prediction and 2) corresponding NAIP reflectance values. Both will ultimately be stored as co-located 2D arrays of pixel values at footprint locations. 

The first step will be to create a `NAIPDatasetWrapper` that will allow us to access LiDAR and NAIP data, as well as store relevant metadata such as the spatial resolution and sample footprint size in meters.

In [None]:
dataset = NAIPDatasetWrapper(
    acquisitions=[MAL2019],
    footprint=(30, 30),
    naip_res=1.0,
    lidar_res=30.0,
)

Based on the acquisition and resolution parameters defined above, we can load coincident `ee.Image` mosaics of LiDAR and NAIP data.

In [None]:
lidar = dataset.load_lidar()
naip = dataset.load_naip()

## Generate Sample Footprints

We'll extract data across a collection of footprints. To begin, we'll distribute points across the LiDAR image with a minimum spacing.

In [None]:
samples = lidar.sample(
    scale=dataset.spacing,
    projection=dataset.acquisitions[0].proj,
    # This is set to keep the total number of samples exportable, and may need to be
    # adjusted based on the size of the LiDAR acquisition.
    factor=0.08,
    dropNulls=True,
    geometries=True,
)

We can check `samples.size` to figure out how many non-masked pixels were sampled at the given spacing.

In [None]:
samples.size().getInfo()

Next, we'll convert our collection of sample points to square footprints with a given size.

In [None]:
footprints = samples.map(
    lambda p: sampling.point_to_footprint(
        p, dims=dataset.footprint, proj=dataset.acquisitions[0].proj
    )
)

## Extract Pixel Values

With our footprints defined, we can extract arrays of pixel values from the LiDAR and NAIP images.

In [None]:
def extract_lidar(footprint: ee.Feature):
    return sampling.extract_values_at_footprint(
        footprint, img=lidar, proj=dataset.acquisitions[0].proj, scale=dataset.lidar_res
    )


def extract_naip(footprint: ee.Feature):
    return sampling.extract_values_at_footprint(
        footprint, img=naip, proj=dataset.acquisitions[0].proj, scale=dataset.naip_res
    )


footprints = footprints.map(extract_lidar, opt_dropNulls=True).map(
    extract_naip, opt_dropNulls=True
)

## Export to Drive

Extracting large numbers of pixel values across footprints is memory- and time-intensive, so we'll need to export the data to Drive rather than directly accessing it client-side. This process can take a while, and progress can be monitored in the [task manager](https://code.earthengine.google.com/tasks).

***Note**: Make sure you're authenticated with the correct Earth Engine account, as this will determine the Drive to which data is exported.*

In [None]:
task = ee.batch.Export.table.toDrive(
    collection=footprints,
    description=dataset.name,
)

task.start()

## Convert CSV To HDF5

*Once the export task is complete*, you'll need to download the resulting CSV file to local storage. 

The CSV format exported by Earth Engine is not optimized for training, so we'll need to convert it to an HDF5 file. In the process, we'll reshape the pixel arrays, which are currently stored as comma-separated strings, into 2D arrays.

In [None]:
from dask.distributed import Client
import dask.dataframe as dd
import numpy as np
import pandas as pd
import h5py
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

We'll use Dask to parallelize the loading and processing steps. Creating a local Dask client will allow us to monitor progress in the Dask dashboard.

In [None]:
client = Client()
client

Load the CSV file as a deferred Dask dataframe. You'll see that it includes columns for each of the NAIP bands, LiDAR attributes, and a few other metadata columns.

In [None]:
df = dd.read_csv("../" + dataset.csv_path.as_posix())
df

Next, we'll parse the 1D strings into 2D arrays. We'll stack the NAIP arrays into a single `image` column, while each of the LiDAR labels will be kept in its own column, allowing us to easily train on a single label.

In [None]:
df["image"] = df.apply(
    sampling.parse_pixel_array,
    shape=dataset.naip_shape,
    col=("R", "G", "B", "N"),
    axis=1,
    meta=pd.Series(dtype=np.uint8),
)

for label in ("cover", "rh25", "rh50", "rh95", "rh100"):
    df[label] = df.apply(
        sampling.parse_pixel_array,
        shape=dataset.lidar_shape,
        col=label,
        axis=1,
        meta=pd.Series(dtype=np.float32),
    )

Now, we can drop the unused columns.

In [None]:
df = df.drop(columns=["R", "G", "B", "N", ".geo", "height", "width", "system:index"])

Next, we'll load the entire dataframe into memory so that we can shuffle it, split it, and save it out to HDFs.

In [None]:
df = df.compute().sample(frac=1, random_state=42)

Split the data into training, validation, and test sets. We do this now to avoid having to load the entire dataset into memory to properly shuffle during the training process.

In [None]:
train, holdout = train_test_split(df, train_size=0.8, random_state=42)
val, test = train_test_split(holdout, train_size=0.5, random_state=42)

In [None]:
for split, split_df in ({"train": train, "val": val, "test": test}).items():
    dst_path = (
        "../"
        + dataset.csv_path.with_name(
            dataset.csv_path.stem + f"_{split}" + ".h5"
        ).as_posix()
    )
    with h5py.File(dst_path, "w") as f:
        for var in ("image", "cover", "rh25", "rh50", "rh95", "rh100"):
            f.create_dataset(
                var, data=np.stack(split_df[var].values), compression="gzip"
            )

Just to get an idea of the data we'll be training with, we can plot a few footprints.

In [None]:
n = 3
check_footprints = train.sample(n=n, random_state=99)

fig, ax = plt.subplots(n, 3, figsize=(6, n * 2))
for i in range(n):
    ax[i, 0].imshow(check_footprints["image"].values[i][:, :, :3])
    ax[i, 1].imshow(check_footprints["cover"].values[i], vmin=0, vmax=100)
    ax[i, 2].imshow(check_footprints["rh95"].values[i], vmin=0, vmax=100)

ax[0, 0].set_title("NAIP")
ax[0, 1].set_title("COVER")
ax[0, 2].set_title("RH95");