This notebook creates training patches for the landcover transfer learning task. The idea is to take the patches that we already extract, and whose coordinates are contained in the patches metadata, and write associated numpy masks giving landcover labels. We assume the raw landcover rasters have already been downloaded using the `landcover.ipynb` notebook. The overall process has the form

1. Build a VRT from all the landcover tiles
2. Define patches whose locations will be used to extract masks

  a. Ensure that it's in the same CRS as the VRT
  
  b. Keep all the tiles with any glacier present
  
  c. (optional) Filter some of the tiles that don't have glacier.
  
3. For each patch...

  a. Windowed read the underlying VRT at that patch's location
  
  b. Ensure that the mask is multichannel, with one channel per landcover type
  
  c. Write that data as a mask
  
  d. Copy the input data to the same transfer data directory
  
4. Shuffle those into training, development, and test datasets.

In principle, we could also create metadata files with paths to the training / development / test data, but this would be a bit inconsistent with how we train the main glacier model, and we want to reuse as much code as possible.

Before implementing this more generally, let's test out the process on a single tile.

In [None]:
import rasterio
import numpy as np
import geopandas as gpd
from pathlib import Path
import matplotlib.pyplot as plt
%matplotlib inline

landcover_dir = Path("/datadrive/glaciers/landcover")
patch_geojson = Path("/datadrive/glaciers/processed/patches/patches.geojson")

In [None]:
import subprocess

def vrt_from_dir(input_dir, output_path="./output.vrt", **kwargs):
    inputs = [f for f in input_dir.glob("*.tif*")]
    subprocess.call(["gdalbuildvrt", "-o", output_path] + inputs)

vrt_from_dir(landcover_dir, landcover_dir / "landcover.vrt")

In [None]:
patch_metadata = gpd.read_file(patch_geojson)
patch_metadata = patch_metadata.to_crs(tile.meta["crs"])
positive_ix = patch_metadata.mask_mean_0 > 0
patch_keep = patch_metadata[patch_metadata.mask_mean_0 > 0]
patch_zero = patch_metadata[patch_metadata.mask_mean_0 == 0]

n_sample = min(len(patch_zero), 1000 * len(patch_keep))
patch_random = patch_zero.sample(n = n_sample)
patch_keep = pd.concat([patch_keep, patch_random])

In [None]:
from rasterio.windows import from_bounds
from scipy import interpolate
grid_one = np.linspace(0, 512, 512, endpoint=False)
patch_grids = [grid_one, grid_one]

landcover = rasterio.open(landcover_dir / "landcover.vrt")

for i in range(200):
    patch_ = list(patch_keep.geometry.iloc[i].bounds)
    mask = landcover.read(window=from_bounds(*patch_, landcover.transform))
    for channel in range(1):
        mask_im = mask[channel]
        nmax = np.nanmax(mask_im)
        if not (np.isnan(nmax) or nmax == 0):
            x = np.load(patch_keep.img_slice[i])[:, :, [4, 3, 1]]
            nx = np.nanmax(x)
            if not (np.isnan(nx) or nx == 0):
                plt.imshow(x / nx)
                plt.show()

                mask_im = mask_im / mask_im.max()
                mask_grids = [
                    np.linspace(0, mask_im.shape[1], mask_im.shape[1], endpoint=False),
                    np.linspace(0, mask_im.shape[0], mask_im.shape[0], endpoint=False)
                ]

                f_interp = interpolate.interp2d(*mask_grids, mask_im)
                mask_im = f_interp(*patch_grids)

                plt.imshow(mask_im)
                plt.show()