# Create Training Dataset for DRC Road Classification

The Random Forest Classifier used the [drc_roads_classification](drc_roads_classification.ipynb) notebook requires a training dataset. In this notebook, we create the training dataset from pre-created label images. These label images were created in [GIMP](https://www.gimp.org/) using manual labeling of road and forest regions in a small region of the OrthoTile strip AOI image created in [drc_roads_classification](drc_roads_classification.ipynb)

In [45]:
import os

import numpy as np
import rasterio
from skimage import feature, filters

## Define input data

In [46]:
# labeled images, generated in GIMP, along with cropped OrthoTile as source image
data_dir = 'pre-data'
forest_image_filename = os.path.join(data_dir, 'mosaic_crop_forest.tif')
road_image_filename = os.path.join(data_dir, 'mosaic_crop_road.tif')
bands_image_filename = os.path.join(data_dir, 'mosaic_crop.tif')

## Prepare feature bands

All of the code in this section was copied from the [drc_roads_classification](drc_roads_classification.ipynb) notebook. It is necessary that the processing for the training dataset be the same as the processing for the prediction dataset.

In [47]:
def load_4band(filename):
    with rasterio.open(filename, 'r') as src:
        # orthotile band ordering: blue, green, red, nir
        b, g, r, n = src.read() 

        mask = b == 0 # 0 is NoData

    return [np.ma.array(band, mask=mask) for band in [b, g, r, n]]

bands = load_4band(bands_image_filename)

In [48]:
def get_texture_features(bands):
    green_band = bands[1]
    edges1 = feature.canny(green_band, low_threshold=0.0, high_threshold=0.01, sigma=2)
    blurred = filters.gaussian(edges1, sigma=2)
    blurred2 = filters.gaussian(edges1, sigma=6)
    return [np.ma.array(texture_band, mask=green_band.mask)
            for texture_band in (blurred, blurred2)]

feature_bands = bands + get_texture_features(bands)

## Apply Label masks to features

In this step, we load the images that represent the label masks for the forest / road classes, we then convert them to binary masks and apply them to the feature dataset to obtain two sets of data points: one set that contains features for the forest class, and another set that contains features for the road class.

In [49]:
# load labeled images as boolean masks

def get_label_mask(image_filename):
    with rasterio.open(image_filename, 'r') as src:
        band = src.read(1)
        label_data = band == 0 # valid data in black regions
        label_mask = ~label_data # mask True (masked) for not valid data
    return label_mask

def get_unmasked_count(mask):
    return np.size(mask) - np.count_nonzero(mask)

forest_mask = get_label_mask(forest_image_filename)
print(get_unmasked_count(forest_mask))
road_mask = get_label_mask(road_image_filename)
print(get_unmasked_count(road_mask))

578330
17757


  s = DatasetReader(fp, driver=driver, **kwargs)


In [50]:
# apply label masks to feature bands to get 2D array of feature values associated with
# pixels of that label
def get_label_pixels(label_mask, bands):
    pixels = np.array([np.ma.array(b, mask=np.ma.mask_or(b.mask, label_mask)).compressed()
                       for b in bands])
    return pixels.swapaxes(0,1) # order by pixel then by band

forest_pixels = get_label_pixels(forest_mask, feature_bands)
road_pixels = get_label_pixels(road_mask, feature_bands)

## Prepare training datasets

First, we create a balanced training set by using random sampling to create same-size samples of labeled pixels, then we create the X (features) and y (class) datasets in the format used by scikit-learn.

In [51]:
def make_same_size_samples(list_of_pixels):
    sample_len = min([p.shape[0] for p in list_of_pixels])

    def sample_pixels(pixels):
        if pixels.shape[0] > sample_len:
            pixel_sample = pixels.copy()
            np.random.shuffle(pixel_sample)
            pixel_sample = pixel_sample[:sample_len]
        else:
            pixel_sample = pixels
        return pixel_sample
    
    return [sample_pixels(p) for p in list_of_pixels]

[forest_pixels_sample, road_pixels_sample] = \
    make_same_size_samples([forest_pixels, road_pixels])

print(forest_pixels_sample.shape)
print(road_pixels_sample.shape)

(17757, 6)
(17757, 6)


In [52]:
forest_label_value = 0
road_label_value = 1
X = np.concatenate((forest_pixels_sample, road_pixels_sample), axis=0)
y = np.array(forest_pixels_sample.shape[0] * [forest_label_value] + \
             road_pixels_sample.shape[0] * [road_label_value])
    
print(X.shape)
print(y.shape)

(35514, 6)
(35514,)


In [53]:
# save to file system
output_file = os.path.join('pre-data', 'classification_training')
np.savez(output_file, X=X, y=y)