# Supervised Classification Training Data Selector
The markdown cells have been designed to work with the 'Table Of Contents(2)' Jupyter notebook extension.
This is highly recommended, if you don't have it yet (and are working on the VDI on the 'agdc-py3-prod module'
select "Edit" on the menu bar above, click the "nbextension config" button at the bottom of the menu, and enable
the extension. The 'Collapsible Headings' extension is also highly recommended.

This notebook lets you create a training dataset for supervised classification.
It was specifically written for use with the urban change detection project, however, modifying the code to enable to be easily used for a range of applications should not be hard.

The results of the training dataset building are exported as a pickled pandas dataframe.

This was written Mike Barnes as part of his third graduate rotation, during January 2018.
Any questions, please contact me at michael.barnes@ga.gov.au

## Python Library Imports

In [1]:
%matplotlib notebook
import os

import numpy as np
import pandas as pd
import xarray as xr

import datacube
from datacube.helpers import ga_pq_fuser
from datacube.storage import masking
from datacube.storage.masking import mask_to_dict

from sklearn import preprocessing
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier

import matplotlib
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow
import matplotlib.colors as colors
import matplotlib.patches as mpatches

import gdal

import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual, IntSlider, FloatSlider, Dropdown
from IPython.display import display

from skimage import exposure
from scipy.signal import lfilter

import datetime

import warnings

import collections

## Functions for Loading Data and Building the Xarray
This project built on some existing work by Peter Tan. An output from Peter's urban change detection algorithm is raster files with all the relevant NBAR (analysis ready satellite derived surface reflectance readings) data saved to the output directory. To speed the loading and analysis during this script, this notebook will use those exisitng files if they are available. Otherwise it will load the data from the Digital Earth Australia archive.

### function: checkForLocalFiles

In [2]:
def checkForLocalFiles(study_area):
    """A quick boolean function to test if the requested named study area has a local directory"""
    rootdir = os.listdir('../')
    if study_area in rootdir:
        return True
    else:
        return False

### function: getData

In [3]:
def getData(study_area):
    """This is the main function to retrieve the NBAR Landsat data and return it as an Xarray."""
    # if the study area is a string, and is accessible locally, load it
    if isinstance(study_area, str):
        if checkForLocalFiles(study_area):
            data = getLocalData(study_area) 
    # if the study area is a string and is on the list, load it
        else:
            data = DCLoadName(study_area)
            data = transformXarrayToCustomStyle(data)
        return data

    # if the study area is a list of coordinates, use them to load the data
    elif isinstance(study_area, list) and len(study_area) == 4:
        data = DCLoad(study_area)
        data = transformXarrayToCustomStyle(data)
        return data        

    else:
        print('Data Loading Error')

### function: DCLoadName
This function is a wrapper for the DCLoad function, that allows previously used study areas to be easily restudied
by easily loading exactly the same area of interest (AOI).

In [4]:
def DCLoadName(study_area):
    """
    Quick way to load the study areas used for the associated project report if the local files are not available.
    
    This funciton is a wrapper on the DCLoad function.
    """
    if study_area == 'mtbarker':
        lat_min = -35.05
        lat_max = -35.08
        lon_min = 138.85
        lon_max = 138.895  
    elif study_area == 'swmelb':
        lat_min = -37.879
        lat_max = -37.91
        lon_min = 144.705
        lon_max = 144.76  
    elif study_area == 'gunghalin':
        lat_min = -35.18
        lat_max = -35.21
        lon_min = 149.14
        lon_max = 149.17
    elif study_area == 'goldengrove': 
        lat_min = -34.77
        lat_max = -34.8
        lon_min = 138.66
        lon_max = 138.73
    elif study_area == 'molonglo':
        lat_min = -35.3
        lat_max = -35.33
        lon_min = 149.015
        lon_max = 149.06
    elif study_area == 'nperth':
        lat_min = -31.686
        lat_max = -31.73
        lon_min = 115.79
        lon_max = 115.813
    elif study_area == 'swbris':
        lat_min = -27.66
        lat_max = -27.7 
        lon_min = 152.877
        lon_max = 152.93
    elif study_area == 'swsyd':
        lat_min = -33.993
        lat_max = -34.04
        lon_min = 150.715 
        lon_max = 150.78
    elif study_area == 'goolwa':
        lat_min = -35.49
        lat_max = -35.522
        lon_min = 138.761
        lon_max = 138.83
    elif study_area == 'gladstone':
        lat_min = -23.868
        lat_max = -23.903
        lon_min = 152.22
        lon_max = 152.265
    elif study_area == 'goldcoast':
        lat_min = -28.08
        lat_max = -28.125
        lon_min = 153.360
        lon_max = 153.4
    elif study_area == 'newcastle':
        lat_min = -32.895
        lat_max = -32.918
        lon_min = 151.59
        lon_max = 151.62
    
    return DCLoad([lat_min, lat_max, lon_min, lon_max])

### function: DCLoad
This function is a variation of a datacube query supplied by Erin Telfer.

In [5]:
def DCLoad(study_area):
    """This function is a variation of a datacube query supplied by Erin Telfer.
    This function takes the 4 coordinates as an input, queries the AGDC, and concatenates the NBAR data from
    all three Landsat sensors into a single output file."""
    
    # to time how long the load takes
    start = datetime.datetime.now()
    print('Loading data') 
    print('Load Started At: ' + str(start))
    
    # define temporal range 
    start_of_epoch = '1987-01-01'
    end_of_epoch =  '2017-10-31'

    # define bands of interest
    bands_of_interest = ['blue', 'green', 'red', 
                         'nir', 'swir1', 'swir2']

    # Landsat sensors of interest are defined
    sensors = ['ls8', 'ls7', 'ls5'] 

    # unpack input parameter
    lat_min, lat_max, lon_min, lon_max = study_area    

    print('Bounding box: ' + str(lat_min) + ' S, ' + str(lon_min) +
          ' E to ' + str(lat_max) + ' S, ' + str(lon_max) + ' E' )
    print('Epoch: ' + start_of_epoch + ' to ' + end_of_epoch)
    print('Sensors: ' + str(sensors))
    print('Bands of Interest: ' + str(bands_of_interest))

    # create query
    query = {'time': (start_of_epoch, end_of_epoch),}
    query['x'] = (lon_min, lon_max)
    query['y'] = (lat_max, lat_min)
    query['crs'] = 'EPSG:4326'

    #Create cloud mask. This will define which pixel quality (PQ) artefacts are removed from the results.
    # It should be noted the "land_sea" code will remove all ocean/sea pixels.
    mask_components = {'cloud_acca':'no_cloud',
    'cloud_shadow_acca' :'no_cloud_shadow',
    'cloud_shadow_fmask' : 'no_cloud_shadow',
    'cloud_fmask' :'no_cloud',
    'blue_saturated' : False,
    'green_saturated' : False,
    'red_saturated' : False,
    'nir_saturated' : False,
    'swir1_saturated' : False,
    'swir2_saturated' : False,
    'contiguous':True,
    'land_sea': 'land'}

    # Connect to DataCube
    dc = datacube.Datacube(app='Urban Change Detection')
    
    # Data for each Landsat sensor is retrieved and saved in a dict for concatenation
    sensor_clean = {}
    
    for sensor in sensors:
        # Load the NBAR and corresponding PQ
        sensor_nbar = dc.load(product= sensor+'_nbar_albers', group_by='solar_day', 
                              measurements = bands_of_interest,  **query)
        sensor_pq = dc.load(product= sensor+'_pq_albers', group_by='solar_day', 
                            fuse_func=ga_pq_fuser, **query)

        # Retrieve the projection information before masking/sorting
        crs = sensor_nbar.crs
        crswkt = sensor_nbar.crs.wkt
        affine = sensor_nbar.affine        

        # Combing the pq so it is a single 
        sensor_all = xr.auto_combine([sensor_pq,sensor_nbar])
        sensor_clean[sensor] = sensor_all

        print('Loaded %s' % sensor) 

    print('Concatenating')
    nbar_clean = xr.concat(sensor_clean.values(), 'time')
    nbar_clean = nbar_clean.sortby('time')
    nbar_clean.attrs['crs'] = crs
    nbar_clean.attrs['affin|e'] = affine    

    print ('Load and Xarray build complete')
    print('Process took ' + str(datetime.datetime.now() - start))
    
    # return xarray changed to custom style to work with this workflow
    return nbar_clean

### function: getLocalData

In [6]:
def getLocalData(study_area):
    """A quick helper function to load the output files from Peter's code for the given location.
    It returns and Xarray of the landsat data for that study area."""
    # build a list of all files in the directory (ie the folder for that location)
    location = '../' + study_area + '/'
    files = os.listdir(location)

    print('Loading data from: ' + location)
    
    # build a list of all the NBAR*.img file names and which bands they represent
    NBARfiles = []
    bands = []
    for file in files:
        if file[-4::] == '.img' and file[0:4] == 'NBAR':
            NBARfiles.append(file)
            bands.append(file.split('NBAR_')[1].split('.img')[0])

    # just catching the random case of when an appropriately named directory exists, but there are no
    # relevant NBAR .img files
    if len(NBARfiles) == 0:
        return DCLoadName(study_area)

    # open all the .img files with NBAR in the name, convert to numpy array, swap axes so order is (x, y, t)
    # and save to dict
    raw_data = {}
    for i in range(len(NBARfiles)):
        raw_data[bands[i]] = gdal.Open(location + NBARfiles[i]).ReadAsArray().swapaxes(0,2)
#     num_scenes = len(raw_data['red'][0][0])   # delete this?

    # build a list of all the dates represented by each band in the NBAR files
    # reuse the list of NBAR file names, but this time access the .hdr file
    in_dates = False
    dates = []
    for line in open(location + NBARfiles[0].split('.img')[0] + '.hdr'):
        if line[0] == '}':
            continue
        if in_dates:
            dates.append(line.split(',')[0].strip())
        if line[0:10] == 'band names':
            in_dates = True

    # save list of satellite originated bands
    sat_bands = bands.copy()

    # add the yet to be calculated derivative bands to the overall bands list
    bands += ['cloud_mask']

    # building the Xarray
    # define the size for the numpy array that will hold all the data for conversion into XArray
    x = len(raw_data['red'])
    y = len(raw_data['red'][0])
    t = len(raw_data['red'][0][0])
    n = len(bands)

    # create an empty numpy array of the correct size
    alldata = np.zeros((x, y, t, n), dtype=np.float32)

    # populate the numpy array with the satellite data
    # turn all no data NBAR values to NaNs
    for i in range(len(sat_bands)):
        alldata[:,:,:,i] = raw_data[sat_bands[i]]
        alldata[:,:,:,i][alldata[:,:,:,i] == -999] = np.nan

    # convert the numpy array into an xarray, with appropriate lables, and axes names
    data = xr.DataArray(alldata, coords = {'x':range(x), 'y':range(y), 'date':dates, 'band':bands},
                 dims=['x', 'y', 'date', 'band'])
    
    # import cloudmask and add to xarray
    cloudmask = gdal.Open(location + '/tsmask.img').ReadAsArray().swapaxes(0,2)
    data.loc[:,:,:,'cloud_mask'] = cloudmask
    
    return data

### function: transformXarrayToCustomStyle

In [7]:
def transformXarrayToCustomStyle(data_new):
    """Function to convert the standard format Xarray from DCLoad into the proprietary format Xarray that this
    Notebook was written to work with"""
    
    # downscale the dataset to a dataarray, and transpose so the variable numbers are right
    datafixed = data_new.to_array().transpose('x','y','time','variable')
    
    # rename the variables into 'band'
    datafixed = datafixed.rename({'variable':'band','time':'date'})
    
    # pull out the current list of bands, find the index number of "pixelquality"
    # replace with 'cloud_mask', and reassign
    new_bands = list(datafixed.band.values)
    cm = new_bands.index('pixelquality')
    new_bands[cm] = 'cloud_mask'
    datafixed.band.values = new_bands
    
    # changing the full datetime stamp to a simple date only stamp
    datafixed['date'] = pd.to_datetime(pd.DataFrame(datafixed.date.to_pandas()).index.date)
    
    # change pixel quality values to mask, 0 = good, 3 = bad
    # see https://www.sciencedirect.com/science/article/pii/S0034425717301086 for PQ value description
    cm_vals = datafixed[:,:,:].sel(band='cloud_mask').values
    cm_vals[cm_vals == 0] = 1
    cm_vals[cm_vals == 16383] = 0
    cm_vals[cm_vals != 0] = 3
    datafixed[:,:,:].sel(band='cloud_mask').values = cm_vals
    
    return datafixed

### function: customStyleXarrayToStandard

In [8]:
def customStyleXarrayToStandard(data):
    return data.to_dataset(dim='band')

# Setting up broad scope variables

## Load Previous Training Data (or make blank dataframe)
This cell will need to be commented/uncommented as appropriate. 
The cell is divided into 3 sections, seperated by **************

If you want to build on an existing training dataset, ensure that the top section refers to the location of file and that the correct file will be opened. If you want to add to the existing training data, uncomment the code in the middle section

If you want to build a new training dataset, comment out the top swection (Ctrl + /) and uncomment the bottom section.

Th

In [9]:
# ************** SECTION 1 **********************
# load previous training data
# by taking the last (ie most recent if the standard date is attached to the file) .pkl file
files = os.listdir('../')
pickles = []
for file in files:
    if file[-3::] == 'pkl':
        pickles.append(file)
trainingdata = pd.read_pickle('../' + pickles[-1])

# ************** SECTION 2 ************************

# # if you are going to be adding to an existing dataset, uncomment the line below:
# trainingdata = trainingdata.drop(columns = trainingdata.columns[1::])

# ***************** SECTION 3 *******************************************

# # setup a multilevel heirachrical index dataframe to store the results
# # storing the training data in this format is way more memory efficient than in an Xarray of same size as data
# # but it takes a lot of processing and manipulation to get it into a more useable form

# trainidx = pd.MultiIndex(levels = [[]]*4, labels = [[]]*4, names=['study_area', 'scene_num', 'row','column'])
# traincols = ['landcover']
# trainingdata = pd.DataFrame(index = trainidx, columns = traincols)

# ***********************************************************

# view the current status
trainingdata

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,landcover,blue,green,red,nir,swir1,swir2
study_area,scene_num,row,column,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
mtbarker,1,38,105,1,247.0,434.0,344.0,4094.0,1305.0,558.0
mtbarker,1,33,104,1,420.0,717.0,560.0,3877.0,2284.0,1088.0
mtbarker,1,32,105,1,420.0,677.0,524.0,3833.0,2284.0,1044.0
mtbarker,1,28,106,1,362.0,717.0,524.0,3877.0,2192.0,999.0
mtbarker,1,26,102,1,324.0,636.0,452.0,4483.0,2009.0,823.0
mtbarker,1,24,100,1,305.0,555.0,380.0,4093.0,1886.0,823.0
mtbarker,1,25,109,1,324.0,596.0,416.0,4527.0,1948.0,867.0
mtbarker,1,25,116,1,363.0,677.0,524.0,3746.0,2131.0,999.0
mtbarker,1,29,148,1,439.0,958.0,739.0,5734.0,2040.0,867.0
mtbarker,1,29,150,1,362.0,757.0,560.0,5390.0,1917.0,823.0


## Other broad scope variables

In [10]:
# easier to work with integers than strings, so map the planned training classes to integers
landcover = {'vegetation':1,'urban':2,'earth':3,'water':4}
# range of pretermined study areas to use as sources for training data
study_areas = ['mtbarker', 'swmelb', 'gunghalin', 'goldengrove', 'molonglo', 'nperth', 'swbris', 'swsyd', 'custom']

# not in broad scope yet
sat_bands = ['blue','green','red','nir','swir1','swir2']
dc_bands = sat_bands.copy() + ['cloud_mask']

colours = ['r', 'b', 'm', 'c']

# Training Data Generator Plot

## Functions

### function: drawTrainingPlot

In [11]:
def drawTrainingPlot(study_area, scene_num, covertype, scene_picks_arr):
    """Allows easy extension to extra subplots in the training plot figure.
    Currently just another layer of abstraction"""
    ax1, scene_picks_arr = drawTrainingScene(study_area, scene_num, covertype, scene_picks_arr)
    plt.draw()
    
    return scene_picks_arr

### function: drawTrainingScene

In [12]:
def drawTrainingScene(study_area, scene_num, covertype, scene_picks_arr):
    """This function draws the desired scene (study area and scene number).
    It also presents the existing training data for that scene if any exists.
    It returns the axes object for the image, along with a numpy array which is
    the existing picks for that scene"""
    
    # colour map included incase of need to display false colour or other in the future
    # could change this to an ordereddict and remove the RGB list created below...?
    # or have RGB a list, and use that in data.sel(band=RGB).values
    colourmap = {'R':'red', 'G':'green', 'B':'blue'}
    
    # combine the data for the 3 bands to be displayed into a single numpy array
    h = data.shape[1]
    w = data.shape[0]
    t = data.shape[2]
    if scene_num > (t -1):
        scene_num = t - 1
    RGB = ['R','G','B']
    date = str(data[:,:,scene_num].date.values)
    
    # create array to store the RGB info in, and fill by looping through the colourmap variable
    # note the .T at the end, because the data array is setup as a (x,y,t), but imshow works (y,x)
    rawimg = np.zeros((h, w, 3), dtype=np.float32)
    for i in range(len(RGB)):     
        rawimg[:,:,i] = data[:,:,scene_num].sel(band=colourmap[RGB[i]]).T
        
    # equalizing for all bands together
    # goal is to make is human interpretable
    img_toshow = exposure.equalize_hist(rawimg, mask = np.isfinite(rawimg))    

    # displaying the results and formatting the axes etc
    plt.imshow(img_toshow)
    ax = plt.gca()
    ax.set_title('True Colour Landsat Scene, taken\n' + date + ', over ' + study_area)
    
   
    if scene_picks_arr is None:
        
        if study_area in trainingdata.index and scene_num in trainingdata.loc[study_area].index:
            # if there aren't any picks yet, make the array
            scene_picks_arr = np.zeros((h,w), dtype=np.float32)
            # fill it with np.nan
            scene_picks_arr[scene_picks_arr == 0] = np.nan
            # make a dict with the key (study_area, scene_num)
            scene_picks_arr = {(study_area, scene_num): scene_picks_arr}
            temp = trainingdata.loc[(study_area, scene_num)]
            # loop through all relevant training points, and populate the array
            for i in range(len(temp)):
                position = temp.iloc[i].name
                scene_picks_arr[(study_area, scene_num)][position[0], position[1]] = temp['landcover'].iloc[i]
        else:
            # if not the right scene/location combination, set to None
            scene_picks_arr = None
    
    # if there are picks, then plot them up, coloured as per the environment level variable colours
    # should I better tie cmap colours to colours
    if scene_picks_arr is not None and (study_area,scene_num) in (scene_picks_arr.keys()):
        cmap = colors.ListedColormap(colours)
        # plot the training pixels
        ax.imshow(scene_picks_arr[(study_area, scene_num)], cmap)
        legend_patches = []
        # build the legend
        for cover in landcover.keys():
            legend_patches.append(mpatches.Patch(color = colours[landcover[cover]-1], label = cover))
        ax.legend(handles = legend_patches)
    else:
        # if not the right scene/location combination, set to None
        scene_picks_arr = None
    
    return ax, scene_picks_arr

### function: train

In [13]:
# some broad scope variables specific to the plotting that need setting up and seem to be very fragile
# so I'm too scared to move them in case something breaks!
global xpos
global ypos
xpos = 0
ypos = 0
global scene_picks_arr
scene_picks_arr = None
colours = ['r', 'b', 'm', 'c']

def train(study_area, scene_num, covertype):
    """ This function is required to allow the onclick event to work with widgets."""
    def onclick(event):
        """This function tells the notebook what to do when the user tags a pixel as a training point"""
        # defining what to do on a click event
        
        # I don't understand why this need to be declared global again, but it breaks without these lines
        global xpos
        global ypos
        # need to cast to int as result is a float, and can't index a list with a float
        xpos = int(event.xdata)
        ypos = int(event.ydata)
        # save the results of the click to the training data dataframe
        trainingdata.loc[(study_area, scene_num, ypos, xpos)] = landcover[covertype]
        # add the results to the current scenes overlay
        scene_picks_arr[(study_area,scene_num)][ypos, xpos] = landcover[covertype]
        # redraw with the trained pixels updated on the image
        drawTrainingPlot(study_area, scene_num, covertype, scene_picks_arr)
    
    # control the figure size
    fig = plt.figure(figsize=[10,10])
    axs = fig.axes
    plt.subplots_adjust(hspace = 0.6)
    
    # draw the figure
    global scene_picks_arr
    scene_picks_arr = drawTrainingPlot(study_area, scene_num, covertype, scene_picks_arr)
    #connect the click event action to the figure
    cid = fig.canvas.mpl_connect('button_press_event', onclick)

## The Training Figure

In [14]:
# create a study area drop down list and display it
study_area_dd = Dropdown(options=study_areas, value = study_areas[0], description='Study Area', disabled = False)
display(study_area_dd)

In [15]:
# work with the value of the dropdown list to get the data ready
study_area = study_area_dd.value
if study_area == 'custom':
    coords = ['lat_min', 'lat_max', 'lon_min', 'lon_max']
    spatial_query = collections.OrderedDict()
    for coord in coords:
        spatial_query[coord] = input(coord + ': ')
    data = getData(list(spatial_query.values()))
else:
    data = getData(study_area)

print('\nStudy area data loaded.')

Loading data from: ../mtbarker/

Study area data loaded.


In [16]:
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    interact(train,
             study_area = fixed(study_area),
             scene_num = IntSlider(value = 1, min = 0, max = data.shape[2]-1,description = "Scene Number"),
             covertype = Dropdown(options=list(landcover.keys()), value=list(landcover.keys())[0], description='Landcover', disabled = False))

In [None]:
# check the outputs of the training data generation process
trainingdata

# Training Results Manipulation and Classifier Training Function

## Building the training dataset

In [None]:
# Aim is to get the data from the dataframe (which holds references to the pixel's location, along with the
# assigned class for that pixel), use it to extract the spectral data for that pixel, format it appropriately
# and pass it to the classification algorithm to teach it.

# useful variables for pulling out data from Xarray
sat_bands = ['blue', 'green', 'red', 'nir', 'swir1', 'swir2']
dc_bands = ['blue', 'green', 'red', 'nir', 'swir1', 'swir2', 'cloud_mask']

# make the required columns
sat_bands_loc = []
for band in sat_bands:
    if band in trainingdata.columns:
        continue
    trainingdata[band] = np.nan
    sat_bands_loc.append(trainingdata.columns.get_loc(band))
    
# loop through the different locations used for the training data.
for loc in trainingdata.index.levels[0]:
    
    # build the Xarray for that location
    data = getData(loc)
    # only look at the training data for that location
    subset = trainingdata.loc[loc]
    # for each row (ie each pick) at that location
    for i in range(len(subset)):
        # unpack the multilevel pandas index into components for accessing the correct Xarray pixel
        scene, y, x = subset.iloc[i].name
        vals = data[x, y, scene].sel(band=dc_bands)
        if vals.sel(band='cloud_mask').values == 0:
            # if the pixel is valid (no cloud), take the spectral bands
            if np.isfinite(vals.sel(band=sat_bands).values).all():
                # if all the bands have readings (no NaNs), save the relevant bits into X and Y
                trainingdata.loc[(loc, scene, y, x), sat_bands] = vals.sel(band=sat_bands).values

# save the latest version of trainingdata somewhere good
time = str(datetime.datetime.now()).split('.')[0].replace(' ','_')
trainingdata.to_pickle('../traningdata_' + time + '.pkl')                
                
trainingdata